Thesis
The highest-integrity facilities no longer run an on-prem "IT stack." They run plant (BMS, controllers, access control) on site, and move everything else—monitoring, analytics, orchestration, ticketing, NMS, visualization—into a hardened cloud control plane. That makes your data center simpler to operate, easier to secure, and measurably more reliable. This paper lays out a full, defensible blueprint: what remains on site, how it connects, how the cloud side is built, and which controls are mandatory for a provider entrusted with this level of access.
1) First principles
Minimal attack surface on site. Keep no general-purpose servers in the facility. Only deterministic systems live there: the BMS, microcontrollers/PLCs for power/cooling/fire, physical access control hardware, cameras, and network/telecom equipment.
Engineered connectivity, not air-gaps. OT and management must talk to the cloud for telemetry, analytics, firmware, and remote ops—but only through explicit, authenticated, least-privilege paths.
Identity everywhere. Humans, services, and devices authenticate strongly (MFA + PAM for humans; mTLS + short-lived tokens for services; x.509 for devices).
Observability as a safety system. Time-series, logs, and events are a first-class workload with SLOs and DR, not an afterthought.
Separation of concerns. On-site systems measure and actuate; the cloud decides, visualizes, and governs.
Compliance is table stakes. The provider's platform operates under SOC 2 Type II and/or ISO/IEC 27001, with transparent subprocessors and auditable controls.
2) On-site: what stays, how it's wired, how it's secured
2.1 The asset classes that remain on site
Plant controls & sensing: BMS head-end (if required by vendor), PLCs for chillers/CDUs, PDUs/UPSes/switchgear IEDs, fire alarm control, leak detection, environmental sensors.
Physical security: Access control panels/readers/strikes, mantraps, tailgate sensors, cameras/NVRs (NVR optional if cloud VMS is used), intercoms.
Network & telecom: Core/aggregation/access switches, routers/firewalls, OOB console servers, LTE failover, structured cabling, GPS/PTP (where needed).
Operator endpoints: Security/operations workstations, NOC displays, staff laptops/tablets.
Zero "general IT" servers: No AD, no app servers, no NMS servers, no visualization servers on site.
2.2 Network segmentation (three planes + OOB)
OT plane (VRF-OT): Fieldbus/PLC/BMS/IED networks. No internet egress. Strict allow-lists to Fabric plane brokers/gateways only.
Fabric plane (VRF-FABRIC): Site message brokers (or lightweight gateways), protocol translators (Modbus/BACnet/OPC UA → MQTT), webhook terminators (if any), log forwarders. This is the sole path northbound.
Enterprise plane (VRF-ENT): Staff endpoints (wired/Wi-Fi), guest internet, office printers, etc. No path to OT; tightly restricted path to Fabric services.
Out-of-band (OOB): Isolated console network (RS-232/IPMI/USB-serial) with LTE or separate fiber uplink; used for break-glass and zero-touch.
Controls: Default-deny ACLs between VRFs; 802.1X on access ports; mgmt addresses in separate VRF; MACsec on critical uplinks; DHCP snooping + IP source guard in ENT.
2.3 Message fabric (GridFabric pattern, serverless on site)
Gateways embedded or appliance form-factor sit at the edge of OT, timestamp readings, and publish typed MQTT v5 to a cloud ingress (no heavy site broker required).
Store-and-forward on the gateways buffers 48–96 h of telemetry/event traffic, flushing in order when WAN returns; message expiry avoids replaying stale, high-rate telemetry.
API-only controllers (e.g., access control/VMS that live in a vendor cloud) are integrated via cloud-side connectors; on site you do not host adapters.
2.4 External access & ZTNA
No inbound holes to site networks. All remote operations ride outbound, mTLS-pinned tunnels from Fabric to the provider's ZTNA/broker.
Operators connect to a cloud bastion (SSO+MFA+PAM), then pivot down via just-in-time, audited sessions (SSH/RDP/serial proxy) that are recorded.
Break-glass OOB exists, but is powered down and requires dual approval to enable.
2.5 Wireless realities (BLE/Wi-Fi on gear)
Commission radios, then disable or lock to pairing-only where possible.
Where a radio must remain: isolate to dedicated SSIDs/VLANs with EAP-TLS (802.1X), low power, and strict geofencing; treat BLE/Wi-Fi keys like credentials (rotate, audit).
2.6 Physical ↔ logical interlocks
Mantrap tailgate → inner reader inhibit via hardwired emergency input and policy in the cloud; dock door open → reader disable deeper in facility; camera bookmarks on controller alarms. (All surfaced as typed events to the cloud.)
2.7 On-site resilience targets
Power loss: Door hardware fails safe per code; gateways ride a DC bus or UPS; OOB remains up.
WAN loss: Site runs autonomously on local policies; gateways buffer; commands fail fast with clear "remote unavailable."
Clock discipline: NTP/NTS on Fabric; PTP local if needed; any >1 s drift raises warnings.
3) Cloud control plane: everything you removed from the building
The cloud side is a product, not a pile of VMs. Below is a provider-neutral reference (works in AWS/Azure/GCP with equivalent services).
3.1 Multi-tenant isolation model
Org tenancy: Each customer gets an isolated account/subscription/project (hard boundary), or a VPC-per-tenant with strict controls if accounts are impractical.
Network: Hub-and-spoke: a Transit Hub VPC/VNet for shared egress and security tooling; per-tenant Spoke VPCs for data and apps. No tenant-to-tenant routing.
Keys & secrets: Tenant-scoped KMS keys; separate secrets vault namespaces; no cross-tenant KMS reuse.
3.2 Ingestion & messaging
MQTT Ingress (managed brokers or containerized brokers): terminates device mTLS; validates topics and quotas; enforces Clean Start=false, Last-Will semantics.
API/Webhook ingress: Internet-facing gateway validates HMAC signatures, enforces mTLS for vendors that support it, and maps vendor events to typed messages on the bus.
Streaming spine: Messages flow into a durable log (e.g., Kafka-class or managed pub/sub) for fan-out to time-series, SIEM, and rule engines.
3.3 Observability & analytics
Time-series: Prometheus-compatible remote-write at the tenant edge feeding a long-retention TSDB per tenant; downsampled roll-ups for 12+ months.
Dashboards & alerts: Grafana per tenant; alert rules checked into Git; on-call escalation with runbook links.
Object storage: Raw JSON/event archives for forensics; lifecycle policies tune cost.
SIEM/SOAR: Central log lake with per-tenant partitions; detections for anomalous device behavior, policy violations, and ZTNA events; automated response (quarantine, token revocation).
3.4 Network & access
ZTNA / Bastion: Zero-trust gateway with SSO (OIDC/SAML) + MFA; PAM brokering all privileged sessions; SCIM for lifecycle.
RBAC/ABAC: Environment (prod/stage), Tenant, Site, Role scopes; break-glass roles time-boxed with dual approval.
Outbound policy: Egress only to allow-listed vendor clouds for API connectors and firmware repositories; outbound TLS inspection where legally permissible.
3.5 Service plane for tooling
NMS / Topology / IPAM: Runs as a cloud service per tenant (no SNMP from cloud to site; telemetry comes from Fabric).
Vendor-specific apps (DCIM extensions, controller GUIs): Deployed as immutable images behind ZTNA, not exposed to the internet; SSO enforced; session recording for admin access.
Rule engines & digital twins: Stateless services that subscribe to telemetry and publish advisory commands (e.g., pump bump +5% on ΔT collapse).
3.6 Software delivery & data protection
Everything-as-Code: IaC for VPCs, brokers, exporters, dashboards; config changes via PRs, not consoles.
CI/CD with attestations: Signed artifacts, SBOMs, image scanning; only signed images may run.
Backups & DR: Cross-region replication for TSDB + object storage; tested RTO/RPO per tenant; quarterly failover drills.
Data residency: Pin storage/compute to the customer's region; vendor subprocessors documented; residency honored contractually.
4) Mandatory security controls for the provider
Frameworks: Maintain SOC 2 Type II and/or ISO/IEC 27001 with the platform in scope (ingress, ZTNA, TSDB, SIEM, brokers, storage). Provide reports under NDA.
Identity: SSO, MFA, SCIM, PAM; quarterly access reviews; SoD for engineering vs operations.
Crypto: TLS 1.2+ everywhere; device mTLS with private PKI; envelope encryption with per-tenant KMS keys; HSM-backed root for the CA.
Endpoint & HR security: Background-checked staff; hardened workstations (EDR, disk encryption, screen lock, up-to-date OS); least-privilege support accounts; secure SDLC training.
Change & incident: Ticketed change control; peer review; observability on changes; 24×7 incident response with customer comms SLAs; post-incident RCAs with action items.
Supply chain: Approved subprocessors list; DPIAs where required; vendor pen tests; quick revocation path for compromised components.
Customer isolation: Technical (accounts/VPCs/keys) and organizational (on-call separation); chaos tests proving one tenant cannot impact another.
5) End-to-end flows (how it actually behaves)
5.1 Cooling telemetry & advisory control
A CDU gateway reads supply/return/flow each second, signs an MQTT message with device cert, and sends to cloud ingress.
Time-series ingests; Grafana trends ΔT. Rule engine notices ΔT < 3 °C for 20 seconds and publishes a command "bump setpoint −1 °C" with a request ID and user=ruleengine.
Gateway validates command policy locally, applies envelope constraints (min/max), actuates, and emits cmd_ack. If WAN is down, it does nothing and logs "cloud absent."
5.2 Access control (API-only)
Operator clicks "Unlock 8 s" in Grafana panel (button posts to a command API behind ZTNA).
Cloud API connector (per-tenant) uses OAuth2 to call the vendor's /unlock. Vendor webhooks back: access_granted, then state.locked=false.
Webhook gateway validates HMAC/timestamp, maps to a typed event/state, and publishes to the tenant bus; dashboards and audits update in near-real time.
5.3 Firmware update
Provider publishes a signed firmware to the tenant's staging repo.
Device class policy allows maintenance window 23:00–03:00 local.
Gateways fetch via allow-listed egress, verify signature, and stage atomically.
On reboot, device proves new version in birth payload; health probes watch for regressions; automatic rollback if probes fail.
6) Operations SLOs and runbooks
Ingestion availability: 99.95% per tenant region; p99 end-to-end latency for events < 2 s.
Command round-trip: p95 ≤ 2 s to cmd_ack for API-backed devices; ≤ 1 s for direct gateway devices.
Time sync SLO: 99.9% of site devices within ±0.5 s of UTC.
Buffer durability: ≥ 72 h at typical rates; visible backlog gauges.
Runbooks: WAN loss (site autonomy), vendor cloud outage (graceful degradation), PKI compromise (rotate trust chain), OOB activation (dual-auth, time-boxed, recorded).
7) Migration path (90-day, zero-server destination)
Days 1–15 — Discover & separate. Inventory every manage-able device. Draw VRFs, cut undocumented flows, stand up ZTNA/PAM, disable unnecessary radios.
Days 16–45 — Fabric thin slice. Deploy gateways on two subsystems (e.g., cooling + power). Stream to cloud ingress, stand up Prom/Grafana per tenant, validate typed events, and build the first alarms.
Days 46–70 — Replace local IT. Decommission on-site NMS/Grafana; move vendor GUIs behind ZTNA in cloud; enable API connectors for access/VMS. Prove store-and-forward and DR of time-series.
Days 71–90 — Drill & document. Run two failure games (WAN cut; vendor cloud outage). Close gaps, set SLOs, finalize contracts (SOC 2/ISO 27001, subprocessors, data residency).
8) Anti-patterns to avoid
Shadow servers ("just a small VM under the desk for NMS"). If it's worth running, run it in the cloud control plane.
Inbound management VPNs into OT. Replace with outbound ZTNA and PAM-recorded sessions.
Opaque vendor clouds. No SOC 2/ISO 27001, no HMAC on webhooks, no SSO—no go.
One-off schemas. Typed messages or bust; free-text JSON is how dashboards and alarms rot.
No OOB. You will eventually need serial/console out-of-band; plan and test it before you do.
9) What the provider must promise (and prove)
Audit reports (SOC 2/ISO 27001) available under NDA; scope maps to the services you consume.
Tenant isolation tests and chaos drills; documented results.
Security transparency: SBOMs for the platform, routine pen tests, vulnerability disclosures with SLAs, subprocessor notices.
Exit plan: You can export your descriptors, telemetry, events, and dashboards and stand up elsewhere within a defined window.
10) The payoff
With zero on-prem IT servers, the facility becomes predictable: fewer moving parts, smaller blast radius, faster recovery. The cloud control plane gives you uniform dashboards, uniform alarms, uniform runbooks across sites. Security is visible and auditable: identity is centralized; device traffic is typed, signed, and least-privileged; providers are certified under SOC 2/ISO 27001; and failure modes are drilled, not guessed at. Most importantly, the people who keep power and cooling running can focus on plant, not babysit a zoo of little servers.
If you want a starting point, adopt the GridFabric message conventions on site (typed MQTT, store-and-forward, strict topics) and land them in a cloud platform that treats security, isolation, and observability as first-class features. That's how you run a modern facility in a world where "just don't connect it to the Internet" is neither realistic nor safe—and still sleep at night.