GridSite Learning Center

Thesis

The highest-integrity facilities no longer run an on-prem "IT stack." They run plant (BMS, controllers, access control) on site, and move everything else—monitoring, analytics, orchestration, ticketing, NMS, visualization—into a hardened cloud control plane. That makes your data center simpler to operate, easier to secure, and measurably more reliable. This paper lays out a full, defensible blueprint: what remains on site, how it connects, how the cloud side is built, and which controls are mandatory for a provider entrusted with this level of access.

1) First principles

Minimal attack surface on site. Keep no general-purpose servers in the facility. Only deterministic systems live there: the BMS, microcontrollers/PLCs for power/cooling/fire, physical access control hardware, cameras, and network/telecom equipment.

Engineered connectivity, not air-gaps. OT and management must talk to the cloud for telemetry, analytics, firmware, and remote ops—but only through explicit, authenticated, least-privilege paths.

Identity everywhere. Humans, services, and devices authenticate strongly (MFA + PAM for humans; mTLS + short-lived tokens for services; x.509 for devices).

Observability as a safety system. Time-series, logs, and events are a first-class workload with SLOs and DR, not an afterthought.

Separation of concerns. On-site systems measure and actuate; the cloud decides, visualizes, and governs.

Compliance is table stakes. The provider's platform operates under SOC 2 Type II and/or ISO/IEC 27001, with transparent subprocessors and auditable controls.

2) On-site: what stays, how it's wired, how it's secured

2.1 The asset classes that remain on site

Plant controls & sensing: BMS head-end (if required by vendor), PLCs for chillers/CDUs, PDUs/UPSes/switchgear IEDs, fire alarm control, leak detection, environmental sensors.

Physical security: Access control panels/readers/strikes, mantraps, tailgate sensors, cameras/NVRs (NVR optional if cloud VMS is used), intercoms.

Network & telecom: Core/aggregation/access switches, routers/firewalls, OOB console servers, LTE failover, structured cabling, GPS/PTP (where needed).

Operator endpoints: Security/operations workstations, NOC displays, staff laptops/tablets.

Zero "general IT" servers: No AD, no app servers, no NMS servers, no visualization servers on site.

2.2 Network segmentation (three planes + OOB)

OT plane (VRF-OT): Fieldbus/PLC/BMS/IED networks. No internet egress. Strict allow-lists to Fabric plane brokers/gateways only.

Fabric plane (VRF-FABRIC): Site message brokers (or lightweight gateways), protocol translators (Modbus/BACnet/OPC UA → MQTT), webhook terminators (if any), log forwarders. This is the sole path northbound.

Enterprise plane (VRF-ENT): Staff endpoints (wired/Wi-Fi), guest internet, office printers, etc. No path to OT; tightly restricted path to Fabric services.

Out-of-band (OOB): Isolated console network (RS-232/IPMI/USB-serial) with LTE or separate fiber uplink; used for break-glass and zero-touch.

Controls: Default-deny ACLs between VRFs; 802.1X on access ports; mgmt addresses in separate VRF; MACsec on critical uplinks; DHCP snooping + IP source guard in ENT.

2.3 Message fabric (GridFabric pattern, serverless on site)

Gateways embedded or appliance form-factor sit at the edge of OT, timestamp readings, and publish typed MQTT v5 to a cloud ingress (no heavy site broker required).

Store-and-forward on the gateways buffers 48–96 h of telemetry/event traffic, flushing in order when WAN returns; message expiry avoids replaying stale, high-rate telemetry.

API-only controllers (e.g., access control/VMS that live in a vendor cloud) are integrated via cloud-side connectors; on site you do not host adapters.

2.4 External access & ZTNA

No inbound holes to site networks. All remote operations ride outbound, mTLS-pinned tunnels from Fabric to the provider's ZTNA/broker.

Operators connect to a cloud bastion (SSO+MFA+PAM), then pivot down via just-in-time, audited sessions (SSH/RDP/serial proxy) that are recorded.

Break-glass OOB exists, but is powered down and requires dual approval to enable.

2.5 Wireless realities (BLE/Wi-Fi on gear)

Commission radios, then disable or lock to pairing-only where possible.

Where a radio must remain: isolate to dedicated SSIDs/VLANs with EAP-TLS (802.1X), low power, and strict geofencing; treat BLE/Wi-Fi keys like credentials (rotate, audit).

2.6 Physical ↔ logical interlocks

Mantrap tailgate → inner reader inhibit via hardwired emergency input and policy in the cloud; dock door open → reader disable deeper in facility; camera bookmarks on controller alarms. (All surfaced as typed events to the cloud.)

2.7 On-site resilience targets

Power loss: Door hardware fails safe per code; gateways ride a DC bus or UPS; OOB remains up.

WAN loss: Site runs autonomously on local policies; gateways buffer; commands fail fast with clear "remote unavailable."

Clock discipline: NTP/NTS on Fabric; PTP local if needed; any >1 s drift raises warnings.

3) Cloud control plane: everything you removed from the building

The cloud side is a product, not a pile of VMs. Below is a provider-neutral reference (works in AWS/Azure/GCP with equivalent services).

3.1 Multi-tenant isolation model

Org tenancy: Each customer gets an isolated account/subscription/project (hard boundary), or a VPC-per-tenant with strict controls if accounts are impractical.

Network: Hub-and-spoke: a Transit Hub VPC/VNet for shared egress and security tooling; per-tenant Spoke VPCs for data and apps. No tenant-to-tenant routing.

Keys & secrets: Tenant-scoped KMS keys; separate secrets vault namespaces; no cross-tenant KMS reuse.

3.2 Ingestion & messaging

MQTT Ingress (managed brokers or containerized brokers): terminates device mTLS; validates topics and quotas; enforces Clean Start=false, Last-Will semantics.

API/Webhook ingress: Internet-facing gateway validates HMAC signatures, enforces mTLS for vendors that support it, and maps vendor events to typed messages on the bus.

Streaming spine: Messages flow into a durable log (e.g., Kafka-class or managed pub/sub) for fan-out to time-series, SIEM, and rule engines.

3.3 Observability & analytics

Time-series: Prometheus-compatible remote-write at the tenant edge feeding a long-retention TSDB per tenant; downsampled roll-ups for 12+ months.

Dashboards & alerts: Grafana per tenant; alert rules checked into Git; on-call escalation with runbook links.

Object storage: Raw JSON/event archives for forensics; lifecycle policies tune cost.

SIEM/SOAR: Central log lake with per-tenant partitions; detections for anomalous device behavior, policy violations, and ZTNA events; automated response (quarantine, token revocation).

3.4 Network & access

ZTNA / Bastion: Zero-trust gateway with SSO (OIDC/SAML) + MFA; PAM brokering all privileged sessions; SCIM for lifecycle.

RBAC/ABAC: Environment (prod/stage), Tenant, Site, Role scopes; break-glass roles time-boxed with dual approval.

Outbound policy: Egress only to allow-listed vendor clouds for API connectors and firmware repositories; outbound TLS inspection where legally permissible.

3.5 Service plane for tooling

NMS / Topology / IPAM: Runs as a cloud service per tenant (no SNMP from cloud to site; telemetry comes from Fabric).

Vendor-specific apps (DCIM extensions, controller GUIs): Deployed as immutable images behind ZTNA, not exposed to the internet; SSO enforced; session recording for admin access.

Rule engines & digital twins: Stateless services that subscribe to telemetry and publish advisory commands (e.g., pump bump +5% on ΔT collapse).

3.6 Software delivery & data protection

Everything-as-Code: IaC for VPCs, brokers, exporters, dashboards; config changes via PRs, not consoles.

CI/CD with attestations: Signed artifacts, SBOMs, image scanning; only signed images may run.

Backups & DR: Cross-region replication for TSDB + object storage; tested RTO/RPO per tenant; quarterly failover drills.

Data residency: Pin storage/compute to the customer's region; vendor subprocessors documented; residency honored contractually.

4) Mandatory security controls for the provider

Frameworks: Maintain SOC 2 Type II and/or ISO/IEC 27001 with the platform in scope (ingress, ZTNA, TSDB, SIEM, brokers, storage). Provide reports under NDA.

Identity: SSO, MFA, SCIM, PAM; quarterly access reviews; SoD for engineering vs operations.

Crypto: TLS 1.2+ everywhere; device mTLS with private PKI; envelope encryption with per-tenant KMS keys; HSM-backed root for the CA.

Endpoint & HR security: Background-checked staff; hardened workstations (EDR, disk encryption, screen lock, up-to-date OS); least-privilege support accounts; secure SDLC training.

Change & incident: Ticketed change control; peer review; observability on changes; 24×7 incident response with customer comms SLAs; post-incident RCAs with action items.

Supply chain: Approved subprocessors list; DPIAs where required; vendor pen tests; quick revocation path for compromised components.

Customer isolation: Technical (accounts/VPCs/keys) and organizational (on-call separation); chaos tests proving one tenant cannot impact another.

5) End-to-end flows (how it actually behaves)

5.1 Cooling telemetry & advisory control

A CDU gateway reads supply/return/flow each second, signs an MQTT message with device cert, and sends to cloud ingress.

Time-series ingests; Grafana trends ΔT. Rule engine notices ΔT < 3 °C for 20 seconds and publishes a command "bump setpoint −1 °C" with a request ID and user=ruleengine.

Gateway validates command policy locally, applies envelope constraints (min/max), actuates, and emits cmd_ack. If WAN is down, it does nothing and logs "cloud absent."

5.2 Access control (API-only)

Operator clicks "Unlock 8 s" in Grafana panel (button posts to a command API behind ZTNA).

Cloud API connector (per-tenant) uses OAuth2 to call the vendor's /unlock. Vendor webhooks back: access_granted, then state.locked=false.

Webhook gateway validates HMAC/timestamp, maps to a typed event/state, and publishes to the tenant bus; dashboards and audits update in near-real time.

5.3 Firmware update

Provider publishes a signed firmware to the tenant's staging repo.

Device class policy allows maintenance window 23:00–03:00 local.

Gateways fetch via allow-listed egress, verify signature, and stage atomically.

On reboot, device proves new version in birth payload; health probes watch for regressions; automatic rollback if probes fail.

6) Operations SLOs and runbooks

Ingestion availability: 99.95% per tenant region; p99 end-to-end latency for events < 2 s.

Command round-trip: p95 ≤ 2 s to cmd_ack for API-backed devices; ≤ 1 s for direct gateway devices.

Time sync SLO: 99.9% of site devices within ±0.5 s of UTC.

Buffer durability: ≥ 72 h at typical rates; visible backlog gauges.

Runbooks: WAN loss (site autonomy), vendor cloud outage (graceful degradation), PKI compromise (rotate trust chain), OOB activation (dual-auth, time-boxed, recorded).

7) Migration path (90-day, zero-server destination)

Days 1–15 — Discover & separate. Inventory every manage-able device. Draw VRFs, cut undocumented flows, stand up ZTNA/PAM, disable unnecessary radios.

Days 16–45 — Fabric thin slice. Deploy gateways on two subsystems (e.g., cooling + power). Stream to cloud ingress, stand up Prom/Grafana per tenant, validate typed events, and build the first alarms.

Days 46–70 — Replace local IT. Decommission on-site NMS/Grafana; move vendor GUIs behind ZTNA in cloud; enable API connectors for access/VMS. Prove store-and-forward and DR of time-series.

Days 71–90 — Drill & document. Run two failure games (WAN cut; vendor cloud outage). Close gaps, set SLOs, finalize contracts (SOC 2/ISO 27001, subprocessors, data residency).

8) Anti-patterns to avoid

Shadow servers ("just a small VM under the desk for NMS"). If it's worth running, run it in the cloud control plane.

Inbound management VPNs into OT. Replace with outbound ZTNA and PAM-recorded sessions.

Opaque vendor clouds. No SOC 2/ISO 27001, no HMAC on webhooks, no SSO—no go.

One-off schemas. Typed messages or bust; free-text JSON is how dashboards and alarms rot.

No OOB. You will eventually need serial/console out-of-band; plan and test it before you do.

9) What the provider must promise (and prove)

Audit reports (SOC 2/ISO 27001) available under NDA; scope maps to the services you consume.

Tenant isolation tests and chaos drills; documented results.

Security transparency: SBOMs for the platform, routine pen tests, vulnerability disclosures with SLAs, subprocessor notices.

Exit plan: You can export your descriptors, telemetry, events, and dashboards and stand up elsewhere within a defined window.

10) The payoff

With zero on-prem IT servers, the facility becomes predictable: fewer moving parts, smaller blast radius, faster recovery. The cloud control plane gives you uniform dashboards, uniform alarms, uniform runbooks across sites. Security is visible and auditable: identity is centralized; device traffic is typed, signed, and least-privileged; providers are certified under SOC 2/ISO 27001; and failure modes are drilled, not guessed at. Most importantly, the people who keep power and cooling running can focus on plant, not babysit a zoo of little servers.

If you want a starting point, adopt the GridFabric message conventions on site (typed MQTT, store-and-forward, strict topics) and land them in a cloud platform that treats security, isolation, and observability as first-class features. That's how you run a modern facility in a world where "just don't connect it to the Internet" is neither realistic nor safe—and still sleep at night.

Related Resources

GridFabric v1.0 — Site-Wide Communications Framework Securing What Can't Be Air-Gapped FMMN Reference Architecture

Cloud-Native by Design: Running a Modern Data Center With Zero On-Site IT Servers