GridSite Learning Center

Thesis

"Just don't connect it to the Internet" is no longer a security strategy—it's a nostalgia pose. Modern facilities depend on remote monitoring, cloud analytics, remote firmware distribution, and third-party operations. Meanwhile, the gear itself now ships with management radios (BLE, Wi-Fi, sometimes LTE) and soft controllers that beg to be administered. Real security comes from architecture and governance, not from hope. This article lays out a pragmatic, operator-first model for protecting critical infrastructure—data halls, substations, micro-grids, campus plants—when remote connectivity is required and attack surfaces are everywhere.

1) Why "air-gap" is a myth (and what replaces it)

Air-gaps fail in practice because business needs sneak around them: a vendor arrives with a laptop and a 4G dongle, a UPS exposes Bluetooth for commissioning, a BMS uploads logs to a cloud portal, or an engineer plugs a jump drive into a PLC. And even if you perfectly air-gapped the process network, physical and wireless entry points remain: roll-up doors, badge readers, tailgate risks, BLE maintenance radios.

The modern replacement for air-gap is intentional connectivity: explicit paths, strong identity, least privilege, constant verification, and controlled failure modes. In short: Zero Trust for facilities.

2) Treat every manageable thing as part of the security estate

If a device can be configured, it must be governed. That includes the obvious (switchgear relays, UPS, PDUs, chillers, fire panels) and the "IT-adjacent" (switches, access points, cameras, badge readers, intercoms, out-of-band modems), plus the forgotten bits (environmental sensors, smart breakers, rooftop antennas). Put them all in one authoritative inventory with:

Unique ID (serial + logical name), physical location, owner, and criticality.
Management interfaces and media (Ethernet mgmt port, serial console, BLE, Wi-Fi, vendor cloud).
Auth method (local accounts, x.509, API tokens), firmware version, SBOM where possible.
Baseline configuration reference and last compliance date.

No inventory, no control.

3) Network architecture: segment like you mean it

Design three distinct planes, each with its own security policy:

OT Plane (Process)

Controllers, sensors, drives, safety systems. No direct internet. Only whitelisted east–west and north–south paths to the fabric plane.

Fabric Plane (Management)

Brokers, gateways, jump hosts, API connectors, logging and monitoring collectors. Only plane allowed to egress to internet, and only to allow-listed FQDNs.

Enterprise/Internet Plane

User endpoints, SaaS access, general IT.

Enforce separation with VRFs/VLANs and default-deny ACLs between planes. Where feasible, deploy data diodes / one-way gateways for high-consequence zones (e.g., process → analytics). Use 802.1X on switching to bind ports to identity and posture; unmanaged serial links terminate on hardened terminal servers in the fabric plane.

4) Identity and access: strong by default, temporary by design

For humans:

SSO with MFA, device posture checks, and privileged access management (PAM) for anything that can change the state of the facility. Elevation is just-in-time (minutes to hours), recorded, and revoked automatically. Adopt hardware-backed MFA wherever practical.

For devices and services:

Mutual TLS with device identities (x.509 or equivalent), short-lived tokens for APIs, and narrow scopes. No shared local accounts; if a vendor demands one, escrow it in PAM and rotate it after use.

Lifecycle:

SCIM or an equivalent process ensures joiners/movers/leavers de-provision cleanly across all platforms—including vendor clouds and field devices.

5) Wired, wireless, and "accidental RF"

You'll find wireless management where you least expect it: BLE on UPS and PDUs, Wi-Fi on cameras and access points, sometimes LoRa/900 MHz on meters. The rules:

Disable radios post-commission where possible, or move them to "pairing only" with physical presence.
If a radio must remain active, fence it: dedicated SSIDs, 802.1X/EAP-TLS, hidden behind a RF-isolated cabinet if practical.
Treat BLE keys and Wi-Fi PSKs as secrets with rotation and audit.
Scan the RF environment quarterly; if you discover undocumented beacons, treat it as a security event.

6) Cloud is not the enemy—opaque cloud is

Remote analytics, central orchestration, and 24×7 managed operations are only realistic with cloud. The question is which cloud services and how they are run. Require, at minimum:

Security Posture

SOC 2 Type II and/or ISO/IEC 27001 certification in scope for the exact services you will consume (not a company-wide marketing badge). Ask for the report, not the logo.

Supplier Transparency

Up-to-date subprocessor list, data flow diagrams, data residency options.

Operational Controls

Formal incident response, regular third-party penetration testing, vulnerability disclosure/bug bounty, encryption at rest and in transit, customer-managed key options where needed.

Access Discipline

Support for SSO, SCIM, RBAC/ABAC, API tokens with scopes and expirations, per-tenant isolation controls.

7) Configuration baselines and patch hygiene (the unsexy win)

Define golden configs for each device class (NTP/NTS, syslog/SNMP targets, allowed ciphers, login banners, failed-login lockouts, TLS only, BLE/Wi-Fi off unless required).

Enforce with configuration management as code; detect drift in hours, not quarters.

Patch on a rhythm: small and frequent beats big and rare. Maintain a pre-prod bench (spares or virtual twins) to burn in firmware before you touch production.

Keep a known-good image and an offline recovery path; never depend on the vendor cloud to recover a bricked controller.

8) Telemetry, detection, and evidence

You cannot defend what you cannot see. Stream device telemetry, events, and commands into a single, typed bus (MQTT, Syslog, OT-aware IDS) and project the essentials into your SIEM and time-series systems. Normalize around a fixed schema: timestamps, device identity, severity, correlation IDs. Build detections for the real attack paths:

New management radio appears; BLE pairing outside maintenance window.
Controller config drift; unsigned firmware attempted; abnormal coils/points toggled.
East–west traffic from OT plane to Internet plane; unexpected DNS.
Unusual badge patterns at high-security doors paired with reader disable events.

Retain enough detail to reconstruct incidents—config versions, session recordings, camera bookmarks for physical alarms—but prune aggressively elsewhere to control cost.

9) Remote access that fails safely

Design remote access so that losing it degrades you to safe local control, not to panic:

A hardened bastion is the single external entry, with MFA and PAM, into the fabric plane. No direct inbound to OT.
Out-of-band path (e.g., LTE) exists for break-glass but is locked behind the same controls and is off by default.
If the vendor cloud is unavailable, local operators can maintain safe operations using cached credentials, local policies, and store-and-forward telemetry.

Regularly tabletop and physically test the failure of your remote providers: what still works, what gracefully pauses, and what you can recover from in minutes vs hours.

10) Third-party risk: write it into the contract

Security by aspiration fails at the first outage. Bake your requirements into SOWs:

Provider must maintain SOC 2 Type II and/or ISO 27001 (and align to IEC 62443 where OT is in scope).
24×7 incident response with defined RTO/RPO for their platform; you are notified within X minutes of material incidents.
Right to audit or obtain independent assessment summaries; pen test reports annually.
Key management & crypto standards; API scopes; idle session timeouts; log retention and export.
Background checks, secure SDLC, and offboarding controls for personnel.
Subprocessor change notification with opt-out rights for high-risk additions.
Data ownership and exit: you can export all logs/configs in a documented format.

11) Physical + logical interlock: the forgotten synergy

Tie your physical security and logical controls together. If a mantrap tailgate alarm fires, treat it as a logical lockdown on associated consoles. If a dock door is open, inhibit badge readers deeper into the facility and escalate access approvals in software. Cameras should bookmark feeds automatically on controller alarms. Security is best when physics and software agree.

12) A 90-day modernization plan

Days 1–15

See and separate

Inventory everything that can be managed. Draw the three planes and cut ad-hoc cross-flows. Stand up a bastion and PAM. Disable stray radios.

Days 16–45

Baselines and identity

Apply golden configs. Move admin accounts behind SSO/MFA. Issue device identities and turn on mTLS for management APIs. Start streaming normalized telemetry.

Days 46–70

Cloud due diligence

Assess current SaaS/remote providers against SOC 2/ISO 27001; close gaps or plan exits. Enable SCIM, role clean-ups, and log export. Add a pre-prod bench for firmware.

Days 71–90

Drill and document

Tabletop the loss of remote access and of a vendor cloud. Run a physical-logical interlock test (mantrap/dock). Finalize runbooks and audit artifacts.

13) How this aligns with GridFabric (optional tie-in)

If you adopt a facility message bus with typed payloads and strict topics, you get a single control point for telemetry, events, and commands. Gate all northbound access through that plane, terminate identities there, and give your SOC one place to detect anomalies. The specifics—MQTT with mTLS, device descriptors, store-and-forward, Prometheus/Grafana—are less important than the principle: one fabric, many devices, one security posture.

14) The boring, defensible conclusion

Critical infrastructure will be connected. The choice is between accidental connectivity (shadow radios, vendor laptops, undocumented clouds) and engineered connectivity (segmented planes, strong identity, audited providers, observable operations). Pick the latter.

Insist that every manageable device is in inventory and under policy. Demand SOC 2/ISO 27001 from anyone who runs your control paths. Make remote access a feature that fails safe. And rehearse it all until the difference between "normal day" and "Internet down" is an SRE note, not an incident bridge.

Securing What Can't Be Air-Gapped: A Modern Playbook for Critical Infrastructure