GridFabric v1.0 — A Practical, Secure Fabric for Site-Wide Communications
A unifying layer that makes mixed fleets—edge data centers, micro-DCs, industrial pods—operate like one system you can automate, secure, and certify.
Published: August 2025
Most facilities already run three incompatible conversations at once. In the equipment rooms, legacy controls speak field buses—Modbus RTU/TCP, BACnet, sometimes OPC UA. In the "smart" subsystems, modern controllers and SaaS platforms speak only web APIs and webhooks. And in the NOC and cloud, operations teams need a single, durable stream of telemetry, events, and commands that they can reason about, alert on, and audit. GridFabric proposes a unifying layer that is deliberately boring on the wire—MQTT 5.0 with strict conventions—and deliberately opinionated at the edges: protocol gateways for field buses, API connectors for web-only systems, and device descriptors that make capabilities explicit. The aim is not to chase fashion. It is to make a mixed fleet—edge data centers, micro-DCs, industrial pods—operate like one system you can automate, secure, and certify.
Why MQTT, and why now
MQTT is small, stateful, and pervasive. With QoS, retained messages, will messages, and shared subscriptions, it maps well to facilities realities: lossy links, devices that nap and wake, and control loops that must be conservative. Kafka and AMQP have their place in analytics and streaming backbones; GridFabric is the layer closer to the steel. It is the place where a chilled-water setpoint change, an access control "unlock for 8 seconds," or a "door forced" alarm is carried with clear semantics, security, and auditability.
The design rule is simple: everything that can talk MQTT, talks MQTT; everything that cannot is made to look like it does. Field buses come in through gateways; API-only platforms—think access control systems, video management systems, power meters that only expose REST, even UniFi-class controllers—come in through API connectors that translate web semantics into GridFabric topics and back.
Architecture in practice
At each site, the fabric centers on two MQTT brokers presented behind a single FQDN or VIP. Clients never hard-code IP addresses; they connect to mqtt.site-id.example and let DNS health checks and broker bridging or clustering do the rest. Every device—whether a physical sensor via a gateway, or a logical "device" that is really an API endpoint—is represented by a Device Descriptor: a signed JSON document that says what it is, where it is, and what signals, commands, and events it supports. Descriptors publish on a well-known topic and are cached locally in a Directory. Gateways and connectors discover descriptors at boot and refuse to publish anything that lacks one, which prevents silent, undocumented streams from creeping into production.
Field buses enter through Gateways that speak Modbus RTU/TCP, BACnet/IP, OPC UA, SNMP, and serial driver stacks. A gateway does three things well: it timestamps at the edge, it buffers for at least 48–96 hours when the broker is absent, and it emits GridFabric payloads that carry type, units, and data quality flags rather than raw register values. On the other side of the house, API Connectors speak the languages SaaS and controller platforms expect. They subscribe to write commands on the broker, then call the vendor API with least-privilege OAuth2 tokens; they receive webhooks through a hardened gateway that verifies HMAC signatures and timestamps; they poll REST endpoints with ETag/If-Modified-Since when a vendor cannot push events, and they always reconcile on reconnect using cursors. In other words, they act like good citizens of the web while presenting the same contracts as their field-bus cousins inside the fabric.
Northbound, sites bridge selected topics to a regional or cloud Ingress broker using mutual TLS, and observability systems subscribe once to the aggregate stream rather than to thousands of small sites. The time-series story is intentionally orthodox: a Prometheus-compatible exporter translates telemetry messages into metrics with predictable names and labels, and either scrapes locally or remote-writes to long-term storage. Grafana handles dashboards and alerting. Nothing prevents you from also streaming raw JSON to object storage for forensics; the point is that a single wire format feeds both operator glass and data science.
Naming, semantics, and message life cycle
GridFabric uses one namespace pattern for everything:
org/<org-id>/site/<site-id>/dev/<device-id>/<channel>
The channel encodes intent, not just direction. telemetry
is for sampled values—temperatures, flows, wattages—and it is almost always sent at QoS 1 with no retain flag (QoS 0 is acceptable for very high-rate streams you can drop on congestion). event
is for discrete things with meaning—door_forced, trip, filter_dp_high—and it also travels at QoS 1, carrying an idempotency key so consumers can confidently deduplicate. state
is for the small set of keys where the latest value is all that matters—locked, enabled, mode—so it is retained: a late-joining client immediately receives the last known state without waiting. command
is for writes. It always carries a request ID and a user identity (or service account) so acknowledgments can be correlated and audits can be run. Birth and death messages—MQTT's retained "I'm here" and broker-sent "I disappeared"—round out the lifecycle and make heartbeat dashboards trivial to author.
The payloads are small and explicit. Every sample carries a timestamp in UTC (ts), a signal name (sig), a value (val), and an optional quality flag (q) with a short error text when applicable. Units are not guessed at; they are encoded per signal in the descriptor using UCUM symbols (°C, kPa, l/min, W). Events carry an evt string, a severity, contextual key-value data, and the iid used for idempotency. Commands carry a cmd, argument object, request ID, and the identity on whose behalf the action is taken. Command outcomes are reported as events—cmd_ack or cmd_nack—with the original request ID so workflows can watch a single topic rather than poll. MQTT 5 user properties are used sparingly for cross-cutting concerns such as correlation IDs when an event was born in a webhook and traversed an API connector.
MQTT's delivery knobs are set for operational reality, not lab perfection. Devices set Clean Start = false, so sessions persist. Gateways honor Receive Maximum and Server Keep Alive to shed load when brokers are busy. Message expiry is set for transient streams (there is no point replaying stale 1-second samples after a long outage), but is never set for events and state. Consumers scale out with shared subscriptions, which lets a pool of analytics workers process a hot topic without fighting over messages.
Device descriptors and the capability taxonomy
A Device Descriptor is the contract. In English it states: "this thing is an access.door; it lives in the lobby; it can emit door_forced and access_granted; it exposes a boolean locked state and a percentage battery_pct; it accepts the unlock{duration_s} and lock{} commands." In JSON, the same reads as a small, signed document with fields for identity, location, firmware, interfaces, capabilities, signals, commands, and events. Descriptors are versioned with semver, posted on a retained cfg topic, and validated against schema at publish time. A site's Directory is simply the set of descriptors currently in force, and the Registry at the regional layer is the union across sites with some additional governance: new kinds and signals enter the world through small pull requests, so two teams don't invent temp and temperature_C for the same physical phenomenon.
Typing matters because it lets downstream systems work without brittle per-device parsers. GridFabric restricts types (bool, integer, number, string, enum, timestamp, object) and requires units up front. Data quality also matters; a value with "q":"bad" is not silently graphed. The taxonomy is deliberately pragmatic: power.meter, cooling.cdu, cooling.tank, access.door, access.reader, camera, environment.sensor, network.device. If you need something exotic—say diesel.genset—you write down the signals and commands and submit the addition; backward-compatible growth is encouraged, renames are not.
The API-only problem, solved properly
A modern site contains controllers that will never speak MQTT. They expose REST and WebSocket APIs, they fire webhooks, and they expect OAuth2 with rotating tokens. GridFabric's API Connector turns that into a first-class citizen. The connector subscribes to the command channel for the logical devices it represents and, when an operator issues unlock on a door, it uses a service account with the minimal scope to call the vendor's API. It emits cmd_ack or cmd_nack with the HTTP status and a redacted error body. For incoming events, it exposes a public webhook endpoint that terminates on a webhook gateway enforcing TLS, HSTS, HMAC signatures, and timestamp windows. Events carry vendor IDs; the connector maps them to GridFabric event names and emits them with a deterministic iid so deduplication is cheap across retries. When a vendor only offers polling, the connector maintains a cursor and uses ETag/If-Modified-Since to avoid waste. When a WebSocket or SSE feed is available, the connector prefers it and actively reconciles on reconnect by walking the REST API from the last cursor to "now."
Two rules keep API connectors safe. First, tokens live in a secrets manager and rotate automatically; the connector never carries long-lived credentials in plain files. Second, no direct internet egress exists from the OT networks. Connectors and brokers live in a fabric VRF that can egress only to allow-listed domains; field devices never see the internet. The result is that an access control platform that is entirely "API-only" behaves inside the fabric like any other device. It publishes events on event, accepts command, and exposes its last known state as a retained key.
Security and isolation without heroics
The network story is short: three VRFs and the strict minimum of pinholes. OT devices sit alone. The fabric—brokers, gateways, connectors—sits in its own VRF. Northbound egress has its own VRF. The default between VRFs is drop; the exceptions are explicit. MQTT clients authenticate with mutual TLS issued by a private CA. Webhooks terminate behind a reverse proxy that can do mTLS when vendors support it and always does HMAC verification. Operator-initiated commands come through a bastion with MFA and privileged access management; dangerous commands can require a two-person rule encoded as a policy on the connector. Every message that changes state carries a user identity and a request ID; every such message is journaled with an append-only log and retained long enough to satisfy auditors.
Time is part of safety. Gateways and connectors sync time with authenticated NTP/NTS; timestamp drift over a second triggers warnings because stale or future-dated samples create nasty illusions in dashboards. Where OT islands demand it, PTP is used locally, but everything published on the fabric includes ISO-8601 UTC timestamps and origin vs. ingest times to make causality plain after an outage.
Observability that doesn't surprise your SREs
Operators live in Prometheus and Grafana; GridFabric treats that as a constraint rather than an afterthought. A small exporter subscribes to telemetry and exposes /metrics. Mapping is mechanical: org, site, device, kind, area become labels; supply_temp_°C becomes gf_cooling_cdu_supply_temp_celsius (names sanitized and de-duplicated). Booleans surface as 0/1 gauges, and enums as labeled gauges. Cardinality is capped up front—device IDs are labels; raw event IDs are not. The same mapping can be done at the regional layer if you prefer to avoid scrapes at the edge; the point is that an SLO like "CDU loop availability" or "Access denied rate" requires no custom glue. Grafana dashboards carry the usual suspects—golden curves, deltas, alerts—and you can still store the raw JSON if a forensics workflow asks for more.
High availability, back-pressure, and life without WAN
A site-local fabric must keep making sense when links are poor. Brokers run as a pair—clustered when the product supports it, bridged when it does not—and the client side never changes its URI. Store-and-forward is non-negotiable: gateways and connectors write to disk when they cannot publish, and they flush in timestamp order when they return, with a visible distinction between origin time and ingest time. MQTT 5 back-pressure is honored; when the broker lowers the Receive Maximum or shortens keep-alives, producers slow themselves rather than crash the box. Telemetry carries message expiry so nobody replays a day's worth of stale 1-second samples after a fiber cut; events and state never expire. Consumers at regional scale use shared subscriptions to process hot topics horizontally.
Capacity planning is banal by design. You count signals, sample intervals, payload sizes, and retention, and you can estimate broker throughput in messages per second and disk in GiB/day. You push that through sustained tests—WAN impairment, packet loss, broker restarts—and you document how long the site can be offline before buffers fill. The answer should be measured in days, not hours.
Commissioning and change the way operations want it
GridFabric treats commissioning as part of operations rather than a construction ritual. A site goes live only when a conformance suite passes: descriptors validate against schema, topic names are correct, QoS and retain flags match policy, and security posture is clean. A golden hour of samples is captured as a fingerprint for a handful of signals—loop ΔT, pump duty, power factor—so anomaly detectors have a baseline that is not fiction. Change follows GitOps: descriptors, gateway maps, connector configs live in version control; releases roll out in rings (canary, cohort, fleet) with a simple rollback if message loss or error rates are out of bounds. Certificates and OAuth2 secrets rotate on a calendar, not after a breach.
Brownfield migration without breaking glass
Most fleets start messy. The migration path is well-trodden: deploy the brokers and a single gateway; model one subsystem in a descriptor; publish telemetries and events; mirror them into Prometheus and Grafana; and use that thin wedge to get the runbooks and alerts right. Then add the API connector for the first controller—access control is an excellent candidate because it demonstrates events, commands, and audit all at once. From there, grow horizontally: cooling, power, cameras, network devices. Every addition comes with descriptors, tests, and dashboards. Nothing is forced to switch overnight; the value compounds as the number of bespoke one-offs declines.
A concrete flow, end-to-end
An operator in the NOC needs to let a contractor through an exterior door for eight seconds. They authenticate to a bastion with MFA; their client publishes to
org/acme/site/dfw-03/dev/door-east-01/command
with {"cmd":"unlock","args":{"duration_s":8},"req":"4b92c7","user":"sec-op-7"}
. The access control connector receives the command, checks the policy (is this user allowed to act on this door at this time?), exchanges its client credentials for a short-lived token, and calls the vendor's /doors/{id}/unlock API. When the vendor returns 200, the connector emits cmd_ack with the original request ID. The door hardware reports access_granted and state.locked=false via the vendor webhook; the webhook gateway verifies the HMAC, the connector maps it to a GridFabric event and state change, and both appear on the broker for analytics, alerting, and audit. Five seconds later, the door's "propped" timer fires; if the leaf is still open, a door_propped event appears with severity and duration. All of this works the same way when the site's WAN is flaky; the command fails fast if the vendor cloud is unreachable, and the door still functions on cached permissions and local policy.
A second example from cooling shows the other half of the world. The CDU gateway polls Modbus registers every second, converts raw values to typed signals with units, and publishes supply_temp_°C, return_temp_°C, and flow_lpm. If ΔT collapses under load, the rule engine at the site can raise pump RPM by 5% inside a safe envelope while emitting an advisory event; the NOC sees the event, the derivative of ΔT in Grafana crosses a threshold, and an automated ticket is raised with the last minute of samples attached. If the broker disappears during this dance, the gateway caches to disk and flushes later; nothing floods the NOC with ghosts when the link returns because telemetry messages beyond the expiry window are dropped locally.
The boring, defensible conclusion
GridFabric is not a bet on a single vendor or a fashionable data bus. It is a set of conservative, exacting conventions that make a heterogenous estate behave like a single, auditable system: MQTT 5.0 on the wire, strict topics and QoS, descriptors that say what a device is and can do, gateways for field buses, connectors for API-only platforms, store-and-forward for bad days, mTLS and OAuth2 where they belong, Prometheus and Grafana for the glass, and small governance so the language stays coherent as you grow. With that in place, operators can write one runbook, one set of alerts, one set of SLOs, and expect them to hold across 10 or 10,000 sites. That is the difference between a clever integration and a fabric.