Tools of the Trade
Designing a Bulletproof Out-of-Band (OOB) Network for 10,000 Sites
TL;DR: Treat OOB like a product, not an afterthought. Give it its own VRF, identity, routing, security policy, and change cadence. Engineer for brown conditions (carrier loss, high latency, bad power) and for scale (10k+ nodes) from day one.
1) What “good OOB” means at 10,000 sites
- Always reachable: survives primary WAN failure, misconfig pushes, and site power events long enough to act.
- Operator-first: one-click reach from a bastion to any device console with full audit.
- Least privilege: OOB can reach devices, not the internet; operators reach OOB, not production.
- Boring to run: uniform design, zero-touch provisioning (ZTP), ringed rollouts, and metrics that predict trouble.
2) Design principles (non-negotiable)
- Separate plane: OOB lives in its own VRF (or VLAN/overlay) with unique addressing, ACLs, and routing.
- Dual independence: Two independent paths where economically feasible—e.g., primary LTE carrier + secondary LTE/eSIM or fixed broadband; separate power feeds if you can.
- Hub-and-spoke with regional hubs: 3–6 regional headends + 2 core headends; BGP over IPsec/WG tunnels; anycast bastions.
- Identity-first: device certificates, per-user MFA, per-session authorization; no shared creds.
- Read-only by default: elevate to write/console with JIT approval; log everything.
- Store-and-forward: telemetry buffers >48h on site; resume uploads on reconnection.
- Automate everything: ZTP, config-as-code, golden images, cohort rollouts, self-tests.
- Assume bad conditions: high RTT, jitter, CGNAT, captive portals, low power, wet/cold/hot.
3) Reference architectures
3.1 Small edge cabinet (single-carrier, cost-sensitive)
- OOB router with LTE/5G + Wi-Fi client for failover to facility SSID.
- Console/OOB switch powering: serial console server (USB-serial is fine), KVM/IP for one host, smart PDU.
- Power: on UPS circuit opposite the IT gear’s PDU.
- Tunnel: route-based IPsec or WireGuard to two regional hubs (active/standby).
- Use case: retail store, micro-cell site, small POP.
3.2 Medium site (dual-carrier)
- OOB router with dual modems / eSIM; SIMs from two carriers.
- Private APN/VPN preferred; else CGNAT + headend-initiated tunnels.
- Serial & relay: 8–16-port console server; smart PDU for cold-reboot.
- Local agent: on-box or tiny VM for tests, buffering, and checks.
- Use case: edge compute rooms, warehouses, regional offices.
3.3 Campus or data hall (hybrid)
- Primary OOB: fiber to separate ISP or dark fiber to a “safe room” L3, plus LTE as tertiary.
- Dual OOB routers (HA), separate PDUs and antennas, diverse rooftop paths.
- Out-of-band fabric: dedicated OOB access switches per row/pod.
- Use case: converted crypto site, hyperscale hall, small DC.
4) Addressing & routing at fleet scale
IPv4 (RFC1918)
- Per-site supernet: /27 to /24 depending on device count.
- Example scheme:
10.SS.RR.0/24— SS = site index (00–255), RR = region (00–31). - Gateways use .1, console servers .10–.19, PDUs .20–.29, IPMI/ILO .30–.99.
IPv6
- Assign a /64 per site OOB if you have GUA; otherwise ULA (e.g.,
fdaa:RR:SS::/64). - Prefer IPv6 inside the tunnel; dual-stack devices when possible.
Routing
- Headend design: 4–8 regional hubs each with two headends (A/B), plus 2 global cores.
- Tunnels: IPsec (route-based) or WireGuard; BGP over tunnels for summarization.
- Summaries: each site advertises only its OOB supernet; hubs advertise aggregated regionals upstream.
- Anycast bastion: /32 (v4) and /128 (v6) announced from all hubs; operators SSH/RDP to one name, land at nearest healthy hub.
5) Security model (practical & strict)
- Device identity: each OOB router + console server holds a device certificate (mTLS).
- User access: VPN → bastion (MFA, device posture) → per-session JIT approval → jump to target.
- RBAC: roles for NetEng, SysEng, Facilities, OEM Vendor; least privilege enforced in PAM.
- Egress: OOB VRF egress allows only: headend IPs, time (NTP/NTS), CRL/OCSP, update repos, and optional SMS if used for break-glass.
- North-south: deny internet inbound except headend.
- East-west: OOB can reach mgmt IPs of devices; devices cannot initiate to OOB (except syslog/telemetry).
- Logging: PAM logs, command transcripts, serial captures, config diffs → append-only store with retention & legal hold.
6) Physical & RF considerations
- Antennas: roof or exterior wall; MIMO; surge protectors; weatherproofing.
- Cabling: coax length <15–20 m where possible; if longer, consider powered repeaters or different placement.
- Power: OOB on a separate UPS string; PDUs on different breakers; document breaker IDs.
- Environmental: enclosures for −20°C…+50°C; heater strips or fans if needed.
- Tamper: lockable cabinets; camera view on OOB gear; seal antennas to deter theft.
7) Provisioning at 10,000 sites (no heroics)
Inventory contract (minimum fields):
{
"site_id": "chi-0147",
"region": "central",
"oob_supernet_v4": "10.147.14.0/24",
"oob_gateway": "10.147.14.1",
"oob_supernet_v6": "fdaa:2:147:14::/64",
"router_serial": "R-7F3C-22A1",
"router_imei": "3560...123",
"modems": [{"carrier":"A","iccid":"8914...001"},{"carrier":"B","iccid":"8914...002"}],
"console_serial": "CS-883201",
"pdu_serials": ["PDU-A-114","PDU-B-297"],
"headends": ["hub-cen-a","hub-cen-b"],
"antenna": {"type":"mimo","mount":"roof","azimuth":92},
"install_photos": ["url1","url2","url3"],
"last_ztp": "2025-08-09T14:01:33Z",
"notes": "antenna 10m coax via ladder tray A"
}ZTP flow: device boots → DHCP options or QR code → fetch config bundle (signed) → build tunnels → enroll into inventory → run self-tests → report status.
- Kitting: pre-label SIMs, ports, antenna direction; include photo checklist.
- SIM policy: eUICC (eSIM) where available; carrier A primary, carrier B fallback; automated swap if sustained poor KPIs.
8) Config-as-code & change rings
- Repos:
oob-global(policies),oob-region-X(headend prefixes),oob-site-SSS(locals). - Pipelines: lint → lab → 1% sites (canary) → 20% cohort → 50% → fleet.
- Guardrails: max change batch, pause on error rate >1%, auto-rollback on health fail.
- Golden images: versioned router OS, console OS, bastion images; drift detection hourly.
9) Monitoring & alerting (what matters)
Per site KPIs
- Tunnel health (to A and B), jitter/latency/loss, rekey counters.
- SIM/carrier status, RSRP/RSRQ/SINR, cell ID changes.
- Console reach (TCP/22/443), serial alive, PDU RPC.
- Time sync, disk space (buffer), CPU/mem, temperature.
- Power state (UPS on battery), door alarms if available.
Fleet dashboards
- Reachability heatmap by region/carrier.
- Top flapping sites, top low-signal sites, tunnel churn per hour.
- Firmware compliance & drift.
- Mean time to repair (MTR) via OOB vs truck roll avoidance.
Alert rules (examples)
- P1: both tunnels down > 90s; P1: serial & PDU unreachable; P1: OOB power lost.
- P2: jitter > 80 ms for 5 min; P2: SIM stuck in register; P2: buffer > 60%.
- P3: OS drift, cert expiring < 14 days, antenna RSSI below site baseline for 24h.
10) Runbooks (one-click where possible)
- Promote backup tunnel: shift primary headend, then carrier.
- Cold-reboot a device: PDU outlet cycle with safeguards; confirm ping/SSH returns.
- Roll back a bad change: revert config tag; push; verify canary; expand.
- Replace SIM remotely: eUICC profile switch; verify attach; update inventory.
- Recover bricked router: fall back to ROMMON/loader over serial; TFTP/USB image; re-enroll.
11) Quarterly drills (not theater)
- Carrier brownout: enforce rate-limit and test voice priority (if provided); verify operator can still reach serial at >300 ms RTT.
- Headend loss: turn down a regional hub; sessions re-anchor to neighbor; SLOs hold.
- Power cut: simulate UPS transfer; verify OOB stays up long enough to drain traffic or move BGP weights; log timings.
- Break-glass: assume bastion outage; use secondary bastion or emergency dial-in path; verify access list works; document outcomes.
12) Example ACLs & flows (pseudo-config)
OOB router VRF “oob”:
ip vrf oob
rd 65000:10
route-target export 65000:10
route-target import 65000:10
ip access-list extended OOB_EGRESS
permit ip any <HEADEND_A_PUBLIC>/32
permit ip any <HEADEND_B_PUBLIC>/32
permit udp any eq 123 any ! NTP/NTS via proxy
deny ip any 10.0.0.0 0.255.255.255 log
deny ip any 172.16.0.0 0.15.255.255 log
deny ip any 192.168.0.0 0.0.255.255 log
deny ip any any log
! Bastion flow (PAM-enforced):
! User MFA → PAM → ephemeral credentials → SSH jump to target mgmt IP
! Session recording → auto-revoke
13) Cost & capacity math (quick reality)
- Per-site OOB data: 100–500 MB/month typical (telemetry + occasional console), spikes during troubleshooting/firmware.
- Plan tiers: light (250 MB), standard (1 GB), heavy (5 GB); pool at regional level.
- Headends: size for 2–3× expected concurrent tunnels per region; prefer stateless workers + sticky control; keep certificate ops off the hot path.
14) 30/60/90 rollout plan
- Day 0–30: finalize schema, headend build, ZTP pipeline; pilot 50 sites across 3 regions; validate dual-tunnel & carrier metrics; write drills.
- Day 31–60: expand to 1,000 sites; enable canary rings; integrate PAM/SIEM; train NOC; tune alerts to ≤ 1 page/site/month.
- Day 61–90: scale to 5,000–10,000; add second SIM carrier where weak; quarterly drills; publish SLOs & monthly “OOB health” reports.
15) What to avoid
Shared local admin passwords; unmanaged modem/APN chaos; NAT hairpins through production; per-site snowflake configs; “manual” provisioning; unmanaged SMS control; no photo evidence of antenna/install.
16) Tie-in: how GridSite helps (light touch)
- Selection & design: choose routers, antennas, SIM/eSIM policies, private APN/VPN designs; solidify the VRF/ACL pattern.
- Ordering & provisioning at scale: kitting, SIM activation, ZTP pipelines, photo-verified installs, and inventory systems that won’t rot.
- Headends & bastions: build/operate regional hubs, anycast bastions, PAM, logging, and DR.
- Operations: 24×7 monitoring, change rings, firmware cadences, and quarterly drills—plus remote-hands dispatch through the marketplace when a truck roll really is needed.
- Integrations: wire OOB alarms to your ITSM, CMDB, SIEM; keep audit & compliance happy.
Bottom line: The best OOB is invisible until you need it—and utterly dependable when you do. Build it like a product, automate it like software, and measure it like an SLO.