Tools of the Trade

Designing a Bulletproof Out-of-Band (OOB) Network for 10,000 Sites

TL;DR: Treat OOB like a product, not an afterthought. Give it its own VRF, identity, routing, security policy, and change cadence. Engineer for brown conditions (carrier loss, high latency, bad power) and for scale (10k+ nodes) from day one.

1) What “good OOB” means at 10,000 sites

  • Always reachable: survives primary WAN failure, misconfig pushes, and site power events long enough to act.
  • Operator-first: one-click reach from a bastion to any device console with full audit.
  • Least privilege: OOB can reach devices, not the internet; operators reach OOB, not production.
  • Boring to run: uniform design, zero-touch provisioning (ZTP), ringed rollouts, and metrics that predict trouble.

2) Design principles (non-negotiable)

  • Separate plane: OOB lives in its own VRF (or VLAN/overlay) with unique addressing, ACLs, and routing.
  • Dual independence: Two independent paths where economically feasible—e.g., primary LTE carrier + secondary LTE/eSIM or fixed broadband; separate power feeds if you can.
  • Hub-and-spoke with regional hubs: 3–6 regional headends + 2 core headends; BGP over IPsec/WG tunnels; anycast bastions.
  • Identity-first: device certificates, per-user MFA, per-session authorization; no shared creds.
  • Read-only by default: elevate to write/console with JIT approval; log everything.
  • Store-and-forward: telemetry buffers >48h on site; resume uploads on reconnection.
  • Automate everything: ZTP, config-as-code, golden images, cohort rollouts, self-tests.
  • Assume bad conditions: high RTT, jitter, CGNAT, captive portals, low power, wet/cold/hot.

3) Reference architectures

3.1 Small edge cabinet (single-carrier, cost-sensitive)

  • OOB router with LTE/5G + Wi-Fi client for failover to facility SSID.
  • Console/OOB switch powering: serial console server (USB-serial is fine), KVM/IP for one host, smart PDU.
  • Power: on UPS circuit opposite the IT gear’s PDU.
  • Tunnel: route-based IPsec or WireGuard to two regional hubs (active/standby).
  • Use case: retail store, micro-cell site, small POP.

3.2 Medium site (dual-carrier)

  • OOB router with dual modems / eSIM; SIMs from two carriers.
  • Private APN/VPN preferred; else CGNAT + headend-initiated tunnels.
  • Serial & relay: 8–16-port console server; smart PDU for cold-reboot.
  • Local agent: on-box or tiny VM for tests, buffering, and checks.
  • Use case: edge compute rooms, warehouses, regional offices.

3.3 Campus or data hall (hybrid)

  • Primary OOB: fiber to separate ISP or dark fiber to a “safe room” L3, plus LTE as tertiary.
  • Dual OOB routers (HA), separate PDUs and antennas, diverse rooftop paths.
  • Out-of-band fabric: dedicated OOB access switches per row/pod.
  • Use case: converted crypto site, hyperscale hall, small DC.

4) Addressing & routing at fleet scale

IPv4 (RFC1918)

  • Per-site supernet: /27 to /24 depending on device count.
  • Example scheme: 10.SS.RR.0/24 — SS = site index (00–255), RR = region (00–31).
  • Gateways use .1, console servers .10–.19, PDUs .20–.29, IPMI/ILO .30–.99.

IPv6

  • Assign a /64 per site OOB if you have GUA; otherwise ULA (e.g., fdaa:RR:SS::/64).
  • Prefer IPv6 inside the tunnel; dual-stack devices when possible.

Routing

  • Headend design: 4–8 regional hubs each with two headends (A/B), plus 2 global cores.
  • Tunnels: IPsec (route-based) or WireGuard; BGP over tunnels for summarization.
  • Summaries: each site advertises only its OOB supernet; hubs advertise aggregated regionals upstream.
  • Anycast bastion: /32 (v4) and /128 (v6) announced from all hubs; operators SSH/RDP to one name, land at nearest healthy hub.

5) Security model (practical & strict)

  • Device identity: each OOB router + console server holds a device certificate (mTLS).
  • User access: VPN → bastion (MFA, device posture) → per-session JIT approval → jump to target.
  • RBAC: roles for NetEng, SysEng, Facilities, OEM Vendor; least privilege enforced in PAM.
  • Egress: OOB VRF egress allows only: headend IPs, time (NTP/NTS), CRL/OCSP, update repos, and optional SMS if used for break-glass.
  • North-south: deny internet inbound except headend.
  • East-west: OOB can reach mgmt IPs of devices; devices cannot initiate to OOB (except syslog/telemetry).
  • Logging: PAM logs, command transcripts, serial captures, config diffs → append-only store with retention & legal hold.

6) Physical & RF considerations

  • Antennas: roof or exterior wall; MIMO; surge protectors; weatherproofing.
  • Cabling: coax length <15–20 m where possible; if longer, consider powered repeaters or different placement.
  • Power: OOB on a separate UPS string; PDUs on different breakers; document breaker IDs.
  • Environmental: enclosures for −20°C…+50°C; heater strips or fans if needed.
  • Tamper: lockable cabinets; camera view on OOB gear; seal antennas to deter theft.

7) Provisioning at 10,000 sites (no heroics)

Inventory contract (minimum fields):

{
  "site_id": "chi-0147",
  "region": "central",
  "oob_supernet_v4": "10.147.14.0/24",
  "oob_gateway": "10.147.14.1",
  "oob_supernet_v6": "fdaa:2:147:14::/64",
  "router_serial": "R-7F3C-22A1",
  "router_imei": "3560...123",
  "modems": [{"carrier":"A","iccid":"8914...001"},{"carrier":"B","iccid":"8914...002"}],
  "console_serial": "CS-883201",
  "pdu_serials": ["PDU-A-114","PDU-B-297"],
  "headends": ["hub-cen-a","hub-cen-b"],
  "antenna": {"type":"mimo","mount":"roof","azimuth":92},
  "install_photos": ["url1","url2","url3"],
  "last_ztp": "2025-08-09T14:01:33Z",
  "notes": "antenna 10m coax via ladder tray A"
}

ZTP flow: device boots → DHCP options or QR code → fetch config bundle (signed) → build tunnels → enroll into inventory → run self-tests → report status.

  • Kitting: pre-label SIMs, ports, antenna direction; include photo checklist.
  • SIM policy: eUICC (eSIM) where available; carrier A primary, carrier B fallback; automated swap if sustained poor KPIs.

8) Config-as-code & change rings

  • Repos: oob-global (policies), oob-region-X (headend prefixes), oob-site-SSS (locals).
  • Pipelines: lint → lab → 1% sites (canary) → 20% cohort → 50% → fleet.
  • Guardrails: max change batch, pause on error rate >1%, auto-rollback on health fail.
  • Golden images: versioned router OS, console OS, bastion images; drift detection hourly.

9) Monitoring & alerting (what matters)

Per site KPIs

  • Tunnel health (to A and B), jitter/latency/loss, rekey counters.
  • SIM/carrier status, RSRP/RSRQ/SINR, cell ID changes.
  • Console reach (TCP/22/443), serial alive, PDU RPC.
  • Time sync, disk space (buffer), CPU/mem, temperature.
  • Power state (UPS on battery), door alarms if available.

Fleet dashboards

  • Reachability heatmap by region/carrier.
  • Top flapping sites, top low-signal sites, tunnel churn per hour.
  • Firmware compliance & drift.
  • Mean time to repair (MTR) via OOB vs truck roll avoidance.

Alert rules (examples)

  • P1: both tunnels down > 90s; P1: serial & PDU unreachable; P1: OOB power lost.
  • P2: jitter > 80 ms for 5 min; P2: SIM stuck in register; P2: buffer > 60%.
  • P3: OS drift, cert expiring < 14 days, antenna RSSI below site baseline for 24h.

10) Runbooks (one-click where possible)

  • Promote backup tunnel: shift primary headend, then carrier.
  • Cold-reboot a device: PDU outlet cycle with safeguards; confirm ping/SSH returns.
  • Roll back a bad change: revert config tag; push; verify canary; expand.
  • Replace SIM remotely: eUICC profile switch; verify attach; update inventory.
  • Recover bricked router: fall back to ROMMON/loader over serial; TFTP/USB image; re-enroll.

11) Quarterly drills (not theater)

  • Carrier brownout: enforce rate-limit and test voice priority (if provided); verify operator can still reach serial at >300 ms RTT.
  • Headend loss: turn down a regional hub; sessions re-anchor to neighbor; SLOs hold.
  • Power cut: simulate UPS transfer; verify OOB stays up long enough to drain traffic or move BGP weights; log timings.
  • Break-glass: assume bastion outage; use secondary bastion or emergency dial-in path; verify access list works; document outcomes.

12) Example ACLs & flows (pseudo-config)

OOB router VRF “oob”:

ip vrf oob
  rd 65000:10
  route-target export 65000:10
  route-target import 65000:10

ip access-list extended OOB_EGRESS
  permit ip any <HEADEND_A_PUBLIC>/32
  permit ip any <HEADEND_B_PUBLIC>/32
  permit udp any eq 123 any  ! NTP/NTS via proxy
  deny   ip any 10.0.0.0 0.255.255.255 log
  deny   ip any 172.16.0.0 0.15.255.255 log
  deny   ip any 192.168.0.0 0.0.255.255 log
  deny   ip any any log

! Bastion flow (PAM-enforced):
! User MFA → PAM → ephemeral credentials → SSH jump to target mgmt IP
! Session recording → auto-revoke

13) Cost & capacity math (quick reality)

  • Per-site OOB data: 100–500 MB/month typical (telemetry + occasional console), spikes during troubleshooting/firmware.
  • Plan tiers: light (250 MB), standard (1 GB), heavy (5 GB); pool at regional level.
  • Headends: size for 2–3× expected concurrent tunnels per region; prefer stateless workers + sticky control; keep certificate ops off the hot path.

14) 30/60/90 rollout plan

  • Day 0–30: finalize schema, headend build, ZTP pipeline; pilot 50 sites across 3 regions; validate dual-tunnel & carrier metrics; write drills.
  • Day 31–60: expand to 1,000 sites; enable canary rings; integrate PAM/SIEM; train NOC; tune alerts to ≤ 1 page/site/month.
  • Day 61–90: scale to 5,000–10,000; add second SIM carrier where weak; quarterly drills; publish SLOs & monthly “OOB health” reports.

15) What to avoid

Shared local admin passwords; unmanaged modem/APN chaos; NAT hairpins through production; per-site snowflake configs; “manual” provisioning; unmanaged SMS control; no photo evidence of antenna/install.

16) Tie-in: how GridSite helps (light touch)

  • Selection & design: choose routers, antennas, SIM/eSIM policies, private APN/VPN designs; solidify the VRF/ACL pattern.
  • Ordering & provisioning at scale: kitting, SIM activation, ZTP pipelines, photo-verified installs, and inventory systems that won’t rot.
  • Headends & bastions: build/operate regional hubs, anycast bastions, PAM, logging, and DR.
  • Operations: 24×7 monitoring, change rings, firmware cadences, and quarterly drills—plus remote-hands dispatch through the marketplace when a truck roll really is needed.
  • Integrations: wire OOB alarms to your ITSM, CMDB, SIEM; keep audit & compliance happy.

Bottom line: The best OOB is invisible until you need it—and utterly dependable when you do. Build it like a product, automate it like software, and measure it like an SLO.