Security
Technical Guide
January 202518 min read

Confidential Inference on Shared Edge Hardware

If your model, inputs, or outputs would embarrass you on a billboard, they shouldn’t sit unencrypted in someone else’s RAM. Confidential inference makes “data in use” as private as data at rest—even on shared edge boxes.

What this article covers
  • A crisp threat model for multi-tenant edge compute
  • The building blocks: CPU TEEs (AMD SEV-SNP, Intel TDX, Arm CCA), GPU isolation, signed artifacts, remote attestation
  • A reference architecture and protocol flow for confidential inference
  • Kubernetes patterns (Confidential Containers, GPU isolation/MIG)
  • Performance, side-channel considerations, ops runbook
  • How this maps to the Availability Standard (CAM) so you can certify resiliency

1) Threat model: what you’re defending against

Setting: You run inference on shared edge hardware—micro-DC racks, MEC sites, retail pods—where other tenants or operators exist.

Adversaries & risks

  • Malicious co-tenant: attempts DMA reads, VRAM scraping, cache snooping, or kernel exploits.
  • Curious operator: has root on the host, can read plain RAM, attach a debugger, dump PCIe traffic.
  • Network observer: sniffs prompts or outputs in flight.
  • Supply-chain risk: modified container image/model artifact.
  • Forensic residue: secrets or prompts leaking via logs, crash dumps, swap.

Goal: Make model weights, prompts, and outputs opaque to host & co-tenants; provide verifiable attestation to clients that the right code is running in the right environment before they send secrets.

2) Building blocks (toolbox you’ll actually use)

CPU Trusted Execution Environments (TEEs)

AMD SEV-SNP and Intel TDX (and Arm CCA on Arm edge silicon) protect guest memory from the host/hypervisor; produce attestation reports (signed measurements of firmware/boot & policy). Run your inference service inside a confidential VM (CVM) or “confidential container” that leverages a TEE VM under the hood.

GPU isolation & confidential modes

  • GPU pass-through to a TEE VM or exclusive MIG slice (for NVIDIA) to prevent co-resident sharing of artifacts.
  • Where available, enable confidential computing features for GPU: measured firmware, encrypted command buffers and memory channels (established via SPDM secure session between TEE VM and device), and protected VRAM regions that the host cannot read.
  • If the platform lacks device-side CC, pair pass-through with IOMMU + PCIe DMA protections and avoid multi-tenant MIG on sensitive jobs.

Remote attestation

A Remote Attestation Service (RAS) validates TEE reports and issues a short-lived attestation token (JWT/COSE) with claims like:

  • TEE type and TCB version
  • Measured boot hash / PCR values
  • Policy ID (which images/models are allowed)
  • Optional GPU device identity & confidential-compute mode flag

Key release & sealing

A Key Broker Service (KBS) releases model decryption keys only when the attestation token passes policy. Weights and secrets are sealed (enveloped) to policy so they decrypt only inside the approved TEE on a machine with the right measurements.

Signed everything

  • Container images signed (Sigstore/COSIGN).
  • Model artifacts signed (same).
  • Policies signed (OPA bundle signatures). Admission controllers verify before admit.

Transport

mTLS everywhere (service mesh or Envoy). Prefer HTTP/3/QUIC at the edge; better under lossy links.

3) Reference architecture: confidential inference on a shared edge node

Architecture (ASCII)
                  +---------------- Core / KMS / RAS ----------------+
                   | 1) Verify attestation; 2) Release keys; 3) Audit |
                   +---------------------▲-----------------------------+
                                         |                                  attestation + key unwrap (mTLS)
                                         |    
    Users ── mTLS/QUIC ─▶  [Anycast Ingress (Envoy)] ─▶ [Edge Node]
                                                     ┌───────────────────────────┐
                                                     │  Confidential VM (CVM)   │  <= SEV-SNP / TDX / CCA
                                                     │  ───────────────────────  │
                                                     │  Inference Service        │
                                                     │  (Triton / vLLM / TGI)    │
                                                     │      ▲          ▲         │
                                                     │      |          |         │
                                                     │   sealed   signed model   │
                                                     │    keys    weights        │
                                                     │      |          |         │
                                                     │   (KBS)    (object store) │
                                                     └──────┼──────────┼─────────┘
                                                            │ SPDM-secure channel
                                                     ┌──────▼────────────────────┐
                                                     │        GPU Device         │
                                                     │ (exclusive or MIG slice)  │
                                                     └───────────────────────────┘

High level: Users hit an Anycast IP; L7 pins sessions. Inside a TEE VM, your inference server attests to RAS, gets a token, and unseals model keys from KBS. It then establishes a secure device session (SPDM) with the GPU, loads the decrypted weights into protected memory, and serves traffic. Host root can’t see prompts, keys, or weights.

4) Protocol flow (step-by-step)

  • Measured boot: Platform boots with secure/verified boot; TEE firmware creates attestation evidence.
  • TEE provisioning: Orchestrator schedules a confidential VM; the image digest is known and signed.
  • Remote attestation: The CVM requests an attestation token from RAS; RAS verifies the quote (SNP/TDX/CCA report) and issues a short-lived JWT/COSE with claims: tee=SEV-SNP, tcb>=X.Y, image_digest=…, policy_id=….
  • Key release: CVM presents token to KBS; if claims satisfy policy (right image, right TCB, right region), KBS releases model key (wrapped to the TEE).
  • Artifact verification: CVM pulls model weights from object store; verifies COSIGN signature against transparency log; decrypts inside the TEE VM.
  • GPU secure session: CVM establishes SPDM with GPU firmware; enables protected memory. Command buffers and PCIe/NVLink channels are encrypted/authenticated.
  • Serve: L7 proxy pins streams; client may request the attestation token (or a derivative) from the service to verify before sending sensitive prompts.
  • Rotate & revoke: Tokens and keys are short-lived (minutes). On TCB downgrade/CVE, RAS stops issuing tokens; running pods fail closed and roll to safe image.

5) Kubernetes deployment patterns (what to actually configure)

Confidential Containers (CoCo / Kata):

  • Node label/taint: confidential-compute=true.
  • RuntimeClass: kata-qemu-tdx or kata-qemu-sev (names vary by distro).
  • Ensure containerd/CRI-O configured for CC runtime.

GPU isolation

  • Use exclusive GPU or exclusive MIG instance per confidential pod.
  • Disable peer-to-peer if isolation requires it; ensure IOMMU is on.
  • NodeFeatureDiscovery can advertise gpu.cc=true so the scheduler picks capable nodes.

Admission control

Gate workloads on COSIGN verification and RuntimeClass. OPA policy ensures only signed images/models deploy to CC nodes.

OPA policy sketch (admission)
package admission

default allow = false

allow {
  input.request.kind.kind == "Pod"
  input.request.object.spec.runtimeClassName == "kata-cc"
  verify_cosign(input.request.object.spec.containers[_].image)
  input.request.object.metadata.labels["confidential"] == "true"
}

6) Client verification: “don’t send secrets unless it’s really a TEE”

Expose a /attest endpoint that returns a signed token binding: The TEE attestation claims (subset), the public key used by the service for mTLS, the model digest & policy ID. Clients verify signature chain to RAS root, freshness, claims, and that the mTLS key matches the session’s peer cert.

Client pseudo-code
tok = requests.get("https://edge.example.com/attest").json()
assert verify_signature(tok, RAS_root)
claims = tok["claims"]
assert claims["tcb_ok"] and claims["policy_id"] in ALLOWED_POLICIES
assert cert_pubkey(session) == claims["mtls_pubkey"]

# OK to send sensitive prompt
resp = requests.post("https://edge.example.com/v1/chat", json=prompt, headers={"Attest": tok["jws"]}

7) Performance notes (what to tell your SREs and CFO)

  • CPU TEE overhead: often single-digit to low-teens percent on CPU-bound paths; GPU-bound inference usually unaffected in throughput.
  • Encrypted device/channel overhead: microseconds per call; negligible versus token generation time.
  • Pinned memory & NUMA: CVMs change topology—pin IRQs and ensure locality.
  • Rule of thumb: TTFT may rise slightly; TBTT usually unchanged. Benchmark, don’t guess.

8) Side-channels & co-tenancy: how to not get cute

  • No shared MIG for highly sensitive jobs; allocate exclusive GPU.
  • Disable perf counters and clock throttling side-channels where supported.
  • Constant-time crypto in the enclave; avoid logging high-res timing tied to secrets.
  • Rate-limit externally visible timing to reduce information leaks.

9) Secrets, logging, and forensics (boring and critical)

  • Keep keys ephemeral (minutes). Seal long-lived secrets to policy, not machines.
  • No core dumps on CC nodes; crash → sealed telemetry only.
  • Logs: redact prompts & PII at source; export aggregates only.
  • Encrypted snapshots with strict access control; never export live RAM.

10) Incident response playbook (pin this to the wall)

  • CVE in TEE firmware: RAS flips policy to require patched TCB; key release stops; orchestrator drains pods; rollout patched image; re-attest; reinstate keys.
  • Compromised operator account: Host root sees nothing; rotate RAS/KBS creds; audit failures.
  • Model leak suspicion: Rotate model key & digest; invalidate old sealed weights; move to new policy ID; re-sign.

11) CAM mapping: what this does for your resiliency score

Confidential inference primarily lifts I-CTRL and I-DATA posture and helps with I-NWK (mTLS & identity).

  • I-CTRL +1–2: Signed artifacts, attestation gates, independent release keys; break-glass still offline.
  • I-DATA +1: Sealed model weights, immutable backups, key release policy tied to TEE reports.
  • I-NWK +1: Mandatory mTLS, SPIFFE identities; optional QUIC.
  • I-PWR / I-COOL: No change (still plan UPS/autonomy right).

For an A2/A3 workload with network & data already strong, confidential inference often keeps you at CAM Tier 3 while shifting sensitive tenants onto shared edge iron safely—lowering cost without lowering tier.

12) Minimal example: vLLM inside a confidential VM (conceptual)

Kubernetes Pod (abridged)
apiVersion: v1
kind: Pod
metadata:
  name: vllm-cc
  labels:
    confidential: "true"
spec:
  runtimeClassName: kata-cc           # maps to SEV-SNP/TDX CVM
  nodeSelector:
    gpu.cc: "true"                    # only schedule where GPU CC is available
  containers:
  - name: vllm
    image: ghcr.io/yourorg/vllm:1.0
    args: ["--model", "TheOrg/awesome-13b", "--enable-chunked-prefill"]
    env:
      - name: ATTEST_URL
        value: https://ras.example.com/verify
      - name: KBS_URL
        value: https://kbs.example.com/release
    volumeMounts:
      - name: weights
        mountPath: /models
    resources:
      limits:
        nvidia.com/gpu: 1
  volumes:
    - name: weights
      csi:
        driver: csi.s3.object
        volumeAttributes:
          bucket: "models"
          readonly: "true"
Startup (inside CVM, pseudo-bash)
# 1. Attest
TOK=$(curl -s --cert client.pem --key client.key $ATTEST_URL)

# 2. Verify local image digest & policy id match token claims
verify_local_measurements "$TOK" || exit 1

# 3. Request key
KEY=$(curl -s -H "Authorization: Bearer $TOK" $KBS_URL)

# 4. Decrypt model
cosign verify-blob --key cosign.pub /models/weights.signed || exit 1
openssl enc -d -aes-256-gcm -K "$KEY" -in /models/weights.enc -out /models/weights.bin

# 5. Establish GPU secure session (driver/tooling specific)
enable_gpu_confidential_mode || echo "fallback: exclusive MIG only"

# 6. Launch server
exec python -m vllm.entrypoints.openai.api_server --model /models/weights.bin "$@"

13) Compliance notes

  • PCI / HIPAA / GDPR: Confidential inference reduces scope by protecting data-in-use; still enforce data minimization and DP where appropriate.
  • Audit evidence: Keep RAS decision logs, KBS key-release logs, COSIGN verification logs, and attestation tokens (hashed) per deployment.

14) Checklist (laminate this)

  • Pick TEE (SEV-SNP / TDX / CCA) and verify cloud/edge hardware support
  • Stand up RAS and KBS (or use a managed equivalent)
  • Convert models to signed+sealed artifacts
  • Enable Confidential Containers runtime & node labels
  • Enforce COSIGN and OPA gates in CI/CD
  • Configure exclusive GPU or MIG with IOMMU; enable GPU CC features if available
  • Implement /attest endpoint & client verification
  • Benchmark TTFT/TBTT and tune batching; monitor p95/p99
  • Drill revocation/rotation (firmware, keys, models)
  • Map design to CAM pillars and certify Tier

Takeaway

Shared edge hardware doesn’t have to mean shared secrets. Put inference in a TEE, sign and seal your models, attest before you decrypt, and—where supported—lock the GPU path with device-level confidentiality. You’ll keep your prompts private, your models valuable, and your auditors happy—without giving up the economics of the edge.