- A crisp threat model for multi-tenant edge compute
- The building blocks: CPU TEEs (AMD SEV-SNP, Intel TDX, Arm CCA), GPU isolation, signed artifacts, remote attestation
- A reference architecture and protocol flow for confidential inference
- Kubernetes patterns (Confidential Containers, GPU isolation/MIG)
- Performance, side-channel considerations, ops runbook
- How this maps to the Availability Standard (CAM) so you can certify resiliency
1) Threat model: what you’re defending against
Setting: You run inference on shared edge hardware—micro-DC racks, MEC sites, retail pods—where other tenants or operators exist.
Adversaries & risks
- Malicious co-tenant: attempts DMA reads, VRAM scraping, cache snooping, or kernel exploits.
- Curious operator: has root on the host, can read plain RAM, attach a debugger, dump PCIe traffic.
- Network observer: sniffs prompts or outputs in flight.
- Supply-chain risk: modified container image/model artifact.
- Forensic residue: secrets or prompts leaking via logs, crash dumps, swap.
Goal: Make model weights, prompts, and outputs opaque to host & co-tenants; provide verifiable attestation to clients that the right code is running in the right environment before they send secrets.
2) Building blocks (toolbox you’ll actually use)
CPU Trusted Execution Environments (TEEs)
AMD SEV-SNP and Intel TDX (and Arm CCA on Arm edge silicon) protect guest memory from the host/hypervisor; produce attestation reports (signed measurements of firmware/boot & policy). Run your inference service inside a confidential VM (CVM) or “confidential container” that leverages a TEE VM under the hood.
GPU isolation & confidential modes
- GPU pass-through to a TEE VM or exclusive MIG slice (for NVIDIA) to prevent co-resident sharing of artifacts.
- Where available, enable confidential computing features for GPU: measured firmware, encrypted command buffers and memory channels (established via SPDM secure session between TEE VM and device), and protected VRAM regions that the host cannot read.
- If the platform lacks device-side CC, pair pass-through with IOMMU + PCIe DMA protections and avoid multi-tenant MIG on sensitive jobs.
Remote attestation
A Remote Attestation Service (RAS) validates TEE reports and issues a short-lived attestation token (JWT/COSE) with claims like:
- TEE type and TCB version
- Measured boot hash / PCR values
- Policy ID (which images/models are allowed)
- Optional GPU device identity & confidential-compute mode flag
Key release & sealing
A Key Broker Service (KBS) releases model decryption keys only when the attestation token passes policy. Weights and secrets are sealed (enveloped) to policy so they decrypt only inside the approved TEE on a machine with the right measurements.
Signed everything
- Container images signed (Sigstore/COSIGN).
- Model artifacts signed (same).
- Policies signed (OPA bundle signatures). Admission controllers verify before admit.
Transport
mTLS everywhere (service mesh or Envoy). Prefer HTTP/3/QUIC at the edge; better under lossy links.
3) Reference architecture: confidential inference on a shared edge node
+---------------- Core / KMS / RAS ----------------+
| 1) Verify attestation; 2) Release keys; 3) Audit |
+---------------------▲-----------------------------+
| attestation + key unwrap (mTLS)
|
Users ── mTLS/QUIC ─▶ [Anycast Ingress (Envoy)] ─▶ [Edge Node]
┌───────────────────────────┐
│ Confidential VM (CVM) │ <= SEV-SNP / TDX / CCA
│ ─────────────────────── │
│ Inference Service │
│ (Triton / vLLM / TGI) │
│ ▲ ▲ │
│ | | │
│ sealed signed model │
│ keys weights │
│ | | │
│ (KBS) (object store) │
└──────┼──────────┼─────────┘
│ SPDM-secure channel
┌──────▼────────────────────┐
│ GPU Device │
│ (exclusive or MIG slice) │
└───────────────────────────┘
High level: Users hit an Anycast IP; L7 pins sessions. Inside a TEE VM, your inference server attests to RAS, gets a token, and unseals model keys from KBS. It then establishes a secure device session (SPDM) with the GPU, loads the decrypted weights into protected memory, and serves traffic. Host root can’t see prompts, keys, or weights.
4) Protocol flow (step-by-step)
- Measured boot: Platform boots with secure/verified boot; TEE firmware creates attestation evidence.
- TEE provisioning: Orchestrator schedules a confidential VM; the image digest is known and signed.
- Remote attestation: The CVM requests an attestation token from RAS; RAS verifies the quote (SNP/TDX/CCA report) and issues a short-lived JWT/COSE with claims: tee=SEV-SNP, tcb>=X.Y, image_digest=…, policy_id=….
- Key release: CVM presents token to KBS; if claims satisfy policy (right image, right TCB, right region), KBS releases model key (wrapped to the TEE).
- Artifact verification: CVM pulls model weights from object store; verifies COSIGN signature against transparency log; decrypts inside the TEE VM.
- GPU secure session: CVM establishes SPDM with GPU firmware; enables protected memory. Command buffers and PCIe/NVLink channels are encrypted/authenticated.
- Serve: L7 proxy pins streams; client may request the attestation token (or a derivative) from the service to verify before sending sensitive prompts.
- Rotate & revoke: Tokens and keys are short-lived (minutes). On TCB downgrade/CVE, RAS stops issuing tokens; running pods fail closed and roll to safe image.
5) Kubernetes deployment patterns (what to actually configure)
Confidential Containers (CoCo / Kata):
- Node label/taint: confidential-compute=true.
- RuntimeClass: kata-qemu-tdx or kata-qemu-sev (names vary by distro).
- Ensure containerd/CRI-O configured for CC runtime.
GPU isolation
- Use exclusive GPU or exclusive MIG instance per confidential pod.
- Disable peer-to-peer if isolation requires it; ensure IOMMU is on.
- NodeFeatureDiscovery can advertise gpu.cc=true so the scheduler picks capable nodes.
Admission control
Gate workloads on COSIGN verification and RuntimeClass. OPA policy ensures only signed images/models deploy to CC nodes.
package admission
default allow = false
allow {
input.request.kind.kind == "Pod"
input.request.object.spec.runtimeClassName == "kata-cc"
verify_cosign(input.request.object.spec.containers[_].image)
input.request.object.metadata.labels["confidential"] == "true"
}
6) Client verification: “don’t send secrets unless it’s really a TEE”
Expose a /attest
endpoint that returns a signed token binding: The TEE attestation claims (subset), the public key used by the service for mTLS, the model digest & policy ID. Clients verify signature chain to RAS root, freshness, claims, and that the mTLS key matches the session’s peer cert.
tok = requests.get("https://edge.example.com/attest").json()
assert verify_signature(tok, RAS_root)
claims = tok["claims"]
assert claims["tcb_ok"] and claims["policy_id"] in ALLOWED_POLICIES
assert cert_pubkey(session) == claims["mtls_pubkey"]
# OK to send sensitive prompt
resp = requests.post("https://edge.example.com/v1/chat", json=prompt, headers={"Attest": tok["jws"]}
7) Performance notes (what to tell your SREs and CFO)
- CPU TEE overhead: often single-digit to low-teens percent on CPU-bound paths; GPU-bound inference usually unaffected in throughput.
- Encrypted device/channel overhead: microseconds per call; negligible versus token generation time.
- Pinned memory & NUMA: CVMs change topology—pin IRQs and ensure locality.
- Rule of thumb: TTFT may rise slightly; TBTT usually unchanged. Benchmark, don’t guess.
8) Side-channels & co-tenancy: how to not get cute
- No shared MIG for highly sensitive jobs; allocate exclusive GPU.
- Disable perf counters and clock throttling side-channels where supported.
- Constant-time crypto in the enclave; avoid logging high-res timing tied to secrets.
- Rate-limit externally visible timing to reduce information leaks.
9) Secrets, logging, and forensics (boring and critical)
- Keep keys ephemeral (minutes). Seal long-lived secrets to policy, not machines.
- No core dumps on CC nodes; crash → sealed telemetry only.
- Logs: redact prompts & PII at source; export aggregates only.
- Encrypted snapshots with strict access control; never export live RAM.
10) Incident response playbook (pin this to the wall)
- CVE in TEE firmware: RAS flips policy to require patched TCB; key release stops; orchestrator drains pods; rollout patched image; re-attest; reinstate keys.
- Compromised operator account: Host root sees nothing; rotate RAS/KBS creds; audit failures.
- Model leak suspicion: Rotate model key & digest; invalidate old sealed weights; move to new policy ID; re-sign.
11) CAM mapping: what this does for your resiliency score
Confidential inference primarily lifts I-CTRL and I-DATA posture and helps with I-NWK (mTLS & identity).
- I-CTRL +1–2: Signed artifacts, attestation gates, independent release keys; break-glass still offline.
- I-DATA +1: Sealed model weights, immutable backups, key release policy tied to TEE reports.
- I-NWK +1: Mandatory mTLS, SPIFFE identities; optional QUIC.
- I-PWR / I-COOL: No change (still plan UPS/autonomy right).
For an A2/A3 workload with network & data already strong, confidential inference often keeps you at CAM Tier 3 while shifting sensitive tenants onto shared edge iron safely—lowering cost without lowering tier.
12) Minimal example: vLLM inside a confidential VM (conceptual)
apiVersion: v1
kind: Pod
metadata:
name: vllm-cc
labels:
confidential: "true"
spec:
runtimeClassName: kata-cc # maps to SEV-SNP/TDX CVM
nodeSelector:
gpu.cc: "true" # only schedule where GPU CC is available
containers:
- name: vllm
image: ghcr.io/yourorg/vllm:1.0
args: ["--model", "TheOrg/awesome-13b", "--enable-chunked-prefill"]
env:
- name: ATTEST_URL
value: https://ras.example.com/verify
- name: KBS_URL
value: https://kbs.example.com/release
volumeMounts:
- name: weights
mountPath: /models
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: weights
csi:
driver: csi.s3.object
volumeAttributes:
bucket: "models"
readonly: "true"
# 1. Attest
TOK=$(curl -s --cert client.pem --key client.key $ATTEST_URL)
# 2. Verify local image digest & policy id match token claims
verify_local_measurements "$TOK" || exit 1
# 3. Request key
KEY=$(curl -s -H "Authorization: Bearer $TOK" $KBS_URL)
# 4. Decrypt model
cosign verify-blob --key cosign.pub /models/weights.signed || exit 1
openssl enc -d -aes-256-gcm -K "$KEY" -in /models/weights.enc -out /models/weights.bin
# 5. Establish GPU secure session (driver/tooling specific)
enable_gpu_confidential_mode || echo "fallback: exclusive MIG only"
# 6. Launch server
exec python -m vllm.entrypoints.openai.api_server --model /models/weights.bin "$@"
13) Compliance notes
- PCI / HIPAA / GDPR: Confidential inference reduces scope by protecting data-in-use; still enforce data minimization and DP where appropriate.
- Audit evidence: Keep RAS decision logs, KBS key-release logs, COSIGN verification logs, and attestation tokens (hashed) per deployment.
14) Checklist (laminate this)
- Pick TEE (SEV-SNP / TDX / CCA) and verify cloud/edge hardware support
- Stand up RAS and KBS (or use a managed equivalent)
- Convert models to signed+sealed artifacts
- Enable Confidential Containers runtime & node labels
- Enforce COSIGN and OPA gates in CI/CD
- Configure exclusive GPU or MIG with IOMMU; enable GPU CC features if available
- Implement /attest endpoint & client verification
- Benchmark TTFT/TBTT and tune batching; monitor p95/p99
- Drill revocation/rotation (firmware, keys, models)
- Map design to CAM pillars and certify Tier
Takeaway
Shared edge hardware doesn’t have to mean shared secrets. Put inference in a TEE, sign and seal your models, attest before you decrypt, and—where supported—lock the GPU path with device-level confidentiality. You’ll keep your prompts private, your models valuable, and your auditors happy—without giving up the economics of the edge.