Day 5 · Student handout

Docker, K8s, GPU sizing, and vLLM

Learners connect containers, deployment manifests, model weights, KV cache, context length, concurrency, VRAM, and p50/p95 latency.

June 2026 Canonical in the 7-day tutorial Full local lesson

Day 5: Docker, K8s, GPU sizing, vLLM

今日目標

你要能用初階但正確的方式回答「這個系統怎麼部署、要多少 GPU、怎麼估算、怎麼驗證」。

Docker

Docker 把服務、依賴、runtime 包起來。

最小服務拆法:

api-gateway
asr-service
llm-service
tts-service
postgres
redis
observability

第一週不一定要全跑起來,但 README 要說清楚每個 service 的責任。

Kubernetes

你需要知道的 K8s 物件:

Resource白話在 AI 系統中的用途
Pod跑 container 的最小單位inference service pod
Deployment管多個 Pod 與 rolling update更新 model server
Service穩定內部入口gateway 呼叫 model service
Ingress對外 HTTP/HTTPS客戶或 demo 入口
ConfigMap非敏感設定model name, feature flags
Secret敏感設定API key, DB password
PVC持久化儲存model cache, logs
Resource limitsCPU/memory/GPU 要求nvidia.com/gpu: 1

GPU 在 K8s 不是自動出現。通常需要 device plugin,讓 kubelet 知道節點有 GPU 可排程。

最小 inference deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-service
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-service
  template:
    metadata:
      labels:
        app: vllm-service
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
          env:
            - name: MODEL_NAME
              value: "replace-with-model-name"

GPU sizing

不要只說「A6000 應該可以」。要拆成公式。

模型權重粗估:

FP16 / BF16: params * 2 bytes
INT8: params * 1 byte + overhead
INT4: params * 0.5 byte + overhead

KV cache 粗估:

KV cache bytes
≈ 2 * num_layers * num_kv_heads * head_dim * context_length * concurrency * bytes_per_element

總 VRAM:

total VRAM
≈ model weights
+ KV cache
+ activation / runtime overhead
+ CUDA graph / framework overhead
+ fragmentation buffer
+ safety margin

vLLM 需要知道:

gpu_memory_utilization
max_num_seqs
max_num_batched_tokens
max_model_len
quantization
tensor_parallel_size

今日產出

建立 GPU sizing spreadsheet:

modelparams_Bprecisionweight_GBcontextconcurrencykv_cache_GBoverhead_GBsafety_GBtotal_GBGPUfitsp50p95
7B7INT44096424GB
31B31INT48192248GB
70B70INT481924H100/H200

建立 K8s checklist:

pod starts
readiness passes
liveness passes
service routes
ingress reaches service
secret loaded but not logged
config visible
GPU resource request documented
logs include request_id, model, latency, error_code
rollback path documented