LLM Inference Optimization 2026: Serving, Batching, KV Cache

Q: vLLM or TensorRT-LLM for production?

Default to vLLM for most production deployments. It's faster to set up, more flexible, broadly supported, and produces 80-90% of TensorRT-LLM's peak performance with much less engineering investment. Move to TensorRT-LLM when you have a stable model, are NVIDIA-locked, and need the last 10-20% performance for cost reasons.

Q: How much can I improve inference performance through optimization?

From a naive deployment to a properly tuned one: typically 5-15x throughput improvement on the same hardware. From a properly tuned one to a highly tuned one: another 1.5-2x. Beyond that, the gains require specialized techniques (custom kernels, model architecture changes) that aren't worth the effort for most teams.

Q: What's the minimum hardware to run a 70B model?

FP16: 2x H100 (160 GB total memory). FP8: 1x H100 (80 GB). INT4 AWQ: 1x H100 with significant headroom. For comfortable production deployment with longer contexts and reasonable batch sizes: 2x H100 with tensor parallelism, FP8 quantization, prefix caching enabled.

Q: Is it worth self-hosting if I'm using less than X tokens per day?

Below ~100M tokens per month, API providers are almost always cheaper than self-hosting once you account for engineering time, operational overhead, and hardware idle time. Above ~500M tokens per month, self-hosting becomes competitive. Between those is gray zone — depends on specific economics, regulatory requirements, and team expertise.

Q: How do I benchmark properly?

Use a realistic request distribution (not just one prompt repeated). Match production concurrency. Measure TTFT, ITL, throughput, error rate. Run for at least an hour to capture warmup and steady state. Tools: genai-perf, vllm-benchmark, locust, custom load generators. Don't trust the throughput numbers shown in marketing materials — they're from idealized workloads that don't match production.

Q: How does prompt caching at the API level work?

For OpenAI: prompts over 1024 tokens are eligible; cached prefixes get discounted input pricing (typically 50%). For Anthropic: explicit cache_control markers in the prompt; cached content gets 90% discount on input pricing after initial cache write. For Google: cache feature is available with specific configuration. All have minimum prompt sizes and cache TTLs you should learn for your provider.

LLM inference optimization is the technical discipline that separates teams running large language models profitably from teams burning money on idle GPUs. As of 2026, the gap between a naively deployed model and a properly optimized inference stack is roughly 10x — same hardware, same model, ten times the throughput, one tenth the per-token cost. The techniques that produce that gap — PagedAttention, continuous batching, prefix caching, speculative decoding, quantization, disaggregated serving — have stabilized into a known toolkit, but applying them correctly takes engineering judgment that doesn’t come from reading marketing material. This eguide is the working engineer’s guide to LLM inference optimization in 2026: what each technique actually does, when it helps and when it doesn’t, what the production-grade serving stacks (vLLM, TensorRT-LLM, SGLang, TGI) get right, and how to measure whether your optimization is actually working. Written for ML platform engineers, infrastructure teams running self-hosted models, SREs scaling inference traffic, and technical architects choosing an inference stack. No assumption that you’ve already built one; every concept introduced before it’s used.

The inference cost reality in 2026 — why this matters
Inference fundamentals — prefill, decode, and the autoregressive loop
KV cache — what it is, why it dominates memory
Continuous batching — beyond static batching
PagedAttention and memory management
Speculative decoding — draft models and accept/reject
Quantization for inference — INT8, FP8, INT4, AWQ, GPTQ
Long-context inference — chunking, sliding window, ring attention
Multi-GPU and multi-node — tensor, pipeline, expert parallelism
Caching strategies — prompt caching, prefix sharing, semantic cache
Streaming inference and time-to-first-token
Inference engines compared — vLLM, TensorRT-LLM, SGLang, TGI
Hardware choices in 2026 — H100, H200, B200, MI300, Groq
Production deployment — autoscaling, monitoring, SLOs
Common mistakes in inference deployments
FAQ

Chapter 1: The inference cost reality in 2026 — why this matters

The economics of running large language models have changed character through 2024-2026. Training a frontier model still costs hundreds of millions of dollars, but training is a one-time event amortized over the model’s commercial lifetime. Inference is the recurring cost that scales with usage. For a model deployed at meaningful scale, the cumulative inference cost crosses the training cost within months and continues compounding for years.

Concrete numbers ground this. A Llama-3.1-70B model deployed on a single H100 GPU naively — no continuous batching, no PagedAttention, no quantization — produces roughly 100-300 tokens per second for a single user. The same model on the same hardware with a properly tuned vLLM deployment serves 30-40 concurrent users at sustained throughput of 3,000-5,000 tokens per second aggregate. That’s a 10-50x throughput improvement on identical hardware. The per-token cost difference is proportional. A team that ships the naive version pays 10-50x more per token than a team that ships the optimized version. At meaningful scale — millions of tokens per day — the difference is the difference between a viable product and a money-losing one.

The other reality: inference optimization is not free. The optimized stacks require engineering investment to set up correctly, tune to your workload, and monitor in production. Misconfigured optimizations can produce worse results than no optimization at all. A wrong quantization level can degrade output quality below acceptable thresholds; a wrong batching configuration can introduce latency spikes that violate your SLOs; a wrong cache eviction policy can blow out memory under specific traffic patterns. The discipline is technical, and the failure modes are subtle.

Who needs this. Teams self-hosting models — Llama, Mistral, Qwen, DeepSeek, Granite, or fine-tuned variants — absolutely need inference optimization. Teams using API providers (OpenAI, Anthropic, Google) don’t manage inference themselves but should understand the underlying techniques because they affect everything from latency to pricing to context-window behavior. Teams considering a build-vs-buy decision between self-hosting and API consumption need to understand what inference optimization can and cannot do for the unit economics of self-hosting.

What this guide covers. The fundamental mechanics of inference, the optimization techniques that have stabilized into production-grade practice, the serving stacks that implement them, the hardware that hosts them, the deployment patterns that operate them, and the common mistakes that bite teams shipping their first optimized inference stack. The intent is that an engineer who works through this guide and applies the techniques to their actual workload should be able to extract most of the theoretically available performance from their setup.

What this guide does not cover in depth. Training optimization, fine-tuning techniques (covered in the LLM Fine-Tuning 2026 eguide), and the upstream choices of base model selection. Those topics intersect with inference but are separate disciplines. This guide assumes you have selected a model and need to serve it efficiently; everything else flows from that starting point.

The state of the field in mid-2026. vLLM is the de-facto open-source standard for self-hosted LLM serving, with PagedAttention as its defining innovation. NVIDIA’s TensorRT-LLM has the highest peak performance on NVIDIA hardware but with steeper engineering investment. SGLang offers structured generation patterns. TGI from Hugging Face remains a strong choice for teams in the Hugging Face ecosystem. Hardware-wise, NVIDIA H100/H200/B200 dominates production deployments; AMD MI300 is gaining traction; specialized inference chips (Groq, Cerebras, SambaNova) serve specific use cases at extreme latency or throughput.

Chapter 2: Inference fundamentals — prefill, decode, and the autoregressive loop

To optimize inference, you have to understand what inference does. The basic mechanics are simple at the conceptual level and have profound implications for performance at the implementation level.

An LLM is a function: given a sequence of tokens, produce a probability distribution over the next token. To generate text, you sample from that distribution to get the next token, append it to your sequence, and call the function again. Repeat until you reach a stop condition. This is autoregressive generation, and it has a specific computational shape that drives all of inference optimization.

The forward pass through the model has two phases for inference: prefill and decode. During prefill, the model processes the entire input prompt at once — every token in the prompt goes through every layer in parallel, computing attention against every other token. Prefill is compute-bound; the GPU is fully utilized doing matrix multiplications across the full prompt. The output of prefill is the model’s representation of the prompt plus the first generated token, along with the KV cache that summarizes the prompt’s attention state.

During decode, the model generates one token at a time. Each decode step takes the most recent token, computes attention against all prior tokens (via the KV cache), and produces the next token’s probability distribution. Decode is memory-bandwidth-bound; the GPU spends most of its time fetching the KV cache from memory rather than doing useful computation. This asymmetry — compute-bound prefill, memory-bound decode — is the central performance dynamic of LLM inference.

# Conceptual inference loop
def generate(model, prompt_tokens, max_new_tokens=512):
    # Prefill: process the entire prompt
    kv_cache, first_token = model.prefill(prompt_tokens)

    output_tokens = [first_token]
    for _ in range(max_new_tokens - 1):
        # Decode: process one token at a time
        next_token, kv_cache = model.decode(output_tokens[-1], kv_cache)
        output_tokens.append(next_token)
        if next_token == EOS_TOKEN:
            break
    return output_tokens

Implications of this two-phase structure. Time-to-first-token (TTFT) is determined by prefill latency. Inter-token latency is determined by decode speed. Total latency for a response is roughly TTFT + (output_length × inter_token_latency). For short prompts and short responses, TTFT dominates. For short prompts and long responses, decode dominates. For long prompts (think RAG with retrieved documents), TTFT can be significant and decode-time depends on the full context.

The KV cache. During prefill, the model computes a key and value vector for every token at every layer; these get cached. During decode, each new token only computes its own key/value, then attends against all cached keys/values. Without the KV cache, every decode step would have to recompute attention against the entire history — quadratic in sequence length and catastrophically slow. With the KV cache, decode is linear in sequence length, but cache size grows linearly too, which becomes a memory bottleneck.

Concrete numbers. For Llama-3.1-70B at FP16: per-token KV cache is roughly 320 KB across all layers. A 4K-token context uses ~1.3 GB of KV cache per request. At batch size 32 (32 concurrent users), that’s 40+ GB just for KV cache, exceeding the memory available on a single H100 unless you manage it carefully. This is why KV cache management is the central concern of inference optimization.

One more fundamental: arithmetic intensity. The ratio of compute to memory access for an operation. Modern GPUs have huge compute capacity (FLOPS) but proportionally limited memory bandwidth. Operations with high arithmetic intensity (matrix-matrix multiplies on big matrices) hit peak FLOPS. Operations with low arithmetic intensity (matrix-vector multiplies during decode) are memory-bound and run at a fraction of peak FLOPS. The goal of many optimizations is to raise arithmetic intensity by batching more work per memory access.

The roofline model. A useful mental model: plot operations on a roofline chart with arithmetic intensity on the x-axis and achievable performance on the y-axis. The “roof” is two lines — a sloping line for memory-bandwidth-bound regime, a flat line at peak FLOPS for compute-bound regime. An operation’s achievable performance is the lower of the two at its arithmetic intensity. Decode operations sit far left on the chart (low arithmetic intensity, memory-bound). Prefill on long sequences sits far right (high arithmetic intensity, compute-bound). The job of inference optimization is to move operations rightward toward higher arithmetic intensity where the GPU can deliver more.

Sampling. Once the model produces logits (the unnormalized probability distribution over the vocabulary at each position), sampling chooses a token. Greedy sampling picks the highest-probability token deterministically. Temperature sampling rescales the distribution and samples randomly. Top-k and top-p (nucleus) sampling restrict the candidate set to the highest-probability tokens. The sampling step is fast (small matrix-vector op on a vector of size vocab) and rarely the bottleneck. The choice affects output quality and diversity but not performance.

Why this matters for optimization. Every optimization technique covered in later chapters targets one of these fundamentals: reducing KV cache size or compute, shifting decode operations to higher arithmetic intensity through batching, reducing memory bandwidth requirements through quantization, parallelizing across multiple GPUs to spread the memory pressure. The techniques are different; the underlying targets are the same. An engineer who understands the fundamentals can predict whether a new optimization technique is likely to help on their workload before measuring; an engineer who doesn’t has to measure everything blindly.

Chapter 3: KV cache — what it is, why it dominates memory

The KV cache is the single largest memory consumer in LLM inference and the central object that optimization techniques target. Understanding its structure clarifies why specific techniques work and where their limits lie.

Structure. At each transformer layer, attention computes key (K) and value (V) projections from the input. For each token at each layer, K and V are vectors of dimension equal to the model’s per-head dimension times the number of heads. For Llama-3.1-70B with 80 layers, 64 attention heads, and head dimension 128: per token, the KV cache holds 80 × 2 (K and V) × 64 × 128 = 1.3M parameters, which at FP16 is ~2.6 MB per token. At 4K tokens of context: ~10 GB per request. At 32K context: ~80 GB. Per single request. At batch size 8: 640 GB. Obviously infeasible on a single GPU.

Most models use grouped-query attention (GQA) or multi-query attention (MQA) to reduce the KV cache footprint. With GQA, multiple query heads share KV vectors. Llama-3.1-70B uses GQA with 8 KV heads (down from 64 query heads), reducing per-token KV cache by 8x to ~320 KB. This is what makes long contexts feasible at all; without GQA, modern context windows would be impractical.

# KV cache size estimation
def kv_cache_size_per_token_bytes(num_layers, num_kv_heads, head_dim, dtype_bytes=2):
    """Estimate KV cache size per token, in bytes."""
    return num_layers * 2 * num_kv_heads * head_dim * dtype_bytes

# Llama-3.1-70B with GQA (8 KV heads)
per_token_bytes = kv_cache_size_per_token_bytes(
    num_layers=80,
    num_kv_heads=8,
    head_dim=128,
    dtype_bytes=2  # FP16
)
print(f"Per-token KV: {per_token_bytes:,} bytes")  # ~320,000 bytes / 320 KB

# For 4K context, 32 concurrent users:
total_kv_gb = (per_token_bytes * 4096 * 32) / 1e9
print(f"Total KV cache: {total_kv_gb:.1f} GB")  # ~42 GB

Memory bandwidth dominates decode time. During each decode step, the model must read the entire KV cache from memory to compute attention. An H100 has 3 TB/s of HBM bandwidth. Reading 42 GB of KV cache takes ~14 ms even at full bandwidth. That’s already several times longer than the compute itself takes. This is why decode is memory-bandwidth-bound and why anything that reduces KV cache size (quantization, eviction, sharing) directly improves decode throughput.

Fragmentation problems. Naive KV cache management allocates contiguous memory per request, sized for the maximum expected context. A request that uses 100 tokens of a 32K-allowed context wastes 31,900 tokens’ worth of memory. When requests have variable lengths and batch composition changes over time, this fragmentation can waste 60-80% of available memory. PagedAttention (chapter 5) addresses this directly.

KV cache quantization. KV cache values can be quantized to FP8 or INT8 with relatively small quality impact, halving or quartering memory consumption. vLLM and other modern engines support KV cache quantization. The trade-off: quantization adds compute overhead and can degrade output quality on edge cases. Most production deployments use FP16 KV cache for safety; quantization is a tool when memory pressure forces it.

Sharing the KV cache across requests. Multiple requests with the same prefix can share the KV cache for that prefix. This is prefix caching (chapter 10) and dramatically reduces effective memory consumption when many requests share common context (e.g., a chatbot with a fixed system prompt).

Why KV cache size constraints what matters. Model parameters are fixed cost — you load them once and serve all requests from the same weights. KV cache is per-request — it grows with the number of concurrent users and the length of each context. The available memory for KV cache (total GPU memory minus model weights minus activations) determines how many concurrent requests you can serve. For a 70B model in FP16 (140 GB) on a single H100 (80 GB), you can’t fit the model. With tensor parallelism across 2 H100s (160 GB), you have ~20 GB left for KV cache after model + activations — enough for maybe 50 concurrent users at moderate context length. With FP8 quantization (70 GB model), a single H100 has ~10 GB for KV cache — supporting fewer concurrent users but at lower hardware cost. These trade-offs cascade through every other decision.

Memory profile through a request. At request start, prefill allocates KV cache for the prompt’s tokens. During decode, KV cache grows by one entry per token generated. At request end, KV cache for that request is freed. Production engines manage this allocation/free cycle aggressively; the engine that doesn’t free promptly leaks memory across requests and eventually crashes.

Future-proofing KV cache considerations. Models are evolving to reduce KV cache pressure: GQA (already standard), MLA (Multi-head Latent Attention, in DeepSeek-V3 and newer models), state-space approaches that don’t have traditional KV cache. As these architectures become more common, the optimization techniques will shift. The fundamental engineering principles (manage memory, batch effectively, exploit shared context) will remain; the specific techniques will evolve.

Chapter 4: Continuous batching — beyond static batching

Batching is the most basic optimization in LLM inference and the place where naive implementations leave the most performance on the table. Understanding what continuous batching actually does requires understanding what static batching does badly.

Static batching is the simple version. Wait for N requests to arrive; process them together; return all responses; start next batch. The problem: requests have variable output lengths. If one request in the batch generates 500 tokens and the others generate 50, the GPU is doing useful work for 50 tokens and then idle work (generating tokens for the one long request) for the remaining 450. GPU utilization plummets.

Continuous batching (also called inflight batching, rolling batching, or dynamic batching) solves this. The batch is not a fixed set of requests; it’s a rolling set. When any request finishes, it’s removed from the batch immediately, and a new request can join. The GPU is always working on as many requests as the batch can fit, no idle generation.

# Conceptual continuous batching loop (simplified from vLLM)
class ContinuousBatcher:
    def __init__(self, max_batch_size):
        self.active_requests = []  # in-flight requests
        self.waiting_requests = []  # queued, not yet started
        self.max_batch_size = max_batch_size

    def step(self):
        # Add new requests if there's room
        while len(self.active_requests) < self.max_batch_size and self.waiting_requests:
            req = self.waiting_requests.pop(0)
            self.active_requests.append(req)

        # Run one decode step across all active requests
        outputs = self.model.decode_batch(self.active_requests)

        # Remove finished requests; collect their final outputs
        finished = [r for r in self.active_requests if r.is_done()]
        for r in finished:
            r.return_to_client()
            self.active_requests.remove(r)

        return outputs

The throughput gain from continuous batching is dramatic. Static batching with mixed request lengths achieves 30-50% GPU utilization in typical workloads. Continuous batching can sustain 80-95% GPU utilization. The result: 2-3x throughput improvement on the same hardware, sometimes more depending on request length distribution.

Prefill in continuous batches. Prefill is heavy (processes many tokens at once); decode is light (one token per request). Naively mixing prefill and decode in the same batch creates inefficiency. Two patterns address this: chunked prefill (break large prefills into chunks that fit alongside decode work) and disaggregated serving (separate prefill servers from decode servers, route requests between them). vLLM supports chunked prefill in modern versions; disaggregated serving is supported by some advanced setups but requires more infrastructure.

Configuration that matters. max_num_seqs in vLLM controls the maximum batch size. Higher values increase throughput at the cost of more memory and potentially worse latency. max_num_batched_tokens controls the total tokens processed per step, which interacts with prefill chunking. Tune both for your workload: high concurrency + short responses tolerates larger batches; high latency sensitivity + long responses wants smaller batches.

The latency-throughput trade-off. Larger batches improve throughput but increase per-request latency. For interactive applications (chatbots with users waiting), you want lower batch sizes and faster response. For batch jobs (offline document processing), you can afford larger batches and accept higher latency. Most production systems separate these workloads onto different inference servers configured differently.

When continuous batching breaks down. Workloads with extreme heterogeneity — some requests with tiny inputs and outputs, others with huge ones — produce headaches even for continuous batching. The heavy requests dominate memory, the light requests can't fully utilize the freed compute. Disaggregated serving or routing by request size helps; expect to need workload-specific tuning rather than one-size-fits-all configuration.

Priority and admission control. With finite batch capacity, you may need to decide which requests get served first when demand exceeds capacity. Priority queues let higher-priority users (paid tier vs. free tier, real-time vs. batch) jump ahead. Admission control rejects requests that would push system beyond healthy limits. Both are layered on top of continuous batching as policy decisions rather than mechanism changes.

Prefix-aware scheduling. The most sophisticated continuous batchers schedule requests to maximize KV cache reuse opportunities. A request that shares a prefix with one already in the batch can be admitted at lower marginal memory cost. Engines like SGLang and recent vLLM versions implement variants of this; it's especially valuable for chatbot workloads with heavy system-prompt sharing.

Effect of batch composition on tail latency. A batch dominated by requests with long outputs creates tail-latency pressure on requests with short outputs that joined late — they wait until the batch processes through enough steps for them to complete. Production observability should track tail latency by request type; if short-output requests have unacceptable tail latency, you may need separate inference pools.

Tuning max_num_batched_tokens. This parameter caps the total token-step work per scheduler iteration. Higher values let prefill chunks coexist with decode work; lower values reduce the time slice of any single scheduler step. The sweet spot is typically 4-8x your model's per-step batch size token budget. Too low: prefill chunks become too small, increasing per-token prefill overhead. Too high: long prefill chunks block ongoing decode and inflate ITL.

Chapter 5: PagedAttention and memory management

PagedAttention is the algorithm that put vLLM on the map and remains the defining innovation in LLM inference memory management. Understanding it makes the difference between configuring an inference stack well and configuring it poorly.

The problem PagedAttention solves. Traditional KV cache allocation reserves contiguous memory per request, sized for the maximum expected sequence length. When a request only uses a fraction of that allocation, the rest is wasted. Across many requests with variable lengths, total waste typically reaches 60-80% of allocated memory. The GPU has memory; the inference engine just can't use it for new requests because no single contiguous block is available.

PagedAttention adapts ideas from operating-system virtual memory. Memory is divided into fixed-size blocks (typical block size: 16 tokens worth of KV cache). Each request gets logical blocks; the engine maintains a mapping from logical blocks to physical blocks. Physical blocks come from a shared pool. When a request needs more KV cache, it allocates another physical block from the pool; when it finishes, blocks return to the pool for reuse.

The result: near-zero fragmentation. A request that uses 100 tokens uses 7 blocks (16 tokens each, last one half-full); the rest of memory is available for other requests. Aggregate batch sizes can be 2-4x larger than with contiguous allocation, directly improving throughput.

# PagedAttention concept (simplified)
class KVCacheManager:
    def __init__(self, total_blocks, block_size=16):
        self.block_pool = list(range(total_blocks))  # available physical blocks
        self.block_size = block_size
        self.request_block_map = {}  # request_id -> [physical blocks]

    def allocate_for_request(self, request_id, num_tokens):
        blocks_needed = (num_tokens + self.block_size - 1) // self.block_size
        if blocks_needed > len(self.block_pool):
            raise OutOfMemory("Not enough blocks for request")
        blocks = [self.block_pool.pop() for _ in range(blocks_needed)]
        self.request_block_map[request_id] = blocks
        return blocks

    def free_request(self, request_id):
        blocks = self.request_block_map.pop(request_id, [])
        self.block_pool.extend(blocks)

PagedAttention enables other capabilities. With block-level memory, the engine can share blocks across requests for common prefixes (the foundation of prefix caching). It can swap blocks to CPU memory when GPU memory is full (allowing larger virtual capacity at the cost of swap latency). It can implement beam search and parallel sampling efficiently because the multiple beams share the prefix's blocks. All of these become natural with paged memory; all are awkward without it.

Block size considerations. Larger blocks waste more memory per request (the last block is partially full); smaller blocks add more per-block overhead. vLLM defaults to 16 tokens per block and that's been validated across many workloads. Tuning the block size for specific workloads can give marginal wins but is rarely worth the effort.

CPU offload (a.k.a. KV swap). When GPU memory is exhausted, vLLM can swap blocks to CPU memory and bring them back when needed. This expands effective memory capacity but at significant latency cost (PCIe bandwidth is much slower than HBM). Useful as a safety valve for memory pressure; not a path to high throughput.

PagedAttention in other engines. The technique has been adopted by TensorRT-LLM, SGLang, and others. vLLM remains the reference implementation. If you're evaluating engines, ask explicitly whether they implement paged KV cache management — it's table stakes for serious inference in 2026.

Common misconfigurations. Setting gpu_memory_utilization too high (vLLM's default is 0.9) leaves no headroom for activations and can cause OOM. Setting it too low wastes available capacity. The right value depends on the model, sequence length, and concurrency. Typical sweet spot for production: 0.85-0.92, with monitoring on actual peak memory usage.

Beam search and parallel sampling with PagedAttention. A single user request that asks for N parallel completions (e.g., top-3 candidates) can share KV cache for the prompt across all N completions. The N completions diverge after the prompt; each diverged tail uses its own blocks. This is much more memory-efficient than running N independent requests. vLLM supports this natively via n in sampling params.

Preemption and rescheduling. Under memory pressure, vLLM can preempt in-flight requests (write their KV cache to CPU, free the GPU blocks) and reschedule them later. This is a tail-latency event for the preempted request but keeps the system from rejecting new ones. Production deployments should monitor preemption rate; significant preemption indicates undersized infrastructure for the workload.

The relationship between block size and memory efficiency. Each request's last block is partially full on average; with block size 16, the average waste is 8 tokens per request. With 100 concurrent requests, that's 800 tokens of wasted KV cache space — roughly 200 KB at Llama-70B GQA dimensions. Not material in most cases. Block size 64 would waste 4x more but reduce per-block metadata overhead. The defaults are sensible; don't tune unless profiling indicates a specific bottleneck.

How PagedAttention interacts with quantization. The KV cache itself can be quantized while still being managed in blocks. Compressed KV cache + PagedAttention = even higher effective batch sizes. vLLM supports FP8 KV cache via kv_cache_dtype=fp8; this halves KV cache memory at small quality impact. Combined with FP8 weights, you can roughly double concurrent throughput vs. an FP16-everywhere baseline.

Chapter 6: Speculative decoding — draft models and accept/reject

Speculative decoding is the optimization that pushes decode throughput beyond what KV cache and batching alone can achieve. Conceptually simple, devastatingly effective when the workload is right.

The idea. Decode is slow because each token requires a forward pass through the entire model. A small "draft model" can generate candidate tokens much faster than the full model. If the candidates match what the full model would have generated, we accept them without paying the full-model cost. The full model only runs as a verifier — checks the draft's predictions in parallel — which is computationally efficient because verifying many tokens in parallel uses the same hardware path as processing a long prompt.

Mechanically. The draft model generates K candidate tokens (typically K=4-8). The full model evaluates the K tokens in a single forward pass, producing the actual probability distribution at each position. Compare draft predictions against full-model predictions; accept the leading prefix where they agree, reject from the first disagreement. The full model effectively generates 1 + (accepted draft tokens) tokens per step, instead of 1 token per step.

Performance impact. When draft and target agree often (typical for non-creative tasks), speculative decoding can produce 2-3x decode speedup. When draft and target disagree often (creative or open-ended tasks), the speedup is smaller (1.2-1.5x) or even negative (if draft is too inaccurate, the overhead of running it isn't worth it).

# Speculative decoding concept (simplified)
def speculative_decode(target_model, draft_model, prompt_kv, num_draft=4):
    # Step 1: Draft model generates K candidate tokens fast
    draft_tokens = []
    draft_kv = prompt_kv.copy()
    for _ in range(num_draft):
        token, draft_kv = draft_model.decode_one(draft_kv)
        draft_tokens.append(token)

    # Step 2: Target model verifies all K candidates in one parallel pass
    target_probs = target_model.forward_parallel(draft_tokens, prompt_kv)

    # Step 3: Accept leading prefix where draft matches target
    accepted = []
    for i, draft_tok in enumerate(draft_tokens):
        target_tok = sample(target_probs[i])
        if draft_tok == target_tok:
            accepted.append(draft_tok)
        else:
            # First disagreement: keep target's choice and stop
            accepted.append(target_tok)
            break
    return accepted

Choosing a draft model. Smaller is faster but less accurate. A 1B-parameter draft for a 70B target is a common ratio. The draft can be a distillation of the target, a smaller checkpoint from the same family, or a separately trained small model. The key requirement: the draft's predictions must correlate with the target's predictions. Random small models work poorly; well-matched drafts work well.

Engineered drafts. Medusa, EAGLE, and similar techniques attach small prediction heads to the target model itself, sharing most of the computation. These methods can give 2-3x speedup without maintaining a separate draft model, at the cost of training overhead and architectural complexity. They've moved from research to early production in 2025-2026.

Multi-token prediction. Variations like medusa-2 or pipeline-parallel decoding generate multiple speculative tokens in a single target-model pass, increasing the parallelism. The frontier of this technique continues to advance through 2026.

When not to use speculative decoding. Workloads where the target model is small (the overhead of speculation is significant vs. just running the target). Workloads with highly variable token-level outputs where draft accuracy is poor (e.g., creative writing, code generation in unusual languages). Latency-sensitive single-request workloads where the parallel verification doesn't help because there's no batching room.

Production support. vLLM supports speculative decoding via specific configuration. TensorRT-LLM supports it. SGLang supports it. The configuration involves choosing a draft model and tuning the number of speculative tokens; defaults work for most cases. Measure actual throughput before and after enabling — speculation should make things faster; if it doesn't, the draft is wrong for the workload.

Acceptance rate as the key metric. Speculation's value depends entirely on how often the draft's predictions match the target's. Production systems should measure acceptance rate as a first-class metric. Acceptance > 70%: significant net speedup. 40-70%: modest speedup, worth keeping. < 40%: probably losing performance overall; revisit the draft model choice or disable speculation.

Speculative decoding with quantization. Quantized drafts (smaller and faster) combined with full-precision targets give particularly large speedups because the draft cost is dominated by memory bandwidth, which quantization reduces directly. INT4 drafts for FP8 targets is a common pattern.

Variable speculation length. Adaptive techniques adjust K (number of speculative tokens) based on recent acceptance rates. When acceptance is high, try longer speculations; when it dips, shorten. This squeezes additional performance over fixed-K speculation. Most production engines now implement some form of adaptive speculation by default in 2026.

Latency vs. throughput trade-off in speculation. Speculation is a strict win for throughput (more output tokens per GPU-hour). For latency, the picture is more nuanced — speculation reduces total wall-clock time for a complete response but may not reduce TTFT, and inter-token latency can become uneven (multiple tokens emitted in a burst, then a verification step). For applications where smooth streaming matters more than total time, careful tuning is needed.

Chapter 7: Quantization for inference — INT8, FP8, INT4, AWQ, GPTQ

Quantization reduces the precision of model weights from FP16 (16-bit floats) to lower-precision representations: FP8, INT8, INT4. The benefits are large: roughly proportional reductions in memory and (for the right hardware) compute. The risks are also real: quantization can degrade output quality, and the impact varies by model, task, and quantization scheme.

Why quantization helps inference. Decode is memory-bandwidth-bound; reading FP16 weights from memory takes twice as long as reading INT8 weights. A weight-only quantized model (INT8 weights, FP16 activations) achieves nearly 2x decode throughput on memory-bound hardware. INT4 quantization achieves nearly 4x. The model still does FP16 math inside the layers; the savings come from fetching less data per token.

Activation quantization. Quantizing not just weights but activations (the intermediate tensors flowing through layers) brings additional benefits — less memory pressure, potentially faster compute on hardware with INT8 tensor cores. Activation quantization is harder than weight quantization because activations have wider dynamic range and quantization errors compound through layers.

# Loading a quantized model with vLLM (AWQ example)
from vllm import LLM, SamplingParams

# AWQ-quantized model: INT4 weights, FP16 activations
llm = LLM(
    model="TheBloke/Llama-3.1-70B-Instruct-AWQ",
    quantization="awq",
    dtype="float16",
    max_model_len=4096,
)

# Same call pattern as FP16, just with the quantized weights loaded
outputs = llm.generate(
    ["Explain quantization in two sentences."],
    SamplingParams(max_tokens=200)
)

Quantization methods overview. GPTQ is post-training weight quantization that uses calibration data to choose quantization parameters; produces good quality at INT4. AWQ (Activation-aware Weight Quantization) is similar in goal but uses activation statistics to identify important weights and protect them from quantization; typically slightly better quality than GPTQ at INT4. SmoothQuant applies a mathematical transformation to make activation quantization easier, supporting full INT8 (weights and activations). FP8 (E4M3 or E5M2 formats) on H100 and newer GPUs gives near-FP16 quality with significant speed gains, increasingly the default for production deployments. NF4 from QLoRA is mostly used for fine-tuning rather than inference.

Quality impact. Production-quality INT8 weight-only quantization typically loses <1% on benchmarks. INT4 with AWQ loses 1-3% on most benchmarks. FP8 typically loses <0.5%. Aggressive quantization (INT4 weight + INT8 activation) can lose 3-5% on harder benchmarks. The loss varies dramatically by model and task — coding tasks tend to be more sensitive than chat; long-context tasks tend to be more sensitive than short-context. Always benchmark your specific workload before committing to a quantization level.

Quantization on different hardware. NVIDIA H100/H200/B200 have native FP8 tensor cores. INT8 has been supported since A100. INT4 doesn't have native tensor-core support; weight-only INT4 still benefits from memory bandwidth reduction during the dequantization-then-multiply path. AMD MI300 has FP8 support; Groq's hardware uses INT8. Hardware match matters: FP8 on H100 is roughly free; INT4 on A100 helps memory but doesn't speed compute proportionally.

Calibration data choice. GPTQ and AWQ both use calibration data to estimate quantization parameters. The calibration data should reflect your inference workload. Off-the-shelf quantized models from Hugging Face are calibrated on generic English text; if your workload is code, scientific text, or non-English, recalibrating with representative data can improve quality.

Practical recommendation for 2026. For H100 deployment, use FP8 weights + FP8 activations as the default; it's near-free in quality and gives meaningful speed gains. For memory-constrained deployments (running a 70B model on hardware that can't fit FP16), use INT4 AWQ; accept the 1-3% quality loss for the ability to fit the model. For older hardware (A100, V100), INT8 weight-only with SmoothQuant is the workhorse; it works without specialized hardware support.

Per-layer quantization sensitivity. Not all layers tolerate quantization equally. Attention output projections and MLP gate layers tend to be more sensitive than embedding layers or MLP up-projections. Advanced quantization methods (AWQ, GPTQ with mixed precision) detect sensitive layers and keep them at higher precision while aggressively quantizing the rest. This gives better quality at the same average bit-width than uniform quantization.

Quantization-aware fine-tuning. If you control the model lifecycle, training (or fine-tuning) with quantization in mind produces better quantized inference quality than post-training quantization. QLoRA and similar techniques fine-tune models in a way that's compatible with INT4 inference. For mission-critical workloads where quantization quality matters, the additional training investment is often worthwhile.

Outlier handling. The hardest part of activation quantization is outliers — occasional very large activation values that, if quantized normally, produce big errors. SmoothQuant addresses this by mathematically moving the difficulty from activations into weights (which quantize more easily). LLM.int8() takes a different approach: detect outliers at runtime and handle them in higher precision. Production stacks use one approach or the other; both have matured.

Per-token vs. per-tensor quantization. Per-tensor quantization (one scale per tensor) is simpler and faster but loses quality on tensors with wide value ranges. Per-channel quantization (scale per channel of the tensor) preserves more information. Per-token quantization (recompute scale for each token) is the most accurate but adds overhead. Modern engines support per-channel for weights and per-tensor or per-token for activations.

Chapter 8: Long-context inference — chunking, sliding window, ring attention

Long-context inference — running models with 32K, 100K, or 1M token contexts — has specific performance characteristics that don't appear at shorter contexts. The optimization techniques used for typical 4K-8K inference still apply but become inadequate alone.

The cost of long context. KV cache scales linearly with context length. Compute for prefill scales quadratically (attention is O(n²)). For a 32K-token prompt, prefill alone can take seconds even on an H100. For a 1M-token prompt, naive prefill takes minutes. The optimizations that handle long context efficiently are not optional at scale.

Flash attention. The standard attention computation has memory complexity O(n²) — the full attention matrix is materialized in memory. Flash attention restructures the computation to use O(n) memory by tiling and recomputing intermediate values. The math is identical; the implementation is much more efficient. Modern inference engines (vLLM, TensorRT-LLM, SGLang) use Flash Attention 2 or 3 by default. Without it, long-context inference is impractical.

# Chunked prefill in vLLM (managed automatically; this shows the concept)
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_chunked_prefill=True,
    max_num_batched_tokens=8192,  # process this many tokens per step
)

# A 32K prompt would be processed in ~4 chunks instead of one giant batch.
# This keeps other decode work alongside flowing rather than blocking.

Chunked prefill. Long prompts are broken into chunks; each chunk is processed alongside ongoing decode work in the continuous batch. This keeps GPU utilization high during long prefills (avoiding the situation where one huge prompt blocks all decode work) and improves TTFT for other requests. vLLM enables this with enable_chunked_prefill=True.

Sliding window attention. Some models (Mistral, parts of Gemini) use sliding window attention where each token only attends to the previous N tokens (typical N: 4096) rather than the entire history. This bounds KV cache and attention compute per step regardless of total sequence length. Trade-off: information from outside the window is lost. For tasks where recency dominates (chatbots, real-time monitoring), sliding window is a strong fit. For tasks needing long-range recall (document QA), it's inappropriate.

Ring attention and context parallelism. For ultra-long contexts that don't fit on a single GPU, ring attention distributes the sequence across multiple GPUs and rotates KV-cache shards in a ring pattern. Gemini 1.5/2.0's 1M-2M token contexts use techniques in this family. Implementation complexity is high; most teams won't build this themselves but may use providers (Google's Gemini API, Anthropic's long-context Claude) that have done so.

Context compression. Various techniques compress long contexts before feeding to the model: extractive summarization, learned compression tokens (Gisting), retrieval-augmented chunk selection. These shift the problem from "run the model on a 1M context" to "select the most relevant 32K of context to feed the model". For most use cases, this is more practical than scaling raw context.

KV cache compression for long contexts. Techniques like H2O (Heavy Hitter Oracle) keep only the most important tokens in the KV cache, evicting less-important ones. StreamingLLM keeps the first few attention sinks plus a recent window. These methods extend usable context beyond the model's "trained" context length with controlled quality loss.

Latency implications. A 100K-token prefill on H100 takes 1-5 seconds depending on model and optimization. For interactive UX, this is too slow; for batch document processing, it's fine. Architect accordingly: long-context tasks may need separate latency budgets, may benefit from pre-processing and caching, may need to communicate progress to the user. Don't treat long-context tasks the same as short ones.

Choosing between long context and RAG. The classic trade-off: 1M-token context vs. RAG over a chunked corpus. Long context wins when the relevant information is spread across the input and you can't predict in advance what's needed. RAG wins on cost (you only process retrieved chunks) and latency (smaller prompt = faster prefill). For most production use cases, well-tuned RAG outperforms long-context inference on the same hardware budget.

Position embedding tricks. Models have been trained with specific position encodings; running them on contexts longer than their training context can produce quality degradation. RoPE (Rotary Position Embedding) extensions like YaRN and dynamic NTK scaling extend usable context length without retraining. The trade-off: extended-context inference may have lower per-token accuracy on tasks requiring precise position understanding.

Long-context evaluation. Standard benchmarks don't fully capture long-context performance. "Needle in a haystack" tests (place a specific fact at a random position in a long context, ask the model to retrieve it) check whether the model uses the entire context. Tasks like long-document QA, multi-document summarization, and code-base understanding measure practical long-context capability. Evaluate your specific use case before committing to a long-context architecture.

Operational considerations. Long-context inference burns more memory per request, meaning fewer concurrent users. Long-context requests have higher tail latency, complicating SLO design. Some workloads benefit from a dedicated long-context inference pool separate from the main pool, sized for the specific request profile.

Cost per token at long contexts. As context grows, total cost per useful output token rises (prefill cost amortizes over fewer output tokens for short responses to long inputs). For pricing-sensitive products, this asymmetry matters — bill users by input+output token, not just output, or risk margin compression on long-context workloads.

Chapter 9: Multi-GPU and multi-node — tensor, pipeline, expert parallelism

Models too large to fit on a single GPU need to be distributed. The three classes of parallelism — tensor, pipeline, expert — each have specific characteristics, trade-offs, and configurations.

Tensor parallelism. Each layer's weights are sharded across multiple GPUs; each GPU holds part of each layer. During the forward pass, GPUs communicate to exchange the partial results that each layer requires. Tensor parallelism keeps every GPU busy on every layer; the latency penalty is the communication overhead. It works best on GPUs with high-bandwidth interconnects (NVLink within a node; 600+ GB/s) and worse on lower-bandwidth interconnects (PCIe between nodes; ~100 GB/s).

# Tensor parallelism with vLLM
from vllm import LLM

# 4-GPU tensor parallel for Llama-3.1-405B (FP8)
llm = LLM(
    model="meta-llama/Llama-3.1-405B-Instruct",
    tensor_parallel_size=4,  # shard across 4 GPUs
    dtype="bfloat16",
)

# Inference still feels single-GPU from the user's perspective;
# vLLM handles the parallelism internally.

Pipeline parallelism. Different layers go to different GPUs. GPU 0 holds layers 1-20; GPU 1 holds layers 21-40; etc. The forward pass flows sequentially through the GPUs. Pipeline parallelism is bandwidth-efficient (each GPU passes only the activation tensor between layers, not weight shards) but creates pipeline bubbles — most GPUs are idle most of the time unless you carefully overlap multiple requests in the pipeline. It's the standard choice for multi-node deployment where inter-node bandwidth is limited.

Expert parallelism (for MoE models). Mixture-of-Experts models like Mixtral 8x22B or DeepSeek-V3 have many "experts" — separate networks that the routing layer selects between. Expert parallelism distributes experts across GPUs; each GPU holds a subset of experts. The routing layer determines which GPU(s) each token needs and the tokens are dispatched accordingly. This is efficient when the token-to-expert traffic is balanced; can create hot spots when popular experts overload specific GPUs.

Choosing the right parallelism. For models that fit on a single node (8 H100s = 640 GB), tensor parallelism is usually best. For models requiring multiple nodes, pipeline parallelism across nodes + tensor parallelism within nodes is the standard pattern. For MoE models, expert parallelism plus tensor parallelism is the typical combo.

Communication backends. NCCL (NVIDIA Collective Communications Library) is the standard for multi-GPU communication on NVIDIA hardware. RCCL is the AMD equivalent. Specialized libraries (Megatron-LM, DeepSpeed-Inference) provide higher-level abstractions on top.

Specific operational concerns. Multi-GPU and multi-node deployments add operational complexity: GPU failures during inference, node failures requiring restart, version skew between nodes, network partition recovery. Production deployments need health checks, graceful degradation patterns, and operational runbooks for the failure modes that don't exist in single-GPU setups.

Costs and ROI of multi-GPU. A 4-GPU tensor parallel deployment is ~3.5x the cost of a 1-GPU deployment for ~3.0-3.5x the throughput (some overhead from communication). The marginal benefit decreases with each GPU added; 8-GPU parallel typically achieves ~6.5x throughput of 1-GPU. Beyond 8 GPUs, communication costs dominate unless you have specialized hardware (NVLink switch, InfiniBand 400G).

When to use API providers instead. For models that require multi-node deployment (405B, very large MoE), the operational complexity is significant. Many teams find that using an API provider (OpenAI, Anthropic, Google, Together, Fireworks) is more cost-effective than self-hosting at the largest model scales. The break-even depends on your specific token volume; calculate carefully before committing.

Sequence parallelism. A complementary technique that shards the sequence dimension across GPUs (rather than the model dimension as in tensor parallelism). For very long contexts, sequence parallelism is essential because the activation memory per token (separate from KV cache) becomes substantial. Modern engines combine sequence parallelism with tensor parallelism for long-context inference.

NCCL tuning. Inter-GPU communication via NCCL has many tunable parameters that affect performance. Tree vs. ring algorithms, buffer sizes, and topology configurations can produce 10-20% performance differences. For large multi-GPU deployments, dedicating engineering effort to NCCL tuning is worthwhile. Tools like nccl-tests benchmark your specific topology.

Network topology constraints. NVLink connects GPUs within a node at 600+ GB/s. PCIe between nodes (or between GPUs in nodes without NVLink) provides ~50-100 GB/s. InfiniBand 400G provides 400 Gbps = 50 GB/s between nodes; 800G hardware is rolling out. Tensor parallelism requires high bandwidth; pipeline parallelism tolerates lower bandwidth. Architect your parallelism strategy to match your hardware topology.

Operational maturity differences. Single-GPU deployments are essentially a single Python process; failures are simple. Multi-GPU deployments require distributed process management; one GPU failure typically takes down all GPUs in that tensor-parallel group. Multi-node deployments add the complexity of node-to-node networking, distributed state, and partial-failure modes. Plan operations investment proportional to deployment complexity.

Chapter 10: Caching strategies — prompt caching, prefix sharing, semantic cache

Caching can dramatically reduce both latency and cost when workloads have repeated patterns. Three distinct caching levels — KV cache reuse, prompt caching, semantic caching — operate at different granularity and have different applicability.

Prefix caching (KV cache reuse). When multiple requests share a common prefix, the KV cache for that prefix can be computed once and reused. For a chatbot with a 500-token system prompt, every conversation pays the prefill cost for those 500 tokens once instead of every request. Savings can be 30-90% of prefill time for workloads with significant shared prefix.

# Prefix caching in vLLM (enabled with config)
llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    enable_prefix_caching=True,  # shared KV cache across requests
)

# Two requests with the same system prompt:
system = "You are a helpful assistant specialized in..."  # 500 tokens
req1 = system + "\n\nUser: Question 1"
req2 = system + "\n\nUser: Question 2"

# vLLM detects the shared prefix and reuses KV cache for the system portion.
# Only the user-specific portion needs prefill compute.

Prompt caching at the API level. OpenAI, Anthropic, and Google all offer prompt caching as an API feature. You mark sections of the prompt as cacheable; subsequent requests with the same prefix get a discount (typically 50-90% off the input token price) and faster response. Anthropic's Claude prompt cache is particularly mature; OpenAI launched prompt caching as a default feature in 2024 and refined it through 2025-2026.

Cache invalidation rules. KV cache reuse requires exact token match for the prefix. Even a single character difference invalidates the cache. The implication: keep system prompts truly static; if you must include dynamic data (current date, user info), put it at the end of the system prompt rather than the beginning to maximize cacheable prefix.

Semantic cache. A different kind of cache — store completed query-response pairs; for a new query, check if a semantically similar query has been answered before; if yes, return the cached response. Implemented with embeddings: embed the query, find nearest neighbors in the cache, return the cached response if similarity is high enough. Useful for FAQ-style workloads with high redundancy. Risky for workloads where small query differences should produce different responses; the cache match might be too aggressive.

Cache eviction policies. Limited cache memory means old entries need to be evicted. LRU (least recently used) is the standard choice; LFU (least frequently used) can be better for workloads with hot prefixes. Some engines support TTL-based eviction (cache entries expire after N seconds). Production-grade cache management measures actual hit rates and adjusts policies based on observed traffic.

Cache observability. Monitor cache hit rate as a first-class metric. A workload that should hit the cache often but doesn't indicates a configuration problem (likely prefixes diverging) or a workload that's less redundant than expected. Cache hit rate trending down over time indicates drift in user behavior or prompt structure changes.

Multi-tenant caching considerations. In a multi-tenant deployment, sharing the cache across tenants can leak information through timing (a tenant can detect whether another tenant has run a similar query by latency). For security-sensitive deployments, partition the cache by tenant. Cost: less effective caching. Benefit: no cross-tenant information leakage.

Cache warming. For workloads with predictable hot prompts (e.g., a daily report generation that always uses the same system prompt), pre-populating the cache before peak traffic reduces cold-start cost. This is a cheap operational practice with no downside.

Cache hierarchy. Production-grade deployments often have multiple cache tiers: hot tier in GPU memory (newest, most frequently used), warm tier in CPU memory (recently used but evicted from GPU), cold tier in distributed cache (Redis, Memcached) for cross-instance sharing. Each tier has different latency and capacity characteristics. The hierarchy is similar to CPU memory hierarchies and trades capacity for access speed.

Semantic cache pitfalls. The hardest part of semantic caching is getting the similarity threshold right. Too aggressive (low similarity required for cache hit): produces incorrect responses for similar-looking but meaningfully different queries. Too conservative (high threshold): misses obvious cache opportunities. Production semantic caches typically combine threshold tuning with confidence-based fallback — if cache match is borderline, route to the model instead of returning the cached answer.

Cache invalidation on model changes. When you upgrade the model, all cache entries become stale. Plan for cache flush on model deployment; warm the cache after deploy if cold-start latency matters; expect a temporary cost spike from rebuilding the cache.

Privacy considerations in caching. Caches often contain user queries and responses. For privacy-sensitive deployments, cache content needs the same protections as the underlying data: access control, encryption at rest, retention limits. Some regulated workloads disable certain cache types (semantic cache especially) because of cross-query information leakage risk.

Chapter 11: Streaming inference and time-to-first-token

Streaming — sending tokens to the client as they're generated rather than waiting for the full response — has become standard for chat applications. Implementing it well requires understanding the latency components and the network layer.

The latency components. Time-to-first-token (TTFT) is the wall-clock time from request submission to first response token. Inter-token latency (ITL) is the time between successive tokens once streaming begins. Total wall-clock time is TTFT + (output_tokens × ITL). For a chat experience to feel responsive, TTFT should be under 500ms; ITL should be under 50ms (20+ tokens per second).

TTFT optimization. TTFT is dominated by prefill time, which scales with input length. To minimize TTFT for short responses with long prompts: prompt caching (most impactful), chunked prefill (so other requests don't have to wait), prefix sharing across requests (for shared system prompts), and on the architectural side, putting the prefill servers on faster hardware while decode runs on cheaper hardware (disaggregated serving).

# Streaming completion from vLLM via OpenAI-compatible API
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Explain quantum tunneling briefly."}],
    stream=True,
    max_tokens=500,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

ITL optimization. ITL is dominated by decode time per token. To minimize: speculative decoding (when applicable), KV cache reduction (quantization), and reducing model size (smaller model = faster decode). Batching helps throughput but can hurt ITL — a larger batch has more sequences competing for memory bandwidth, slowing each individual stream. Production deployments balance latency-sensitive and throughput-sensitive workloads across separate inference fleets.

Transport protocols. Server-Sent Events (SSE) is the standard transport for streaming LLM responses; works over standard HTTP. WebSockets are another option, more complex but bidirectional. gRPC streaming offers high performance but requires non-HTTP client setup. For most use cases, SSE over HTTP is the right choice — it's what OpenAI's API uses and the tooling is mature.

Buffer and timeout tuning. Aggressive output buffering (waiting for many tokens before flushing to the client) can stall the user experience even if generation is fast. Configure HTTP servers to flush small chunks immediately. Some load balancers and CDNs buffer aggressively by default; configure them to pass streaming traffic through unbuffered. Tools like Cloudflare and nginx need specific configuration to handle SSE correctly.

Streaming with structured outputs. Generating JSON or other structured formats with streaming complicates things — the client typically can't parse partial JSON. Solutions: stream tokens but only deliver complete structured chunks; use JSON streaming formats (JSONL or NDJSON) where each line is a complete object; or accept that structured outputs need batch (non-streaming) responses.

Cancellation. When a user cancels a long-running stream, the server should stop the inference work — not just close the connection while the GPU keeps generating tokens. Production engines should respect HTTP connection close as a cancellation signal; verify your stack does this, especially behind proxies.

Streaming SLOs. P50 TTFT, P95 TTFT, P50 ITL, P95 ITL are the four key streaming SLOs. P50 numbers describe the typical experience; P95 numbers describe the worst-case user experience. Production deployments should track all four and alert when any degrades.

Token timing variance. Even in a well-tuned system, ITL is not constant — token N+1 takes slightly different time than token N depending on batch composition, memory access patterns, and other concurrent requests. Variance is fine; spikes that produce noticeable pauses in the stream are user-experience problems. Track the standard deviation of ITL across tokens, not just the mean.

Streaming and structured tool calls. Modern LLM applications often involve tool calls — the model produces structured output that triggers a function. Streaming such output is complex: you want to start executing the tool as soon as its arguments are complete, not wait for the entire response. Frameworks like LangChain and LlamaIndex implement streaming tool calls; SGLang has native primitives for this pattern.

Backpressure handling. If a client consumes the stream slowly (slow network, slow rendering), the server must either buffer tokens (memory pressure) or stop generating (waste of GPU work). Production streaming implementations handle backpressure by signaling the generation engine to pause when buffers fill. This is a subtle implementation detail that catches many production deployments off-guard.

Stream resumability. Long-running streams can drop due to network issues. A robust streaming API supports resumption — the client provides a token offset, the server continues from that point. Implementing this requires storing the generated state long enough for clients to resume. Worth the complexity for high-value use cases (long-form content generation, complex agent runs); overkill for chat.

Chapter 12: Inference engines compared — vLLM, TensorRT-LLM, SGLang, TGI

The choice of inference engine drives most of the achievable performance and operational characteristics. Four major engines dominate the production landscape in 2026, with overlapping but distinct strengths.

Engine	Primary author	Strengths	Trade-offs
vLLM	UC Berkeley / open-source community	PagedAttention, OpenAI-compatible API, best general-purpose throughput, broad model support	Single-engine focus; less peak performance than TensorRT-LLM on NVIDIA-specific paths
TensorRT-LLM	NVIDIA	Highest peak performance on NVIDIA H100/H200/B200; deeply optimized FP8	NVIDIA-only; steeper engineering investment; less flexible model support
SGLang	UC Berkeley / Stanford	Structured generation, constrained decoding, fast for JSON/regex outputs	Smaller community; newer; less mature operational tooling
TGI	Hugging Face	Tight Hugging Face ecosystem integration; production tooling; multi-model support	Performance has trailed vLLM in 2025-2026 benchmarks

vLLM. The de-facto default for self-hosted LLM serving in 2026. PagedAttention is its defining innovation. Continuous batching, prefix caching, FP8 quantization, speculative decoding, tensor parallelism are all supported. The OpenAI-compatible API makes drop-in deployment straightforward. Community is active; releases are frequent; documentation is comprehensive. For 80% of production deployments, vLLM is the right starting point.

# Production-ready vLLM startup
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192 \
  --max-num-seqs 64 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --quantization fp8 \
  --port 8000

TensorRT-LLM. NVIDIA's first-party inference library. Achieves the highest peak performance on NVIDIA hardware — typically 10-30% faster than vLLM on specific models well-tuned for TensorRT-LLM. Trade-offs: NVIDIA-only (won't run on AMD or other accelerators); engine compilation takes time and is model-specific; supports a narrower set of models than vLLM; less ergonomic for rapid iteration. Use TensorRT-LLM when you have a stable production model and want every last bit of performance.

SGLang. Differentiated by structured generation — constrained decoding for JSON, regex, grammar-bounded outputs. Faster than other engines on these specific workloads because it integrates the constraint checking into the decoding step. For agent applications heavy on tool calls and structured outputs, SGLang is increasingly the default choice.

TGI. Hugging Face's inference server. Strong in the HF ecosystem; tight integration with HF Hub, datasets, and the broader stack. Performance has trailed vLLM in 2025-2026 head-to-head benchmarks, though the gap has narrowed with TGI 3.0. For teams already deep in the HF tooling stack, TGI provides operational comfort that may outweigh raw performance differences.

Other engines worth knowing. LMDeploy (Shanghai AI Lab) is strong for Chinese-language deployments. DeepSpeed-Inference (Microsoft) has specific strengths in extreme model scaling. OpenLLM (BentoML) wraps multiple backends with deployment tooling. The community continues to fragment; expect new entrants and consolidation through 2026.

How to choose. Start with vLLM unless you have specific reasons to choose otherwise. If you need maximum performance and are NVIDIA-locked, evaluate TensorRT-LLM. If your workload is structured output heavy, evaluate SGLang. If you're deep in HF tooling and the performance gap doesn't matter for your scale, TGI is fine. Avoid changing engines mid-project; the operational tax of migration is significant.

Benchmarking your specific workload. Engine benchmarks in the wild are useful but workload-dependent. Always benchmark on your actual model with your actual request distribution. Tools like genai-perf, vllm-benchmark, and Open Perflab provide standardized benchmarking. Plan a one-week benchmarking exercise before committing to an engine for production.

Engine update cadence. vLLM ships major releases roughly monthly; TensorRT-LLM has slower release cycles tied to NVIDIA's TensorRT updates; SGLang ships frequently; TGI follows Hugging Face's broader cadence. Production deployments should pin specific engine versions and update on a tested cadence rather than chasing latest. Pinning prevents regression surprises; updating periodically captures performance improvements.

Engine compatibility with new models. New models sometimes need engine updates for proper support — new architectures (MLA, mixture-of-depths), new tokenizers, new normalization schemes. The engine that supports the model you want to deploy on day one matters; check ahead before model selection.

Custom kernels. The absolute frontier of inference performance lives in custom CUDA kernels written for specific operations. Most teams don't write these directly; the major engines bundle high-performance kernels (Flash Attention, RMSNorm fused with matrix multiply, etc.). For teams pushing the limits, engaging with engine maintainers or writing custom kernels yields the last few percent of performance.

Engine choice and tooling ecosystem. Around each engine sits an ecosystem of monitoring tools, deployment templates, Kubernetes operators, and best-practice configurations. vLLM has the broadest ecosystem in 2026; TensorRT-LLM has NVIDIA's official enterprise tools; TGI has Hugging Face Inference Endpoints. Don't underweight ecosystem when selecting; the supporting tools save real engineering time.

Chapter 13: Hardware choices in 2026 — H100, H200, B200, MI300, Groq

Hardware selection drives more of inference economics than any other single choice. The 2026 landscape has more options than 2024 had but remains NVIDIA-dominated for production deployments.

Hardware	HBM	HBM bandwidth	FP8 performance	Typical role
NVIDIA H100 (SXM)	80 GB	3.35 TB/s	~4 PFLOPS (sparse)	Production workhorse; widely available
NVIDIA H200	141 GB	4.8 TB/s	~4 PFLOPS	H100 successor; better for memory-heavy workloads
NVIDIA B200	192 GB	~8 TB/s	~9 PFLOPS	Blackwell flagship; deploying through 2026
AMD MI300X	192 GB	5.3 TB/s	~2.6 PFLOPS	NVIDIA alternative; growing software support
Groq LPU	~230 MB SRAM	~80 TB/s on-chip	N/A	Extreme low-latency inference; specific use cases
NVIDIA A100	80 GB	2.0 TB/s	No FP8	Older fleet; cost-effective for INT8 workloads

NVIDIA H100. The workhorse of production inference in 2026. Wide software support (vLLM, TensorRT-LLM, TGI, SGLang all tune for H100), broad cloud availability, mature operational stack. FP8 tensor cores make it 2x faster than A100 for FP8 inference. Pricing has dropped through 2025-2026 as supply caught up; spot pricing for H100 instances ranges $2-$4 per GPU-hour on most clouds. For most teams, H100 is the default.

NVIDIA H200. The "H100 with more memory" upgrade. 141 GB HBM vs. H100's 80 GB enables longer contexts and larger batches without sharding. Compute throughput is comparable to H100. Pricing typically ~20-30% above H100. Right choice for memory-heavy workloads (long context, large models) where H100 hits memory limits.

NVIDIA B200. The Blackwell-generation flagship. 192 GB HBM, ~2x the FP8 performance of H100. Started shipping in volume late 2025 / early 2026; cloud availability ramping through 2026. Pricing is 2-3x H100 currently; expected to normalize as supply scales. The right choice for new builds targeting the highest performance and pricing predictability for 2027+.

AMD MI300X. AMD's challenger. 192 GB HBM is a meaningful spec advantage over H100. Software support has improved dramatically in 2025-2026 — vLLM, TGI, and llama.cpp all support MI300 now with reasonable performance. Performance gap vs. H100 has narrowed but typically remains 10-20% behind on equivalent workloads. Pricing tends to undercut H100 by 20-30%. Worth evaluating for cost-optimized self-hosted deployments; expect to do more engineering work than on NVIDIA.

Groq LPU. Specialized inference hardware that achieves very low latency by keeping the entire model in SRAM. Token generation latency: ~10-15 ms vs. ~30-50 ms on H100. Trade-off: small per-chip memory means you need many Groq chips to host a single large model (hundreds of chips for 70B+ models). Pricing through Groq's API service is competitive for the specific use case of low-latency inference. Not a general replacement for GPU-based deployments.

Other specialized hardware. Cerebras (very large wafer-scale chips, good for training), SambaNova (full-stack AI systems for enterprise), Tenstorrent (RISC-V-based, evolving software stack), Intel Gaudi 3 (still building software ecosystem). All have specific positioning; none have broad production traction comparable to NVIDIA in 2026.

Cloud vs. on-premises. The major clouds (AWS, GCP, Azure) all offer H100 and H200 instances. AMD MI300 is available on selected providers. Spot/preemptible pricing can be significantly cheaper than on-demand. On-premises deployment makes sense at very high sustained utilization (60+ hours per day per GPU) where the multi-year amortization beats cloud rates. Below that threshold, cloud is more cost-effective.

Selecting for your workload. Start with the question: what model do I need to serve, at what concurrency, at what latency SLO? Calculate the memory needed for the model + KV cache + activations. Calculate the throughput needed. Map both to hardware options. Often the answer is "H100 with appropriate parallelism"; sometimes it's "H200 because we need the memory"; sometimes it's "B200 because we're sizing for 2027"; sometimes it's "Groq because latency dominates everything else". Don't over-optimize; the right hardware is usually obvious once requirements are specified.

GPU generation transitions. New GPU generations bring meaningful performance gains. H100 → H200 added memory (same compute). H100 → B200 added both memory and compute. Each transition cycle (~18-24 months) shifts the cost-performance frontier. Production fleets should plan for hardware refreshes; a 3-year-old fleet running on V100 or A100 is paying significant performance tax relative to current hardware.

Supply chain and availability. Through 2024, H100 was severely supply-constrained; through 2025, supply normalized; in 2026, supply is good but pricing remains elevated relative to historical GPU economics. B200 supply is constrained in 2026 and will normalize through 2027. Plan procurement with realistic supply timelines.

Power and cooling. H100/H200 each draw ~700W; B200 ~1000W. At fleet scale, power and cooling infrastructure matters. Data centers designed for older GPU generations may not have the power density for modern hardware. On-premises deployments must plan facility upgrades; cloud deployments inherit the cloud provider's infrastructure investments.

Specialized inference hardware economics. Groq, Cerebras, and similar specialized hardware can deliver dramatic results for the right workload but at premium pricing. The economics work when the latency or throughput improvement directly creates business value (real-time applications, time-sensitive batch jobs). For typical workloads, NVIDIA GPUs win on cost-per-token; specialized hardware wins on latency or peak throughput for specific scenarios.

Chapter 14: Production deployment — autoscaling, monitoring, SLOs

An optimized inference engine is necessary but not sufficient for production. The operational layer — autoscaling, monitoring, SLOs, incident response — turns inference capacity into a reliable service. The patterns differ from typical web services because of LLM-specific characteristics.

Autoscaling. LLM inference has slow cold starts (loading a 70B model takes 30-120 seconds) and high per-instance cost. Naive horizontal autoscaling that adds and removes instances reactively wastes money and produces request failures during scale-up. Better: predictive autoscaling based on traffic patterns; pre-warmed pools of standby instances; queue-based admission control that absorbs traffic spikes.

# Queue-based admission with HPA on Kubernetes (simplified)
# Custom metric: pending requests in queue
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: pending_requests
      target:
        type: AverageValue
        averageValue: "5"  # scale up if avg queue depth >5 per pod
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30  # respond to spikes
    scaleDown:
      stabilizationWindowSeconds: 600  # don't churn on dips

Monitoring. Beyond standard service metrics (request rate, error rate, latency), inference-specific signals matter. GPU utilization (target: 80%+ during peak; below 50% means you're paying for idle capacity). KV cache utilization (target: 60-80%; near 100% causes evictions and slows everything). Batch size achieved (target: close to max_num_seqs during peak). TTFT and ITL P50/P95/P99 (the latency story users actually experience).

SLO definition. Sample SLOs for an interactive chat application: P95 TTFT < 1 second; P95 ITL < 80 ms; error rate < 0.5%; availability > 99.9% over rolling 30-day window. For a batch document processing job: throughput > 1M tokens per hour per instance; cost per million tokens < $5; success rate > 99%. Choose SLOs that match how the service is actually used; don't copy-paste numbers from someone else's reference architecture.

Cost tracking. Per-request cost depends on input tokens, output tokens, and the hardware running the inference. Production systems should attribute cost to tenants, request types, or business lines. Token-level cost tracking lets you identify expensive workloads, set per-tenant budgets, and price your downstream product appropriately. Cost should be a first-class observability signal alongside latency.

Capacity planning. Inference capacity needs lead time to provision. For owned GPUs, lead time is 8-16 weeks. For reserved cloud instances, weeks. For on-demand cloud, minutes (but at higher cost). Map expected demand 6-12 months out; reserve baseline capacity; use on-demand for peaks. Misestimating demand in either direction is expensive — over-provisioning burns money; under-provisioning produces incidents.

Multi-region. For global services, deploy inference in multiple regions to reduce user latency. Route requests to the nearest region. Coordinate model deployments across regions to maintain version parity. The operational complexity increases significantly; only adopt multi-region when latency requirements clearly justify it.

Model lifecycle. Models get updated; the production rollout of a new model version needs careful management. Blue-green deployment (run old and new versions in parallel; gradually shift traffic). A/B testing (route a fraction of traffic to the new version; compare metrics). Canary release (deploy to a small fraction of users first; monitor; expand). Production-grade inference services support all three patterns.

Disaster recovery. What happens when a GPU fails? When a node fails? When a region fails? Test the failure modes; have runbooks. The mean time to failure for a single GPU under load is non-trivial (months, not years, but not decades). At fleet scale, you'll see failures regularly; the question is whether they cause user-visible incidents.

Routing and traffic management. Production deployments often have multiple model variants (different sizes, different quantizations) optimized for different use cases. A request router examines the request — input length, expected output length, user tier, latency requirements — and routes to the appropriate backend. Smart routing extracts significant cost savings; lazy routing pays for the most expensive backend on every request.

# Simple request routing example
class RequestRouter:
    def __init__(self, backends):
        # backends = {"fast": small_model_endpoint, "smart": large_model_endpoint}
        self.backends = backends

    def route(self, request):
        input_length = len(request.prompt.split())
        complexity = self.estimate_complexity(request.prompt)

        if input_length < 100 and complexity < 0.3:
            return self.backends["fast"]  # use small model for easy queries
        elif complexity > 0.7 or request.priority == "high":
            return self.backends["smart"]  # use large model for hard queries
        else:
            return self.backends["fast"]  # default to fast

    def estimate_complexity(self, prompt):
        # Simple heuristic; production would use a classifier
        signals = [
            len(prompt.split()) > 500,
            "code" in prompt.lower(),
            "analyze" in prompt.lower(),
            "compare" in prompt.lower(),
        ]
        return sum(signals) / len(signals)

Graceful degradation. When the inference service can't meet SLOs (overloaded, partial failure, dependency issues), the right behavior is graceful degradation rather than hard failure. Options: shed lower-priority traffic; respond with a smaller faster model; cache previous responses; return a degraded but useful answer. The exact strategy depends on use case; the principle is universal — failures should be soft rather than hard.

Model versioning and A/B testing. Production inference often runs multiple model versions simultaneously to test improvements. The serving layer needs to support: tagging traffic by version, comparing metrics across versions, rolling out new versions gradually. Tooling exists (Seldon Core, BentoML, vLLM-Router) but most production deployments build custom routing layers tuned to their specific needs.

Auditability requirements. Regulated industries (finance, healthcare) need to log every inference request and response for audit. Log retention can become a significant cost (LLM responses can be long; high-volume services produce TB of logs per day). Plan storage tiers, retention policies, and search infrastructure for the audit data; don't treat it as an afterthought.

Multi-model serving. Many production deployments serve more than one model. Routing requests to the right model, sharing GPU capacity efficiently across models, and managing model lifecycle (loading, unloading, replacing) becomes substantial operational work. Tools like Ray Serve and TorchServe specialize in this; vLLM's multi-model support has matured through 2025-2026 but remains less developed than single-model deployment.

Chapter 15: Common mistakes in inference deployments

Recurring failure patterns across teams deploying optimized LLM inference. Knowing them in advance avoids weeks of rework.

Mistake 1: skipping benchmarking. Teams adopt vLLM, deploy it, never measure what they actually achieve. Six months later they discover they could be serving 3x the traffic on the same hardware with proper tuning. Benchmark systematically before going to production; benchmark periodically once in production.

Mistake 2: maximum context length set too high. Setting max_model_len to the model's full capability (e.g., 32K for Llama-3.1) when your actual usage is 4K creates wasted memory and reduces batch sizes dramatically. Set it to your actual P99 usage plus a buffer; revisit when usage patterns change.

Mistake 3: gpu_memory_utilization too high. Setting vLLM's GPU memory utilization to 0.95+ leaves no headroom for activations and produces OOM under load. Stay in the 0.85-0.92 range and monitor actual peak usage.

Mistake 4: ignoring prefix caching. Workloads with shared system prompts can get 50-90% prefill cost reduction with prefix caching enabled — but it's off by default in some configurations. Verify it's on; verify it's actually hitting.

Mistake 5: wrong quantization. Aggressive quantization (INT4) deployed without quality benchmarking can produce subtle output degradation that doesn't show up in basic tests. Test quantized models on your specific workload, not just standard benchmarks.

Mistake 6: speculative decoding with the wrong draft. A draft model that doesn't correlate with the target produces speculation overhead with no acceptance benefit. Measure acceptance rate; if it's low (<40%), the draft is wrong.

Mistake 7: insufficient concurrency. Running with batch size 1 because "we want low latency" wastes most of the GPU. Latency under low load is fine with batching; the latency hit from batching is small and the throughput gain is huge.

Mistake 8: not measuring TTFT separately from ITL. Aggregate latency hides important problems. A workload with great ITL but bad TTFT feels janky; a workload with great TTFT but bad ITL feels slow. Track both.

Mistake 9: streaming through buffered proxies. SSE streams pass through HTTP infrastructure; if any layer buffers, the user experience suffers regardless of how fast the inference is. Verify end-to-end streaming behavior, not just from the inference server.

Mistake 10: forgetting to set step limits on agent workloads. An agent that should take 10 steps but goes wrong and takes 200 produces a per-task cost spike. Hard step limits and per-task cost budgets prevent runaway costs.

Mistake 11: deploying the optimized version once then never re-tuning. Workload patterns drift; engine versions improve; hardware changes. Configuration that was optimal a year ago is probably not optimal now. Schedule periodic re-tuning.

Mistake 12: hardware mismatch. Buying H200 when H100 would suffice (wasted memory budget); buying H100 when H200 is needed (running into context limits). Right-size hardware to actual workload, with some buffer for growth.

Mistake 13: skipping observability investment. Without per-request traces, you can't debug performance issues; without per-tenant cost attribution, you can't price your product; without alerting on SLO violations, you discover incidents after users do.

Mistake 14: choosing engine based on marketing rather than benchmarks. The "fastest" engine in vendor benchmarks may not be fastest on your workload. Always benchmark in your environment with your data.

Mistake 15: ignoring tokenizer overhead. Tokenization at request time and detokenization at response time adds latency that can be material at high throughput. Use the engine's batched tokenization paths; avoid tokenizing on the application side and then re-encoding. Small detail; cumulative impact.

Chapter 16: FAQ

vLLM or TensorRT-LLM for production?

Default to vLLM for most production deployments. It's faster to set up, more flexible, broadly supported, and produces 80-90% of TensorRT-LLM's peak performance with much less engineering investment. Move to TensorRT-LLM when you have a stable model, are NVIDIA-locked, and need the last 10-20% performance for cost reasons.

How much can I improve inference performance through optimization?

From a naive deployment to a properly tuned one: typically 5-15x throughput improvement on the same hardware. From a properly tuned one to a highly tuned one: another 1.5-2x. Beyond that, the gains require specialized techniques (custom kernels, model architecture changes) that aren't worth the effort for most teams.

What's the minimum hardware to run a 70B model?

FP16: 2x H100 (160 GB total memory). FP8: 1x H100 (80 GB). INT4 AWQ: 1x H100 with significant headroom. For comfortable production deployment with longer contexts and reasonable batch sizes: 2x H100 with tensor parallelism, FP8 quantization, prefix caching enabled.

Is it worth self-hosting if I'm using less than X tokens per day?

Below ~100M tokens per month, API providers are almost always cheaper than self-hosting once you account for engineering time, operational overhead, and hardware idle time. Above ~500M tokens per month, self-hosting becomes competitive. Between those is gray zone — depends on specific economics, regulatory requirements, and team expertise.

How do I benchmark properly?

Use a realistic request distribution (not just one prompt repeated). Match production concurrency. Measure TTFT, ITL, throughput, error rate. Run for at least an hour to capture warmup and steady state. Tools: genai-perf, vllm-benchmark, locust, custom load generators. Don't trust the throughput numbers shown in marketing materials — they're from idealized workloads that don't match production.

How does prompt caching at the API level work?

For OpenAI: prompts over 1024 tokens are eligible; cached prefixes get discounted input pricing (typically 50%). For Anthropic: explicit cache_control markers in the prompt; cached content gets 90% discount on input pricing after initial cache write. For Google: cache feature is available with specific configuration. All have minimum prompt sizes and cache TTLs you should learn for your provider.

What's disaggregated serving and should I use it?

Disaggregated serving runs prefill (heavy, compute-bound) on different servers from decode (light, memory-bound). Each can be optimized for its workload. The benefit: better hardware utilization and lower P99 latency. The cost: significantly more operational complexity. Worth it for very large deployments with strict latency SLOs; not worth it for most teams in 2026 — single-engine continuous batching is good enough.

How do I handle a model that doesn't fit on one GPU?

Tensor parallelism within a node (across 2, 4, or 8 GPUs connected by NVLink). Pipeline parallelism across nodes (when single-node memory isn't enough). Expert parallelism for MoE models. Modern engines (vLLM, TensorRT-LLM) handle the parallelism configuration; you specify the topology and they do the rest. For very large models (405B+), expect significant ops work to get right.

What's the deal with FP8 vs INT8 vs INT4?

FP8 (on H100+) is near-free in quality and gives meaningful speed gains; the default for new deployments. INT8 weight-only works on older hardware (A100, V100) with small quality impact. INT4 (AWQ, GPTQ) dramatically reduces memory at 1-3% quality cost; right choice when you can't fit FP16. Don't quantize without benchmarking your specific workload.

Should I use streaming or batch responses?

Use streaming for interactive applications where the user is waiting in real time. Use batch (non-streaming) responses for background processing, agent tool calls that need the full response before acting, and structured outputs that need to parse cleanly. Most production systems support both modes.

How do I handle traffic spikes without overprovisioning?

Queue-based admission control absorbs short spikes without rejecting requests. Pre-warmed standby pools cover medium spikes. Burst to on-demand cloud capacity covers large spikes (at higher cost). The right mix depends on your traffic pattern; instrument and tune.

What's a reasonable per-GPU monthly cost in 2026?

On-demand cloud H100: ~$1500-$3000/month per GPU. Reserved 1-year: ~$1000-$1800/month. Reserved 3-year or on-premises: ~$500-$900/month amortized. The exact numbers vary by cloud and region. The math: at $2000/month per H100 serving Llama-70B at 3000 tokens/sec aggregate output, the per-token cost is roughly $0.10 per million tokens — competitive with API pricing for many use cases.

What's the future of inference optimization?

Specialized hardware (Groq, Cerebras) for extreme latency or throughput. Continued software optimization (kernel improvements, scheduling refinements). Model architecture changes that are inherently more efficient (state-space models, mixture-of-depths). Easier multi-node deployment as tooling matures. Expect 2-3x cumulative cost reduction over the next 2-3 years from these combined improvements, on top of price drops from underlying hardware progress.

How do I migrate an existing deployment to an optimized stack?

Plan a parallel deployment. Stand up the optimized stack alongside the existing one. Route a small percentage of traffic to the new stack and measure latency, throughput, quality, and cost in parallel with the old. Increase traffic gradually as confidence builds. Migration without parallel deployment risks production incidents from configuration errors that didn't surface in pre-production testing. Budget 2-4 weeks for a serious migration, longer if you have many models or strict SLOs.

How does inference relate to training infrastructure?

They use the same hardware (mostly) but have different optimization targets. Training optimizes for batch throughput on long fixed sequences; inference optimizes for low latency on variable-length requests. Many production teams run separate infrastructure for training and inference, since the optimal configurations diverge. Some specialized hardware (Groq, certain TPU configurations) is inference-only.

What's the relationship between inference optimization and model architecture?

Tight. Models designed with inference in mind (GQA, sliding window attention, mixture-of-experts with efficient routing) have inherently better inference characteristics than older designs. State-space models (Mamba, Mamba-2) and hybrid architectures eliminate the quadratic attention bottleneck entirely at the cost of some quality. Through 2026-2027, expect architecture choices to be increasingly driven by inference economics rather than pure benchmark performance.

Closing thoughts

LLM inference optimization in 2026 is a mature discipline with well-understood techniques, established tooling, and clear best practices — but the gap between teams that apply it well and teams that don't is enormous. The fundamentals (prefill vs. decode, KV cache, continuous batching) are not complicated; the implementation work (choosing the right engine, configuring it for your workload, monitoring the right metrics, iterating based on real data) is what separates outcomes. Start with vLLM, benchmark systematically, instrument observability from day one, treat optimization as ongoing rather than one-shot, and remember that the goal isn't peak performance in a benchmark — it's reliable, cost-effective service to your users at the scale your business needs. The teams that internalize this — investing in the boring engineering of measurement, tuning, and operations rather than chasing every new optimization paper — ship infrastructure that scales economically through whatever the next year brings in models, hardware, and demand patterns.

Table of Contents