Writing

Engineering decisions behind production ML systems.

2026

Mar 27, 2026

Our ICLR 2026 Paper: Automated Prompt Translation Across Foundation Models

We published a framework for automatically adapting prompts when switching between LLMs. Here's what we learned about prompt portability at scale.

Mar 10, 2026

Structural Conventions: Why Claude Loves XML and GPT Loves JSON

Different models respond better to different formatting conventions. Here's what actually works across Claude, GPT, Llama, and others.

Mar 8, 2026

Chat Templates: The Hidden Interface Layer

The same prompt behaves differently across models. The reason isn't the model weights - it's the chat template. Here's what you need to know.

Feb 27, 2026

Continuous Batching: Why LLM Serving Is Not Like Traditional Inference

Static batching wastes GPU cycles waiting for the slowest request. Continuous batching fixes this by adding and removing requests at every iteration. Here's how it works.

Feb 27, 2026

Quantization for LLM Inference: From FP16 to INT4

Quantization cuts memory and speeds up inference. But naive 8-bit quantization breaks at 6.7B+ parameters. Here's why, and how modern methods fix it.

Feb 27, 2026

Speculative Decoding: Draft Once, Verify in Parallel

LLM decoding is sequential and memory-bound. Speculative decoding breaks this by guessing multiple tokens and verifying in parallel. Here's how it works and when to use it.

2025

Dec 28, 2025

Tensor Parallelism: How Large Models Fit Across GPUs

Data parallelism hits a wall when your model doesn't fit on one GPU. Tensor parallelism solves this by sharding the model itself.

Dec 8, 2025

Why Your GPU Utilization Is Lower Than You Think

A batch size of 1 on an A100 typically achieves 10-20% utilization. Understanding why—and how to fix it—is key to cost-effective inference.

Nov 14, 2025

The Two Phases of LLM Inference: Prefill and Decode

Time-to-first-token is compute-bound. Token generation is memory-bound. Understanding this split is key to optimizing inference.

Oct 22, 2025

Flash Attention: Why Memory Access Patterns Matter More Than FLOPs

Flash Attention doesn't reduce computation—it reduces memory traffic. Understanding why that matters is key to optimizing transformers.

Sep 15, 2025

MHA vs GQA vs MLA: A Visual Guide to Attention Mechanisms

Multi-Head, Grouped-Query, and Multi-Head Latent Attention explained with memory calculations you can verify yourself.

Aug 28, 2025

Mixture of Experts: The Scaling Strategy Behind Modern LLMs

DeepSeek uses 256 experts, Llama 4 uses 8,192. Why expert count matters, and what the tradeoffs actually are.

Jul 18, 2025

Pre-Norm vs Post-Norm: Why Normalization Placement Matters

GPT-2 popularized Pre-Norm. Now models like OLMo 2 and Gemma 3 are switching back to Post-Norm. What changed?

Jun 12, 2025

Linear Attention: Is O(n) Worth the Accuracy Tradeoff?

Standard attention is O(n²). Linear variants promise O(n). Qwen3 and Kimi K2 use hybrid approaches. Here's what you give up.