Writing

Engineering decisions behind production ML systems.

2026

2025

Tensor Parallelism: How Large Models Fit Across GPUs

Data parallelism hits a wall when your model doesn't fit on one GPU. Tensor parallelism solves this by sharding the model itself.

Why Your GPU Utilization Is Lower Than You Think

A batch size of 1 on an A100 typically achieves 10-20% utilization. Understanding why—and how to fix it—is key to cost-effective inference.

The Two Phases of LLM Inference: Prefill and Decode

Time-to-first-token is compute-bound. Token generation is memory-bound. Understanding this split is key to optimizing inference.

Flash Attention: Why Memory Access Patterns Matter More Than FLOPs

Flash Attention doesn't reduce computation—it reduces memory traffic. Understanding why that matters is key to optimizing transformers.

MHA vs GQA vs MLA: A Visual Guide to Attention Mechanisms

Multi-Head, Grouped-Query, and Multi-Head Latent Attention explained with memory calculations you can verify yourself.

Mixture of Experts: The Scaling Strategy Behind Modern LLMs

DeepSeek uses 256 experts, Llama 4 uses 8,192. Why expert count matters, and what the tradeoffs actually are.

Pre-Norm vs Post-Norm: Why Normalization Placement Matters

GPT-2 popularized Pre-Norm. Now models like OLMo 2 and Gemma 3 are switching back to Post-Norm. What changed?

Linear Attention: Is O(n) Worth the Accuracy Tradeoff?

Standard attention is O(n²). Linear variants promise O(n). Qwen3 and Kimi K2 use hybrid approaches. Here's what you give up.