Running 1M-token context on a single GPU (the math)
Most people dismiss million-token context windows as a hardware problem. It is not. It is a math problem — and the math has a solution. The Raw Numbers A 70B model stores KV cache at 2 bytes per el...

Source: DEV Community
Most people dismiss million-token context windows as a hardware problem. It is not. It is a math problem — and the math has a solution. The Raw Numbers A 70B model stores KV cache at 2 bytes per element (fp16). With 96 layers, 64 heads, 128 head-dim, the KV cache per token is: bytes_per_token = 2 * num_layers * 2 * num_heads * head_dim * bytes_per_element = 2 * 96 * 2 * 64 * 128 * 2 = 6,291,456 bytes ≈ 6 MB/token At 1M tokens: 6 TB. Two H100s hold 160 GB combined. You are 37× short. The Compression Table Model Context No compression 5x 10x 17x 33x 7B 1M tokens 420 GB 84 GB 42 GB 25 GB 13 GB 13B 1M tokens 780 GB 156 GB 78 GB 46 GB 24 GB 70B 1M tokens 6,000 GB 1,200 GB 600 GB 120 GB 60 GB 70B 128K tokens 768 GB 154 GB 77 GB 45 GB 23 GB 17× compression: 70B at 1M tokens fits on 2× H100 (120 GB). 33× compression: 70B at 1M tokens fits on a single H100 (80 GB). The Python Formula def kv_cache_gb( model_params_b, # e.g. 70 for 70B context_length, # e.g. 1_000_000 compression_ratio=1, # Nexus