Inference Networking Requirements

Answer

Frontier AI inference networking requirements are complex and highly dependent on model architecture, parallelism strategy, and latency constraints. The minimum domain size -- the number of GPUs that must be tightly coupled via NVLink-class interconnect (TB/s) to serve a single inference request -- ranges from 8 GPUs (optimistic) to 72 GPUs (conservative) for frontier models in 2026, and is growing toward 144 GPUs with NVIDIA's Rubin platform.

The critical finding for orbital feasibility: Inference is not embarrassingly parallel at the request level for frontier models. A single inference request on a 671B MoE model like DeepSeek R1 requires at minimum 8 GPUs (with aggressive quantization) and optimally 64 GPUs in a single NVLink domain for production throughput. The all-to-all communication pattern required by expert parallelism demands NVLink-class bandwidth (1.8 TB/s per GPU, 130 TB/s aggregate) -- approximately 2,000x the bandwidth demonstrated by Google's Suncatcher optical inter-satellite links (800 Gbps = 0.1 TB/s per link pair). However, smaller models (8B-70B) with quantization can run on 1-8 GPUs within a single satellite, and pipeline parallelism across satellites at 800 Gbps is theoretically feasible for latency-tolerant workloads.

This creates a bifurcated feasibility picture: orbital compute is well-suited for small-to-medium model inference (where the domain fits within a single satellite) and for embarrassingly parallel scale-out (many independent small requests), but faces fundamental bandwidth constraints for frontier MoE model inference requiring wide expert parallelism across satellites.

Evidence

Scale-Up Domain Size for Current Frontier Models

[evidence:premai-parallelism-guide-2026.1] Llama 405B in FP8 requires a minimum of 8x H100 80GB GPUs (640 GB total VRAM). In BF16, it requires 16 GPUs across 2 nodes. With INT4 quantization, 4x A100 80GB (320 GB) suffices. [premai-parallelism-guide-2026.1]
[evidence:premai-parallelism-guide-2026.2] DeepSeek-V3/R1 (671B MoE) requires 8x H100 80GB minimum in FP8 (671 GB weights), or 8x H200 141GB for BF16 (1,342 GB weights). The model has 256 experts but only activates ~37B parameters per token. [premai-parallelism-guide-2026.2]
[evidence:nvidia-dynamo-moe-inference.1] For DeepSeek R1's wide-EP decode, the optimal configuration distributes 256 experts across 64 GPUs (4 experts per GPU), requiring all 64 GPUs within a single NVLink domain. This requires the GB200 NVL72 rack. [nvidia-dynamo-moe-inference.1]
[evidence:nvidia-gb200-specs.1] The GB200 NVL72 connects 72 Blackwell GPUs with 5th-generation NVLink at 1.8 TB/s per GPU, providing 130 TB/s aggregate all-to-all bandwidth within the rack. [nvidia-gb200-specs.1]
[evidence:nvidia-rubin-cpx-nvl144.1] NVIDIA's next-generation Vera Rubin NVL144 CPX (available late 2026) doubles the domain size to 144 GPUs with NVLink 6.0 at 3.6 TB/s per GPU. It delivers 100TB of fast memory and 1.7 PB/s of memory bandwidth in a single rack. The platform is purpose-built for million-token context inference. [nvidia-rubin-cpx-nvl144.1]
[evidence:nvidia-rubin-cpx-nvl144.2] Rubin Ultra (2027) will further increase to NVLink 7.0 with a 144-GPU NVLink domain delivering 15 exaFLOPS of FP4 inference. [nvidia-rubin-cpx-nvl144.2]

Tensor Parallelism Bandwidth Requirements

[evidence:premai-parallelism-guide-2026.3] Tensor parallelism (TP) requires two all-reduce synchronization operations per transformer layer. Llama 70B has 80 layers = 160 sync points per forward pass. NVLink is "effectively mandatory" for TP beyond TP=2. On PCIe 5.0 (128 GB/s), communication consumes 40-50% of inference time at TP=4. [premai-parallelism-guide-2026.3]
[evidence:premai-parallelism-guide-2026.4] TP scaling efficiency: TP=2 achieves 85-95% efficiency; TP=4 achieves 70-85%; TP=8 achieves 56-75%. At TP=8, 25-44% of potential speedup is lost to communication overhead even on NVLink. [premai-parallelism-guide-2026.4]
[evidence:nvidia-nvlink-supercharge-inference.1] A single Llama 3.1 70B inference query requires up to 20 GB of TP synchronization data transferred from each GPU. At batch size 32, NVSwitch-equipped H100 systems achieved 168 tok/s/GPU vs 112 tok/s/GPU without NVSwitch -- a 1.5x improvement. [nvidia-nvlink-supercharge-inference.1]
[evidence:premai-parallelism-guide-2026.5] Bandwidth viability thresholds for tensor parallelism: NVLink 4.0 (900 GB/s) = "excellent, TP=8 works well"; NVLink 3.0 (600 GB/s) = "good, TP=8 acceptable"; PCIe 5.0 (128 GB/s) = "marginal, TP=2 max"; PCIe 4.0 (64 GB/s) = "poor, avoid TP". [premai-parallelism-guide-2026.5]

Pipeline Parallelism and Lower-Bandwidth Tolerance

[evidence:premai-parallelism-guide-2026.6] Pipeline parallelism (PP) uses point-to-point transfers between adjacent stages (not all-to-all), requiring far less bandwidth than TP. PP works on PCIe systems. Multi-node standard pattern: TP within nodes (NVLink), PP across nodes (100 Gbps Ethernet minimum, InfiniBand preferred). [premai-parallelism-guide-2026.6]
[evidence:premai-parallelism-guide-2026.7] PP has a "bubble problem": with PP=4, each GPU sits idle 75% of the time for a single request. Continuous batching mitigates this with concurrent traffic, but single-request latency is always worse with PP than TP. [premai-parallelism-guide-2026.7]
[evidence:ai-dc-networking-gpu-clusters.1] PP generates "more predictable, structured traffic flows between consecutive pipeline stages" via point-to-point Send/Recv operations, compared to TP's all-to-all AllGather and ReduceScatter collectives. [ai-dc-networking-gpu-clusters.1]

Expert Parallelism (MoE) Networking Requirements

[evidence:nvidia-wide-ep-nvl72.1] Wide-EP on DeepSeek R1 with EP=32 achieves 1.8x more output tokens/sec/GPU than EP=8 at 100 tokens/sec per user. Wide-EP distributes fewer experts per GPU, freeing HBM for KV cache and increasing batch capacity. [nvidia-wide-ep-nvl72.1]
[evidence:nvidia-wide-ep-nvl72.2] "Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of [token-gather] communication pattern would make large-scale EP impractical." The all-to-all operations during the MoE phase "can quickly saturate an already memory-bound decode phase." [nvidia-wide-ep-nvl72.2]
[evidence:nvidia-moe-frontier-models.1] On NVL72, frontier MoE models (Kimi K2 Thinking, DeepSeek-R1, Mistral Large 3) achieve 10x performance improvement over HGX H200 systems. Prior to NVL72, the max NVLink domain was 8 GPUs (H200); EP beyond 8 GPUs required "higher-latency scale-out networking" which bottlenecked performance. [nvidia-moe-frontier-models.1]
[evidence:semianalysis-inferencex-v2.1] "All top tier labs are already using disaggregated inferencing and wide expert parallelism" -- including OpenAI, Anthropic, xAI, Google DeepMind, and DeepSeek. Single-node inference is insufficient for frontier production deployment. [semianalysis-inferencex-v2.1]
[evidence:semianalysis-inferencex-v2.2] DeepSeek R1 EP8 (single node) places 32 experts/layer/GPU; EP64 (8 nodes) places 4 experts/layer/GPU. Wider EP yields three compounding benefits: reduced expert weight footprint frees HBM for KV cache, higher tokens-per-expert improves arithmetic intensity, and aggregate HBM bandwidth scales linearly with GPU count. [semianalysis-inferencex-v2.2]

Prefill vs Decode Phase Differences

[evidence:nvidia-dynamo-moe-inference.2] NVIDIA Dynamo's disaggregated serving separates compute-bound prefill from memory-bound decode onto different GPUs. For DeepSeek R1, disaggregated serving achieved 6x throughput gain in the medium-latency regime vs co-located approaches. [nvidia-dynamo-moe-inference.2]
[evidence:nvidia-dynamo-moe-inference.3] Prefill benefits from low tensor parallelism to reduce communication overhead. Decode benefits from high tensor parallelism or wide EP to improve memory operations. Different parallelism configurations are optimal for each phase. [nvidia-dynamo-moe-inference.3]
[evidence:lmsys-gb200-deepseek-part2.1] SGLang on GB200 NVL72 achieved 26,156 input tokens/sec/GPU (prefill) and 13,386 output tokens/sec/GPU (decode) for DeepSeek R1 with FP8 attention and NVFP4 MoE -- a 3.8x and 4.8x speedup vs H100 settings. Configuration used 48 decode ranks and 2-4 prefill ranks per instance. [lmsys-gb200-deepseek-part2.1]

KV Cache and Memory Requirements

[evidence:premai-parallelism-guide-2026.8] Llama 70B KV cache formula: 2 x 80 layers x 8 KV heads x 128 dim x seq_len x batch_size x bytes_per_element. At 32K context, batch 8, FP16: ~42 GB KV cache on top of model weights. For a single request at 32K context: ~5 GB KV cache. [premai-parallelism-guide-2026.8]
[evidence:premai-parallelism-guide-2026.9] Rule of thumb: reserve 40-50% of VRAM beyond model weights for KV cache and runtime overhead. KV cache scales linearly with context length and batch size. [premai-parallelism-guide-2026.9]
[evidence:nvidia-dynamo-moe-inference.4] NVIDIA Dynamo's KV Cache Manager distributes KV cache across multiple GPU nodes, supporting hierarchical caching at GPU, node, and cluster levels. KV cache transfer between prefill and decode workers requires "low-latency, high-throughput communication leveraging GPUDirect RDMA." [nvidia-dynamo-moe-inference.4]

Long-Context Inference Scaling

[evidence:nvidia-rubin-cpx-nvl144.3] The Vera Rubin NVL144 CPX with 100 TB of fast memory is specifically designed for 1M+ token context workloads. For Llama 3 405B processing 1M tokens: 128 H100 GPUs across 16 nodes required, achieving 77 seconds with 93% parallelization efficiency. [nvidia-rubin-cpx-nvl144.3]
[evidence:premai-parallelism-guide-2026.10] 1M tokens of KV cache requires ~15 GB (general benchmark). Llama 70B at 128K context: ~42 GB KV cache per user. KV cache can easily exceed model weights at long context lengths. [premai-parallelism-guide-2026.10]

Optical Inter-Satellite Link Capabilities

[evidence:google-suncatcher.1] Google's Project Suncatcher demonstrated 800 Gbps each-way (1.6 Tbps total) using a single transceiver pair in bench testing. With DWDM and spatial multiplexing, "tens of terabits per second" between satellites "should be possible." The 81-satellite cluster operates at ~650 km altitude with inter-satellite distances of 100-200m. [google-suncatcher.1]
[evidence:google-suncatcher.2] At 100-200m separation, speed-of-light propagation latency is ~0.3-0.7 microseconds -- negligible compared to processing latency. Free space propagation is ~50% faster than fiber. [google-suncatcher.2]
[opinion:google-suncatcher.3] The Suncatcher paper does not specify what parallelism strategy would be used across the satellite cluster, nor whether the demonstrated bandwidth is sufficient for TP, PP, or EP. The paper focuses on system design feasibility, not distributed ML architecture. [google-suncatcher.3]

Model Architecture Trends and Future Domain Size

[evidence:nvidia-moe-frontier-models.2] Since early 2025, over 60% of open-source frontier model releases use MoE architecture. The top 10 most intelligent open-source models all use MoE. MoE has "rapidly become the architecture of choice" for frontier models. [nvidia-moe-frontier-models.2]
[evidence:epoch-consumer-gpu-gap.1] Frontier AI capabilities become runnable on a single consumer GPU (RTX 4090, ~24 GB VRAM) within 6-12 months on average. Small open models improve faster (+125 ELO/year) than frontier models (+80 ELO/year), driven by distillation and quantization. [epoch-consumer-gpu-gap.1]
[evidence:semianalysis-inferencex-v2.3] GPT-4o estimated at ~200B parameters (dense). Claude 3.5 Sonnet estimated at ~175B parameters (dense). GPT-4 estimated at ~1.76T parameters total (MoE, 16 experts, ~280B active per token). These dense frontier models fit on 2-4 GPUs with FP8 quantization. [semianalysis-inferencex-v2.3]
[evidence:premai-parallelism-guide-2026.11] Models under 35B with any quantization do not require multiple GPUs. 70B at INT4 runs on a single H100. Specialized fine-tuned 8B models sometimes beat general 70B models on domain tasks. [premai-parallelism-guide-2026.11]

Scale-Out and Embarrassingly Parallel Inference

[evidence:ai-dc-networking-gpu-clusters.2] Data parallelism (DP) across independent model replicas is "embarrassingly parallel" -- multiple copies of the model run on separate GPU clusters, each serving independent requests with no cross-replica communication. This is the primary scale-out mechanism for inference. [ai-dc-networking-gpu-clusters.2]
[evidence:premai-parallelism-guide-2026.12] For models that fit on a single node: "deploy N independent instances across N nodes. Each instance serves requests independently with no cross-node communication. This scales throughput linearly." [premai-parallelism-guide-2026.12]

Bandwidth Gap: NVLink vs Optical ISL

[evidence:nvidia-gb200-specs.2] NVLink 5 (Blackwell): 1,800 GB/s (14.4 Tbps) per GPU bidirectional. NVLink 6 (Rubin): 3,600 GB/s (28.8 Tbps) per GPU bidirectional. [nvidia-gb200-specs.2]
[evidence:google-suncatcher.4] Suncatcher bench demo: 800 Gbps (0.1 TB/s) per transceiver pair. Target with DWDM: "tens of Tbps" (let's estimate 10-40 Tbps upper bound). [google-suncatcher.4]
[evidence:premai-parallelism-guide-2026.13] For multi-node inference: InfiniBand strongly preferred. 100 Gbps Ethernet minimum viable. 10 Gbps will bottleneck pipeline parallelism transfers. [premai-parallelism-guide-2026.13]

Analysis

The Domain Size Question: What Does "Tightly Coupled" Mean for Inference?

The minimum domain size for inference depends critically on three factors: (1) model size and architecture (dense vs MoE), (2) precision/quantization level, and (3) the parallelism strategy chosen (which in turn depends on latency vs throughput priorities).

Dense models (GPT-4o ~200B, Claude ~175B): In FP8, these fit on a single 8-GPU node (1.1-1.6 TB total VRAM). With INT4 quantization, they can fit on 2-4 GPUs. These models use tensor parallelism within the node (requiring NVLink) but do not need multi-node connectivity for a single inference request. For production throughput, multiple independent replicas scale embarrassingly in parallel.

Large dense models (Llama 405B): Require 8 GPUs minimum in FP8 (405 GB weights + KV cache overhead), or 16 GPUs in BF16. A single NVLink-connected node suffices for latency-optimized serving; multi-node with PP across nodes serves throughput needs.

Frontier MoE models (DeepSeek R1 671B, Kimi K2): This is where domain size becomes contentious. The model fits on 8x H200 in FP8 (~671 GB weights, single node has 1,128 GB). But production deployment uses wide expert parallelism across 32-64 GPUs within a single NVLink domain for 1.8x better throughput. All top-tier labs deploy with disaggregated serving + wide EP + FP4, which requires multi-node NVLink-class connectivity [semianalysis-inferencex-v2.1].

Why Expert Parallelism Demands NVLink-Class Bandwidth

The expert routing mechanism in MoE models creates an all-to-all communication pattern fundamentally different from pipeline parallelism. During each MoE layer, every token must be dispatched to specific expert GPUs and results gathered back -- this is a many-to-many pattern that scales with the number of active experts and tokens per batch.

NVIDIA's own analysis is unambiguous: "Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of this communication pattern would make large-scale EP impractical" [nvidia-wide-ep-nvl72.2]. Before NVL72, EP was limited to 8 GPUs per NVLink domain, and expanding beyond that required InfiniBand which introduced latency bottlenecks [nvidia-moe-frontier-models.1].

The Bandwidth Gap Between NVLink and Optical ISL

This is the central question for orbital feasibility:

Interconnect	Bandwidth per link	Relative to NVLink 5
NVLink 5 (Blackwell)	1,800 GB/s (14.4 Tbps)	1.0x
NVLink 6 (Rubin)	3,600 GB/s (28.8 Tbps)	2.0x
Suncatcher demo (single pair)	100 GB/s (800 Gbps)	0.056x
Suncatcher target (DWDM)	~1,250-5,000 GB/s (10-40 Tbps)	0.7-2.8x
InfiniBand 400G	50 GB/s (400 Gbps)	0.028x
100 Gbps Ethernet	12.5 GB/s	0.007x

At the current demonstrated level (800 Gbps): Optical ISLs provide bandwidth comparable to a single InfiniBand link but ~18x less than NVLink 5. This is sufficient for pipeline parallelism (which uses point-to-point transfers and tolerates lower bandwidth) but insufficient for tensor parallelism or wide expert parallelism.

At the DWDM target (tens of Tbps): If Google's projection of "tens of Tbps" materializes, optical ISLs could approach or match NVLink bandwidth. At 10-40 Tbps per satellite pair, this would be in the range needed for TP and EP. However, this remains undemonstrated in space, and the aggregate bandwidth across an 81-satellite cluster would still be orders of magnitude below the 130 TB/s aggregate within a single NVL72 rack.

Latency is not the bottleneck: At 100-200m separation, light-speed propagation adds only ~0.3-0.7 microseconds -- negligible compared to GPU processing time (~microseconds per layer) and comparable to NVLink's copper cable propagation. The issue is purely bandwidth, not latency.

What Parallelism Strategies Work Over Optical ISLs?

Feasible today (800 Gbps demonstrated):

Data parallelism (scale-out): Multiple independent model replicas on separate satellites, each serving independent requests. No inter-satellite communication needed. This is embarrassingly parallel and works at any bandwidth. This is the most natural fit for orbital compute.
Pipeline parallelism across satellites: PP uses point-to-point transfers between stages. At 800 Gbps (100 GB/s), transferring activation tensors between pipeline stages is feasible for moderate batch sizes. A Llama 70B activation tensor might be 2-10 GB depending on batch size and precision, transferable in 20-100ms. This adds latency but works for throughput-oriented workloads.

Feasible with DWDM (10+ Tbps projected):

Tensor parallelism across small groups: If DWDM achieves 10+ Tbps between satellite pairs, TP=2 or TP=4 across satellites becomes viable (comparable to NVLink 3.0 at 600 GB/s per GPU). This would enable serving 70B-405B dense models across 2-4 satellites.
Expert parallelism across small groups: With enough aggregate bandwidth, EP=8 across satellites might become feasible, though the all-to-all pattern remains challenging.

Likely infeasible regardless of bandwidth:

Wide EP (EP=32-64) across satellites: The all-to-all communication pattern with 32-64 endpoints requires aggregate bandwidth that scales with the square of endpoints. Even at 40 Tbps per link, routing 64-way all-to-all across satellites would face topological bandwidth constraints absent in the fully-connected NVLink switch fabric.

Implications for Orbital Compute Architecture

The analysis suggests a tiered orbital compute architecture:

Tier 1 -- Single-satellite inference (highest feasibility): Models up to ~70B (quantized to INT4/FP8) run entirely within a single satellite on 1-8 GPUs. No inter-satellite networking required. This covers most practical inference workloads: fine-tuned 8B models, Llama 70B, GPT-4o-class dense models (with quantization). The rapid improvement of small models (closing frontier gap in 6-12 months [epoch-consumer-gpu-gap.1]) means this tier will handle increasingly capable models over time.

Tier 2 -- Small-cluster inference via pipeline parallelism (moderate feasibility): Models up to ~405B distributed across 2-4 satellites via PP, using 800 Gbps+ optical links. Adds latency (tens of ms per generation step) but can achieve good throughput with batching. Suitable for latency-tolerant, high-throughput workloads (batch inference, background processing).

Tier 3 -- Frontier MoE inference with wide EP (low feasibility): Models like DeepSeek R1 requiring 64-GPU NVLink domains. The bandwidth gap between optical ISLs and NVLink makes this impractical unless DWDM technology achieves multi-Tbps per link AND an optical switching fabric provides NVLink-equivalent all-to-all connectivity. This remains speculative.

The Direction of Model Architecture Evolution

Two countervailing trends shape future domain size requirements:

Trend toward larger domains: MoE architectures dominate frontier models, and MoE inference benefits enormously from wide EP requiring large NVLink domains. NVIDIA's roadmap (NVL72 -> NVL144) explicitly grows domain size. Future models with more experts will require even wider EP for optimal throughput.

Trend toward smaller effective models: Distillation, quantization (FP4, INT4), and architectural innovations (MLA, GQA) compress frontier capabilities into smaller models. A model that required 8 GPUs today may need 2 GPUs in 18 months. The consumer GPU gap analysis shows frontier capabilities becoming available on single consumer GPUs in 6-12 months [epoch-consumer-gpu-gap.1].

Net effect: For any given capability level, the required domain size is shrinking over time. But the frontier itself is constantly advancing -- the newest, most capable models consistently require the largest domains. Orbital compute may always be 1-2 generations behind the terrestrial frontier in terms of what models it can serve, but the models it can serve will be increasingly capable.

Value Derivation

Optimistic (8 GPUs): Assumes FP4/INT4 quantized MoE models fit on a single 8-GPU node within one satellite. All inter-satellite networking is embarrassingly parallel scale-out. This scenario applies to distilled/quantized versions of frontier models or the previous generation's best models. It is achievable with current technology.
Central (16 GPUs): Assumes frontier models require 2 nodes (16 GPUs) with pipeline parallelism or limited EP across satellites connected by 800 Gbps+ optical links. This covers Llama 405B-class models and FP8 MoE models where PP can bridge the inter-satellite gap. Feasible with demonstrated Suncatcher-class optical links.
Conservative (72 GPUs): Assumes wide EP across a full NVL72 rack is necessary for production-competitive throughput on frontier MoE models. This is what top-tier labs currently deploy [semianalysis-inferencex-v2.1] and represents the domain size needed to match terrestrial inference economics. This is infeasible across satellites with current or near-term inter-satellite link technology.

New Source Details

premai-parallelism-guide-2026

URL: https://blog.premai.io/multi-gpu-llm-inference-tp-vs-pp-vs-ep-parallelism-guide-2026/
Title: Multi-GPU LLM Inference: TP vs PP vs EP Parallelism Guide (2026)
Description: Comprehensive practical guide to multi-GPU inference parallelism strategies with specific GPU counts, bandwidth thresholds, and efficiency data

nvidia-wide-ep-nvl72

URL: https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/
Title: Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems
Description: NVIDIA technical blog detailing how wide EP works on NVL72, with performance data showing EP32 achieves 1.8x throughput vs EP8

nvidia-dynamo-moe-inference

URL: https://developer.nvidia.com/blog/how-nvidia-gb200-nvl72-and-nvidia-dynamo-boost-inference-performance-for-moe-models/
Title: How NVIDIA GB200 NVL72 and NVIDIA Dynamo Boost Inference Performance for MoE Models
Description: Analysis of disaggregated serving for MoE models showing 6x throughput gains with wide EP on NVL72, with simulation across hundreds of thousands of configurations

nvidia-nvlink-supercharge-inference

URL: https://developer.nvidia.com/blog/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference/
Title: NVIDIA NVLink and NVSwitch Supercharge Large Language Model Inference
Description: Benchmark data showing NVSwitch delivers 1.5x inference throughput for Llama 70B, with quantification of per-query data transfer requirements

nvidia-nvlink-fusion-inference

URL: https://developer.nvidia.com/blog/scaling-ai-inference-performance-and-flexibility-with-nvidia-nvlink-and-nvlink-fusion/
Title: Scaling AI Inference Performance and Flexibility with NVIDIA NVLink and NVLink Fusion
Description: NVIDIA analysis showing 72-GPU NVLink domain maximizes revenue and performance for inference workloads

semianalysis-inferencex-v2

URL: https://newsletter.semianalysis.com/p/inferencex-v2-nvidia-blackwell-vs
Title: InferenceX v2: NVIDIA Blackwell Vs AMD vs Hopper
Description: SemiAnalysis comprehensive inference benchmark showing all top-tier labs use disaggregated serving with wide EP; detailed DeepSeek R1 deployment configurations

nebius-gb200-interconnect

URL: https://nebius.com/blog/posts/leveraging-nvidia-gb200-nvl72-gpu-interconnect
Title: Leveraging high-speed, rack-scale GPU interconnect with NVIDIA GB200 NVL72
Description: Technical deep-dive on how TP groups require fastest interconnect and are always contained within a single NVL72 rack

nvidia-moe-frontier-models

URL: https://blogs.nvidia.com/blog/mixture-of-experts-frontier-models/
Title: Mixture of Experts Powers the Most Intelligent Frontier AI Models
Description: NVIDIA analysis showing 10x performance leap for MoE models on NVL72 vs H200, with data on MoE adoption in 60%+ of frontier models

nvidia-rubin-cpx-nvl144

URL: https://nvidianews.nvidia.com/news/nvidia-unveils-rubin-cpx-a-new-class-of-gpu-designed-for-massive-context-inference
Title: NVIDIA Unveils Rubin CPX: A New Class of GPU Designed for Massive-Context Inference
Description: NVL144 with 100TB memory, 1.7 PB/s bandwidth, purpose-built for million-token context inference; domain size doubles from 72 to 144 GPUs

lmsys-gb200-deepseek-part1

URL: https://lmsys.org/blog/2025-06-16-gb200-part-1/
Title: Deploying DeepSeek on GB200 NVL72 (Part I)
Description: LMSYS benchmark showing 2.7x decode throughput improvement, using 12 decode + 2 prefill nodes within NVL72

lmsys-gb200-deepseek-part2

URL: https://lmsys.org/blog/2025-09-25-gb200-part-2/
Title: Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part II)
Description: 3.8x prefill and 4.8x decode speedup with NVFP4 MoE on 48 decode ranks

epoch-consumer-gpu-gap

URL: https://epoch.ai/data-insights/consumer-gpu-model-gap
Title: Frontier AI capabilities can be run at home within a year or less
Description: Epoch AI analysis showing 6-12 month lag before frontier capabilities run on single consumer GPU; small models improving faster than frontier

ai-dc-networking-gpu-clusters

URL: https://www.thenetworkdna.com/2026/03/ai-data-center-networking-how-gpu.html
Title: AI Data Center Networking: How GPU Clusters Are Changing Network Design
Description: Technical analysis of TP, PP, DP communication patterns and bandwidth requirements in AI clusters