Inference Networking Requirements
What are the scale-up and scale-out networking requirements for AI inference, how is domain size evolving, and what does this imply for orbital feasibility?
Answer
The minimum scale-up domain size — the number of GPUs that must be tightly coupled via NVLink-class interconnect to serve a single inference request — depends on model size, architecture (dense vs MoE), quantization level, batch size, context length, and throughput targets. For frontier models in 2026, it ranges from 8 GPUs (optimistic for orbital) to 72 GPUs (conservative), with NVIDIA's roadmap extending to 144 GPUs (Vera Rubin NVL144).
These factors stack systematically. Model weights set a memory floor (a 671B MoE model in FP8 fills an 8-GPU node). KV cache adds memory proportional to batch size and context length — for Llama 70B at 32K context, this ranges from ~5 GB (single request) to ~330 GB (batch of 64), pushing total memory well above the weight-only floor. Production throughput optimization drives domain size further still: wider expert parallelism (EP=64 vs EP=8) delivers 1.8× per-GPU throughput on MoE models but requires all-to-all communication across a 72-GPU NVLink domain nvidia-wide-ep-nvl72.1. The gap between "minimum GPUs to hold the model" and "production-competitive domain size" can be 4-8× in GPU count.
The parallelism strategy then determines the bandwidth requirement. Data parallelism (independent replicas) needs no cross-GPU communication. Pipeline parallelism uses point-to-point transfers feasible at 100 Gbps+. But tensor and expert parallelism require NVLink-class bandwidth within the domain: NVLink 5 provides 1.8 TB/s per GPU. The fastest demonstrated inter-satellite link (ISL) — from Google's Suncatcher project, a free-space optical transceiver designed for close-formation satellite clusters — provides 0.2 TB/s (800 Gbps each-way, 1.6 Tbps total bidirectional) per optical ISL pair between two satellites. This ~9× ratio compares what each GPU can access within an NVLink domain (1.8 TB/s bidirectional) to what each satellite-to-satellite link can carry (0.2 TB/s bidirectional); the effective per-GPU bandwidth for cross-satellite communication is lower still when multiple GPUs share a single ISL. At the aggregate level, a 72-GPU NVLink domain provides 130 TB/s of non-blocking all-to-all bandwidth through an internal switch fabric — a topology no ISL network can replicate.
The orbital feasibility of a given workload therefore depends on whether its required NVLink domain fits within a single satellite. Our satellite GPU capacity analysis finds that monolithic satellites housing a full NVL72 rack (~72 GPUs, ~130 kW, with internal NVLink) are physically feasible and represent the baseline design point for multiple industry proposals (SpaceX AI Sat Mini, Starcloud-3). A single such satellite can serve many current frontier inference workloads — including MoE models at EP=64 — with no inter-satellite networking needed beyond embarrassingly parallel scale-out, provided the workload fits within a single NVL72-style domain under current batching and context-length assumptions. However, domain size depends on architecture, quantization, batch size, context length, and throughput targets; long-context workloads push domain sizes upward; and NVIDIA's roadmap is already at NVL144 (shipping H2 2026), with NVL576 announced. The ISL bandwidth constraint applies when the required domain exceeds a single satellite's capacity, or in distributed architectures of smaller satellites where cross-satellite parallelism is needed for models exceeding one satellite's NVLink domain.
Analysis
The Production Inference Workload Landscape
Before mapping workloads to hardware requirements, it is useful to consider what kinds of deployments drive the bulk of hosted AI inference compute. Comprehensive public data on workload composition is scarce, but several patterns are clear:
Model sizes in production span a wide range. Models under 35B with quantization run on a single GPU premai-parallelism-guide-2026.11; specialized fine-tuned 8B models sometimes match general 70B models on domain tasks premai-parallelism-guide-2026.11. Frontier capabilities become runnable on a single consumer GPU within 6-12 months epoch-consumer-gpu-gap.1, so today's small models rival yesterday's frontier. At the other end, over 60% of open-source frontier model releases use MoE architecture nvidia-moe-frontier-models.2, and all top-tier labs deploy frontier MoE models (671B+) with disaggregated serving and wide expert parallelism semianalysis-inferencex-v2.1.
Production inference is throughput-optimized. API providers serve batched requests — many concurrent users sharing a GPU group — not single requests on dedicated hardware. This matters because batch size directly scales KV cache memory (see below), pushing hardware requirements well above what a single-request analysis would suggest. The distinction between "this model fits on 8 GPUs" (at batch=1) and "this model serves production traffic efficiently on 8 GPUs" (at batch=64) is substantial.
Context lengths are growing. While most API requests currently use moderate context, the trend is toward longer context windows: Vera Rubin NVL144 is explicitly designed for million-token inference nvidia-rubin-cpx-nvl144.1, and KV cache at long context can exceed model weight memory premai-parallelism-guide-2026.10. Long-context workloads amplify memory requirements and thus domain size.
From Workload to Memory Requirements
GPU memory required for inference is the sum of three components, each driven by different workload parameters.
1. Model weight memory depends on parameter count and quantization precision:
| Model | Parameters | FP16 | FP8 | INT4 |
|---|---|---|---|---|
| Small (e.g., Llama 8B) | 8B | 16 GB | 8 GB | 4 GB |
| Medium (e.g., Llama 70B) | 70B | 140 GB | 70 GB | 35 GB |
| Dense frontier | ~200B | ~400 GB | ~200 GB | ~100 GB |
| Large dense (Llama 3.1 405B) | 405B | 810 GB | 405 GB | ~200 GB |
| Frontier MoE (DeepSeek R1) | 671B | 1,342 GB | 671 GB | ~336 GB |
For MoE models, only a subset of parameters activate per token (DeepSeek R1 activates ~37B of 671B premai-parallelism-guide-2026.2), but all expert weights must reside in GPU memory for fast routing.
2. KV cache memory scales linearly with context length, batch size, and precision. Using Llama 70B (80 layers, 8 KV heads, 128 head dim) as a worked example premai-parallelism-guide-2026.8:
| Context length | Batch 1 | Batch 8 | Batch 64 |
|---|---|---|---|
| 4K | ~0.6 GB | ~5 GB | ~41 GB |
| 32K | ~5 GB | ~42 GB | ~330 GB |
| 128K | ~21 GB | ~168 GB | ~1,310 GB |
The rule of thumb: reserve 40-50% of VRAM beyond model weights for KV cache and runtime overhead premai-parallelism-guide-2026.9. At high batch size or long context, KV cache can exceed model weights premai-parallelism-guide-2026.10.
Note: these values use the Llama 70B multi-head KV cache formula. Models using Multi-head Latent Attention (MLA, as in DeepSeek R1) or aggressive Grouped Query Attention compress KV cache significantly. The table illustrates the mechanism — memory scales with batch × context — rather than providing exact figures for every architecture.
3. Activation and overhead memory — temporary tensors, framework buffers, CUDA contexts — adds ~10-20% beyond weights and KV cache.
From Memory to Minimum Domain Size
The minimum number of GPUs for a deployment is the total VRAM requirement divided by per-GPU memory capacity (H100: 80 GB, H200: 141 GB, B200: 192 GB). Representative minimum configurations:
| Deployment scenario | Weight mem | KV cache | ~Total | Min H100 (80 GB) | Min H200 (141 GB) |
|---|---|---|---|---|---|
| 8B @ INT4, 8K ctx, batch 16 | 4 GB | ~2 GB | ~8 GB | 1 | 1 |
| 70B @ FP8, 32K ctx, batch 1 | 70 GB | ~5 GB | ~82 GB | 2 | 1 |
| 70B @ FP8, 32K ctx, batch 8 | 70 GB | ~42 GB | ~120 GB | 2 | 1 |
| 70B @ FP8, 32K ctx, batch 64 | 70 GB | ~330 GB | ~440 GB | 6 | 4 |
| 200B @ FP8, 32K ctx, batch 8 | 200 GB | ~42 GB | ~270 GB | 4 | 2 |
| 405B @ FP8, 32K ctx, batch 8 | 405 GB | ~42 GB | ~490 GB | 7 | 4 |
| 405B @ BF16, 32K ctx, batch 8 | 810 GB | ~42 GB | ~930 GB | 12 | 7 |
| 671B MoE @ FP8, 32K ctx, batch 1 | 671 GB | ~5 GB | ~740 GB | 10 | 6 |
| 671B MoE @ FP8, 32K ctx, batch 64 | 671 GB | ~330 GB | ~1,100 GB | 14 | 8 |
These are memory-limited minimums — the smallest configuration that holds the deployment in GPU memory. As the next section discusses, practical domain sizes for production deployment are substantially larger.
Throughput Optimization Drives Domain Size Beyond the Memory Minimum
Production inference economics optimize for cost-per-token ($/tok), not minimum GPU count. Several mechanisms push domain size above the memory floor:
Decode bandwidth scaling. Autoregressive decode is memory-bandwidth-bound: each generated token requires reading the full model weights from HBM. Distributing the model across more GPUs via tensor parallelism increases aggregate HBM bandwidth, directly improving tokens/second. NVSwitch-equipped H100 systems achieve 1.5× decode throughput versus non-NVSwitch configurations for Llama 70B at batch 32 nvidia-nvlink-supercharge-inference.1. TP=8 achieves 56-75% of ideal scaling even on NVLink premai-parallelism-guide-2026.4.
Wide expert parallelism for MoE. Distributing fewer experts per GPU frees HBM for KV cache, enables larger batches, and improves arithmetic intensity — three compounding benefits semianalysis-inferencex-v2.2. EP=32 on NVL72 achieves 1.8× per-GPU throughput versus EP=8 nvidia-wide-ep-nvl72.1. The NVL72 enabled a 10× performance leap for MoE models over prior HGX H200 systems (max 8 GPUs per NVLink domain) nvidia-moe-frontier-models.1.
Disaggregated prefill and decode. Separating compute-bound prefill from memory-bound decode onto different GPU groups allows each to be configured optimally. NVIDIA Dynamo achieves 6× throughput gain versus co-located serving for DeepSeek R1 nvidia-dynamo-moe-inference.2. On NVL72, SGLang achieves 26,156 input tok/s/GPU (prefill) and 13,386 output tok/s/GPU (decode) with 48 decode ranks and 2-4 prefill ranks lmsys-gb200-deepseek-part2.1.
The net effect: while a 671B MoE model fits on 8 H200 GPUs (FP8, single request), production deployment uses 64 GPUs across a full NVL72 for wide EP and disaggregated serving. All top-tier labs operate this way semianalysis-inferencex-v2.1. The gap between "minimum GPUs to hold the model" and "production-competitive domain size" can be 4-8× in GPU count.
Communication Patterns and Bandwidth Requirements by Parallelism Strategy
The parallelism strategy determines the communication pattern — and thus the bandwidth requirement — between GPUs in the domain:
Data parallelism (DP): Independent model replicas serve independent requests with no cross-replica communication [ai-dc-networking-gpu-clusters.2, premai-parallelism-guide-2026.12]. Embarrassingly parallel — scales linearly across any interconnect. This is the primary mechanism for increasing total inference throughput.
Pipeline parallelism (PP): Model layers distributed across GPUs in a pipeline. Communication is point-to-point between adjacent stages ai-dc-networking-gpu-clusters.1 — structured, predictable transfers of activation tensors. Minimum 100 Gbps Ethernet; InfiniBand preferred premai-parallelism-guide-2026.6. The standard multi-node pattern is TP within NVLink nodes, PP across nodes. Tradeoff: with PP=4, each GPU sits idle 75% of the time for a single request; continuous batching mitigates this but single-request latency is always worse than TP premai-parallelism-guide-2026.7.
Tensor parallelism (TP): A single layer's computation split across GPUs, requiring two all-reduce synchronizations per transformer layer — 160 syncs per forward pass for Llama 70B premai-parallelism-guide-2026.3. NVLink is "effectively mandatory" beyond TP=2: on PCIe 5.0 (128 GB/s), communication consumes 40-50% of inference time at TP=4 premai-parallelism-guide-2026.3. Bandwidth viability thresholds: NVLink 4.0 (900 GB/s) = "excellent, TP=8 works well"; PCIe 5.0 (128 GB/s) = "marginal, TP=2 max" premai-parallelism-guide-2026.5.
Expert parallelism (EP): Each MoE layer dispatches every token to its selected expert GPUs and gathers results — a many-to-many all-to-all pattern. "Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of this communication pattern would make large-scale EP impractical" nvidia-wide-ep-nvl72.2. Before NVL72, EP beyond 8 GPUs required InfiniBand, which bottlenecked performance nvidia-moe-frontier-models.1.
Summary of bandwidth requirements by parallelism strategy:
| Parallelism | Communication pattern | Minimum viable bandwidth | Optimal bandwidth |
|---|---|---|---|
| DP | None | Any | Any |
| PP | Point-to-point between adjacent stages | 100 Gbps | 400 Gbps (InfiniBand) |
| TP | All-reduce every layer | ~600 GB/s (NVLink 3.0) | 1,800+ GB/s (NVLink 5+) |
| EP | All-to-all every MoE layer | ~50 GB/s inter-node (RDMA) | 130 TB/s aggregate (NVL72) |
The key insight: DP and PP tolerate low-bandwidth interconnects (including optical ISLs), while TP and EP require NVLink-class bandwidth within the domain. The domain size question reduces to: how many GPUs must be in the NVLink-connected domain?
The NVLink-to-ISL Bandwidth Gap
Comparing bandwidth across interconnect technologies. For NVLink, the figure is the per-GPU bidirectional bandwidth to other GPUs in the domain. For ISLs and network fabrics, it is the per-link bandwidth between two nodes:
| Interconnect | Bandwidth (per GPU or per link) | Relative to NVLink 5 | Relative to NVLink 6 |
|---|---|---|---|
| NVLink 5 (Blackwell NVL72) | 1,800 GB/s (14.4 Tbps) | 1.0× | 0.5× |
| NVLink 6 (Vera Rubin NVL72) | 3,600 GB/s (28.8 Tbps) | 2.0× | 1.0× |
| Suncatcher demo (single pair, bidirectional) | 200 GB/s (1.6 Tbps) | 0.11× | 0.056× |
| Suncatcher DWDM target | ~1,200-1,600 GB/s (9.6-12.8 Tbps) | 0.67-0.89× | 0.33-0.44× |
| Our upper-bound estimate (DWDM + spatial mux) | ~1,250-5,000 GB/s (10-40 Tbps) | 0.7-2.8× | 0.35-1.4× |
| InfiniBand 400G | 50 GB/s (400 Gbps) | 0.028× | 0.014× |
| 100 Gbps Ethernet | 12.5 GB/s | 0.007× | 0.003× |
At the demonstrated level (1.6 Tbps bidirectional): Optical ISLs provide ~4× InfiniBand 400G bandwidth — sufficient for PP but insufficient for TP or wide EP. The per-endpoint gap versus NVLink 5 is ~9× (bidirectional-to-bidirectional) and versus NVLink 6 is ~18×.
At Google's DWDM target (9.6-12.8 Tbps per aperture): ISLs would approach NVLink 5 per-link bandwidth google-suncatcher.4. Google states the required bandwidth is "on the order of 10 Tbps," achievable via DWDM. Higher scaling via spatial multiplexing requires very short inter-satellite separations. Our 10-40 Tbps upper bound is our own extrapolation, not Google's projection.
The aggregate gap is more severe. A 72-GPU NVLink domain provides 130 TB/s (Blackwell) or 260 TB/s (Vera Rubin nvidia-nvlink6-specs.1) all-to-all bandwidth through a non-blocking switch fabric connecting all endpoints simultaneously. An ISL network provides point-to-point links between satellite pairs — a fundamentally different topology. Replicating all-to-all connectivity across 72 satellites would require each satellite to maintain simultaneous high-bandwidth links to all 71 others, which is physically infeasible.
The terrestrial target is moving. Vera Rubin NVL72 (H2 2026) doubles the per-GPU NVLink baseline to 3.6 TB/s nvidia-nvlink6-specs.1, widening the gap.
Propagation latency is negligible at close formation: 100-200m separation adds ~0.3-0.7 microseconds google-suncatcher.2, comparable to NVLink copper propagation. However, total end-to-end ISL latency also includes transceiver encoding/decoding, FEC, serialization, and protocol overhead — not characterized in the Suncatcher paper and potentially adding microseconds to tens of microseconds per hop. The primary constraint is bandwidth, not latency, but end-to-end latency may also matter for latency-sensitive TP and EP patterns.
No DWDM ISL has been tested in orbit. The progression from 800 Gbps demonstrated to 10+ Tbps in space requires wavelength multiplexing in a vacuum-to-vacuum optical path with pointing jitter, thermal distortion, and vibration that bench tests do not capture.
Communication-Efficient MoE Serving Techniques
Recent research demonstrates that the NVLink bandwidth requirement for MoE inference is not as rigid as the above suggests. Several techniques reduce the effective inter-node bandwidth needed for expert parallelism, though none eliminate the NVLink advantage entirely:
Disaggregated inference separates attention computation from expert/MoE computation onto different GPU groups. ByteDance's MegaScale-Infer (SIGCOMM 2025) replaces all-to-all collectives with M2N point-to-point communication, achieving 1.9× per-GPU throughput and 1.5-2× cost reduction in production megascale-infer-sigcomm.1. Critically, MegaScale-Infer supports heterogeneous clusters where some nodes use PCIe rather than NVLink megascale-infer-sigcomm.2, demonstrating that disaggregated MoE does not require NVLink on all nodes.
Hierarchical all-to-all performs two-stage communication: fast intra-node aggregation via NVLink, then aggregated inter-node transfers over slower fabric. DeepSeek's DeepEP library achieves 153 GB/s intra-node (NVLink) and 43-58 GB/s inter-node (RDMA) deepep-communication-lib.1. This maps naturally to satellite clusters where each satellite has an internal NVLink domain.
Locality-aware expert placement uses optimization to co-locate frequently co-activated experts, reducing inter-node traffic by 20-36% (MoETuner achieves 17.5% end-to-end speedup on 16 H200 GPUs across 2 InfiniBand nodes moetuner-expert-placement.1).
Production demonstrations over InfiniBand confirm that wide EP works without NVLink between nodes: LMSYS deployed DeepSeek-V3 with EP72 across 12 InfiniBand-connected H100 nodes lmsys-large-scale-ep.1; vLLM achieved 2,200 output tok/s per H200 in multi-node InfiniBand deployments vllm-large-scale-ep.1. These achieve practical throughput but remain significantly below NVL72 performance (NVIDIA cites 10× for MoE models nvidia-moe-frontier-models.1).
Net effect on orbital feasibility: For monolithic satellites with an internal NVL72 domain, these techniques are directly applicable — the satellite IS the NVLink domain, and intra-satellite performance matches terrestrial. For distributed architectures with smaller satellites, these techniques improve the picture: disaggregated serving + hierarchical all-to-all could make MoE inference across 4-8 satellites (each with 8 GPUs) functional at reduced throughput, perhaps 3-10× below terrestrial NVL72. The binary "feasible/infeasible" framing overstates the constraint for distributed architectures — a degraded-but-functional mode exists. However, for cost-competitive operation, the all-to-all within the expert group still benefits strongly from NVLink-class bandwidth.
Feasible Parallelism Strategies Over Optical ISLs
The following applies to parallelism across satellites — i.e., when the required domain exceeds a single satellite's internal NVLink domain. For monolithic 72-GPU satellites, cross-satellite parallelism is needed only for NVL144+ workloads or for data-parallel scale-out (which is trivial). For distributed architectures with smaller satellites, these constraints apply to any workload exceeding one satellite's GPU count.
Feasible today (800 Gbps demonstrated):
- Data parallelism (scale-out): Independent model replicas on separate satellites, each serving independent requests. No inter-satellite communication needed. Embarrassingly parallel, works at any bandwidth. The most natural fit for orbital compute.
- Pipeline parallelism across satellites: PP uses point-to-point transfers between stages. At 800 Gbps (100 GB/s), transferring activation tensors between pipeline stages is feasible for moderate batch sizes. A Llama 70B activation tensor might be 2-10 GB depending on batch size and precision, transferable in 20-100ms. This adds latency but works for throughput-oriented workloads.
Feasible with DWDM (10+ Tbps projected):
- Tensor parallelism across small groups: If DWDM achieves 10+ Tbps between satellite pairs, TP=2 or TP=4 across satellites becomes viable (comparable to NVLink 3.0 at 600 GB/s per GPU). This would enable serving 70B-405B dense models across 2-4 satellites.
- Expert parallelism across small groups: With enough aggregate bandwidth, EP=8 across satellites might become feasible, though the all-to-all pattern remains challenging.
Likely infeasible regardless of bandwidth:
- Wide EP (EP=32-64) across satellites: The all-to-all communication pattern with 32-64 endpoints requires aggregate bandwidth that scales with the number of endpoint pairs. Even at 40 Tbps per link, routing 64-way all-to-all across satellites would face topological bandwidth constraints absent in the fully-connected NVLink switch fabric.
Implications for Orbital Compute Architecture
The key question for orbital feasibility is not "can ISLs replace NVLink?" (they cannot) but "can the workload's required NVLink domain fit within a single satellite?"
Our satellite GPU capacity analysis finds that monolithic satellites housing a full NVL72 rack (~72 GPUs, ~130 kW, with internal NVLink) are physically feasible and represent the baseline design point for multiple industry proposals (SpaceX AI Sat Mini, Starcloud-3). This significantly changes the feasibility picture compared to an architecture of small satellites:
Monolithic satellites (72+ GPUs, internal NVLink):
- Workloads fitting within a 72-GPU domain — including many frontier MoE configurations at EP=64 and production throughput under current batching/context assumptions — run entirely within a single satellite. Inter-satellite links carry only data-parallel scale-out traffic (embarrassingly parallel, any bandwidth suffices). Note that actual domain size depends on architecture, quantization, batch size, context length, and throughput targets; long-context or very high-batch workloads may exceed a 72-GPU domain even today.
- NVL144-class workloads (~144 GPUs, ~260 kW): Approach but do not exceed the practical single-satellite power ceiling of ~300-500 kW. A single satellite housing an NVL144 domain is physically feasible but pushes thermal and structural limits.
- NVL576+ workloads (~1 MW, multi-rack): Exceed single-satellite capacity. These require either cross-satellite NVLink-class bandwidth (infeasible with ISLs) or pipeline parallelism across satellites (feasible but at reduced throughput).
Distributed satellites (8-16 GPUs each, ISL-connected):
- Workloads fitting within one satellite (up to ~70B quantized, 1-8 GPUs): Work well — no inter-satellite networking needed for tightly-coupled computation. This covers Tier 1 workloads.
- Workloads exceeding one satellite's domain: Pipeline parallelism across 2-4 satellites at 800 Gbps+ ISL bandwidth is feasible for models up to ~405B, adding latency but achieving good throughput with batching. Communication-efficient techniques (disaggregated serving, hierarchical all-to-all) may extend this to MoE models at reduced throughput (3-10× below terrestrial NVL72).
- Wide EP (64+ GPUs) across satellites: Infeasible due to the 9-18× per-link bandwidth gap and absent all-to-all switch-fabric topology.
For reference, our conventions define three workload tiers based on domain size requirements (independent of satellite architecture):
- Tier 1 (1-8 GPUs): Models up to ~70B quantized. Feasible on any satellite architecture. The rapid improvement of small models (closing the frontier gap in 6-12 months epoch-consumer-gpu-gap.1) means this tier handles increasingly capable models over time.
- Tier 2 (8-72 GPUs): Large dense models and frontier MoE with wide EP. Fit within a monolithic 72-GPU satellite but exceed a small satellite's domain — for distributed architectures, these require cross-satellite parallelism with the ISL bandwidth constraints described above.
- Tier 3 (72+ GPUs): NVL144+ workloads. Approach or exceed even monolithic satellite capacity. The terrestrial roadmap (NVL144 → NVL576 → NVL1152) is pushing domains beyond what near-term satellites can house.
The monolithic satellite path resolves the near-term inter-satellite bandwidth problem for current frontier workloads. However, it introduces its own engineering challenges — thermal management at 130+ kW, complex deployable structures, higher per-event loss risk — discussed in the satellite GPU capacity analysis. And the terrestrial competitive bar continues to advance: as domains grow from NVL72 to NVL144 and NVL576, the question of whether a single satellite can keep pace will recur at each generation.
The Direction of Model Architecture Evolution
Two countervailing trends shape future domain size requirements:
Trend toward larger domains: MoE architectures dominate frontier models (60%+ of recent releases nvidia-moe-frontier-models.2), and MoE inference benefits enormously from wide EP requiring large NVLink domains. NVIDIA's roadmap (NVL72 → NVL144 → NVL576) explicitly grows domain size. Future models with more experts will require even wider EP for optimal throughput.
Trend toward smaller effective models: Distillation, quantization (FP4, INT4), and architectural innovations (MLA, GQA) compress frontier capabilities into smaller models. A model that required 8 GPUs today may need 2 GPUs in 18 months. The consumer GPU gap analysis shows frontier capabilities becoming available on single consumer GPUs in 6-12 months epoch-consumer-gpu-gap.1.
Net effect: For any given capability level, the required domain size is shrinking over time. But the frontier itself is constantly advancing — the newest, most capable models consistently require the largest domains. Orbital compute may always be 1-2 generations behind the terrestrial frontier in terms of what models it can serve, but the models it can serve will be increasingly capable.
The Accelerating Competitive Bar for AI Data Center Networking
Two March 2026 sources — SemiAnalysis's GTC 2026 recap [semianalysis-gtc-2026.1, semianalysis-gtc-2026.2] and Jensen Huang's Lex Fridman interview [jensen-huang-lex-2026.1, jensen-huang-lex-2026.2] — underscore that the networking bar for competitive AI inference is not just high but accelerating:
The terrestrial roadmap extends well beyond NVL144. NVIDIA's GTC 2026 announcements: NVL576 (8 Oberon racks, CPO inter-rack all-to-all) and NVL1152 (8 Kyber racks, Feynman generation) semianalysis-gtc-2026.1. The NVL144 Kyber rack alone requires 72 NVLink 7 switches at 28.8 Tbps each semianalysis-gtc-2026.2. A hypothetical NVL288 backplane would need 20,736 differential pairs — approaching practical copper limits. NVIDIA cannot double electrical lane speed beyond 224 Gbps, meaning copper-only bandwidth scaling is bounded semianalysis-gtc-2026.2.
Co-packaged optics (CPO) bridges to multi-rack domains. NVIDIA's approach: "use copper where they can, and optics where they must" semianalysis-gtc-2026.2. CPO enters for inter-rack connectivity starting with NVL576. Even if orbital ISLs achieve multi-Tbps bandwidth, they would need CPO-equivalent functionality between satellites — a more constrained problem than terrestrial inter-rack CPO due to pointing, vibration, and thermal distortion.
Inference architecture is disaggregating into more tightly coupled components. The Groq LPU integration demonstrates attention-FFN disaggregation across heterogeneous hardware (GPUs + LPUs + storage accelerators) connected via all-to-all networks semianalysis-gtc-2026.1. The LPX rack's 640 TB/s internal bandwidth and inter-rack Spectrum-X connections show competitive MoE inference now requires coordinating multiple hardware types across rack types within a pod. A single Vera Rubin pod achieves 10 PB/s of internal scale bandwidth jensen-huang-lex-2026.1.
Jensen Huang explicitly rejects the notion that inference can be served on loosely-coupled hardware. Inference "is thinking, and I think thinking is hard. Thinking is way harder than reading" jensen-huang-lex-2026.2. NVLink-72 was built so a 4-10 trillion parameter model runs "as if on one GPU" jensen-huang-lex-2026.1. Test-time compute scaling and agentic systems drive inference compute requirements upward.
Implications for orbital compute: A monolithic satellite with an internal NVL72 can match the current single-rack competitive bar. But the terrestrial frontier is moving to multi-rack pods with heterogeneous hardware, CPO inter-rack fabric, and 10+ PB/s internal bandwidth — a level of integration with no orbital analogue. Even if individual satellites house NVL72 or NVL144 domains, they cannot replicate the inter-rack all-to-all topology, heterogeneous hardware coordination, and sub-microsecond switching of an NVL576 pod. Orbital compute remains competitive for workloads fitting within a single satellite's domain (Tier 1-2), but the Tier 3 frontier — where terrestrial infrastructure coordinates hundreds of GPUs across multiple rack types — is widening, not closing.
Value Derivation
These values represent the workload's required domain size (how many GPUs must be in a single NVLink domain). The orbital feasibility of each depends on satellite architecture:
Optimistic (8 GPUs): FP4/INT4 quantized MoE models fit on a single 8-GPU node. Feasible on any satellite architecture — even the smallest compute satellites. All inter-satellite networking is embarrassingly parallel scale-out.
Central (16 GPUs): Frontier models require 2 nodes (16 GPUs) with pipeline parallelism or limited EP. On a monolithic 72-GPU satellite, this fits trivially within the internal NVLink domain. On small (8-GPU) satellites, it requires cross-satellite PP at 800 Gbps+ — feasible with demonstrated Suncatcher-class ISLs.
Conservative (72 GPUs): Wide EP across a full NVL72 rack is necessary for production-competitive throughput on frontier MoE models — this is what top-tier labs currently deploy semianalysis-inferencex-v2.1. On a monolithic satellite with an internal NVL72, this fits within a single satellite. On distributed small satellites, it is infeasible — the ISL bandwidth gap and absent all-to-all topology prevent cross-satellite EP at this scale.
Evidence
Scale-Up Domain Size for Current Frontier Models
Llama 405B in FP8 requires a minimum of 8x H100 80GB GPUs (640 GB total VRAM). In BF16, it requires 16 GPUs across 2 nodes. With INT4 quantization, 4x A100 80GB (320 GB) suffices. [premai-parallelism-guide-2026.1]
DeepSeek-V3/R1 (671B MoE) requires 8x H100 80GB minimum in FP8 (671 GB weights), or 8x H200 141GB for BF16 (1,342 GB weights). The model has 256 experts but only activates ~37B parameters per token. [premai-parallelism-guide-2026.2]
For DeepSeek R1's wide-EP decode, the optimal configuration distributes 256 experts across 64 GPUs (4 experts per GPU), requiring all 64 GPUs within a single NVLink domain. This requires the GB200 NVL72 rack. [nvidia-dynamo-moe-inference.1]
The GB200 NVL72 connects 72 Blackwell GPUs with 5th-generation NVLink providing 1.8 TB/s per-GPU NVLink bandwidth and 130 TB/s aggregate all-to-all bandwidth. Note: these bandwidth figures come from NVIDIA's NVL72 wide-EP technical documentation [nvidia-wide-ep-nvl72.2], not the GB200 specs page. [nvidia-wide-ep-nvl72.2]
NVIDIA's next-generation Vera Rubin NVL144 CPX (available late 2026) doubles the domain size to 144 GPUs with NVLink 6.0 at 3.6 TB/s per GPU. It delivers 100TB of fast memory and 1.7 PB/s of memory bandwidth in a single rack. The platform is purpose-built for million-token context inference. [nvidia-rubin-cpx-nvl144.1]
Rubin Ultra (2027) will further increase to NVLink 7.0 with a 144-GPU NVLink domain delivering 15 exaFLOPS of FP4 inference. [nvidia-rubin-cpx-nvl144.2]
Tensor Parallelism Bandwidth Requirements
Tensor parallelism (TP) requires two all-reduce synchronization operations per transformer layer. Llama 70B has 80 layers = 160 sync points per forward pass. NVLink is "effectively mandatory" for TP beyond TP=2. On PCIe 5.0 (128 GB/s), communication consumes 40-50% of inference time at TP=4. [premai-parallelism-guide-2026.3]
TP scaling efficiency: TP=2 achieves 85-95% efficiency; TP=4 achieves 70-85%; TP=8 achieves 56-75%. At TP=8, 25-44% of potential speedup is lost to communication overhead even on NVLink. [premai-parallelism-guide-2026.4]
A single Llama 3.1 70B inference query requires up to 20 GB of TP synchronization data transferred from each GPU. At batch size 32, NVSwitch-equipped H100 systems achieved 168 tok/s/GPU vs 112 tok/s/GPU without NVSwitch -- a 1.5x improvement. [nvidia-nvlink-supercharge-inference.1]
Bandwidth viability thresholds for tensor parallelism: NVLink 4.0 (900 GB/s) = "excellent, TP=8 works well"; NVLink 3.0 (600 GB/s) = "good, TP=8 acceptable"; PCIe 5.0 (128 GB/s) = "marginal, TP=2 max"; PCIe 4.0 (64 GB/s) = "poor, avoid TP". [premai-parallelism-guide-2026.5]
Pipeline Parallelism and Lower-Bandwidth Tolerance
Pipeline parallelism (PP) uses point-to-point transfers between adjacent stages (not all-to-all), requiring far less bandwidth than TP. PP works on PCIe systems. Multi-node standard pattern: TP within nodes (NVLink), PP across nodes (100 Gbps Ethernet minimum, InfiniBand preferred). [premai-parallelism-guide-2026.6]
PP has a "bubble problem": with PP=4, each GPU sits idle 75% of the time for a single request. Continuous batching mitigates this with concurrent traffic, but single-request latency is always worse with PP than TP. [premai-parallelism-guide-2026.7]
PP generates "more predictable, structured traffic flows between consecutive pipeline stages" via point-to-point Send/Recv operations, compared to TP's all-to-all AllGather and ReduceScatter collectives. [ai-dc-networking-gpu-clusters.1]
Expert Parallelism (MoE) Networking Requirements
Wide-EP on DeepSeek R1 with EP=32 achieves 1.8x more output tokens/sec/GPU than EP=8 at 100 tokens/sec per user. Wide-EP distributes fewer experts per GPU, freeing HBM for KV cache and increasing batch capacity. [nvidia-wide-ep-nvl72.1]
"Without the 130 TB/s of aggregate bandwidth provided by the NVL72, the complexity and overhead of [token-gather] communication pattern would make large-scale EP impractical." The all-to-all operations during the MoE phase "can quickly saturate an already memory-bound decode phase." [nvidia-wide-ep-nvl72.2]
On NVL72, frontier MoE models (Kimi K2 Thinking, DeepSeek-R1, Mistral Large 3) achieve 10x performance improvement over HGX H200 systems. Prior to NVL72, the max NVLink domain was 8 GPUs (H200); EP beyond 8 GPUs required "higher-latency scale-out networking" which bottlenecked performance. [nvidia-moe-frontier-models.1]
"All top tier labs are already using disaggregated inferencing and wide expert parallelism" -- including OpenAI, Anthropic, xAI, Google DeepMind, and DeepSeek. Single-node inference is insufficient for frontier production deployment. [semianalysis-inferencex-v2.1]
DeepSeek R1 EP8 (single node) places 32 experts/layer/GPU; EP64 (8 nodes) places 4 experts/layer/GPU. Wider EP yields three compounding benefits: reduced expert weight footprint frees HBM for KV cache, higher tokens-per-expert improves arithmetic intensity, and aggregate HBM bandwidth scales linearly with GPU count. [semianalysis-inferencex-v2.2]
Prefill vs Decode Phase Differences
NVIDIA Dynamo's disaggregated serving separates compute-bound prefill from memory-bound decode onto different GPUs. For DeepSeek R1, disaggregated serving achieved 6x throughput gain in the medium-latency regime vs co-located approaches. [nvidia-dynamo-moe-inference.2]
The prefill and decode phases have different compute and memory resource requirements, making different parallelism configurations optimal for each phase. The source describes prefill as compute-bound and decode as memory-bound, but does not specify a preferred TP degree for prefill. [nvidia-dynamo-moe-inference.3]
SGLang on GB200 NVL72 achieved 26,156 input tokens/sec/GPU (prefill) and 13,386 output tokens/sec/GPU (decode) for DeepSeek R1 with FP8 attention and NVFP4 MoE -- a 3.8x and 4.8x speedup vs H100 settings. Configuration used 48 decode ranks and 2-4 prefill ranks per instance. [lmsys-gb200-deepseek-part2.1]
KV Cache and Memory Requirements
Llama 70B KV cache formula: 2 x 80 layers x 8 KV heads x 128 dim x seq_len x batch_size x bytes_per_element. At 32K context, batch 8, FP16: ~42 GB KV cache on top of model weights. For a single request at 32K context: ~5 GB KV cache. [premai-parallelism-guide-2026.8]
Rule of thumb: reserve 40-50% of VRAM beyond model weights for KV cache and runtime overhead. KV cache scales linearly with context length and batch size. [premai-parallelism-guide-2026.9]
NVIDIA Dynamo supports distributing KV cache across prefill and decode workers. Note: the specific terms "KV Cache Manager," "hierarchical caching," and "GPUDirect RDMA" appear in Dynamo's GitHub documentation rather than the cited MoE blog post. The blog post describes disaggregated prefill/decode serving but does not detail these mechanisms. [nvidia-dynamo-moe-inference.4]
Long-Context Inference Scaling
The Vera Rubin NVL144 CPX with 100 TB of fast memory is specifically designed for 1M+ token context workloads. For Llama 3 405B processing 1M tokens: 128 H100 GPUs across 16 nodes required, achieving 77 seconds with 93% parallelization efficiency. [nvidia-rubin-cpx-nvl144.3]
Llama 70B KV cache at 32K context with batch size 8 in FP16: ~42 GB (per the formula in item 22). KV cache can easily exceed model weights at long context lengths. [premai-parallelism-guide-2026.10]
Optical Inter-Satellite Link Capabilities
Google's Project Suncatcher demonstrated 800 Gbps each-way (1.6 Tbps total) using a single transceiver pair in bench testing. With DWDM and spatial multiplexing, "tens of terabits per second" between satellites "should be possible." The 81-satellite cluster operates at ~650 km altitude with inter-satellite distances of 100-200m. [google-suncatcher.1]
At 100-200m separation, speed-of-light propagation latency is ~0.3-0.7 microseconds -- negligible compared to processing latency. Free space propagation is ~50% faster than fiber. [google-suncatcher.2]
The Suncatcher paper does not specify what parallelism strategy would be used across the satellite cluster, nor whether the demonstrated bandwidth is sufficient for TP, PP, or EP. The paper focuses on system design feasibility, not distributed ML architecture. [google-suncatcher.3]
Model Architecture Trends and Future Domain Size
Since early 2025, over 60% of open-source frontier model releases use MoE architecture. The top 10 most intelligent open-source models all use MoE. MoE has "rapidly become the architecture of choice" for frontier models. [nvidia-moe-frontier-models.2]
Frontier AI capabilities become runnable on a single consumer GPU (RTX 4090, ~24 GB VRAM) within 6-12 months on average. Small open models improve faster (+125 ELO/year) than frontier models (+80 ELO/year), driven by distillation and quantization. [epoch-consumer-gpu-gap.1]
Frontier dense models in the 100B-200B parameter range fit on 2-4 GPUs with FP8 quantization. (Note: specific model parameter estimates such as GPT-4o ~200B or Claude 3.5 Sonnet ~175B are widely discussed in industry but originate from leaks and speculation, not official disclosures. The general claim about frontier dense models fitting in 100B-200B range is supported by public model releases like Llama 405B and the observation that production dense models are designed to fit on a single 8-GPU node.) [semianalysis-inferencex-v2.3]
Models under 35B with any quantization do not require multiple GPUs. 70B at INT4 runs on a single H100. Specialized fine-tuned 8B models sometimes beat general 70B models on domain tasks. [premai-parallelism-guide-2026.11]
Scale-Out and Embarrassingly Parallel Inference
Data parallelism (DP) across independent model replicas is "embarrassingly parallel" -- multiple copies of the model run on separate GPU clusters, each serving independent requests with no cross-replica communication. This is the primary scale-out mechanism for inference. [ai-dc-networking-gpu-clusters.2]
For models that fit on a single node, data parallelism deploys independent instances across nodes, each serving requests with no cross-node communication, scaling throughput linearly. [premai-parallelism-guide-2026.12]
Bandwidth Gap: NVLink vs Optical ISL
- NVLink 5 (Blackwell): 1,800 GB/s (14.4 Tbps) per GPU bidirectional. Note: this bandwidth figure is sourced from NVIDIA's NVL72 wide-EP documentation [nvidia-wide-ep-nvl72.2] rather than the GB200 specs page. [nvidia-wide-ep-nvl72.2]
36b. NVLink 6 (Vera Rubin NVL72): 3,600 GB/s (28.8 Tbps) per GPU bidirectional, 260 TB/s aggregate all-to-all bandwidth — exactly 2x Blackwell NVL72. The NVL72 uses 36 NVLink Switch chips (vs 18 in Blackwell) with bidirectional SerDes at double the lane rate. Shipping H2 2026. [nvidia-nvlink6-specs.1]
- Suncatcher bench demo: 800 Gbps each-way (1.6 Tbps total, 0.2 TB/s bidirectional) per optical ISL pair. Google states the required bandwidth is "on the order of 10 Tbps," achievable via DWDM with 9.6-12.8 Tbps per aperture; higher scaling requires spatial multiplexing at very short separations. [google-suncatcher.4]
37b. [analysis] Extrapolating from Google's Suncatcher DWDM figures (9.6-12.8 Tbps per aperture), spatial multiplexing with multiple apertures could plausibly reach a 10-40 Tbps upper bound per ISL — this is our own extrapolation, not Google's projection.
- For multi-node inference: InfiniBand strongly preferred. 100 Gbps Ethernet minimum viable. 10 Gbps will bottleneck pipeline parallelism transfers. [premai-parallelism-guide-2026.13]
Communication-Efficient MoE Serving
MegaScale-Infer (ByteDance, SIGCOMM 2025) disaggregates attention and FFN modules onto separate GPU nodes, replacing all-to-all collectives with M2N point-to-point communication. Custom M2N library achieves 4.2x higher throughput and 68.2% lower latency than NCCL at 256KB message sizes. End-to-end: up to 1.9x higher per-GPU throughput and 1.5-2x cost reduction in production. — megascale-infer-sigcomm
MegaScale-Infer supports heterogeneous clusters where some nodes use PCIe (L40S) rather than NVLink for intra-node communication, demonstrating that disaggregated MoE inference does not require NVLink on all nodes. — megascale-infer-sigcomm
DeepEP (DeepSeek) provides two communication modes for MoE all-to-all: a high-throughput mode achieving 153 GB/s on NVLink intra-node and 43-58 GB/s on RDMA inter-node, and a low-latency mode achieving 77-194 microsecond dispatch latency using pure RDMA. — deepep-communication-lib
MoETuner uses ILP to optimize expert placement by minimizing maximum communication cost across GPU pairs. On 16 H200 GPUs across 2 InfiniBand-connected nodes: 30.5% reduction in tail latency, 24.7% reduction in average latency, 17.5% end-to-end speedup vs naive contiguous placement. — moetuner-expert-placement
LMSYS deployed DeepSeek-V3 with EP72 across 12 nodes (96 H100 GPUs) using InfiniBand inter-node connectivity. Achieved 52.3k input tokens/sec and 22.3k output tokens/sec per node, within 5.6% of DeepSeek's official profile. Uses two-batch overlap to mask inter-node communication latency. — lmsys-large-scale-ep
vLLM achieved 2,200 output tokens/sec per H200 GPU in multi-node InfiniBand deployments using wide EP with DeepEP kernels and dual-batch overlap. — vllm-large-scale-ep
Terrestrial Networking Roadmap (GTC 2026)
At GTC 2026, Nvidia announced NVL576 (8 Oberon racks with CPO inter-rack all-to-all), NVL1152 (8 Kyber racks, Feynman generation), and the Groq LPU integration. The LPX rack provides 640 TB/s of internal scale-up bandwidth from 256 LPUs. Attention-FFN disaggregation (AFD) routes tokens between GPU racks (attention) and LPU racks (expert FFN) via all-to-all collectives over Spectrum-X Ethernet. — semianalysis-gtc-2026
NVL144 Kyber rack requires 72 NVLink 7 switches at 28.8 Tbps each and an "extremely high spec PCB" for all-to-all routing density. Nvidia's approach: "use copper where they can, and optics where they must." All intra-rack scale-up remains copper through Rubin Ultra; CPO enters for inter-rack connections starting with NVL576. A hypothetical NVL288 backplane would need 20,736 differential pairs, approaching practical copper limits. Nvidia is unable to achieve another doubling of electrical lane speed from 224 Gbps to 448 Gbps, meaning copper-only bandwidth scaling is limited. — semianalysis-gtc-2026
Jensen Huang (March 2026): NVLink-72 was built specifically so "an entire 4 trillion, 10 trillion parameter model" can run "in one computing domain as if it's running on one GPU." The Vera Rubin pod achieves 10 PB/s of internal scale bandwidth with ~1,100 GPUs and 60 exaflops. NVIDIA plans to produce ~200 pods per week. Each NVL72 rack weighs 2-3 tons with 1.3 million components and must be factory-assembled. — jensen-huang-lex-2026
Jensen Huang explicitly rejects the notion that inference can be commoditized on simple hardware: "inference is thinking, and I think thinking is hard. Thinking is way harder than reading." Test-time compute scaling is "intensely compute intensive." On space: acknowledges cooling challenges ("no conduction, no convection... we're gonna put big, giant radiators out there") and describes space compute as practical today only for edge imaging, not large-scale AI inference. — jensen-huang-lex-2026