Terrestrial GPU Annual Failure Rate

What is the annual rate at which datacenter GPUs (H100/A100-class) permanently fail — requiring physical replacement — in terrestrial operation?

What is the annual permanent failure rate of datacenter GPUs (H100/A100-class) in terrestrial operation?

Answer

The annual permanent GPU failure rate for H100-class accelerators in terrestrial datacenters ranges from 2.5% (optimistic) to 6% (conservative), with a central estimate of 4%. These rates cover only permanent failures requiring physical GPU replacement — the rate relevant for orbital deployment where failed GPUs cannot be hot-swapped.

Optimistic (2.5%): Assumes the majority of Meta's "Faulty GPU" job interruptions are recoverable via automated restart/drain, consistent with the Delta study's improved H100 critical hardware resilience (zero PMU SPI, GPU Fallen Off Bus, and NVLink errors; only 3 GSP errors observed cui-two-gpus-2025.4) and the lemon node analysis showing ~1.2% of fleet chronically defective revisiting-ml-cluster-reliability.3.
Central (4%): Approximately half of observed GPU-category interruptions require physical replacement. Consistent with the 5% overprovisioning recommendation cui-two-gpus-2025.6 and 3–5 day physical replacement cycles nonuniform-tensor-parallelism.1.
Conservative (6%): Treats all "Faulty GPU" interruptions from Meta's Llama 3 data as permanent, yielding the raw annualized rate of 148 failures / 16,384 GPUs / 54 days × 365 = 6.1%.

Separately:

HBM uncorrectable error rate: ~10%/year per GPU raw for H100, but 92% are mitigated by row remapping cui-two-gpus-2025.3, yielding an unmitigated HBM failure rate of ~0.8%/year. HBM failures that exhaust spare row capacity become permanent.
Transient failure rate (all causes): ~17%/year per GPU — includes software, network, and recoverable hardware issues that interrupt jobs but do not permanently reduce capacity.
Overprovisioning: Terrestrial operators maintain ~5% spare GPUs and replace failed units within 3–5 days, keeping effective capacity at ~100%. For orbital, where replacement is impossible, the permanent failure rate drives irreversible capacity decay.

Analysis

Distinguishing Failure Categories

The most critical contribution of this page is separating failure types that have fundamentally different operational implications:

Job interruptions (~17%/year): Any event that stops a training job. Meta's Llama 3 data reports 419 such events in 54 days on 16,384 GPUs meta-llama3-paper.1, meta-llama3-paper.2. This is the broadest and least useful metric — it includes software bugs, network issues, and transient hardware events.
GPU hardware faults (~11%/year): Failures attributed to GPU die, HBM, SRAM, system processor, or thermal subsystem. The Llama 3 data attributes 262 of 419 unexpected interruptions to GPU hardware categories (excluding SDC) meta-llama3-paper.2. Many of these are resolved automatically — only 3 manual interventions were needed during the 54-day period meta-llama3-paper.4.
Permanent GPU failures (~4%/year, central): Hardware faults requiring physical replacement. This is the subset that cannot be resolved by automated restart, driver reset, or node reboot. No source directly reports this rate; it is bounded from multiple directions (see below).

For the orbital model, only permanent failures matter. Transient failures cause downtime (reducing availability) but do not permanently reduce the GPU fleet.

Bounding the Permanent Failure Rate

Upper bound (~6%): all "Faulty GPU" events are permanent. Meta's 148 "Faulty GPU" interruptions over 54 days on 16,384 GPUs annualize to 6.1% meta-llama3-paper.2. This is an upper bound because Meta's automation handled nearly all failures with only 3 manual interventions meta-llama3-paper.4, implying many "Faulty GPU" events were resolved without physical replacement.

Middle estimate (~4%): overprovisioning convergence. The Delta study recommends 5% overprovisioning to maintain 99.9% job availability given a 2.2-hour recovery time cui-two-gpus-2025.6. If physical replacement takes 3–5 days nonuniform-tensor-parallelism.1, a 5% spare pool sustains a permanent replacement rate of roughly 3–5%/year. The NTP paper's finding that clusters spend 81% of time with >0.1% GPUs failed nonuniform-tensor-parallelism.2 corroborates this range.

Lower bound (~2.5%): lemon node analysis + H100 hardware improvements. Meta's lemon node analysis identifies 1.2% of the fleet as chronically defective, with GPUs as the root cause in 28.2% of cases revisiting-ml-cluster-reliability.3. The Delta study found dramatic H100 hardware improvements: zero PMU, GPU Fallen Off Bus, and NVLink errors across 146 days on 608 GPUs cui-two-gpus-2025.4. This suggests the permanent non-memory GPU failure rate is substantially lower than the "Faulty GPU" interrupt rate.

Cross-check: row remapping failures. The Delta study observed 8 row remapping failures on 608 H100 GPUs in 146 days cui-two-gpus-2025.7, annualizing to 3.3%/year for this specific permanent failure mode. This provides a floor on permanent failures from the HBM degradation pathway alone, consistent with the central estimate.

Cross-check: depreciation evidence. Hyperscalers depreciate GPUs over 5–6 years gpu-useful-life-2025.1, and Google's TPUs operate at 100% after 7–8 years gpu-useful-life-2025.2. A 4% permanent annual failure rate means ~81% of GPUs survive to year 5 without replacement — consistent with managed fleets where replacements maintain capacity.

The Google 1–3 year claim trendforce-gpu-lifespan-2024.1 likely refers to economic useful life (when GPUs become obsolescent) rather than physical failure. This is consistent with NVIDIA's 1-year release cadence making GPUs economically obsolete in 2–3 years even while physically functional.

HBM as a Distinct Failure Mode

HBM deserves separate treatment because it follows a distinct degradation pattern:

High raw error rate: H100's per-GPU MTBE for uncorrectable ECC errors is 88,768 hours (~10%/year annualized), 3.2x worse than A100 cui-two-gpus-2025.2.
Effective but finite mitigation: Row remapping succeeds 92% of the time cui-two-gpus-2025.3, reducing the unmitigated rate to ~0.8%/year. But spare rows (capped at 512) are consumed progressively.
Eventual permanent failure: Once spare rows are exhausted, the next uncorrectable error becomes permanent. Eight RRFs in 146 days on 608 GPUs cui-two-gpus-2025.7 show this transition already occurring.
Progressive degradation marker: Microsoft found that >10 correctable errors on a row increases regression probability by 77.8% microsoft-superbench.3, confirming that memory errors accelerate toward permanent failure.

For orbital deployment, HBM degradation is compounded by radiation effects (additional uncorrectable errors consuming spare rows faster), making the HBM failure trajectory a key driver of GPU attrition in orbit.

Overprovisioning vs. Capacity Decay

Terrestrial model: overprovisioning with active replacement. Operators maintain ~5% spare GPUs cui-two-gpus-2025.6, replace failed units within 3–5 days nonuniform-tensor-parallelism.1, and maintain dedicated spare buffers per customer nebius-fault-tolerant-2025.2. Automation detects and drains failed nodes with minimal human intervention meta-llama3-paper.4, and proactive lemon detection removes degrading hardware before it causes repeated failures revisiting-ml-cluster-reliability.3. Effective capacity stays at ~100%.

Orbital model: irreversible capacity decay. Failed GPUs in orbit cannot be hot-swapped. Cold spares add mass and cost, degrade from radiation and thermal cycling while idle, and once exhausted, every subsequent GPU failure permanently reduces fleet capacity. The permanent failure rate (~4% central) directly determines the capacity decay rate used in the effective lifetime integral.

The models are complementary, not competing. Terrestrial operators face the same underlying permanent failure rate but can compensate via replacement. The distinction matters because the orbital model should use the permanent failure rate (capacity that cannot be recovered), not the total interruption rate (which includes transient events that terrestrial operators recover from within hours).

Combination Method

The central estimate of 4% was derived by triangulating five independent data sources:

Meta Llama 3 primary data meta-llama3-paper.2: 148 "Faulty GPU" events in 54 days on 16,384 H100s → 6.1% annualized upper bound on permanent GPU failures, discounted by the automation evidence meta-llama3-paper.4.
NCSA Delta longitudinal study cui-two-gpus-2025.6: 5% overprovisioning recommendation, anchoring the steady-state failure-to-replacement ratio.
Meta ML cluster reliability revisiting-ml-cluster-reliability.3: 1.2% fleet chronically defective, 28.2% GPU-caused, with explicit transient vs. permanent taxonomy revisiting-ml-cluster-reliability.2.
NTP paper nonuniform-tensor-parallelism.1: 3–5 day physical replacement timeline constraining the replacement throughput of a 5% spare pool.
Microsoft SuperBench microsoft-superbench.1: 10.36% node defect rate during A100 cluster build-out (includes performance degradation, not just hard failures).

The central value of 4% sits at the intersection of these constraints: above the 2.5% lower bound from lemon node analysis and H100 hardware improvements, below the 6% upper bound from treating all "Faulty GPU" events as permanent, and consistent with the 5% overprovisioning that the Delta study recommends for operational fleets.

Key Uncertainties

No source directly reports the permanent GPU replacement rate. All estimates are derived indirectly from total interruption rates, overprovisioning recommendations, and lemon node analysis. The fraction of "Faulty GPU" interruptions that require physical replacement vs. automated recovery is the single largest uncertainty.
H100 observation period is short. The Delta study's H100 data spans only 146 days cui-two-gpus-2025.1 — too short to observe wear-out failure modes or HBM spare row exhaustion at scale. The A100 data (895 days) is more reliable for long-term trends.
Workload dependence. Meta's RSC-1 GPU swap rate is ~3x RSC-2 revisiting-ml-cluster-reliability.4, suggesting the failure rate depends heavily on workload intensity. The values here reflect training-class workloads at near-maximum utilization; inference workloads may be less severe.
HBM degradation trajectory. The fixed spare row count (512) with increasing memory capacity means newer GPUs with more HBM (H100 96GB vs A100 40GB) may exhaust spares faster, causing an increasing permanent failure rate over time. This is not yet captured in the constant annual rate assumption.

Evidence

Long-Duration Field Studies

The Delta HPC system at NCSA comprises 448 A100 GPUs (40 GB HBM2e) observed over 895 days (9.6 million GPU-hours) and 608 H100 GPUs (96 GB HBM3, GH200 Superchip) observed over 146 days (2.1 million GPU-hours), for a combined 11.7 million GPU-hours of operational data. — cui-two-gpus-2025
H100 GPUs exhibit 3.2x lower per-GPU MTBE for uncorrectable ECC memory errors compared to A100: 88,768 hours per GPU for H100 vs 283,271 hours per GPU for A100. The per-GB MTBE is 8.5 million hours (H100 HBM3) vs 11.3 million hours (A100 HBM2e), a 24% reduction. — cui-two-gpus-2025
H100 GPU memory error-recovery mechanisms (row remapping and error containment) successfully mitigate uncorrectable memory errors with a probability of 0.92 (92% success rate). Available spare rows for row remapping are capped at 512 rows, unchanged from A100 despite a 2.4x increase in memory capacity. — cui-two-gpus-2025
H100 GPUs demonstrate significant improvements in GPU hardware resilience over A100: zero PMU SPI errors, zero GPU Fallen Off Bus errors, and zero NVLink errors were observed during the H100 operational period. GSP errors were reduced from 3,857 (A100, 895 days) to 3 (H100, 146 days), a >99.9% reduction. — cui-two-gpus-2025
Overall node availability is approximately 99.4% for A100 GPUs and 99.3% for H100 GPUs, corresponding to 9–10 minutes of downtime per day. Mean recovery time after failure is 0.88 hours for A100 and 2.2 hours for H100. — cui-two-gpus-2025
For a training job using 608 H100 GPUs with 2.2-hour recovery time, maintaining 99.9% job-level availability requires 5% overprovisioning (31 additional GPUs). If recovery time is reduced to 5 minutes, overprovisioning drops to 2%. — cui-two-gpus-2025
Eight row remapping failures (RRFs) were observed on H100 GPUs during their 146-day operational period, indicating memory recovery failure due to exhaustion of spare memory rows. Zero RRFs were observed on A100 GPUs during their 895-day period. — cui-two-gpus-2025

Meta Llama 3 Primary Data

During a 54-day snapshot of Llama 3 405B pre-training on 16,384 H100 GPUs (each at 700W TDP, 80GB HBM3), 466 total job interruptions occurred: 47 planned and 419 unexpected. — meta-llama3-paper
Of the 419 unexpected interruptions, failure categories were: Faulty GPU 148 (30.1%), GPU HBM3 Memory 72 (17.2%), Software Bug 54 (12.9%), Network Switch/Cable 35 (8.4%), Unplanned Host Maintenance 32 (7.6%), GPU SRAM Memory 19 (4.5%), GPU System Processor 17 (4.1%), NIC 7 (1.7%), NCCL Watchdog Timeouts 7 (1.7%), Silent Data Corruption 6 (1.4%), GPU Thermal Interface/Sensor 6 (1.4%), SSD 3 (0.7%), Power Supply 3 (0.7%), Server Chassis 2 (0.5%). — meta-llama3-paper
Approximately 78% of unexpected interruptions were attributed to confirmed or suspected hardware issues. GPU issues (all GPU-related categories combined) accounted for 58.7% of all unexpected issues. Despite 466 total interruptions, the system achieved >90% effective training time. — meta-llama3-paper
Despite the large number of failures, significant manual intervention was required only three times during the 54-day period, with the rest handled by automation. — meta-llama3-paper

Meta Large-Cluster Reliability Study

Meta's RSC-1 (16K GPUs) and RSC-2 (8K GPUs) A100 clusters, totaling 24K GPUs, experienced 4 million jobs over 11 months with 150+ million GPU-hours. Failure rates varied substantially over time, ranging from ~2.5 to ~17.5 failures per 1,000 node-days on RSC-1. — revisiting-ml-cluster-reliability
Hardware errors are binned as transient (e.g., ECC error, link flap) or permanent (e.g., degraded hardware requiring repair or replacement by a vendor). — revisiting-ml-cluster-reliability
"Lemon nodes" — servers causing repeating job failures not identifiable by standard health checks — represent 1.2% of RSC-1's footprint but are involved in 13% of daily jobs. Of lemon node root causes, 28.2% are GPUs, 20.5% DIMMs, 15.4% PCIe, 10.3% EUD, 7.7% BIOS, 7.7% NICs, 5.1% PSUs, 2.6% optics, 2.6% CPUs. — revisiting-ml-cluster-reliability
GPUs are physically swapped in the cluster. RSC-1 GPUs are swapped at approximately 3x the rate of RSC-2, likely due to differing workloads. — revisiting-ml-cluster-reliability

Replacement Timelines and Overprovisioning

Hardware failure recovery time is modeled as 3–5 days for physical GPU replacement, described as "perhaps on the low-side for replacing high-demand hardware." Software failure recovery takes approximately 3 hours. — nonuniform-tensor-parallelism
Using Llama 3 failure rates, a GPU cluster spends 81% of its time with >0.1% of GPUs in a failed state. — nonuniform-tensor-parallelism
The traditional DP-DROP fault tolerance method requires 90 spare racks to maintain full mini-batch size in a 32K GPU cluster, while Nonuniform Tensor Parallelism (NTP) reduces this to 2 spare DP replicas (16 racks). — nonuniform-tensor-parallelism

Industry Validation

During a 90-day cluster build-out evaluation of 24k+ A100 GPUs (3k+ VMs) in Azure, SuperBench identified 10.36% of nodes as defective (exhibiting failure or performance regression). SuperBench has been deployed in Azure production for 2+ years, validating hundreds of thousands of GPUs. — microsoft-superbench
Without proactive validation, baseline MTBI in Azure A100 clusters was 17.5 hours. In simulation, SuperBench's Selector increased MTBI to 22.61x baseline (262.05 hours). 38.1% of incidents previously required more than one day to resolve. — microsoft-superbench
On A100 GPUs, row remapping with more than 10 correctable errors shows a 77.8% higher chance of end-to-end workload regression compared to 1-10 correctable errors, indicating correctable errors are a precursor to permanent degradation. — microsoft-superbench
Nebius reports peak MTBF of 56.6 hours (169,800 GPU-hours) on a 3,000-GPU (375-node) production cluster and average MTBF of 33.0 hours. Average MTTR is 12 minutes across most installations. — nebius-fault-tolerant-2025
Nebius maintains a "dedicated spare buffer of GPU capacity for each customer" to ensure quick provisioning of replacement nodes; faulty nodes are automatically drained and replaced with healthy spares. — nebius-fault-tolerant-2025

GPU Longevity

Hyperscaler GPU depreciation schedules have been extended: Microsoft and Google from 4 to 6 years, Meta from 4.0 to 5.5 years, Oracle from 4 to 6 years. Amazon reversed from 6 back to 5 years in February 2025, incurring a $700 million operating income hit. — gpu-useful-life-2025
Google's custom TPUs have been maintained at 100% utilization after 7–8 years of operation. Azure ran K80-class GPUs in production for 9 years (2014–2023) and P100s for 7 years (2016–2023). — gpu-useful-life-2025
A "general architect at Alphabet" (unnamed) stated that datacenter GPUs last 1–3 years at 60–70% utilization, with 1–2 years typical under heavy AI workloads. TrendForce notes this "cannot be considered 100% accurate and requires further confirmation." — trendforce-gpu-lifespan-2024

Existing Sources (refined interpretation)

Epoch AI calculates MTBF of ~50,000 GPU-hours for H100 GPUs based on Meta's Llama 3 data, treating all 419 unexpected interruptions (hardware + software + network) as "failures." This conflates permanent hardware failures with transient software/network issues. — epoch-gpu-failures
Tom's Hardware reports 148 GPU hardware failures and 72 HBM3 failures across 16,384 H100 GPUs in 54 days. The 148 "GPU failures" annualize to ~6.1%. This is the journalism layer's interpretation of the primary data in the Meta Llama 3 paper [meta-llama3-paper.2]. — meta-llama3-failures