- What it does: Reduces memory and compute requirements by representing numbers with 8-bit precision instead of 16-bit (FP16) or 32-bit (FP32).
- Impact: FP8 mixed precision training nearly doubles throughput on H800 GPUs for tensor operations like matrix multiplications, which are central to transformer workloads. The Hopper architecture’s Tensor Cores are designed for FP8 precision, making it highly effective for large-scale deep learning tasks that require both computational efficiency and high throughput.
- Estimated Gain: ~1.8x performance boost, critical for achieving high token throughput.
- What it does: Activates only 37B parameters per token out of 671B, significantly reducing the compute cost for forward and backward passes.
- Impact: Sparse activation significantly reduces computational overhead without compromising representational power.
- Estimated Gain: Estimated Gain: 5–10x improvement in compute efficiency compared to dense architectures.
- What it does: Eliminates the auxiliary-loss typically used to balance expert activation in MoE, reducing inefficiencies and avoiding performance degradation.
- Impact: Improves token processing efficiency without wasting GPU cycles on balancing overhead.
- Estimated Gain: ~5–10% boost in efficiency, depending on the prior impact of auxiliary losses.
- What it does: Predicts two tokens per forward pass instead of one, reducing the number of forward passes required for training and decoding. The speculative decoding framework validates the second token, with an acceptance rate of 85–90%.
- Impact:
- Fewer forward passes accelerate training, improving throughput.
- The high acceptance rate ensures minimal overhead from corrections.
- Estimated Gain: ~1.8x improvement in token processing efficiency, depending on model configuration and workload.
- What it does: Optimizes distributed training by overlapping inter-GPU communication with computation, addressing a common bottleneck in large-scale MoE systems.
- Impact: Removes inefficiencies that typically reduce utilization in cross-node setups.
- Estimated Gain: Allows near-100% utilization of GPU capacity during training.
DeepSeek trained its model on a cluster of 2,048 H800 GPUs, leveraging Nvidia's Hopper architecture. These GPUs are designed to excel at tasks like matrix multiplications and sparse attention, particularly when using FP8 mixed precision. While the H800 has lower interconnect bandwidth compared to the H100 due to export regulations, its computational efficiency remains strong for the kinds of workloads needed in large-scale AI training.
Using their stated figures, let’s verify whether the throughput aligns with the claimed GPU-hour budget.
-
Throughput per GPU-Hour:
- Tokens Processed: 14.8 trillion tokens.
- GPU-Hours Used: 2.664 million.
- Tokens per GPU-Hour:
2.664million GPU-hours14.8trillion tokens=5.56million tokens per GPU-hour.
- Cluster Throughput:
- GPUs Used: 2,048.
- Tokens Processed per Hour:
2,048GPUs×5.56million tokens per GPU-hour=11.38billion tokens per hour.
- Time to Process 14.8T Tokens:
11.38billion tokens per hour14.8trillion tokens=1,300hours (or 54 days).
This aligns with their claim of completing pretraining in less than two months.
DeepSeek-V3’s claim of processing 14.8T tokens in 2.664M GPU-hours is plausible. The numbers are internally consistent, and the described techniques align with established principles of efficient large-scale training. While reproduction by other labs will provide final confirmation, the absence of red flags suggests that DeepSeek's reported achievements are feasible.
For more details on DeepSeek-V3 and its training methodology, refer to the technical report on arXiv.
Qualifications: Through my work on the LIT platform, I’ve developed tools that enable data scientists to efficiently design, train, and deploy deep learning models, including advanced workflows for LLMs. Prior to that, I spent 8 years providing professional services in deep learning, building custom AI solutions across diverse industries. In support of that work I’ve read and analyzed hundreds of technical reports and academic papers. My expertise lies in building tooling, pipelines, and integrations for both predictive and generative AI, supported by a strong foundation in deep learning and software engineering.