
Researchers at NVIDIA have developed a Novel approach Training large language models (LLM) in 4-bit quantized format while maintaining their stability and accuracy at the level of high-fidelity models. Their technique, NVFP4, makes it possible to train models that not only outperform other known 4-bit formats but match the performance of the larger 8-bit FP8 format, while using half the memory and a fraction of the compute.
The success of NVFP4 shows that enterprises can continue to reduce mitigation costs by operating a lean model that meets efficiency at scale. It also points to a future where the cost of LLM training will come down to the point where many organizations can train their own bespoke model rather than just fine-tuning it from scratch.
The quantization challenge
Model quantization AI is a technique used to reduce the computational and memory costs of running and training models. It works by converting model parameters, or weights, from high-precision formats such as 16- and 32-bit floating point (BF16 and FP32) to lower-precision formats. The key challenge of quantization is to reduce the size of the model while preserving as much of its knowledge and capabilities as possible.
In recent years, 8-bit floating-point formats (FP8) have become a popular industry standard, offering a good balance between efficiency and performance. They significantly reduce the computational cost and memory demand for LLM training without a large reduction in accuracy.
The next logical step is 4-bit floating point (FP4), which promises to halve memory usage again and further boost performance on modern hardware. However, this transition has been challenging. Existing 4-bit formats, such as MXFP4, often struggle to maintain the same level of accuracy as their 8-bit counterparts, forcing a difficult trade-off between cost and performance.
How NVFP4 works
NVFP4 overcomes the stability and accuracy challenges of other FP4 techniques through an improved design and targeted training methodology. A major problem with 4-bit precision is its extremely limited range: it can only represent 16 distinct values. When converting from a high-fidelity format, outlier values ​​can distort the entire dataset, damaging the model’s accuracy. NVFP4 uses a more sophisticated, multi-level scaling approach that better handles these outliers, allowing "A more precise and accurate representation of tensor values ​​during training," According to Nvidia.
Beyond the format, the researchers introduce a version of 4-bit training that achieves accuracy comparable to FP8. A central component is their “mixed health strategy.” Instead of converting the entire model to NVFP4, the majority of layers are quantized while a small fraction of numerically sensitive layers are kept in a high-fidelity format such as BF16. It preserves stability where it matters most. The method also adjusts how gradients are calculated during the backpropagation—or learning phase of the model—to reduce bias that can accumulate from low-precision arithmetic.
Practically NVFP4
To test their approach, the NVIDIA team trained a powerful 12 billion parameter hybrid Mumba Transformer Model At a massive 10 trillion tokens. They then directly compared its performance against a baseline model trained in the widely popular FP8 format. The results showed that the NVFP4 model’s training reduction and downstream task accuracy closely tracked the FP8 version throughout the process.
Performance across a wide range of domains, including cognitive reasoning, arithmetic, and commonsense tasks, with only a slight drop-off in the coding benchmark at late training.
"This is, to our knowledge, the first successful demonstration of training billion-parameter language models with 4-bit precision on a multi-trillion token horizon, laying the groundwork for faster and more efficient training of future frontier models, the researchers write.
In practice, NVFP4’s 4-bit precision format enables developers and enterprises to train and deploy AI models with the same accuracy as traditional 8-bit formats, said NVIDIA Director of Product for AI and Data Center GPUS NVIDIASHAR Narseman.
“By storing model weights directly in a 4-bit format while preserving accuracy, it empowers developers to experiment with new architectures, iterate faster and uncover insights without being hampered by resource constraints,” he told VentureBeat.
In contrast, FP8 (while already well ahead of FP16) still imposes limitations on model size and inference performance due to high memory and bandwidth demands. “NVFP4 breaks that ceiling, offering equal quality with dramatically more headroom for development and experimentation,” said Narasimhan.
When compared to the alternative 4-bit format, MXFP4, the advantages of NVFP4 become even more apparent. In an experiment with an 8 billion parameter model, NVFP4 converged to a better loss score than MXFP4. To reach the same level of performance as the NVFP4 model, the MXFP4 model must be trained on 36% more data, which is a substantial increase in training time and cost.
In addition to making pretraining more efficient, NVFP4 also redefines what’s possible. “Demonstrating that model quality can be preserved at a 4-bit precision scale opens the door to a future where highly skilled models, not just hyperscalers, can be trained from scratch by medium-sized enterprises or startups,” said Narasiman, adding that, over time, we can expect an advanced LLMS model to be developed by LLMS in general.
Beyond pre-training
Although the paper focuses on the benefits of NVFP4 during pretraining, its effect also extends to mitigation.
Narasimhan said, “Models trained on NVFP4 can not only provide faster speed and higher throughput, but also shorten the time required to achieve ROI for AI factories.
Because these models are smaller and more efficient, they unlock new possibilities for serving complex, high-quality responses in real time, even in token-based, agent applications, without increasing energy and computation costs.
Narasimhan said he sees the future of model performance as not just about less stress on health, but about building better systems.
“There are many opportunities to expand low-precision research as well as modify architectures to address the components that are increasingly computationally dominating in large-scale models,” he said. “These areas are rich in opportunity, especially as we move toward agentic systems that demand high throughput, low latency and adaptive reasoning. NVFP 4 has proven that quality can be improved without compromising precision, and it sets the stage for a new era of intelligent, efficient AI design.”