Huawei’s new open source technique runs on less powerful, less expensive hardware to reduce LLM

by SkillAiNest

Huawei’s new open source technique runs on less powerful, less expensive hardware to reduce LLM

Huawei’s computing system lab in Zurich has introduced an New Open Source quantization method Large language models (LLM), which aims to reduce memory requirements without output quality sacrifice.

The technique, which is called SINQ (Sankorin Normalized quantization)Designed to make it easier to integrate in the fast, calibration, and existing model workflows. The code to perform this has been made available by the Huawei Research Team Got hub And The hugs face Under a permit, enterprise-friendly Apache 2.0 licenses, allowing organizations to take and use, amend and deploy it commercially-all of this for free.

In different size models, SINQ reduces memory use 60-70 %Depending on architecture and bit width.

This enables models that require 60GB of memory to run first G20 GB SetupA is an important capable of running a large model on GPU or even multi -GPU users’ grade setup at the same high -end GPU.

This makes it possible to run models that first needed advanced enterprise GPU-such as NVIDIA’s A100 or H100, significantly more affordable hardware, such as a single nvidia GeForce RTX 4090 For,,,,,,,,,, for,, for,,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,,, for,,,, for,,,, for,,,, for,, for,.Approximately 00 1600), Like an enterprise hardware A100 80GB For,,,,,,,,,, for,, for,,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,,, for,,,, for,,,, for,,,, for,, for,.000 19,000) Or even H100 Units. They Exceed 000 30,000.

For teams using cloud infrastructure, savings are similar. A100-based examples often cost $ 3-4.50 per hour, while 24 GB GPUs such as RTX 4090 are available on many platforms at $ 1-11.50 per hour.

Over time, especially the burden of extended work loads, may increase this difference Thousands of dollars in cost reductionWhile the deployment of LLM on small clusters, local work stations, or consumer grade setup was also forced by the first memory, unlocking.

Tackle the memory challenge of LLMS

Running a large model often requires a compromise between performance and size.

Practically, use nerve networks Floating point number Representing both weight and activities. A floating point number can express many values ​​(very small, very large, with partial parts).

This flexibility is helpful because during training and diagnosis, weight and activities can vary dramatically. The floating point allows the model specifically to adjust. (For example, the weight can be 0.0023 or 123.45, and the floating point can both catch both decent precision.)

Quantization-a method that reduces the weight of the model’s weight-reducing the use of minority, but usually comes with commercial relationships in the quality of the model, especially 4-bit precision and below.

When you turn these floating point values ​​into lower precision formats (such as 8 -bit investors), you get closer to them.

This means that you store and count it with low bits, which is efficiently and more memory-but you are at risk of losing sincerely (ie introducing small errors).

The trick is that conversion is to be done carefully, so the model’s behavior remains almost the same, though internally it is working with the pathogenesis of these weights and activities.

Sinq solves these pain points by introducing a plug and play solution, which also performs strong performance in low precision settings. Without calibration data or bean layers are required.

How does SINQ work

Two important innovations were introduced in the Sinq view:

  1. Dual Axis Skilling: Instead of using the same -scale element to increase the amount of matrix, SINQ uses separate scaling vector for rows and columns. It helps reduce the effects of outgoing people and allows quantization error to be distributed in the matrix more flexible.

  2. Sankhurin-notopte styling normalization: A sharp algorithm affected by the synchronous repetition is used in the matrix to normalize the standard deviation of the columns and columns. This calls the authors a “matrix imbalance”, which helps minimize, a new proxy metric that shows more efficient than alternatives such as Cur Curtosis to improve quantization efficiency.

A combination of these two features allows SIQ to improve the calibration -free techniques such as round -to -Nairist (RTN), HQ, and Hemard -based quantization in multiple benchmarks.

Performance and compatibility

SINQ has been reviewed in many architectures and models, including QWEN3 Series, Lilama, and Dipic.

On benchmarks such as Vikitect 2 and C4, SINQ reduces permanent harassment and flip rates compared to baseline methods, often approaching or getting closer to the performance of the Calibeth solution.

It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods such as AWQ, which contains a variety of A-SINQ. In Calibrified Settings, A-Sinq further narrows the gap with full precision models.

In terms of run -time performance, SINQ models are almost doubled from HQQ and 30 times faster than AWQ. This makes it appropriate for both research and production environment where quantization time is a practical obstacle.

Open source and easy to use

Huawei has released SINQ as an open source project under a permit, enterprise -friendly Apache 2.0 license, with the implementation instructions and reproductive capacity tools available on the Gut Hub:

This reservoir includes help for the amount of hugging facial models with just a few lines of the code, as well as tools tools to save weight and reload the quantity. Default settings offer balance between memory savings and accuracy, and users can customize parameters such as bit width, tiling strategies, and group sizes based on their needs.

The authors also provide the integration of diagnosis lm-eval Plan to issue a quantized model on the library and a pre -embracing facial hub in the near future.

Are looking forward to

With the increasing demand for running large models on consumer grade hardware, quantization is becoming an essential tool. The purpose of SINQ is to reduce the obstruction of entry for LLM deployment, which enables developers and researchers to effectively shrink models without major trade relationships in quality or compatibility.

More updates-including the integration of facial transformers and the already proprietary model.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro