Want a smart insight into your inbox? Sign up for our weekly newsletters to get the only thing that is important to enterprise AI, data, and security leaders. Subscribe now
The Chinese e -commerce giant Alibaba has created global waves in the tech and business communities with the family of the “Queen” General AI Language Model (LLMS), which began in April 2023 in April 2023, with the launch of the Tongai Kian Van Chatboat.
Why?
Well, not only its models are powerful and score high on third -party benchmark tests for mathematics, science, reasoning and written works, for most parts, they have been issued under open source licensing terms, which allows organizations and businesses to use it for different purposes. Think of them as an alternative to DPC.
This week, Alibaba’s Kevin team released the latest update to its Kevin family, and they are already focusing on their high performance by AI power users in the West. In one case, he developed a new CIMI-2 model from rival Chinese AI Startup Moon Shot, released in mid-July 2025.
AI Impact Series returning to San Francisco – August 5
The next step of the AI is here – are you ready? Block, GSK, and SAP leaders include for a special look on how autonomous agents are changing enterprise workflows-from real time decision-making to end to automation.
Now secure your place – space is limited:
New QWEN3-235B-A22B-25507-Instruct Model – issued on the AI code sharing community The hugs face With A “Floating Point 8” or FP8 versionWhich we will cover the depth below-will improve the works, the correctness of the facts and the multi-linguistic understanding on the original Kevin 3. It also improves the “non -thinking” version of Claude Opse 4.
The new QWen3 model also provides alignment with your creators, better coding results, user preferences and long contexts. But it’s not all …
Read for what this enterprise offers users and technical decision makers.
The FP8 version allows businesses to run Qian 3 with very little memory and computers
The 8-bit floating point model of the “FP8” version compresses the numerical operations of the model to use less memory and processing power-not significantly affected its performance.
In practice, this means that organizations can run a model with small, less expensive hardware or more effectively in the cloud with the capabilities of Kevin 3. The result is the ability to measure the deployment without the need for rapid reaction, low energy costs and mass infrastructure.
This makes the FP8 model attractive to the production environment, especially with severe delays or cost barriers. Teams can measure QWEN3’s capabilities in GPU examples or local development machines, avoiding the need for large -scale multi -GPU clusters. It also reduces the obstruction of private fine toning and on -premises deployment, where infrastructure resources are limited and the total cost of ownership issues.
Although the Kevin team did not release the official calculation, comparison with the FP8 quantized deployment shows that performance saving is sufficient. Here is a practical example (Up to 07/23/2025 at 16:04 pm it was updated and corrected at ET – the piece was actually a wrong calculation -based chart. I apologize for the mistakes And thanks for contacting readers Me about them::
Metric | BF16 / BF16- Equivi Blood | FP8 quantized Blood |
---|---|---|
The use of GPU memory< | 40 640 GB Total (8 × H100-80 GB, TP-8) | ≈ 320 GB Total Suggested 4 × H100-80 GB, TP-4 Lalest Fit Print Community Run: Olama offloaded 2 × H100 ~ 143 GB in H100 |
Single Corey Enformer Speed | ~ 74 tokens / s (batch = 1, context = 2 k, 8 × h20-96 GB, TP-8) | 2 72 tokens / s (the same settings, 4 × H20-96 GB, TP-4) |
Strength / energy | Under the full node load of eight H100S ~ 4-4.5 kW (550-600 W Per Card, as well as host) | The FP8 needs half the card and half the data is transmitted. Nvidia Hopper FP8 Case Studies Report ≈ 35-40 % Low TCO and Comparative Throat compared to energy |
GPUS is required (practical) | 8 × H100-80 GB (TP-8) or 8 × A100-80 GB for equality | 4 × H100-80 GB (TP-4). 2 × H100 with offensive offloading, at the expense of delayed |
<Disk Foot Prints for the checkpoints: BF16 weights are ~ 500 GB; The FP8 checkpoint is “more than 200 GB”, so the absolute savings of memory on GPU are mostly in need of low cards, not just weight.
· Speed data are from QWen3 Official Sglang Benchmark (Beach 1). Input scales almost as the sizes with batch size: Beston measurement ~ 45 tokens/s per user batch 32 and 4 1.4k token/s total on the same four GPU FP8 setup.
… No vendor provides the precise wall strength for QWen, so we estimate the H100 board’s specs and Nvidia Hopper FP8 using energy -saving data.
There is no further ‘hybrid reasoning’ … instead, Kevin will issue separate arguments and instruct models.
Perhaps the most interesting, Kevin announced that he would no longer follow the “hybrid” reasoning approach, which he introduced in April with Kevin 3. It seems that this sovereign AI has been influenced by the point of view by collective new research.
This allows users to togel the “reasoning” model, which helps the AI model to check itself and develop chains (COT) before responding.
In a way, it was designed to imitate the reasoning capabilities of powerful proprietary models such as Openi’s “O1, O3, O4-Mini, O4-Mini-HIGH), which also develop” the thinking of chains “.
However, unlike rival models that are always engaged in such a “reasoning” for every indicator, the Queen 3 Queen website can be manually changed or shut down with the “Thinking Mode” button on the Chat Boat. Or, the user can type “/think” before your indicator on the diagnosis of a local or private running model.
The idea was to provide users for more difficult indicators and tasks to be given control to engage in a slow and more token think mode, and use a non -thinking method for easy indicators. But once again, he called on the user to decide. Despite being flexible, it also introduced the complexity of the design and contradictory behavior in some cases.
Now the quintendo is written on x:
“After talking to the community and thinking about it, we decided to stop using the mood of hybrid thinking. Instead, we will train the guided and thinking model separately so that we can achieve the best quality.”
With the update of 2507-a directive or irrational model, for now-Ali Baba is no longer surrounded both perspectives in a single model. Instead, the variations of the separate model will be trained for instructions and reasoning, respectively.
The result is a model that follows the user’s instructions more closely, produces more predictive reactions, and, as benchmark data suggests, improves multiple diagnostic domains significantly.
Performance standards and matters of use
Compared to your predecessor, QWEN3-235b-A22B-Instruct-25507 Provides model measurement improvement:
- MMLU Pro Score increased from 75.2 to 83.0A notable benefit in the performance of general knowledge.
- GPQA and Super GPQA benchmarks decrease 15-20 % pointsReflect the accuracy of strong facts.
- Argument Such as Aime25 and ARC-AI show more than double the previous performance.
- Code generation improvesLivecodebench scores are increasing from 32.9 to 51.8.
- Multi -linguistic support extendsWith better coverage of long tail languages and better alignment in dialects.

The model maintains a compound specialist (MOE) architecture, with 8 out of 128 experts activated during estimates, with a total of 235 billion parameters-22 billion of which are active at any time.
As mentioned, the FP8 version introduces the fine granular quantity to reduce the use of improvement of the style and the use of memory.
Is ready for enterprise by design
Unlike many open source LLMs, which are often issued only under the license of research or requires access to API for commercial use, QWEN3 aims to deploy enterprise.
Proud of a legitimate Apache 2.0 license, this means that businesses can use it freely for trade applications. They can also do:
- Deploy the model using VLM and Sglang via APIs locally or according to Open AI.
- Privately fine tone models using Laura or Qlora without exposing proprietary data;
- Log and inspect all the gestures and results on the premises of compliance and audit.
- Scale variations (from 0.6B to 32b) or from prototype to production using MOE checkpoints.
Alibaba’s team also introduced the Queen Agent, a lightweight framework that summarizes the tool Invocation logic for users who make agent system.
Benchmarks such as Tao Retail and BFCL-V3 recommend that the instruction model can perform the multi-phase decision-making tasks with qualifications-the domain of generally-made agents.
The reaction of the community and the industry
The release has already been obtained by AI power users.
Paul CorrortAI Educator and Founder of Private LLM Chat Boat Host Blue Shell AiFor, for, for,. Posted A comparison chart on X, which has been shown on a benchmark like QWen3-235B-A22B-A22B-Instruct-25507 Out Performing Claude Ops 4 and Kimi 2, on benchmarks like GPQA, AM 25 and Arena Hard V2, call it. “Kimi’s more powerful … and Claude Ops is better than 4.”
AI influence Nic (@ns123abc) Comments on its sharp effect: “The QWen-3-235B made Kimi 2 unrelated after just a week, despite being of a quarter, and you are laughing.”
On the other hand, Jeff Buder, Head of Product over Suggling FaceHighlighting the benefits of deployment: “Kevin quietly released a massive improvement in Kevin 3 … It is at the top of the LLMS on the Best Open (Kimi 2, a 4 X Big model) and closed (Claude Ops 4) benchmark.”
He appreciated the availability of FP8 outposts for high -speed estimates, a 1 -click deployment on Azure ML and co -operation for local use via Mac or Intel MLX MLX.
The overall tone by the developers has been zealous, as the model’s performance, licensing and deployment appeals to both of the hobbies and professionals.
What is next to the Kevin team?
Alibaba is already the basis for future refreshments. A separate reasoning model is in the pipeline, and the Kevin Roadmap is rapidly referring to the agent system that has long been able to plan Horizon Task Planning.
Multi-modal support, as seen in the Qwen2.5-USI and the QWen-VL model, is expected to expand.
And already, rumors and rumors have begun as members of the Kevin team tease another update to their model family, with Their web properties A new QWEN3-Coder-480B-A355b-Instruction model is displaying URL wires, possibly with 480 billion parameter MOE with 1 million token context.
Is QWen3-235b-A22B-Instruct-25507 ultimately not just one more jump in the performance of the signals benchmark, but also the open model as a viable alternative to proprietary systems.
Deployment flexibility, strong general performance and enterprise -friendly licensing model provides a unique edge in the crowded field.
For teams wishing to integrate the latest Instruction Following model into their AI stack, the Wonder Lock is a serious contender without the limits of these or use-based fees-a serious contender.