Kohir's new vision model runs on two GPUs, beat Top Tero VLMS on visual works

Want a smart insight into your inbox? Sign up for our weekly newsletters to get the only thing that is important to enterprise AI, data, and security leaders. Subscribe now

Increased deep research features and other AI -powered analysis have given rise to more models and services that seek to simplify the process and use more documents to read businesses.

Canada’s AI company Harmony Is it a banking on its models, including a newly released visual model, to make it a matter of fact that deep research features should also be improved for enterprise use issues.

The company has issued a vision to the command, a visual model that specifically targets the use of enterprise, made on the back of its command. The company says the 112 billion parameter models can “unlock valuable insights from visual data, and document the document can make a very accurate, data -driven decisions through the Optical Character Recognition (OCR) and image analysis.”

The company said, “Whether it is interpreting the product with complex arigram or analyzing the pictures of the real -world scenes of the real world, the most demanding enterprise improves the vision to tackle the vision challenges.” In a blog post.

AI Impact Series returning to San Francisco – August 5

The next step of the AI is here – are you ready? Block, GSK, and SAP leaders include for a special look on how autonomous agents are changing enterprise workflows-from real time decision-making to end to automation.

Now secure your place – space is limited:

This means that a vision can usually read and analyze the most common types of images of a vision: graph, chart, arigram, scanned documents and PDF.

? @Coier Left a vision to the command right now @Hoggingfis ?
Enterprise is designed for multi -modal use cases: interpretation of products, analyzing images, asking about chart… ❓ ??
112b dense vision language with Sota Performance-Check the benchmark matrix in this… pic.twitter.com/ormfm5F8CF
– Jeff Bodier? (@Jeffboudier) July 31, 2025

Since this command is made on A arch architecture, like a text model, Command A vision requires two or less GPUs. Vision model also maintains text capabilities and understands at least 23 languages to read words on images. Unlike other models, the command reduces the total cost of ownership for businesses and is fully improved in terms of recovery use for businesses, Kohir said.

How is the Architecting Command A

Kohir said that one after that Llava architecture To create a model, including visual models. This architecture transforms visual properties into a soft vision token, which can be divided into different tiles.

The company said the tiles were transferred to a text tower, “a dense, 111b parameters textical LLM” command. “In this manner, the same image eats up to 3,328 tokens.”

Kohir said he has trained the visual model in three stages: Vision Language alignment alignment, Surveillance, Fine Toning (SFT) and Human Association (RLHF) education after training to learn.

The company said, “This approach enables the image encoder features to be mapped in place to embed the language model.” “On the contrary, during the SFT phase, we simultaneously trained the vision encoder, vision adapter and language model on a diverse set of multi -modal works.”

To imagine enterprise AI

Benchmark tests showed other models with similar visual abilities to the command, performing well.

Kohir commanded a vision against him Open IOf GPT 4.1, Method‘Lama 4 Morek, FalsePixter large and wrong medium 3 in the nine benchmark test. The company did not mention whether it had tested the model against Mr. KK OCR -based API, Mistral OCR.

This enables agents safely within your organization’s visual data, and opens the automation of painful tasks containing slides, aragrams, PDFs and images. pic.twitter.com/ihznuwekrk
– Kohir (@Cairo) July 31, 2025

Command outsourcing one vision to other models such as Chartaka, Okar Bench, AI2D and Text VQA. Overall, the Command A vision was 83.1 % of GPT 4.1, 80.5 % of Lalama 4 Maurit and Mr. Medium 3 to 78.3 %.

These days, most of the large language models (LLM) are multi -modal, which means that they can produce or understand visual media such as photos or videos. However, businesses usually use high graphical documents such as charts and PDFs, so it is often difficult to remove information from these unique data sources.

With deep research at the height, the importance of reading, analyzing, analyzing, and even downloading models has increased.

Kohir also said that he was offering a command in the open weight system, hoping that the businesses would start using the products of businesses wanting to be closed or off the proprietary model. So far, the developers have some interest.

The removal of handwritten notes from an image affected its accuracy!
– Adam Sardo (@Sardo_Dam) July 31, 2025

Finally, an AI that will not decide my terrible doodles.
– Martha Wazir? (@Mart Wesner) August 1, 2025

Daily Insights on Business Use Matters with Daily VB

If you want to impress your boss, the VB Daily covers you. We give you internal scope what companies are doing with Generative AI, from regulatory shifts to practical deployments, so that you can share insights for more and more ROIs.

Read our privacy policy

Thanks for subscribing. Check more VB Newsletter here.

There was a mistake.

How is the Architecting Command A

To imagine enterprise AI

Editor's pick

Get latest news

Kohir’s new vision model runs on two GPUs, beat Top Tero VLMS on visual works

How is the Architecting Command A

To imagine enterprise AI

Microsoft’s Windows Remember that even after its reaunction is occupying passwords and Social Security numbers

Openi’s latest move is a game changer.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news