Want a smart insight into your inbox? Sign up for our weekly newsletters to get the only thing that is important to enterprise AI, data, and security leaders. Subscribe now
Increased deep research features and other AI -powered analysis have given rise to more models and services that seek to simplify the process and use more documents to read businesses.
Canada’s AI company Harmony Is it a banking on its models, including a newly released visual model, to make it a matter of fact that deep research features should also be improved for enterprise use issues.
The company has issued a vision to the command, a visual model that specifically targets the use of enterprise, made on the back of its command. The company says the 112 billion parameter models can “unlock valuable insights from visual data, and document the document can make a very accurate, data -driven decisions through the Optical Character Recognition (OCR) and image analysis.”
The company said, “Whether it is interpreting the product with complex arigram or analyzing the pictures of the real -world scenes of the real world, the most demanding enterprise improves the vision to tackle the vision challenges.” In a blog post.
AI Impact Series returning to San Francisco – August 5
The next step of the AI is here – are you ready? Block, GSK, and SAP leaders include for a special look on how autonomous agents are changing enterprise workflows-from real time decision-making to end to automation.
Now secure your place – space is limited:
This means that a vision can usually read and analyze the most common types of images of a vision: graph, chart, arigram, scanned documents and PDF.
Since this command is made on A arch architecture, like a text model, Command A vision requires two or less GPUs. Vision model also maintains text capabilities and understands at least 23 languages to read words on images. Unlike other models, the command reduces the total cost of ownership for businesses and is fully improved in terms of recovery use for businesses, Kohir said.
How is the Architecting Command A
Kohir said that one after that Llava architecture To create a model, including visual models. This architecture transforms visual properties into a soft vision token, which can be divided into different tiles.
The company said the tiles were transferred to a text tower, “a dense, 111b parameters textical LLM” command. “In this manner, the same image eats up to 3,328 tokens.”
Kohir said he has trained the visual model in three stages: Vision Language alignment alignment, Surveillance, Fine Toning (SFT) and Human Association (RLHF) education after training to learn.
The company said, “This approach enables the image encoder features to be mapped in place to embed the language model.” “On the contrary, during the SFT phase, we simultaneously trained the vision encoder, vision adapter and language model on a diverse set of multi -modal works.”
To imagine enterprise AI
Benchmark tests showed other models with similar visual abilities to the command, performing well.
Kohir commanded a vision against him Open IOf GPT 4.1, Method‘Lama 4 Morek, FalsePixter large and wrong medium 3 in the nine benchmark test. The company did not mention whether it had tested the model against Mr. KK OCR -based API, Mistral OCR.
Command outsourcing one vision to other models such as Chartaka, Okar Bench, AI2D and Text VQA. Overall, the Command A vision was 83.1 % of GPT 4.1, 80.5 % of Lalama 4 Maurit and Mr. Medium 3 to 78.3 %.
These days, most of the large language models (LLM) are multi -modal, which means that they can produce or understand visual media such as photos or videos. However, businesses usually use high graphical documents such as charts and PDFs, so it is often difficult to remove information from these unique data sources.
With deep research at the height, the importance of reading, analyzing, analyzing, and even downloading models has increased.
Kohir also said that he was offering a command in the open weight system, hoping that the businesses would start using the products of businesses wanting to be closed or off the proprietary model. So far, the developers have some interest.