New training method boosts AI multimodal reasoning with small, smart datasets

Miromand has been released by AI and researchers from several Chinese universities Open Meridasinera new training framework that improves the capabilities of language models in multimodal reasoning.

The framework uses a two-step process. It first refines the base model with the curated dataset in a supervised fine-tuning (SFT) step. Then, a reinforcement learning (RL) stage guides the model to reason more efficiently in tasks that involve both text and visual data.

Experiments show that models trained with OpenMM Reasoner outperform other well-known visual reasoning models, often trained on a small, high-quality dataset. The framework and all its assets, including the trained 7B model, are fully open source, providing a reliable foundation for building applications that require traceability and robustness.

According to Qichen Zhang, co-author of a research paper outlining the new methodology, OpenMerManor offers significant advantages for businesses looking beyond large, closed systems. "A small open-source reasoning model has practical advantages: enterprises can deploy it locally, reduce latency, lower token costs associated with long chains of ideas, retain full control over their data and (it) is fine-tuned to adapt to their specific downstream work," He told VentureBeat.

The challenge of transparent multimodal reasoning

Recent advances in reinforcement learning with verified rewards (RLVR) have significantly improved the reasoning capabilities of large language models (LLMs). RLVR trains LLMs to produce China’s thought (quote) tokens (which mimic the reasoning process humans use) before generating the final answer. This improves the model’s ability to solve complex reasoning tasks such as arithmetic and coding.

Encouraged by this success, researchers have applied similar RL-based methods Large multimodal models (LMMS), showing that the benefits can extend beyond text to visual comprehension and problem solving in a variety of ways.

However, lack of transparency in the training pipeline has been a major hurdle. Many studies on multimodal reasoning do not provide detailed information about their data curation and training processes, making it difficult to reproduce their results or understand what makes these models work.

“This lack of openness limits reproducibility and obscures a deeper understanding of how inferential LMMs are actually constructed and how their training dynamics evolve,” the researchers note.

Open MR Manner Prescription

OpenMerminer addresses this gap with a fully transparent and scalable training instruction built on open source LMMS. The researchers found that it is important to validate high-quality datasets by scaling the diversity of the data. Although the use of diverse data sources is important, increasing the diversity of correct answers to the same question was an essential focus for improvement.

The first step of the recipe is a three-step Supervised Fine Tuning (SFT) pipeline. It starts with data sourcing, where the team collected about 103,000 raw question-answer pairs from public datasets covering common visual question-and-answer and reasoning tasks. Next, they added a data Asana phaseusing a robust model (QWEN3-VL-235B-instruct) to generate new, high-quality reasoning markers for selected questions. (The data will then be used to train a smaller model.)

To increase answer diversity, the team created multiple verified reasoning markers for each question. This increased the dataset to 583,000 samples. Finally, they implemented a “domain mixing” step, adding data from mathematical reasoning domains to further generalize the model’s capabilities, resulting in a final SFT dataset of 874,000 instances.

The second step is an RL recipe that uses a smaller, 74,000 sample dataset drawn from domains such as science, math and puzzles. The model is trained with a composite reward function that considers both the accuracy of the final answer and the consistency of the output format. To improve performance, the process includes a penalty "overthinking," Discourage the model from generating excessively long answers (a problem with many inference models trained by RL, which learn to produce excessively long inference sequences by mistake, resulting in high cost and slow answers).

This recipe can provide a blueprint for businesses training their models. "For companies with limited domain-specific data, one possible strategy is to first increase the diversity in response to their existing dataset, then use domain mixing to integrate this domain data into a general reasoning directive like ours," Zhang explained. "This allows the model to adapt to industry-specific tasks as well as having strong general-purpose reasoning skills, without the need for millions of samples."

A more efficient and rational model

According to Zhang, the step process fundamentally changes the reliability of the model results. "Traditional models often ‘jump’ directly to an answer, meaning they search only a narrow part of the reasoning space," He said. "In contrast, a logic-first approach forces the model to implicitly evaluate multiple intermediate steps … (allowing it) to traverse more profound paths and arrive at answers with far greater internal consistency."

The researchers used the QWEN2.5-VL-7B-instruct open-source vision language model to generate the data for fine-tuning the OpenMM Reasoner instruction. The result is a highly capable LMM that continuously improves state-of-the-art methods, e.g Open Vision Resident (OVR), across a wide range of multimodal reasoning benchmarks. The SFT stage alone creates a robust baseline model that achieves higher performance and data efficiency than other SFT approaches despite using a significantly smaller training dataset.

The subsequent RL phase further accelerates and stabilizes these capabilities, leading to more consistent and improved performance. After RL, the final model achieves state-of-the-art results on several benchmarks, including Weimath, Arithmetic, and Arithmetic Kata.

One of the key findings was that, as the model improved in multimodal reasoning, it also showed a "The gradual emergence of textual reasoning behaviors suggests a transfer of reasoning ability from multimodal to purely linguistic domains." The researchers note. This suggests that skills learned in one condition can reinforce performance in another.

"Our results show that strengthening multimodal reasoning can also improve math skills alone." Zhang said. "Looking ahead, we expect these methods to be extended to video and audio."

Researchers also found that token performance is crucial. Although allowing a model to generate longer reasoning steps can improve performance, excessive tokens reduce performance. Their results show that setting small "Reasoning budget" can achieve comparable or even better accuracy, which is an important consideration for deploying cost-effective enterprise applications.

by Open sourcing all components From their workflow, researchers provide a reproducible view of the entire process. For enterprise teams, this transparency is invaluable. "For business leaders concerned with vendor lock-in, hidden bias or ambiguous data sources, this level of transparency is essential." Zhang explained. "It empowers teams to validate data, customize the pipeline for new domains, and maintain long-term independence from any provider."

Editor's pick

Get latest news

New training method boosts AI multimodal reasoning with small, smart datasets

The challenge of transparent multimodal reasoning

Open MR Manner Prescription

A more efficient and rational model

Time series and trend analysis challenges inspired by real-world datasets

The tariff turmoil exposes costly blind spots in supply chains and AI

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news