Bilingual ASR for dialects, code switching, and songs – MiMo-V2.5 Voice

by SkillAiNest April 25, 2026

written by SkillAiNest April 25, 2026

Whisper changed what people expected from open source ASR. Three years later, the leaderboard looks very different.

what is it: MiMo-V2.5-ASR is an 8B open-source speech recognition model from Xiaomi MiMo, MIT-licensed and available on HuggingFace, designed for bilingual Chinese-English transcription in dialects, noisy audio, code-switched speech, and song lyrics.

The problem: Most ASR models are benchmarked on clean studio data and deployed in the real world, where audio is noisy, speakers overlap, and people switch languages mid-sentence. The difference between benchmark accuracy and production accuracy is where voice products silently fail.

Solution: Staged training with large-scale intermediate training, supervised fine-tuning, and a reinforcement learning algorithm specifically targeting scenarios where traditional models break down. Local punctuation from prosody means transcriptions are ready for use.

What makes it different?: On the Open ASR leaderboard, MiMo-V2.5-ASR posts a 5.73% average WER in English, below Whisper large-v3’s 7.44%. It has a score of 19.55% on Wu bid vs FunASR-1.5 29.08%. On tune, 3.95% 4.25% on m4singer vs Gemini 2.5 Pro. These aren’t cherry-picking scenarios — they’re tough.

Key Features:

Eight Chinese dialects are locally supported, including Wu, Cantonese, Hokkien, and Sichuanese
Chinese-English code switching without language tags
Duplication of melody under accompaniment and pitch variations
Robustness to multi-speaker and noisy environments.
Native punctuation, no post-processing required.
MIT license, Python API, Gradio demo, self-hosted

Benefits:

Production-grade accuracy over actual audio conditions in the field
A single model replaces multiple regional or domain-specific ASR solutions.
Self-hosting eliminates per-call API costs and keeps data on your infrastructure.
A ready-to-use punctuation output cuts one step from the pipeline each downstream.

Who is it for?: ML engineers and audio product teams build bilingual or Chinese transcription pipelines that require accuracy that exists outside of the lab.

Open source ASR has been catching up to closed models for years. MiMo-V2.5-ASR is a data point whose variance is now very small, and in some scenarios has disappeared.

Editor's pick

Get latest news

Bilingual ASR for dialects, code switching, and songs – MiMo-V2.5 Voice

How to build a self-learning RAG system with knowledge reflection.

Render AI chat data and log into Codex Browsable Views – Euphony

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news