Early Leader for VibeVoice – Kdnuggets

by SkillAiNest

Early Leader for VibeVoice – KdnuggetsEarly Leader for VibeVoice – Kdnuggets
Photo by Author | Canva

. Introduction

Open Source AI is experiencing an important moment. With the progress in large language models, general machine learning, and now speech technologies, open source models are rapidly reducing this difference with proprietary systems. One of the most interesting entries in this place is Microsoft’s open source voice stack, Wabu. This model is designed for the family natural, expression, and interactive conversation, which combines the quality of advanced trading offerings.

In this article, we will look for vibevoice, download the model, and guess Google Koalab using GPU run time. In addition, we will address the problems that arise during the model diagnosis.

. Introduction to Vibevoice

vibevoice Expressive, long -shaped, multi -speaker audio is the next generation text -to -speech framework for making podcasts and dialogue. Unlike traditional TTS, it takes the lead in scaling, speaker consistency, and natural turns.

Its basic innovation is in constant sound and spiritual tokenis working at 7.5 Hz, which is made pair of a large language model (QWEN2.5-1.5B) and is a dispersion head to produce high sincere audio. This design enables up to 90 minutes of speech with 4 separate speakers, which overtakes the previous system.

Wabuis is available as an open source model The hugs faceWith, the community retention code for easily experience and use.

Leader for the initial leader WebviceLeader for the initial leader WebviceFrom a picture vibevoice

. to start with vibevoice-1.5b

In this guide, we will learn how to cloning the Woose Ripposary and learning how to run a demo by providing a text file to produce a natural speech. Setups only take 5 minutes to produce audio.

!! 1. Community repository and install

First, clon the community version of Vibevoice Respository (vibevoice-community/vibevoice), Install, and install the desired packed packages Hug the face hub To download the model using the library API.

Note: Before starting the Kolab session, make sure your run time is set on T4 GPU.

!git clone -q --depth 1  /content/VibeVoice
%pip install -q -e /content/VibeVoice
%pip install -q -U huggingface_hub

!! 2. Download Model Snap Shot from a Submit Face

Download the model storage using a throat face snapshot API. It will download all files microsoft/VibeVoice-1.5B Deposit

from huggingface_hub import snapshot_download
snapshot_download(
    "microsoft/VibeVoice-1.5B",
    local_dir="/content/models/VibeVoice-1.5B",
    local_dir_use_symlinks=False
)

!! 3. Make a transcript with speaker

We will create a text file inside Google Koalab. Its l We will use the magic function %%writefile To provide content. The following is a sample conversation between two speakers about Kdnuggets.

%%writefile /content/my_transcript.txt
Speaker 1: Have you read the latest article on KDnuggets?
Speaker 2: Yes, it's one of the best resources for data science and AI.
Speaker 1: I like how KDnuggets always keeps up with the latest trends.
Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community.

!! 4. Evaluation of running (multi -speaker)

Now, we will run the Demo -Azgar Script in the vibevice storage. The script requires the model path, text file path, and speaker names.

#1 Run: Map Speaker 1 → Alice, Speaker 2 → Frank

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Alice Frank

As a result, you will see the following output. This model will use CUDA to produce audio, as a speaker with both Frank and Alice. It will also provide a summary you can use for analysis.

Using device: cuda
Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Frank
  Speaker 1 -> Alice
Speaker 1 ('Alice') -> Voice: en-Alice_woman.wav
Speaker 2 ('Frank') -> Voice: en-Frank_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B
==================================================
GENERATION SUMMARY
==================================================
Input file: /content/my_transcript.txt
Output file: ./outputs/my_transcript_generated.wav
Speaker names: ('Alice', 'Frank')
Number of unique speakers: 2
Number of segments: 4
Prefilling tokens: 368
Generated tokens: 118
Total tokens: 486
Generation time: 28.27 seconds
Audio duration: 15.47 seconds
RTF (Real Time Factor): 1.83x
==================================================

Run Audio in the Notebook:

We will now use IP tin function to listen to audio born inside Kolab.

from IPython.display import Audio, display
out_path = "/content/outputs/my_transcript_generated.wav"
display(Audio(out_path))

Leader for the initial leader WebviceLeader for the initial leader Webvice

It took 28 seconds to produce audio, and it looks clean, natural and smooth. I like it!

Try again with different voice actors.

#2 Run: Try different sounds (Mary for Speaker 1, Carter 2 for Speaker)

!python /content/VibeVoice/demo/inference_from_file.py \
  --model_path /content/models/VibeVoice-1.5B \
  --txt_path /content/my_transcript.txt \
  --speaker_names Mary Carter

Initially, with a smooth transfer between background music and the speaker, the audio made was even better.

Found 9 voice files in /content/VibeVoice/demo/voices
Available voices: en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman
Reading script from: /content/my_transcript.txt
Found 4 speaker segments:
  1. Speaker 1
     Text preview: Speaker 1: Have you read the latest article on KDnuggets?...
  2. Speaker 2
     Text preview: Speaker 2: Yes, it's one of the best resources for data science and AI....
  3. Speaker 1
     Text preview: Speaker 1: I like how KDnuggets always keeps up with the latest trends....
  4. Speaker 2
     Text preview: Speaker 2: Absolutely, it's a go-to platform for anyone in the AI community....

Speaker mapping:
  Speaker 2 -> Carter
  Speaker 1 -> Mary
Speaker 1 ('Mary') -> Voice: en-Mary_woman_bgm.wav
Speaker 2 ('Carter') -> Voice: en-Carter_man.wav
Loading processor & model from /content/models/VibeVoice-1.5B

Indications: If you are not sure which names are available, the script startup prints “Available Sounds:”.

Ordinary people include:

en-Alice_woman, en-Carter_man, en-Frank_man, en-Mary_woman_bgm, en-Maya_woman, in-Samuel_man, zh-Anchen_man_bgm, zh-Bowen_man, zh-Xinran_woman

. Tracing defects

!! 1. There are no demo scripts in repo?

The official Microsoft Woose Ripozetry has been dragged and reset. Community reports show that some codes and demo have been removed or are no longer accessible to the original location. If you know that the government storage is deprived of this example, please check a community mirror or protected documents that have preserved the original demo and instructions.

!! 2. Slow breed or cuda errors in Kolab

Confirm that you are on GPU run time: Run Time Type → Hardware Accelerator: GPU (T4 or no available GPU).

!! 3. CUDA OOM (out of memory)

You can take several steps to minimize the burden. Start by shortening the input text and reducing the length of the breed. Consider reducing the audio sample rate and/or adjusting the size of the inner parts if the script allows it. Set the beach size to 1 and select different shape of the smaller model.

!! 4. No audio or missing output folder

The script usually prints the final output path in the console. Scroll to find the exact location

find /content -name "*generated.wav"

!! 5. Names of sound didn’t get?

Copy the exact names listed under the sounds available. Use alias names shown in the demo (Alice, Frank, Mary, Carter). They are compatible with her .wav Assets

. The final views

For many projects, I will choose an open source stack for many forced reasons, such as more than a Wabois paying APIS. First and importantly, it is easy to connect and offers elasticity of customization, making it suitable suitable for a wide range of applications. In addition, it surprisingly highlights the needs of the GPU, which can be an important advantage in a resource -affected environment.

The Wabuis is open source, which means that in the future, you can expect a better framework that enables the faster generation in the CPU.

Abid Ali Owan For,,,,,,,,,, for,, for,,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,,, for,,,, for,,,, for,,,, for,, for,.@1abidaliawan) A certified data scientist is a professional who loves to create a machine learning model. Currently, he is focusing on creating content and writing technical blogs on machine learning and data science technologies. Abid has a master’s degree in technology management and a bachelor’s degree in telecommunications engineering. Its vision is to create AI products using a graph neural network for students with mental illness.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro