Run real-time speech-to-speech AI models natively.

by SkillAiNest

Run real-time speech-to-speech AI models natively.
Photo by author

# Introduction

Before we get started, I want you to watch this video:


Isn’t it amazing? I mean now you can run a full native model that you can talk to on your machine and it works out of the box. This Feels like talking to a real person Because the system can listen and speak at the same time, just like a natural conversation.

This is not the “you speak then wait then answer” routine. Personaplex Real-time speech-to-speech is conversational AI. which handles interruptions, overlaps and natural conversational gestures like “uh-huh” or “right” as you speak.

PersonaPlex is designed to be full duplex so it can listen and create speech simultaneously without forcing the user to pause first. It makes conversations feel much more fluid and human than traditional voice assistants.

In this tutorial, we’ll learn how to set up a Linux environment, install PersonaPlex locally, and then start the PersonaPlex web server so you can interact with the AI ​​in real time in your browser.

# Using PersonaPlex Natively: A Step-by-Step Guide

In this section, we’ll see how we install PersonaPlex on Linux, launch the real-time WebUI, and start talking to a full-duplex speech-to-speech AI model running natively on our machine.

// Step 1: Accept the terms of the model and generate a token

Before you can download and run PersonaPlex, you must accept the terms of use for the model on Hugging Face. The PersonaPlex-7B-v1 speech-to-speech model from NVIDIA is gated, which means you can’t access the weights unless you agree to the license terms on the model page.

Go to PersonaPlex. Model page On the face hugger and login. You will see a notice saying that you need to share your contact information and accept the license terms to access the files. Review the NVIDIA Open Model License and accept the terms to unlock the repository.

Once access is granted, create an access token for the hugger face:

  1. go to Settings → Access Tokens
  2. Create a new token with Read on permission
  3. Copy the generated token.

Then export it to your terminal:

export HF_TOKEN="YOUR_HF_TOKEN"

This token allows your local machine to authenticate and download the PersonaPlex model.

// Step 2: Installing Linux Dependencies

Before installing PersonaPlex, you must install Opus Audio Codec Development Library. PersonaPlex relies on Opus to handle real-time audio encoding and decoding, so this dependency must be available on your system.

Run on Ubuntu or Debian based systems:

sudo apt update
sudo apt install -y libopus-dev

// Step 3: Building PersonaPlex from Source

Now we will clone the PersonaPlex repository and install the required moshi package from source.

Clone the official NVIDIA repository:

git clone 
cd personaplex

Once inside the project directory, install moshi:

This will compile and install PersonaPlex components with all required dependencies, including PyTorch, CUDA libraries, NCCL, and audio tooling.

You should see packages like torch, nvidia-cublas-cu12, nvidia-cudnn-cu12, punishment-fragments, and moshi-personaplex installed successfully.

Tip: Do this in a virtual environment if you’re on your own machine.

// Step 4: Starting the WebUI Server

Before launching the server, install the faster Hugging Face downloader:

Now start the PersonaPlex Realtime Server:

python -m moshi.server --host 0.0.0.0 --port 8998

The first run will download the full PersonaPlex model, which is about 16.7 GB. This may take some time depending on your internet speed.

Run real-time speech-to-speech AI models natively.

After the download is complete, the model will be loaded into memory and the server will start.

Run real-time speech-to-speech AI models natively.

// Step 5: Talking to PersonaPlex in a Browser

Now that the server is up and running, it’s time to actually talk to PersonaPlex.

If you are running it on your local machine, copy and paste this link into your browser: http://localhost:8998.

This will load the WebUI interface in your browser.

After the page opens:

  1. Select a sound.
  2. Click connect
  3. Allow microphone.
  4. Start speaking.

The interface includes conversation templates. For this demo, we chose astronaut (entertainment) Templates to make interactions more playful. You can also create your own template by editing the initial system prompt text. It allows you to fully customize the AI’s personality and behavior.

For voice selection, we switched from default and chose Natural F3 Just trying something different.

Run real-time speech-to-speech AI models natively.

And honestly, it feels surprisingly natural.

You can pause it while it’s speaking.

You can ask follow-up questions.

You can change topics mid-sentence.

It easily handles the flow of conversations and responds intelligently in real time. I even tested it by simulating a bank customer service call, and the experience felt realistic.

Run real-time speech-to-speech AI models natively.

PersonaPlex includes several sound presets:

  • Natural (female): NATF0, NATF1, NATF2, NATF3
  • Natural (Male): NATM0, NATM1, NATM2, NATM3
  • Varieties (Female): VARF0, VARF1, VARF2, VARF3, VARF4
  • Varieties (Male): VARM0, VARM1, VARM2, VARM3, VARM4

You can experiment with different sounds to match the personality you like. Some feel more communicative, others more expressive.

# Concluding Remarks

After going through this whole setup and talking to PersonaPlex in real time, one thing becomes very clear.

It feels different.

We are used to chat-based AI. You type. This gives the answer. You wait your turn. This transaction is felt.

Speech-to-speech is fully animated.

With PersonaPlex running natively, you’re no longer waiting your turn. You can interrupt it. You can change direction in the middle of a sentence. You can naturally ask follow-up questions. The conversation flows. It feels closer to how humans actually talk.

And this is why I truly believe that the future of AI is speech-to-speech.

But even this is only half the story.

The real change will happen when these real-time communication systems are deeply connected to agents and tools. Imagine talking to your AI and saying, “Book me a ticket for Friday morning.” Check stock price and trade. Write that email and send it. Schedule a meeting. Pull the report.

Tabs are not being changed. Not copy paste. Not typing commands.

Just talking.

PersonaPlex already solves one of the most difficult problems, which is natural, full-duplex conversation. The next layer is execution. Once speech-to-speech systems connect to APIs, automation tools, browsers, trading platforms, and productivity apps, they stop being assistants and start becoming operators.

In short, it becomes something like OpenClaw on steroids.

A system that doesn’t just talk like humans but works on your behalf in real time.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master’s degree in Technology Management and a Bachelor’s degree in Telecommunication Engineering. His vision is to create an AI product using graph neural networks for students struggling with mental illness.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro