How to become a real-time public speaking and presentation coach.

Nowadays, being able to present an idea, project, or achievement is an essential skill. Your ability to show and talk about your work may determine whether you get that degree, funding, or accreditation.

But while effective communication is important, it’s not a skill that everyone has. This is something you build through constant practice.

Therein lies the challenge: When we practice on our own, it’s easy to overlook coordination, posture, and delivery, which are just as important as speech.

That is where we need a coach. They are another pair of eyes and ears that note important details and relay them to you when you present. Thanks to recent advances in visual AI, you can now get continuous and objective feedback at any time. such as frameworks Vision Agents Allows you to seamlessly integrate powerful visual models and build the AI-powered applications you want.

In this article, we’ll build a real-time public speaking and presentation coach powered by VisionAgents, which you can run on your PC or Mac to improve your delivery.

What are we making?

In this guide, we’ll walk you through how to create a coaching agent that acts as your personal practice partner. The agent will provide real-time feedback, highlight areas for improvement and offer helpful suggestions via audio and text.

They will track several aspects of your offer, looking for:

Filler words: To help you reduce the use of words such as “um”, “oh”, “like” and “you know”.
Speaking speed: To identify if you are speaking too fast or too slow.
Voice typePointing out if you’re looking boring.
Explanation: To hear if your words are clear enough.
Currency: To check whether you maintain good or bad posture. Look at your shoulders, back and chin.
Hand gestures: To monitor the use of your hands.
Eye contact: To determine if your eyes are looking directly at your audience.

Now you have a mental picture of what we are setting out to build.

Better still, here’s a visual look at how this armor looks and works.

You can find all the code in this tutorial here. Repo.

Technical terms

Before we begin, make sure you have:

Key technologies

First, let’s introduce the key players and their respective roles in our Presentation Coach implementation.

Stream video

Stream video There is a complete video infrastructure built on WebRTC, which enables browsers and apps to send live audio and video. It comes supercharged with a worldwide edge network that transports your video to the nearest server in less than 30 milliseconds. This means that for our presentation coach, the AI can join your practice session like a real participant, watching and listening to you in real time without interruption, as well as providing feedback.

Vision Agents

Vision Agents Stream is an open source framework that allows you to integrate video streams, AI models and chat interfaces. It ships with stream video as its default transport layer.

This framework simplifies the development of multimodal AI agent applications by providing a unified agent class that orchestrates everything. With Vision Agents, you can connect models and get them to work seamlessly as an integrated system.

OpenAI Realtime API

OpenAI Realtime API Allows you to stream live, low-latency interactions with OpenAI models. Its strength lies in its ability to handle speech-to-speech at once. Your words go in, the AI thinks about them, and you get audio and text feedback almost instantly. Your app and model can communicate instantly, like a live conversation. This will be the actual brainchild of the presentation coach.

YOLO11

YOLO11 is an advanced and powerful computer vision model developed by Ultralytics. It supports a wide range of tasks, including object detection, instance segmentation, image classification, pose estimation/keypoint detection, and oriented bounding box detection.

It tracks 17 different points on your body, such as your shoulder, head, and hand positions, and even tries to determine your posture at certain times. Our presentation will focus on the aspects of pose estimation and keypoint detection.

The real-time AI presentation coach pipeline combines live video pose detection and speech transcription to provide voice and text feedback.

Project setup

Now, let’s get straight into building this presentation coach with all the technologies we’ve highlighted.

We’ll start by installing UV, the recommended installer for Vision Agents. Create a project folder and run this command in Terminal if you’re using pip installer:

pip install uv

For Linux/MacOS, run:

curl -LsSf  | sh.

For Windows, run:

powershell -ExecutionPolicy ByPass -c \"irm  | iex\"

Next, start uv in your project:

uv init

Then create a virtual environment:

uv venv

and activate the virtual environment:

.venv\Scripts\activate

Install now. Vision Agents With required plugins and dependencies:

uv add vision-agents(getstream,openai,ultralytics) python-dotenv

In the root directory, create a env File and provide required credentials:

STREAM_API_KEY=your-stream-api-key

STREAM_API_SECRET=your-stream-secret

OPENAI_API_KEY=your-openai-api-key

CALL_ID="practice-room"

In the root directory, create a ‘instructions’ Folder and Markdown File’coach.md’ Within it

In the root directory, create a file and name it Download_yolo_pose.py.

Your current project folder structure should look like this:

└── 📁Presentation Coach

    └── 📁.venv

    └── 📁instructions

        ├──coach.md

    └── .env

    └── .gitignore

    └── download_yolo_pose.py

    └── main.py

    └── pyproject.toml

    └── README.md

    └── uv.lock

Set up YOLO.

Ultralytics uses the YOLO11 framework. yolo11n-pose.pt Model file to see your posture during your presentation. This pre-trained deep learning model file estimates poses by detecting key points. in you Download_yolo_pose.py file, enter this:

//download_yolo_pose.py

from ultralytics import YOLO

import shutil

from pathlib import Path

model = YOLO("yolo11n-pose.pt")  

project_root = Path(__file__).parent

target = project_root / "yolo11n-pose.pt"

if not target.exists():

    print("Copying model to project root...")

    shutil.copy2(model.model.path, target)

else:

    print("Model already in project root.")

print(f"Ready: {target.resolve()}")

It downloads automatically. yolo11n-pose.pt file, if not present in your project, and copies it to the root of the project.

Coaching Instructions

AI plays the role of coach in this implementation. gave coach.md File gives him his whole personality, skills and coaching philosophy. You specify the tone, output rate, response length, speaking speed, feedback time, and other metrics you want your AI to follow. Without these guidelines, you will receive general answers, vague suggestions, long answers, and interruptions.

Stick it in yourself. coach.md For best results file:

//instructions/coach.md

These instructions describe how the coaching system should behave when someone is practicing a presentation. Ensure to give quick, specific tips and try not to interrupt their flow. Only provide feedback after detecting at least 3-5 seconds of silence.

On format, feedback should appear like short texts on screen and keep them within 1 or 2 sentences maximum.

You want people to be relaxed during the presentation, so, ensure you start with something positive and always add one actionable tip.

You’ll have access to video feeds, transcripts and pose data. That’s enough to get a good idea of pace, body language and how engaged they look.

A big part of your evaluation is to understand their speech. You should look out for:

Pace: Shouldn’t be too fast or too slow. Send a message to address it when noticed.

Filler words: Listen for “um”,”uh”,”emm”,”you know”. If they keep popping up, send a reminder to pause.

Tone and variety: Watch out for their pitch and suggest accordingly

Clarity: Make sure that their words are clear enough

Also, keep an eye on the body posture. Encourage confident presentation involving use of hands, straight shoulders and steady eye contact.

Presentation agent

Now that we have all our pieces in place, it’s time to look at the central processing unit of our Presentation Coach. within the main.py File is where the magic of vision agents happens, live video streaming, OpenAI real-time functions, Yolo pose detection, and binding your coaching instructions into one multimodal agent.

Here is our method. main.py looks like:

// main.py

import logging

from dotenv import load_dotenv

from vision_agents.core import Agent, User, cli

from vision_agents.core.agents import AgentLauncher

from vision_agents.plugins import getstream, openai, ultralytics

load_dotenv()

async def create_agent(**kwargs) -> Agent:

    agent_user = User(

        name="Public Speaking & Presentation Coach",

        id="coach_agent",

        image="

    )

    return Agent(

        edge=getstream.Edge(),

        agent_user=agent_user,

        instructions="@instructions/coach.md",

        llm=openai.Realtime(

            fps=6,                          

            voice="alloy",

        ),

        processors=(

            ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")

        ),

    )

async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:

    print(f"Presentation Coach Agent starting...")

    print(f"Joining call: {call_type}:{call_id}")

    call = await agent.create_call(call_type, call_id)

    session = await agent.join(call)

    print("Agent connected and ready!")

    print("Real-time coaching enabled")

    try:

        await agent.llm.simple_response(

            text="Greet the user warmly and say you're ready to help them practice. "

                 "Watch their body language and speech — give encouraging, real-time feedback."

        )

        await agent.finish()

    finally:

        await session.close()

if name == "__main__":

    cli(AgentLauncher(create_agent=create_agent, join_call=join_call))

Let’s see what’s happening in this code:

Your keys are loaded from an .env file. load_dotenv Function
gave create_agent The function then creates the coach’s identity using the User object, assigning it a name, ID, and avatar.
The instantiated agent object takes several arguments, which configure how the agent behaves and interacts with videos, models, and the user. Our agents object to the following arguments: edge, agent_user, instructions, LLMSand processors.
edge=getstream.Edge() Connects everything to Stream’s global, low-latency video infrastructure.
agent_user Specifies the identity of a previously created coach.
Instructions located in coach.md load your coaching philosophy directly into the mind of the agent.
llm Defines AI language models and parameters. For this agent, it is OpenAI.Realtime, which opens a WebSocket to OpenAI’s Realtime API. With a frame rate of 6, the agent receives six video frames per second. The sound parameter is set to “alloy” which allows for real-time speech generation.
processors Perform specific types of AI/ML computations on incoming streams. In this case, the video frames were analyzed by YOLO11.
with the join_call function, the agent joins the call with a short, welcoming greeting that appears immediately in the chat. gave await agent.finish The function controls the agent’s real-time loop, which continuously listens, watches, thinks, and automatically responds. No manual prompts are required.

To run the agent, type this into your terminal:

python main.py

The result

We have successfully developed a public speaking and presentation AI agent that provides timely feedback with valuable suggestions to help you improve your presentation in real time.

This is made possible by a trio of Vision Agents, YOLO11, and the OpenAI Realtime API. In less than 50 lines of code, we were able to create an agent that cost almost nothing (just a few tokens) compared to paying $99 for a SaaS platform or hiring a physical coach. Very cool

With Vision Agents, you have a developer-friendly framework that provides countless opportunities for architects to efficiently build engaging AI apps.

Happy building!

Table of Contents

What are we making?

Technical terms

Key technologies

Stream video

Vision Agents

OpenAI Realtime API

YOLO11

Project setup

Set up YOLO.

Coaching Instructions

Presentation agent

The result

Editor's pick

Get latest news

How to become a real-time public speaking and presentation coach.

Table of Contents

What are we making?

Technical terms

Key technologies

Stream video

Vision Agents

OpenAI Realtime API

YOLO11

Project setup

Set up YOLO.

Coaching Instructions

Presentation agent

The result

Is a Safe AI Assistant Possible?

Top Heroku Alternatives for Deployment in 2026

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news