Nowadays, being able to present an idea, project, or achievement is an essential skill. Your ability to show and talk about your work may determine whether you get that degree, funding, or accreditation.
But while effective communication is important, it’s not a skill that everyone has. This is something you build through constant practice.
Therein lies the challenge: When we practice on our own, it’s easy to overlook coordination, posture, and delivery, which are just as important as speech.
That is where we need a coach. They are another pair of eyes and ears that note important details and relay them to you when you present. Thanks to recent advances in visual AI, you can now get continuous and objective feedback at any time. such as frameworks Vision Agents Allows you to seamlessly integrate powerful visual models and build the AI-powered applications you want.
In this article, we’ll build a real-time public speaking and presentation coach powered by VisionAgents, which you can run on your PC or Mac to improve your delivery.
Table of Contents
What are we making?
In this guide, we’ll walk you through how to create a coaching agent that acts as your personal practice partner. The agent will provide real-time feedback, highlight areas for improvement and offer helpful suggestions via audio and text.
They will track several aspects of your offer, looking for:
Filler words: To help you reduce the use of words such as “um”, “oh”, “like” and “you know”.
Speaking speed: To identify if you are speaking too fast or too slow.
Voice typePointing out if you’re looking boring.
Explanation: To hear if your words are clear enough.
Currency: To check whether you maintain good or bad posture. Look at your shoulders, back and chin.
Hand gestures: To monitor the use of your hands.
Eye contact: To determine if your eyes are looking directly at your audience.
Now you have a mental picture of what we are setting out to build.
Better still, here’s a visual look at how this armor looks and works.
You can find all the code in this tutorial here. Repo.
Technical terms
Before we begin, make sure you have:
Key technologies
First, let’s introduce the key players and their respective roles in our Presentation Coach implementation.
Stream video
Stream video There is a complete video infrastructure built on WebRTC, which enables browsers and apps to send live audio and video. It comes supercharged with a worldwide edge network that transports your video to the nearest server in less than 30 milliseconds. This means that for our presentation coach, the AI can join your practice session like a real participant, watching and listening to you in real time without interruption, as well as providing feedback.
Vision Agents
Vision Agents Stream is an open source framework that allows you to integrate video streams, AI models and chat interfaces. It ships with stream video as its default transport layer.
This framework simplifies the development of multimodal AI agent applications by providing a unified agent class that orchestrates everything. With Vision Agents, you can connect models and get them to work seamlessly as an integrated system.
OpenAI Realtime API
OpenAI Realtime API Allows you to stream live, low-latency interactions with OpenAI models. Its strength lies in its ability to handle speech-to-speech at once. Your words go in, the AI thinks about them, and you get audio and text feedback almost instantly. Your app and model can communicate instantly, like a live conversation. This will be the actual brainchild of the presentation coach.
YOLO11
YOLO11 is an advanced and powerful computer vision model developed by Ultralytics. It supports a wide range of tasks, including object detection, instance segmentation, image classification, pose estimation/keypoint detection, and oriented bounding box detection.
It tracks 17 different points on your body, such as your shoulder, head, and hand positions, and even tries to determine your posture at certain times. Our presentation will focus on the aspects of pose estimation and keypoint detection.

Project setup
Now, let’s get straight into building this presentation coach with all the technologies we’ve highlighted.
We’ll start by installing UV, the recommended installer for Vision Agents. Create a project folder and run this command in Terminal if you’re using pip installer:
pip install uv
For Linux/MacOS, run:
curl -LsSf | sh.
For Windows, run:
powershell -ExecutionPolicy ByPass -c \"irm | iex\"
Next, start uv in your project:
uv init
Then create a virtual environment:
uv venv
and activate the virtual environment:
.venv\Scripts\activate
Install now. Vision Agents With required plugins and dependencies:
uv add vision-agents(getstream,openai,ultralytics) python-dotenv
In the root directory, create a env File and provide required credentials:
STREAM_API_KEY=your-stream-api-key
STREAM_API_SECRET=your-stream-secret
OPENAI_API_KEY=your-openai-api-key
CALL_ID="practice-room"
In the root directory, create a ‘instructions’ Folder and Markdown File’coach.md’ Within it
In the root directory, create a file and name it Download_yolo_pose.py.
Your current project folder structure should look like this:
└── 📁Presentation Coach
└── 📁.venv
└── 📁instructions
├──coach.md
└── .env
└── .gitignore
└── download_yolo_pose.py
└── main.py
└── pyproject.toml
└── README.md
└── uv.lock
Set up YOLO.
Ultralytics uses the YOLO11 framework. yolo11n-pose.pt Model file to see your posture during your presentation. This pre-trained deep learning model file estimates poses by detecting key points. in you Download_yolo_pose.py file, enter this:
//download_yolo_pose.py
from ultralytics import YOLO
import shutil
from pathlib import Path
model = YOLO("yolo11n-pose.pt")
project_root = Path(__file__).parent
target = project_root / "yolo11n-pose.pt"
if not target.exists():
print("Copying model to project root...")
shutil.copy2(model.model.path, target)
else:
print("Model already in project root.")
print(f"Ready: {target.resolve()}")
It downloads automatically. yolo11n-pose.pt file, if not present in your project, and copies it to the root of the project.
Coaching Instructions
AI plays the role of coach in this implementation. gave coach.md File gives him his whole personality, skills and coaching philosophy. You specify the tone, output rate, response length, speaking speed, feedback time, and other metrics you want your AI to follow. Without these guidelines, you will receive general answers, vague suggestions, long answers, and interruptions.
Stick it in yourself. coach.md For best results file:
//instructions/coach.md
These instructions describe how the coaching system should behave when someone is practicing a presentation. Ensure to give quick, specific tips and try not to interrupt their flow. Only provide feedback after detecting at least 3-5 seconds of silence.
On format, feedback should appear like short texts on screen and keep them within 1 or 2 sentences maximum.
You want people to be relaxed during the presentation, so, ensure you start with something positive and always add one actionable tip.
You’ll have access to video feeds, transcripts and pose data. That’s enough to get a good idea of pace, body language and how engaged they look.
A big part of your evaluation is to understand their speech. You should look out for:
Pace: Shouldn’t be too fast or too slow. Send a message to address it when noticed.
Filler words: Listen for “um”,”uh”,”emm”,”you know”. If they keep popping up, send a reminder to pause.
Tone and variety: Watch out for their pitch and suggest accordingly
Clarity: Make sure that their words are clear enough
Also, keep an eye on the body posture. Encourage confident presentation involving use of hands, straight shoulders and steady eye contact.
Presentation agent
Now that we have all our pieces in place, it’s time to look at the central processing unit of our Presentation Coach. within the main.py File is where the magic of vision agents happens, live video streaming, OpenAI real-time functions, Yolo pose detection, and binding your coaching instructions into one multimodal agent.
Here is our method. main.py looks like:
// main.py
import logging
from dotenv import load_dotenv
from vision_agents.core import Agent, User, cli
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import getstream, openai, ultralytics
load_dotenv()
async def create_agent(**kwargs) -> Agent:
agent_user = User(
name="Public Speaking & Presentation Coach",
id="coach_agent",
image="
)
return Agent(
edge=getstream.Edge(),
agent_user=agent_user,
instructions="@instructions/coach.md",
llm=openai.Realtime(
fps=6,
voice="alloy",
),
processors=(
ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")
),
)
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
print(f"Presentation Coach Agent starting...")
print(f"Joining call: {call_type}:{call_id}")
call = await agent.create_call(call_type, call_id)
session = await agent.join(call)
print("Agent connected and ready!")
print("Real-time coaching enabled")
try:
await agent.llm.simple_response(
text="Greet the user warmly and say you're ready to help them practice. "
"Watch their body language and speech — give encouraging, real-time feedback."
)
await agent.finish()
finally:
await session.close()
if name == "__main__":
cli(AgentLauncher(create_agent=create_agent, join_call=join_call))
Let’s see what’s happening in this code:
Your keys are loaded from an .env file.
load_dotenvFunctiongave
create_agentThe function then creates the coach’s identity using the User object, assigning it a name, ID, and avatar.The instantiated agent object takes several arguments, which configure how the agent behaves and interacts with videos, models, and the user. Our agents object to the following arguments:
edge,agent_user,instructions,LLMSandprocessors.edge=getstream.Edge()Connects everything to Stream’s global, low-latency video infrastructure.agent_userSpecifies the identity of a previously created coach.Instructions located in coach.md load your coaching philosophy directly into the mind of the agent.
llmDefines AI language models and parameters. For this agent, it is OpenAI.Realtime, which opens a WebSocket to OpenAI’s Realtime API. With a frame rate of 6, the agent receives six video frames per second. The sound parameter is set to “alloy” which allows for real-time speech generation.processorsPerform specific types of AI/ML computations on incoming streams. In this case, the video frames were analyzed by YOLO11.with the
join_callfunction, the agent joins the call with a short, welcoming greeting that appears immediately in the chat. gaveawait agent.finishThe function controls the agent’s real-time loop, which continuously listens, watches, thinks, and automatically responds. No manual prompts are required.
To run the agent, type this into your terminal:
python main.py
The result
We have successfully developed a public speaking and presentation AI agent that provides timely feedback with valuable suggestions to help you improve your presentation in real time.
This is made possible by a trio of Vision Agents, YOLO11, and the OpenAI Realtime API. In less than 50 lines of code, we were able to create an agent that cost almost nothing (just a few tokens) compared to paying $99 for a SaaS platform or hiring a physical coach. Very cool
With Vision Agents, you have a developer-friendly framework that provides countless opportunities for architects to efficiently build engaging AI apps.
Happy building!