Under the baton of AI agents: A technical guide to the next frontier of general AI

Agents are the hottest topic in AI today, and with good reason. AI agents act on behalf of their customers, autonomously handling tasks such as online shopping, creating software, researching business trends or booking travel. By taking generative AI out of the sandbox of chat interfaces and allowing it to work directly on the world, Agentic represents a leap forward in the power and utility of AI. Taking general AI out of the safe sandbox of a chat interface and allowing it to act directly on the world represents a leap in the power and utility of AI.

Agentic AI is moving really fast: for example, one of the fundamental building blocks of today’s agents, the Model Context Protocol (MCP), is only a year old! As in any fast-moving field, there are many competing definitions, hot takes and misleading opinions.

To cut through the noise, I want to explain the basic components of an agentic AI system and how they fit together: It’s really not as complicated as it sounds. Hopefully, by the time you finish reading this post, agents won’t seem so mysterious.

Agentic ecosystems

Definitions of the word “agent” abound, but I like this minimalist variation from British programmer Simon Wilson:

An LLM agent runs tools in a loop to achieve a goal.

A user prompts a large language model (LLM) with a goal: say, to book a table at a restaurant near a certain theater. Along with the objective, the model gets a list of tools at its disposal, such as a database of restaurant locations or a record of the user’s food preferences. The model then plans how to achieve the goal and calls a tool, which provides the answer. The model then calls a new tool. Through iteration, the agent moves toward accomplishing the goal. In some cases, the model’s orchestration and planning choices are complemented or enhanced by mandatory code.

But what kind of infrastructure does it take to realize this vision? An agent system needs some basic components:

A method Create an agent. When you deploy an agent, you don’t want to code it from scratch. There are several agent development frameworks out there.
somewhere Run the AI model. An experienced AI developer can download an open-weight LLM, but doing it right requires expertise. It also takes expensive hardware that would be of poor use to the average user.
somewhere Run the agent code. With the framework established, the user generates code for an agent object that contains a defined set of functions. Most of these functions involve sending signals to the AI model, but the code needs to be run somewhere. In practice, most agents will move to the cloud, because we want them to run when our laptops are turned off, and we want them to do their job.
A mechanism for translating between text-based LLM and Tool calls.
a Short term Memory To track the content of agentic interactions.
a Long term memory To track user preferences and associations across sessions.
A method Trace System execution to evaluate agent performance.

Let’s dive into each of these components in more detail.

Building an agent

To tell the LLM that it intends to approach a particular task in a way that improves its performance on that task. This “argument of China’s thinking” is now everywhere.

The analog in agentic systems is a reactive (reasoning + action) model, in which the agent thinks (“I’ll use the map function to find a nearby restaurant”), takes an action (issues an API call to the map function), then observes (“There are two pizza places and an Indian restaurant within two blocks of the movie theater”).

React is not the only way to build agents, but it is the foundation of most successful agent systems. Today, agents are usually in the loop on this Thought-Action-Observation The arrangement

Tools available to an agent can include local tools and remote tools such as databases, microservices, and software as a service. A device specification includes a natural language description of how and when it is used and the syntax of its API calls.

The developer can also tell the agent to, essentially, build their own tools on the fly. Say a tool retrieves a table stored as comma-separated text, and to accomplish its goal, the agent needs to sort the table.

Repeatedly passing a table through LLM and evaluating the results would be a huge waste of resources – and is not even guaranteed to give the correct result. Instead, the developer can simply instruct the agent to generate its own Python code when it encounters a simple but repetitive task. These pieces of code can run alongside the agent or locally in a dedicated secure code interpreter tool.

Available tools can divide the responsibility between the LLM and the developer. Once the tools available to the agent are defined, the developer can easily instruct the agent which tools to use when necessary. Or, the developer can specify which data types to use and even which data items to use as arguments during function calls.

Similarly, the developer can easily tell the agent when necessary to automate repetitive tasks or, alternatively, to specify which algorithms to use and which data types to provide. The approach may vary from agent to agent.

Run time

Historically, there were two main ways to isolate code running on shared servers: containerization, which was efficient but offered less security; and virtual machines, which were secure but came with a lot of computational overhead.

In 2018, Amazon Web Services’ (AWS’s) Lambda serverless computing service is deployed Firecrackera new paradigm in server isolation. Firecracker builds “microVMSs,” complete with hardware isolation and their own Linux kernel, but with lower overhead (less than a few megabytes) and shorter startup times (as low as a few milliseconds). Low overhead means that each function executing on the Lambda server can have its own microVM.

However, since instantiating an agent requires the deployment of an LLM, along with memory resources to track the inputs and outputs of the LLM, a per-function isolation model is impractical. Instead, with session-based isolation, each session is assigned its own microVM. When the session ends, the LLM state information is copied to long-term memory, and the MicroVM is destroyed. This ensures secure and efficient deployment of hosts of agents.

Tool calls

Just as there are several existing development frameworks for agent creation, there are several existing standards for communication between agents and tools, the most popular – currently – being the Model Context Protocol (MCP).

The MCP agent establishes a one-to-one connection between the LLM and a dedicated MCP server that processes tool calls, and it also establishes a standard format for transferring various types of data back and forth between the LLM and its server.

Many platforms use MCP by default, but these are also configurable, so they will support a growing set of protocols over time.

Sometimes, however, the required tool is not the same as the available API. In such cases, the only way to retrieve data or perform an action is through cursor movement and clicks on a website. There are many services available to perform such tasks Computer use. This makes any website a potential tool for agents, unlocking decades of content and valuable services not yet directly available through APIs.

Options

With agents, authorization works in two directions. First, of course, users need permission to run their own agents. But since the agent is acting on behalf of the user, it will usually need its permission to access network resources.

There are a few different ways to approach the permission issue. One is with an access delegation algorithm like OAUTH, which essentially eliminates the authorization process by the agentic system. The user enters login credentials in OAUTH, and the agentic system uses OAUTH to log in to protected resources, but the agentic system never has direct access to the user’s passwords.

In another approach, the user logs into a secure session on the server, and the server has its login credentials on the secure resource. Permissions allow the user to choose from a variety of permission strategies and algorithms to implement those strategies.

Memory and traces

Short term memory

LLM is the next word prediction engine. What makes them incredibly versatile is that their predictions are based on long sequences of words they’ve already seen, known. Context. Context, in itself, is a kind of memory. But this is not the only type that requires an agent system.

Suppose, again, that an agent is trying to book a restaurant near a movie theater, and from the map tool, he retrieves a couple of dozen restaurants within a mile radius. He doesn’t want to put all that restaurant information into an LLM context: all that information can wreak havoc with the next word’s chances.

Instead, it can store the entire list in short-term memory and retrieve one or two records at a time, based on the user’s price and food preferences and proximity to the theater. If none of these restaurants are exhausted, the agent can sink into short-term memory instead of executing another tool call.

Long term memory

Agents also need to remember their previous interactions with their clients. If last week I told the restaurant booking agent what kind of food I like, I don’t want to tell her again this week. The same goes for my price tolerance, the kind of environment I’m looking for, etc.

Long-term memory allows the agent to see what it needs to know about previous conversations with the user. However, agents do not usually create long-term memories themselves. Instead, after a session is complete, the entire conversation goes to a separate AI model, which creates new long-term memories or updates existing ones.

Memory creation may include LLM summarization and “chunking,” in which documents are divided into groupings for ease of retrieval during subsequent sessions. Available systems allow the user to choose strategies and algorithms for summarizing, chunking and other information extraction techniques.

observation

Agents are a new type of software system, and they require new ways of thinking about observing, monitoring, and auditing their behavior. Some of the questions we ask will seem familiar: are agents moving fast enough, how much are they costing, how many toll calls are they making and are customers happy. But new questions will also arise, and we can’t necessarily predict what data we’ll need to answer them.

Observation and tracing tools can provide an end-to-end view of the execution of a session with an agent, and break down step-by-step what actions were taken and why. For the agent builder, these markers are key to understanding how well agents are doing — and providing data to help them do better.

I hope this explanation has cleared up AgentI enough that you’re ready to try building your own Agents!

Editor's pick

Get latest news