
Photo by editor
# Introduction
If you’re trying to understand how large language model (LLM) systems work today, it helps to stop thinking only in terms of notation. Most real-world LLM applications aren’t just a quick-and-answer. These are systems that manage context, connect to tools, retrieve data, and handle multiple steps behind the scenes. This is where the majority of the real work happens. Rather than focusing on immediate engineering tricks, it is more useful to understand the building blocks behind these systems. Once you understand these concepts, it becomes clear why some LLM applications feel credible and others don’t. Here are 10 key LLM engineering concepts that explain how modern systems are actually built.
# 1. Understanding Contextual Engineering
Context engineering involves deciding what the model should see at any given moment. This goes beyond writing a good prompt. This includes management instructions, conversation history, retrieved documents, tool definitions, memory, intermediate steps, and execution traces. Basically, it is the process of choosing what information to display, in what order, and in what format. This is often more important than prompt words alone, leading many to suggest that context engineering is the new prompt engineering. Many LLM failures are not because the prompt is poor, but because the context is missing, outdated, redundant, poorly organized, or saturated with noise. For more details I have written a separate article on this topic: A gentle introduction to context engineering in the LL.M.
# 2. Implementing tool calling
Tool calling allows the model to call an external function instead of just trying to generate an answer from its training data. In practice, this is how an LLM searches the web, queries a database, runs code, sends an application programming interface (API) request, or retrieves information from a knowledge base. In this paradigm, the model is no longer just creating text—it is choosing between thinking, speaking, and acting. This is why tool calling is the focus of most production-grade LLM applications. Many practitioners refer to this as the characteristic that turns the LLM into an “agent” because it acquires the ability to take action.
# 3. Adopting a model context protocol
Whereas tool calling allows a model to invoke a specific function. Model Context Protocol (MCP) A standard that allows tools, data and workflows to be shared and reused across different artificial intelligence (AI) systems such as the Universal Connector. Before MCP, integrating N models with M tools might require N×M custom integrations, each with its own potential for errors. MCP solves this by providing a consistent way to expose tools and data so that any AI client can use them. It is fast becoming an industry-wide standard and serves as a key component to building reliable, large-scale systems.
# 4. Enabling agent-to-agent communication
Unlike MCP, which focuses on exposing tools and data in a reusable way, agent-to-agent (A2A) communication focuses on how multiple agents coordinate actions. This is a clear indication that LLM engineering is moving beyond single-agent applications. Google introduced A2A as a protocol for agents to securely communicate, share information, and coordinate operations across enterprise systems. The basic idea is that many complex workflows no longer fit within a single assistant. Instead, a research agent, a planning agent, and an execution agent may need to collaborate. A2A provides these interactions with a standardized structure, preventing teams from inventing ad hoc messaging systems. For more details refer to: Building AI agents? A2A vs. MCP simply explained..
# 5. Leveraging Semantic Caching
If parts of your prompt—such as system instructions, tool definitions, or static documentation—don’t change, you can reuse them instead of sending them back to the model. This is known as prompt caching, which helps reduce both latency and costs. The strategy involves placing static content first and dynamic content later, treating prompts as modular, reusable blocks. Semantic caching goes a step further by allowing the system to reuse previous answers for semantically similar queries. For example, if a user asks a question in a slightly different way, you don’t necessarily need to create a new answer. The main challenge is to find a balance: if the similarity check is too loose, you can give the wrong answer. If it’s too tight, you lose performance gains. I wrote a tutorial on this which you can see here: Create an inference cache to save costs in high-traffic LLM apps.
# 6. Using context compression
Sometimes a retriever successfully finds relevant documents but returns too much text. Although the document may be relevant, the model often only needs the specific class that answers the user’s question. If you have a 20-page report, the answer might be covered in just two paragraphs. Without context compression, the model must process the entire report, adding noise and cost. With compression, the system extracts only the useful parts, making the response faster and more accurate. This is an important survey paper for those who want to study it in depth: Context compression in retrieval-augmented generation for large language models: A survey.
# 7. Applying reclassification
Reclassification is a secondary evaluation that occurs after the initial retrieval. First, a retriever pulls a group of candidate documents. Next, a ranker examines these results and places the most relevant results at the top of the context window. This concept is important because many retrieval-augmented generation (RAG) systems fail not because retrieval did not find anything, but because the best evidence was buried at the bottom while less relevant parts took over the prompt. Reranking solves this sorting problem, which often significantly improves response quality. You can choose a ranking model such as a benchmark. Massive Text Embedding Benchmark (MTEB)which evaluates models in various retrieval and reclassification tasks.
# 8. Implementing Hybrid Recovery
Hybrid retrieval is a method that combines different methods to make the search more reliable. Instead of relying solely on semantic search, which infers meaning through embeddings, you combine it with keyword search methods like Best Matching 25 (BM25). BM25 excels at finding exact words, names, or rare identifiers that semantic search might miss. By using both, you capture the strengths of both systems. I have discovered similar problems in my research: Query Feature Modeling: Improving Search Relevance with Semantic Search and Metadata Filtering. The goal is to improve detection by combining different signals rather than relying on a single vector-based method.
# 9. Designing Agent Memory Architectures
Much of the confusion surrounding “memory” comes from treating it as a monolithic concept. In modern agent systems, it is best to separate short-term working state from long-term memory. Short-term memory represents what the agent is currently using to complete a specific task. Long-term memory functions like a database of stored information, organized by keys or namespaces, and brought up only in the relevant window. Memory in AI is primarily a problem of retrieval and state management. You must decide what to store, how to organize it, and when to recall it to ensure agents remain efficient without being overwhelmed by irrelevant data.
# 10. Management of inference gateways and intelligent routing
Inference routing involves treating each model request as a traffic management problem. Instead of sending each query through the same path, the system decides where it should go based on user needs, task complexity, and cost constraints. Simple applications can go to a smaller, faster model, while complex reasoning tasks are relegated to a more powerful model. This is essential for LLM applications at scale, where speed and efficiency are as important as quality. Effective routing ensures better response times for customers and optimal resource allocation for the provider.
# wrap up
Importantly, advanced LLM applications work best when you think in systems rather than just indicators.
- Prioritize context engineering first.
- Add tools only when the model needs to perform an action.
- Use MCP and A2A to ensure your system scales and connects cleanly.
- Use caching, compression, and reranking to improve retrieval performance.
- Consider memory and routing as fundamental design issues.
When you look at LLM applications through this lens, navigating the field becomes much easier. Real progress is found not only in the development of large models, but in the sophisticated systems built around them. By mastering these building blocks, you’re already thinking like a specialized LLM engineer.
Kanwal Mehreen is a machine learning engineer and a technical writer with a deep passion for AI along with data science and medicine. He co-authored the e-book “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she is a champion of diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a passionate advocate for change, having founded FEMCodes to empower women in STEM fields.