

Picture by the writer
Llama.cpp is an original, high performance framework that strengthens many famous local AI tools, including Olama, local chat boats, and other device LLM solutions. By working directly with Llama.CPP, you can minimize the overhead for your specific hardware, can control the grain properly, and improve the performance, which allows your local AI agents and applications faster and more.
In this tutorial, I will guide you through the construction of AI applications through LLAMCPP, which use a powerful C/C ++ library to effectively run large language models (LLM). We will cover the construction of a llama.cpp server, connecting it with Langchen, and using a web search and a react agent worth the use of tools such as a web -rip.
1.
This section covers the installation of the Lama CPP and its dependence, it is made for CUDA support, construction of the necessary binary, and to run the server.
Note: We are using an NVIDIA RTX 4090 graphics card that is running on the Linux operating system that has a CUDA toolkit pre -configured. If you do not have the same local hardware access, you can rent GPU events Vast For an affordable price.


From the screenshot Wide console
- Update your system package list and install the necessary tools such as blood, CMACE, curl, and gut. Hardware information has been added to PC PC Ayots, and Llama.CPP requires Libcurl4-Opensl-DEV to download the model from the hugging face.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git -y
- Clone the official Llama.cpp from Gut Hub and use CMAKE to create construction.
# Clone llama.cpp repository
git clone
# Configure build with CUDA support
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_CUDA=ON \
-DLLAMA_CURL=ON
- Set up the Llama.CPP and all its tools, including the server. For convenience, copy all compiled binary from Llama.CPP/ Blood/ Bin/ Directory to Main Llama.cpp/ Directory.
# Build all necessary binaries including server
cmake --build llama.cpp/build --config Release -j --clean-first
# Copy all binaries to main directory
cp llama.cpp/build/bin/* llama.cpp/
- Llama.cPP Server Start with Uncontrollable/Jima-3-4B-ET-GGUF Model.
./llama.cpp/llama-server \
-hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL \
--host 0.0.0.0 \
--port 8000 \
--n-gpu-layers 999 \
--ctx-size 8192 \
--threads $(nproc) \
--temp 0.6 \
--cache-type-k q4_0 \
--jinja
- You can test if the server is running properly using the post application using the curl.
(main) root@C.20841134:/workspace$ curl -X POST \
-H "Content-Type: application/json" \
-d '{
"messages": (
{"role": "user", "content": "Hello! How are you today?"}
),
"max_tokens": 150,
"temperature": 0.7
}'
Output:
{"choices":({"finish_reason":"length","index":0,"message":{"role":"assistant","content":"\nOkay, user greeted me with a simple "Hello! How are you today?" \n\nHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes \"being\" but in a friendly way. \n\nI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. \n\nSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. \n\nI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}),"created":1749319250,"model":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","usage":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}
2. Construction of AI Agent with Lang Graph and Lama CPP
Now, use a dengue graph and langchin to communicate with the Llama.CPP server and make a multi tool AI agent.
- Set up the key to search skills.
- To work with Langchen with a local Lama CPP server (which imitates Openi API), you can set Openi_P._Kek on a local or any empty wire, as the base_orle will direct applications locally.
export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=local
- Install essential Libraries: Lang graphs for agents, tuley paiwan for tuel search tools, and LLM interactions and various Langchen packages for tools.
%%capture
!pip install -U \
langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai
- Create Chatopini from Langchen to communicate with your local Llama.cpp server.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",
temperature=0.6,
base_url="
)
- Set the tools that your agent will use.
- Tavilysearchsults: Allows the agent to find the web.
- Paitonoplaptol: Provides a print loop to read the agent to read the code.
from langchain_community.tools import TavilySearchResults
from langchain_experimental.tools.python.tool import PythonREPLTool
search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool = PythonREPLTool()
tools = (search_tool, code_tool)
- To create an agent using LLM and specified tools, use a pre -built -in -cue_Gent function of the Lang graph that can cause and process (reaction framework).
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
model=llm,
tools=tools,
)
3. Test AI agent with example questions
Now, we will test the AI agent and also show which tool the agent uses.
- This helpful function uses the names of the tools used by the history of the conversation history. It is useful to understand the agent’s decision -making process.
def extract_tool_names(conversation: dict) -> list(str):
tool_names = set()
for msg in conversation.get('messages', ()):
calls = ()
if hasattr(msg, 'tool_calls'):
calls = msg.tool_calls or ()
elif isinstance(msg, dict):
calls = msg.get('tool_calls') or ()
if not calls and isinstance(msg.get('additional_kwargs'), dict):
calls = msg('additional_kwargs').get('tool_calls', ())
else:
ak = getattr(msg, 'additional_kwargs', None)
if isinstance(ak, dict):
calls = ak.get('tool_calls', ())
for call in calls:
if isinstance(call, dict):
if 'name' in call:
tool_names.add(call('name'))
elif 'function' in call and isinstance(call('function'), dict):
fn = call('function')
if 'name' in fn:
tool_names.add(fn('name'))
return sorted(tool_names)
- Explain a function to run the agent with a given question and print the usage tools and final answers.
def run_agent(question: str):
result = agent.invoke({"messages": ({"role": "user", "content": question})})
raw_answer = result("messages")(-1).content
tools_used = extract_tool_names(result)
return tools_used, raw_answer
- Let’s ask the agent for the top 5 breaking news stories. It should use tavily_search_results_json tool.
tools, answer = run_agent("What are the top 5 breaking news stories?")
print("Tools used ➡️", tools)
print(answer)
Output:
Tools used ➡️ ('tavily_search_results_json')
Here are the top 5 breaking news stories based on the provided sources:
1. **Gaza Humanitarian Crisis:** Ongoing conflict and challenges in Gaza, including the Eid al-Adha holiday, and the retrieval of a Thai hostage's body.
2. **Russian Drone Attacks on Kharkiv:** Russia continues to target Ukrainian cities with drone and missile strikes.
3. **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, but Russia's Africa Corps remains.
4. **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk could have implications for Tesla stock and the U.S. space program.
5. **Education Department Staffing Cuts:** The Biden administration is seeking Supreme Court intervention to block planned staffing cuts at the Education Department.
- Let’s ask the agent to write and execute the codes for the Fiboniki series. It should use the Azgar_Reple tool.
tools, answer = run_agent(
"Write a code for the Fibonacci series and execute it using Python REPL."
)
print("Tools used ➡️", tools)
print(answer)
Output:
Tools used ➡️ ('Python_REPL')
The Fibonacci series up to 10 terms is (0, 1, 1, 2, 3, 5, 8, 13, 21, 34).
The final views
In this guide, I have used a small quantized LLM, which sometimes struggles with accuracy, especially when it comes to selecting tools. If your goal is to build AI agents ready for production, I am very recommended to run the latest, full -sized model with Llama.cpp. Large and more recent models usually provide better results and more reliable output
It is important to note that Llama.CPP can be more difficult than a user -friendly tools like Olama. However, if you are willing to improve your specific hardware LEB debug, improve and take time for Taylor Lama CPP, the benefits and flexibility of performance are worth it.
One of the biggest benefits of Lama CPPP is its performance: You don’t need high -end hardware to start. It runs well without a regular CPU and laptop dedicated GPU, which makes local AI almost every Everyone accessible. And if you ever need more power, you can always rent a cheaper GPU from the cloud provider.
Abid Ali Owan For,,,,,,,,,, for,, for,,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,,, for,,, for,,, for,,, for,,, for,,,,, for,,,, for,,,, for,,,, for,, for,.@1abidaliawan) A certified data scientist is a professional who loves to create a machine learning model. Currently, he is focusing on creating content and writing technical blogs on machine learning and data science technologies. Abid has a master’s degree in technology management and a bachelor’s degree in telecommunications engineering. Its vision is to create AI products using a graph neural network for students with mental illness.