How to Protect Sensitive Data by Running LLM Locally with Olama

Whenever engineers are building AI-powered applications, the use of sensitive data is always a top priority. You don’t want to send user data to an external API that you don’t control.

For me, it happened while I was building. Finance GPTwhich is my personal open source project that supports me financially. The application lets you upload your bank statements, tax forms like 1099s, etc., and then you can ask questions in plain English like, “How much did I spend on groceries this month?” or “What was my effective tax rate last year?”

The problem is that answering these questions meant sending all the sensitive transaction history, W-2s and income data to OpenAI or Anthropic or Google, which I wasn’t comfortable with. Even after resetting the PII data from those documents, I wasn’t fine with the trade.

This is where Ulama comes in. Olama lets you run large language models entirely on your laptop. You don’t need any API keys or cloud infrastructure and no data leaves your machine.

In this tutorial, I’ll explain what Olama is, how to get started with it, and how to use it in a real Python application so that users of the application can choose to keep their data completely local.

Conditions

You will need at least the following:

What is a symptom?

Olama is an open source tool that makes it very easy to run LLMs locally. You can think of it as Docker but for AI models. You can draw models using just one command and Olama handles everything else like downloading weights, managing memory and serving the model through a native REST API.

The native REST API is compatible with OpenAI’s API format which means any application that can talk to OpenAI can switch to using Ollama without changing any code.

Installation

The first thing you’ll need to do is download the installer ollama.com. Once installed, you can verify that it is running:

ollama --version

The above command checks if Olama was installed correctly and prints the current version.

Draw and run your first model.

Olama hosts a variety of models. ollama.com/library. To draw and instantly chat with someone, just do this:

ollama run llama3.2

This command will download the model from ollama and start an interactive chat session with it. Note: The model size will be a few GBs depending on which model is downloaded. Alternatively, if you only want to download a specific model:

ollama pull mistral

It downloads a model to your machine without starting a chat session which is useful when you want to pre-order models.

You can run the following command to list your installed models.

ollama list

It shows all the models you have downloaded locally along with their sizes.

I have used the following models and they have worked very well for specific tasks.

Model	size	Good for
`llama3.2`	~2 GB	Fast, general purpose
`mistral`	~4 GB	After strict instructions
`qwen2.5:7b`	~4 GB	Multilingualism, Reasoning
`deepseek-r1:7b`	~4 GB	Complex reasoning tasks

How Ollama’s API Works

After Olama runs, it will be served at localhost:11434. You can call it directly using curl:

curl  -d '{
  "model": "llama3.2",
  "messages": ({ "role": "user", "content": "What is compound interest?" }),
  "stream": false
}'

It sends a chat message directly from the command line to Ollama’s REST API, disabling streaming so you get the full response at once. The above point is just to interact with the model. is a more useful endpoint. As it is compatible with OpenAI. This is a key feature that makes it easy to put into existing apps using OpenAI or other LLMs.

How to call Ulama from Python

How to use the Ollama Python library

Olama has its own Python library which is quite intuitive to use:

pip install ollama

from ollama import chat

response = chat(
    model="llama3.2",
    messages=(
        {'role': 'user', 'content': 'Explain what a Roth IRA is in simple terms.'}
    )
)

print(response.message.content)

The above code uses Ollama’s native Python SDK to send a message and print the model’s response, which is the simplest way to call Ollama from Python.

How to use OpenAISDK with Olama as a backend

As mentioned earlier, Olama has an endpoint that is compatible with OpenAI, so you can also use the OpenAI Python SDK and point it to your local server:

pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="",
    api_key='ollama',  # Required by the SDK, but ignored by Ollama
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=(
        {'role': 'user', 'content': 'Explain what a Roth IRA is in simple terms.'}
    )
)

print(response.choices(0).message.content)

It uses the standard OpenAI Python SDK but redirects it to your local Olama server. gave api_key The field is required by the SDK but ignored by Ollama. This pattern seamlessly uses Olama for existing applications. The code is almost identical to what you would write for OpenAI.

How to integrate Olama into Langchain App

Most production applications are built with orchestration frameworks such as LangChain, which has native Olama support. This means switching providers is just a line change.

Install the integration:

pip install langchain-ollama

How to create a chat model

from langchain_ollama import ChatOllama

llm = ChatOllama(model="llama3.2")

response = llm.invoke("What is the difference between a W-2 and a 1099?")
print(response.content)

It creates a LangChain-compatible chat model supported by a native Olama model, exchanging one-line ChatOpenAI.

Compare it to the OpenAI version and you’ll see that the interface is almost identical:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o")

How to Create an LLM-Provider Agnostic App

The real power of the application comes from the abstraction of LLM providers. Applications like Perplexity let users choose the LLM they want to use for their tasks. Here’s a simple factory pattern that returns the correct LLM based on the configuration:

from langchain_openai import ChatOpenAI
from langchain_ollama import ChatOllama
from langchain_anthropic import ChatAnthropic

def get_llm(provider: str, model: str):
    """
    Return the appropriate LangChain LLM based on the provider.
    
    Args:
        provider: One of "openai", "ollama", "anthropic"
        model: The model name (e.g. "gpt-4o", "llama3.2", "claude-3-5-sonnet")
    
    Returns:
        A LangChain chat model ready to use
    """
    if provider == "openai":
        return ChatOpenAI(model=model)
    elif provider == "ollama":
        return ChatOllama(model=model)
    elif provider == "anthropic":
        return ChatAnthropic(model=model)
    else:
        raise ValueError(f"Unknown provider: {provider}")

The snippet above shows a helper that returns the correct LangChain model based on the provider string, so the rest of your app doesn’t need to know which LLM is running underneath.

Now the rest of your code doesn’t need to know about the provider that LLM is running under. This includes your chains, your agents and your tools. you pass llm around and it just works.

How to use Olama with LangGraph

If you’re using Lang Graph to build agents (as I covered in my previous article on AI agents), plugging into Olama is just as smooth:

from langgraph.prebuilt import create_react_agent
from langchain_ollama import ChatOllama
from langchain_core.tools import tool

@tool
def get_spending_summary(category: str) -> str:
    """Get total spending for a given category this month."""
    # In a real app, this would query your database
    return f"You spent $342.50 on {category} this month."

llm = ChatOllama(model="llama3.2")

agent = create_react_agent(
    model=llm,
    tools=(get_spending_summary)
)

response = agent.invoke({
    "messages": ({"role": "user", "content": "How much did I spend on groceries?"})
})

print(response("messages")(-1).content)

This fragment creates a ReAct agent that uses a native runtime model to decide when to call tools while still keeping all data on the device during the agent’s workflow.

The agent will decide to call. get_spending_summary Instead of sending your data over the Internet to the tool and OpenAI when needed, get results using a model that runs locally.

How does Finance GPT use it in practice?

FinanceGPT is built to support OpenAI, Anthropic, Google and Ollama as LLM providers. The user sets his preference on the UI or in a configuration file and the application instantiates the correct model using a pattern that exactly matches the factory pattern above.

When a user selects Ulama, here’s what happens:

Their bank statements and other sensitive documents are parsed locally.
Sensitive fields such as SSNs are masked before any LLM call.
The masked data and query goes to the local Olama server running on their own machine.
The response is returned locally and nothing ever leaves their network.

To run FinanceGPT locally with Olama, the setup looks like this:

# 1. Pull a capable model
ollama pull llama3.2

# 2. Clone and configure FinanceGPT
git clone 
cd FinanceGPT
cp .env.example .env

# 3. In .env, set your LLM provider to Ollama
# LLM_PROVIDER=ollama
# LLM_MODEL=llama3.2

# 4. Start the full stack
docker compose -f docker-compose.quickstart.yml up -d

With this setup, the entire application including front-end, back-end and LLM runs on your own hardware.

Be aware of trade-offs

Ollama is a great on-premises alternative to using cloud LLMs, but it comes with its own set of issues.

Response criteria

Olama models are essentially 7B parameter models running locally, so by design they will not match GPT-4o on complex reasoning tasks. For simple question-and-answer and summarization tasks, the results are comparable, but for multi-level reasoning or critical judgment calls, the difference is significant.

speed

The speed of inference depends on the hardware that is running the model. Without a GPU, Olama models can take several seconds to respond. On Apple Silicon (M1/M2/M3), performance is surprisingly good even without a dedicated GPU.

Hardware requirements

Smaller models (7B parameters) require around 8GB of RAM, while larger models (13B+) require 16GB or more. If you are building your application for end users, you cannot guarantee that they have the hardware.

Tool usage and function calling

Not all native models support the reliable calling function. If your agent relies heavily on tool usage, carefully evaluate the model you choose. Love the models. qwen2.5 And mistral Generally handle it better than others.

The right mental model: Use the cloud model when you need maximum efficiency, and use on-premise models when privacy or cost constraints make the cloud model impractical.

The result

In this tutorial, you learned what Olama is, how to install it and draw models, and three different ways to call it from Python: the native Olama library, the OpenAI-compatible SDK, and LangChain. You also saw how to create a provider-agnostic factory pattern so that your app can switch between cloud and on-premises models with a single configuration change.

Ollama makes native LLMs truly practical for production apps. An OpenAI-compatible API means integration is almost zero-friction, and native LangChain support means you can build provider-agnostic apps from scratch.

The finance domain is an obvious fit—but the same principle applies anywhere sensitive data is involved: healthcare, legal technology, HR, personal productivity. If your app processes data that users wouldn’t want stored on someone else’s server, giving them a local option isn’t just a good thing. This is a trust feature.

Check out FinanceGPT.

All code examples come from here. Finance GPT. If you want to see these samples in a full app, poke around the repo. It has document processing, portfolio tracking, tax optimization – all built with LangGraph.

If you find this useful, Give the project a star on GitHub. – It helps other developers discover it.

Resources

Table of Contents

Conditions

What is a symptom?

Installation

Draw and run your first model.

How Ollama’s API Works

How to call Ulama from Python

How to use the Ollama Python library

How to use OpenAISDK with Olama as a backend

How to integrate Olama into Langchain App

How to create a chat model

How to Create an LLM-Provider Agnostic App

How to use Olama with LangGraph

How does Finance GPT use it in practice?

Be aware of trade-offs

Response criteria

speed

Hardware requirements

Tool usage and function calling

The result

Check out FinanceGPT.

Resources

Editor's pick

Get latest news

How to Protect Sensitive Data by Running LLM Locally with Olama

Table of Contents

Conditions

What is a symptom?

Installation

Draw and run your first model.

How Ollama’s API Works

How to call Ulama from Python

How to use the Ollama Python library

How to use OpenAISDK with Olama as a backend

How to integrate Olama into Langchain App

How to create a chat model

How to Create an LLM-Provider Agnostic App

How to use Olama with LangGraph

How does Finance GPT use it in practice?

Be aware of trade-offs

Response criteria

speed

Hardware requirements

Tool usage and function calling

The result

Check out FinanceGPT.

Resources

Pandas vs Polars: A complete comparison of syntax, speed and memory

Learn MLOps with MLflow and Databricks.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news