Vibe is coding a private AI financial analyst with Python and native LLMs.

Photo by author

# Introduction

Last month, I found myself staring at my bank statement, trying to figure out where my money was actually going. The spreadsheet felt cumbersome. Current apps are like black boxes, and worst of all, they demand that I upload my sensitive financial data to a cloud server. I wanted something different. I wanted an AI data analyst that could analyze my expenses, identify unusual transactions, and give me clear insights — all while keeping my data 100% local, so I built one.

What started as a weekend project turned into a deep dive into real-world data preprocessing, practical machine learning, and the power of locals. Major language models (LLMs). In this article, I will tell you how I used an AI-powered financial analysis app. The python With “vibe coding”. Along the way, you’ll learn many practical concepts that apply to any data science project, whether you’re analyzing sales logs, sensor data, or customer feedback.

By the end, you will understand:

How to build a robust data preprocessing pipeline that handles messy, real-world CSV files
How to choose and implement machine learning models when you have limited training data.
How to design interactive visualizations that actually answer user questions.
How to integrate native LLM to generate natural language insights without sacrificing privacy

The full source code is available at GitHub. Feel free to fork it, extend it, or use it as a starting point for your own AI data analyzer.

Figure 1: App dashboard showing breakdown of spend and AI insights | Photo by author

# Problem: Why did I make it?

Most personal finance apps share one fundamental flaw: Your data leaves you in control. You upload bank statements to services that store, process and potentially monetize your information. I wanted a tool that:

Let me upload and analyze data instantly.
Everything was processed locally — no cloud, no data leaks.
Deliver AI-powered insights, not just static charts

This project became my vehicle for learning several concepts that every data scientist should know, such as handling heterogeneous data formats, choosing algorithms that work with small data sets, and building privacy-preserving AI features.

# Project Architecture

Before diving into the code, here’s the project structure showing how the pieces fit together:


project/   
  ├── app.py              # Main Streamlit app
  ├── config.py           # Settings (categories, Ollama config)
  ├── preprocessing.py    # Auto-detect CSV formats, normalize data
  ├── ml_models.py        # Transaction classifier + Isolation Forest anomaly detector
  ├── visualizations.py   # Plotly charts (pie, bar, timeline, heatmap)
  ├── llm_integration.py  # Ollama streaming integration
  ├── requirements.txt    # Dependencies
  ├── README.md           # Documentation with "deep dive" lessons
  └── sample_data/
    ├── sample_bank_statement.csv
    └── sample_bank_format_2.csv

We will go through the construction of each layer step by step.

# Step 1: Building a robust data preprocessing pipeline

The first lesson I learned was that real-world data is messy. Different banks export CSVs in completely different formats. Chase Bank uses “transaction date” and “amount”. Bank of America uses “Date”, “Payee” and separate “Debit” columns. Moniepoint and OPay each have their own styles.

A preprocessing pipeline should handle these differences automatically.

// Automatic detection of column mappings

I’ve created a pattern-matching system that identifies columns regardless of naming conventions. Using regular expressions, we can map obscure column names to standard fields.

import re

COLUMN_PATTERNS = {
    "date": (r"date", r"trans.*date", r"posting.*date"),
    "description": (r"description", r"memo", r"payee", r"merchant"),
    "amount": (r"^amount$", r"transaction.*amount"),
    "debit": (r"debit", r"withdrawal", r"expense"),
    "credit": (r"credit", r"deposit", r"income"),
}

def detect_column_mapping(df):
    mapping = {}
    for field, patterns in COLUMN_PATTERNS.items():
        for col in df.columns:
            for pattern in patterns:
                if re.search(pattern, col.lower()):
                    mapping(field) = col
                    break
    return mapping

Key insight: Design for differences, not specific formats. This method works for any CSV that uses common financial terms.

// Standard schema normalization

After detecting the columns, we Normalization All in one consistent structure. For example, banks dividing debits and credits need to be collected in an amount column (negative for expenses, positive for income):

if "debit" in mapping and "credit" in mapping:
    debit = df(mapping("debit")).apply(parse_amount).abs() * -1
    credit = df(mapping("credit")).apply(parse_amount).abs()
    normalized("amount") = credit + debit

Key takeaways: Normalize your data as soon as possible. It simplifies each of the following operations, such as feature engineering, machine learning modeling, and visualization.

Figure 2: The preprocessing report shows what the pipeline detected, giving users transparency Photo by author

# Step 2: Selection of machine learning models for limited data

Another major challenge is the limited training data. Users upload their statements, and there is no large labeled dataset to train a deep learning model. We need algorithms that work well with small samples and can be extended with simple rules.

// Taxonomy of Transactions: A Hybrid Approach

Instead of pure machine learning, I built a hybrid system:

Rule-based matching for trusted cases (e.g. keywords like “WALMART” → grocery)
Pattern-based fallback for ambiguous transactions

SPENDING_CATEGORIES = {
    "groceries": ("walmart", "costco", "whole foods", "kroger"),
    "dining": ("restaurant", "starbucks", "mcdonald", "doordash"),
    "transportation": ("uber", "lyft", "shell", "chevron", "gas"),
    # ... more categories
}

def classify_transaction(description, amount):
    for category, keywords in SPENDING_CATEGORIES.items():
        if any(kw in description.lower() for kw in keywords):
            return category
    return "income" if amount > 0 else "other"

The approach works instantly without any training data, and is easy for users to understand and customize.

// Anomaly Detection: Why the Isolated Forest?

To detect abnormal costs, I need an algorithm that:

Work with small datasets (vs Deep learning)
Make no assumptions about the distribution of the data (unlike statistical methods like Z-score only).
Provide fast predictions for interactive UI.

Loneliness from the forest to learn Ticked all the boxes. It isolates anomalies by randomly dividing the data. Anomalies are fewer and more distinct, so less dissection is required to isolate them.

from sklearn.ensemble import IsolationForest

detector = IsolationForest(
    contamination=0.05,  # Expect ~5% anomalies
    random_state=42
)
detector.fit(features)
predictions = detector.predict(features)  # -1 = anomaly

I also combined this with a simple Z-score check to catch obvious outliers. Oh Z score Describes the position of a raw score in terms of its distance from the mean, measured in standard deviation:
\(
z = frac{x – \mu}{\sigma}
\)
The combined approach captures more anomalies than either method alone.

Key takeaways: Sometimes simple, well-chosen algorithms perform better than complex ones, especially when you have limited data.

Figure 3: Anomaly detector flags unusual transactions, which are highlighted in the timeline | Photo by author

# Step 3: Designing concepts that answer the questions.

Visualizations should answer questions, not just show data. I used tactfully For interactive charts because it allows users to explore the data themselves. Here are the design principles I followed:

Consistent color coding: red for expenses, green for income
Context by comparison: Show income versus expenses side by side.
Progressive disclosure: Show a summary first, then let users drill down.

For example, a spending breakdown uses a donut chart with a hole in the middle for a neater look:

import plotly.express as px

fig = px.pie(
    category_totals,
    values="Amount",
    names="Category",
    hole=0.4,
    color_discrete_map=CATEGORY_COLORS
)

Stream Light These charts make it easy to add. st.plotly_chart() And create a responsive dashboard.

Figure 4: Multiple chart types give users different perspectives on the same data. Photo by author

# Step 4: Integrating a native large language model for natural language insight

The final piece was generating human-readable insights. I chose integration. Allamaa tool for running LLMs locally. Why local instead of calling? Open AI or Claude?

Privacy: Bank data never leaves the machine.
Cost: Unlimited queries, zero API fees
Speed: No network latency (although generation still takes a few seconds)

// Streaming for better user experience

LLMs may take several seconds to generate a response. Streamlit displays tokens as they arrive, making the wait feel less overwhelming. Here is a simple implementation using requests With streaming:

import requests
import json

def generate(self, prompt):
    response = requests.post(
        f"{self.base_url}/api/generate",
        json={"model": "llama3.2", "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")

In Streamlit, you can show it. st.write_stream().

st.write_stream(llm.get_overall_insights(df))

// Rapid engineering for financial data

The key to useful LLM output is a structured prompt that contains actual data. For example:

prompt = f"""Analyze this financial summary:
- Total Income: ${income:,.2f}
- Total Expenses: ${expenses:,.2f}
- Top Category: {top_category}
- Largest Anomaly: {anomaly_desc}

Provide 2-3 actionable recommendations based on this data."""

This gives the model concrete numbers to work with, leading to more relevant insights.

Figure 5: The upload interface is simple. Choose a CSV and let the AI do the rest. Photo by author

// Run the application

Getting started is straightforward. You will need to install Python, then run:

pip install -r requirements.txt

# Optional, for AI insights
ollama pull llama3.2

streamlit run app.py

Upload any bank CSV (the app detects the format automatically), and within seconds, you’ll see a dashboard with categorized transactions, anomalies, and AI-generated insights.

# The result

This project taught me that creating something functional is just the beginning. The real learning happened when I asked why each piece worked:

Why are columns automatically detected? Because real-world data doesn’t follow your schema. Creating a flexible pipeline saves hours of manual cleaning.
why Forest of Solitude? Because small datasets require algorithms designed for them. You don’t always need deep learning.
Why Local LLM? Because privacy and cost matters in production. Running models locally is now practical and powerful.

These lessons apply far beyond personal finance, whether you’re analyzing sales data, server logs, or scientific measurements. The same principles of robust preprocessing, functional modeling, and privacy-aware AI will serve you well in any data project.

The full source code is available on GitHub. Fork it, extend it, and make it your own. If you make something cool with it, I’d love to hear about it.

// References

Shatu Olomide A software engineer and technical writer with a knack for simplifying complex concepts and a keen eye for detail, passionate about leveraging modern technology to craft compelling narratives. You can also search on Shittu. Twitter.

# Introduction

# Problem: Why did I make it?

# Project Architecture

# Step 1: Building a robust data preprocessing pipeline

// Automatic detection of column mappings

// Standard schema normalization

# Step 2: Selection of machine learning models for limited data

// Taxonomy of Transactions: A Hybrid Approach

// Anomaly Detection: Why the Isolated Forest?

# Step 3: Designing concepts that answer the questions.

# Step 4: Integrating a native large language model for natural language insight

// Streaming for better user experience

// Rapid engineering for financial data

// Run the application

# The result

// References

Editor's pick

Get latest news

Vibe is coding a private AI financial analyst with Python and native LLMs.

# Introduction

# Problem: Why did I make it?

# Project Architecture

# Step 1: Building a robust data preprocessing pipeline

// Automatic detection of column mappings

// Standard schema normalization

# Step 2: Selection of machine learning models for limited data

// Taxonomy of Transactions: A Hybrid Approach

// Anomaly Detection: Why the Isolated Forest?

# Step 3: Designing concepts that answer the questions.

# Step 4: Integrating a native large language model for natural language insight

// Streaming for better user experience

// Rapid engineering for financial data

// Run the application

# The result

// References

This startup mathematician wants to change the way math is done.

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news