How to ship a production-ready RAG app with FAISS (Guardrails, Evils, and Fallbacks)

by SkillAiNest

Most LLM applications look great in high-fidelity demos. Then they hit the hands of real users and start failing in very predictable but harmful ways.

They answer questions they shouldn’t, they break when document retrieval is weak, they time out due to network latency, and no one can tell exactly what happened because there are no logs and no tests.

In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to keep production facts alive. It’s not just a script that calls an API. It is a system that includes a FastAPI backend, a persistent FAISS vector store, and the necessary security guardrails (including recovery gates and fallbacks).

Table of Contents

  1. Why RAG is not ready for stand-alone production.

  2. The architecture you are building.

  3. Project setup and structure

  4. How to create a RAG layer with FAISS

  5. How to Add LLM Call with Structured Output

  6. How to Add Guardrails: Retrieval Gates and Fallbacks

  7. Fast API App: Creating a /Response endpoint

  8. How to Add Beginner Friendly Evils

  9. What to improve next: Realistic upgrades

Why RAG is not ready for stand-alone production.

Retrieval Augmented Generation (RAG) is often referred to as the hallucination killer. By grounding the model in the retrieved text, we provide it with the facts it needs to be valid. But simply connecting a vector database to LLM is not enough for a production environment.

Production problems usually arise from silent failures in the system around the model:

  • Weak recovery: If the app retrieves irrelevant pieces of text, the model tries to close the gap by inventing the answer anyway. Without a designated “I don’t know” path, the model is essentially forced to be delusional.

  • Lack of visibility: Without structured outputs and basic logging, you can’t tell if a bad fetch, a confusing prompt, or a model update caused an incorrect response.

  • Fragility: If you don’t implement fallbacks, a simple API timeout or incorrect provider response becomes a user-facing outage.

  • No regression testing: In traditional software, we have unit tests. In AI, we need evils. Without them, one small tweak on your spur of the moment might solve one problem but break ten others without you even realizing it.

We will address each of these issues systematically in this guide.

Conditions

This tutorial is beginner friendly, but assumes you have some basics down so you can focus on building a robust RAG system instead of getting bogged down by setup issues.

knowledge

You should be comfortable with:

  • Python Basics (Functions, Modules, Virtual Environments)

  • Basic HTTP + JSON (requests, response payloads)

  • APIs with FastAPI (What is an endpoint and how to run a server)

  • Advanced LLM concepts (pointing, temperature, structural yield)

Tools + Accounts

You will need:

  • Python 3.10+

  • doing a thing OpenAI-compatible API key (OpenAI or any provider that supports the same request/response format)

  • A local environment where you can run the FastAPI app (Mac/Linux/Windows)

What this tutorial covers (and doesn’t cover)

We will create a production mind baseline:

  • Oh FAISS-supported recovery With a persistent index + metadata

  • Oh Recovery Gate To prevent “forced hallucinations”.

  • Structured JSON output So your background is stable.

  • Fallback behavior For timeouts and provider errors

  • A small one eval control To prevent regression

We will not implement advanced upgrades such as rankers, semantic chunking, authentication, background jobs beyond the final roadmap.

The architecture you are building.

Our application flow follows a disciplined path so every answer is evidence-based:

  1. User query: The user sends a query through the Fast API endpoint.

  2. Retrieve: The system embeds the query and retrieves the top-k matching document fragments.

  3. Recovery Gate: We evaluate the similarity score. If the context isn’t relevant enough, we immediately stop and refuse the query.

  4. Breeding and Breeding: If the gate is passed, we send the context-enhanced signal to the LLM.

  5. Constructed response: The model returns a JSON object that contains the response, the sources used, and the confidence level.

Project setup and structure

To keep things organized and maintainable, we’ll use a modular structure. This allows you to change your LLM provider or your vector database without rewriting your entire core application.

Project structure

.
├── app.py              # FastAPI entry point and API logic
├── rag.py              # FAISS index, persistence, and document retrieval
├── llm.py              # LLM API interface and JSON parsing
├── prompts.py          # Centralized prompt templates
├── data/               # Source .txt documents
├── index/              # Persisted FAISS index and metadata
└── evals/              # Evaluation dataset and runner script
    ├── eval_set.json
    └── run_evals.py

Install dependencies.

First, create a virtual environment to isolate your project:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenv

Configure the environment.

Make a .env file in the root directory. We are targeting OpenAI compliant providers:

OPENAI_API_KEY=your_actual_api_key_here
OPENAI_BASE_URL=
OPENAI_MODEL=gpt-4o-mini

Important Note on Compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you may need to change the URL, headers (eg X-API-Key), and the way you extract the embeddings and the content of the final message. embed_texts() And call_llm().

How to create a RAG layer with FAISS

i rag.pywe handle the “retriever” portion of the RAG. This involves converting raw text into mathematical vectors that a computer can compare.

What is FAISS (and what does it do)?

FAISS (Facebook AI Similarity Search) is a fast library for vector similarity search. In the RAG system, each piece of text becomes an embedding vector (a list of floats). FAISS stores these vectors in an index so you can quickly query:

“Embedded in this question, which document fragments are closest to this?”

In this tutorial, we use IndexFlatIP Normalizing the inner product and vector along it faiss.normalize_L2(...). With normalized vectors, behaves like an inner product. Cosine similaritygiving us a stable score we can use for the retrieval gate.

Chunking strategy with overlap

We will use chunking with overlap. If we split a document into exactly 1,000 characters, we can cut a sentence in half, losing its meaning. By using an overlap of, for example, 200 characters, we ensure that the end of one segment and the start of the next sharing context.

Implementation of rag.py

import os
import faiss
import numpy as np
import requests
import json
from typing import List, Dict
from dotenv import load_dotenv

load_dotenv()

INDEX_PATH = "index/faiss.index"
META_PATH = "index/meta.json"

def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List(str):
    chunks = ()
    step = max(1, size - overlap)
    for i in range(0, len(text), step):
        chunk = text(i : i + size).strip()
        if chunk:
            chunks.append(chunk)
    return chunks

def embed_texts(texts: List(str)) -> np.ndarray:
    # Note: If your provider is not OpenAI-compatible, change this URL and headers
    url = f"{os.getenv('OPENAI_BASE_URL')}/embeddings"
    headers = {"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}"}
    payload = {"input": texts, "model": "text-embedding-3-small"}

    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    resp.raise_for_status()
    # If your provider uses a different response format, change the line below
    vectors = np.array((item("embedding") for item in resp.json()("data")), dtype="float32")
    return vectors

def build_index() -> None:
    all_chunks: List(str) = ()
    metadata: List(Dict) = ()

    if not os.path.exists("data"):
        os.makedirs("data")
        return

    for file in os.listdir("data"):
        if not file.endswith(".txt"):
            continue

        with open(f"data/{file}", "r", encoding="utf-8") as f:
            text = f.read()

        chunks = chunk_text(text)
        all_chunks.extend(chunks)
        for c in chunks:
            metadata.append({"source": file, "text": c})

    if not all_chunks:
        return

    embeddings = embed_texts(all_chunks)
    faiss.normalize_L2(embeddings)

    dim = embeddings.shape(1)
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)

    os.makedirs("index", exist_ok=True)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        json.dump(metadata, f, ensure_ascii=False)

def load_index():
    if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)):
        raise FileNotFoundError(
            "FAISS index not found. Add .txt files to data/ and run build_index()."
        )

    index = faiss.read_index(INDEX_PATH)
    with open(META_PATH, "r", encoding="utf-8") as f:
        metadata = json.load(f)
    return index, metadata

def retrieve(query: str, k: int = 5) -> List(Dict):
    index, metadata = load_index()

    q_emb = embed_texts((query))
    faiss.normalize_L2(q_emb)

    scores, ids = index.search(q_emb, k)
    results = ()
    for score, idx in zip(scores(0), ids(0)):
        if idx == -1:
            continue
        m = metadata(idx)
        results.append(
            {"score": float(score), "source": m("source"), "text": m("text"), "id": int(idx)}
        )
    return results

How to Add LLM Call with Structured Output

A major point of failure in AI apps is the “conversational” nature of LLMs. If your backend expects a list of sources but returns LLM conversation filler, your code will crash.

We deal with it. Structural production: instruct the model to return a strict JSON object, then safely parse it.

Implementation of llm.py

import json
import requests
import os
from typing import Dict, Any

def call_llm(system_prompt: str, user_prompt: str) -> Dict(str, Any):
    # Note: Change URL/Headers if using a non-OpenAI compatible provider
    url = f"{os.getenv('OPENAI_BASE_URL')}/chat/completions"
    headers = {
        "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
        "Content-Type": "application/json",
    }

    payload = {
        "model": os.getenv("OPENAI_MODEL"),
        "messages": (
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ),
        "response_format": {"type": "json_object"},
        "temperature": 0,
    }

    try:
        resp = requests.post(url, headers=headers, json=payload, timeout=30)
        resp.raise_for_status()
        content = resp.json()("choices")(0)("message")("content")

        parsed = json.loads(content)
        parsed.setdefault("answer", "")
        parsed.setdefault("refusal", False)
        parsed.setdefault("confidence", "medium")
        parsed.setdefault("sources", ())
        return parsed

    except (requests.Timeout, requests.ConnectionError):
        return {
            "answer": "The system is temporarily unavailable (network issue). Please try again.",
            "refusal": True,
            "confidence": "low",
            "sources": (),
            "error_type": "network_error",
        }
    except Exception:
        return {
            "answer": "A system error occurred while generating the answer.",
            "refusal": True,
            "confidence": "low",
            "sources": (),
            "error_type": "unknown_error",
        }

How to Add Guardrails: Retrieval Gates and Fallbacks

Guardrails are interceptors. They sit between the user and the model to prevent potential failures.

Recovery Gate: How it works and how to add it.

In the standard RAG pipeline, the system always calls LLM. If the user asks an unrelated query, the retriever will still return the “closest” (but incorrect) chunks.

The solution is the recovery gate:

  1. Recover and get the pieces of the top. Top similarity score

  2. If the score is below a threshold (eg 0.30), immediately denied

  3. Call LLM only when retrieval is strong enough to base an answer on.

A range of 0.30 Normalized cosine is a reasonable starting point when using matching, but you should tune it using evals (next section).

Fallbacks and Why They Matter

Fallback ensures that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep the format of your API responses consistent, which prevents front-end errors and makes logging meaningful.

In this tutorial, fallbacks are implemented internally. call_llm() So your FastAPI layer stays simple.

Fast API App: Creating a /Response endpoint

gave app.py File is the conductor. It combines retrieval, vigilance, signaling, and generation.

Implementation of app.py

from fastapi import FastAPI
from pydantic import BaseModel
from rag import retrieve
from llm import call_llm
import prompts
import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("rag_app")

app = FastAPI(title="Production-Ready RAG")

class QueryRequest(BaseModel):
    question: str

@app.post("/answer")
async def get_answer(req: QueryRequest):
    start_time = time.time()
    question = (req.question or "").strip()

    if not question:
        return {
            "answer": "Please provide a non-empty question.",
            "refusal": True,
            "confidence": "low",
            "sources": (),
            "latency_sec": round(time.time() - start_time, 2),
        }

    # 1) Retrieval
    results = retrieve(question, k=5)
    top_score = results(0)("score") if results else 0.0

    logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results))

    # 2) Retrieval Gate (Guardrail)
    if top_score < 0.30:
        return {
            "answer": "I do not have documents to answer that question.",
            "refusal": True,
            "confidence": "low",
            "sources": (),
            "latency_sec": round(time.time() - start_time, 2),
            "retrieval": {"top_score": top_score, "k": 5},
        }

    # 3) Augment
    context_text = "\n\n".join((f"Source {r('source')}: {r('text')}" for r in results))
    user_prompt = f"Context:\n{context_text}\n\nQuestion: {question}"

    # 4) Generation with Fallback
    response = call_llm(prompts.SYSTEM_PROMPT, user_prompt)

    # 5) Attach debug metadata
    response("latency_sec") = round(time.time() - start_time, 2)
    response("retrieval") = {"top_score": top_score, "k": 5}
    return response

Centralized Prompt – Template: prompts.py

A small but important habit: keep prompts centralized so they’re repeatable and easy to test.

Example prompts.py

SYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.
If the context does not contain the answer, respond with refusal=true.

Return a valid JSON object with exactly these keys:
- answer: string
- refusal: boolean
- confidence: "low" | "medium" | "high"
- sources: array of strings (source filenames you used)

Do not include any extra keys. Do not include markdown. Do not include commentary."""

How to Add Beginner Friendly Evils

In AI systems, outputs are probabilistic. This makes testing more difficult than traditional software. Evals are a set of “golden questions” and “expected behaviors” that you run repeatedly to detect regressions.

Instead of “does it output exactly the same string” you test:

  • The app should. denial When recovery is weak?

  • When it answers, is it included? Sources?

  • Is the behavior stable across quick tweaks and model changes?

Step 1: Create evals/eval_set.json

It should have both positive and negative aspects.

(
  {
    "id": "in_scope_01",
    "question": "What is a retrieval gate and why is it important?",
    "expect_refusal": false,
    "notes": "Should explain gating and relate it to hallucination prevention."
  },
  {
    "id": "out_of_scope_01",
    "question": "What is the capital of France?",
    "expect_refusal": true,
    "notes": "If the knowledge base only includes our docs, the app should refuse."
  },
  {
    "id": "edge_01",
    "question": "",
    "expect_refusal": true,
    "notes": "Empty input should not call the LLM."
  }
)

Step 2: Create evals/run_evals.py

This runner calls your API endpoint (end-to-end) and tests the expected behavior.

import json
import requests

API_URL = "

def run():
    with open("evals/eval_set.json", "r", encoding="utf-8") as f:
        cases = json.load(f)

    passed = 0
    failed = 0

    for case in cases:
        resp = requests.post(API_URL, json={"question": case("question")}, timeout=60)
        resp.raise_for_status()
        out = resp.json()

        got_refusal = bool(out.get("refusal", False))
        expect_refusal = bool(case("expect_refusal"))

        ok = (got_refusal == expect_refusal)

        # Beginner-friendly: if it answers, sources should exist and be a list
        if not got_refusal:
            ok = ok and isinstance(out.get("sources"), list)

        if ok:
            passed += 1
            print(f"PASS {case('id')}")
        else:
            failed += 1
            print(f"FAIL {case('id')} expected_refusal={expect_refusal} got_refusal={got_refusal}")
            print("Output:", json.dumps(out, indent=2))

    print(f"\nDone. Passed={passed} Failed={failed}")
    if failed:
        raise SystemExit(1)

if __name__ == "__main__":
    run()

How to Use Evals in Practice

Run your server:

uvicorn app:app --reload

In another terminal, run evals:

python evals/run_evals.py

If an eval fails, you have a solid signal that something has changed in the retrieval, gating, signaling, or provider behavior.

What to improve next: Realistic upgrades

Building a reliable RAG app is iterative. Here are realistic next steps:

  • Semantic Chunking: Break text based on meaning rather than character count.

  • Ranking: Use a cross-encoder reranker to reorder the top-k chunks for higher accuracy.

  • Metadata filtering: Filter results by category, date, or department to reduce false positives.

  • Better references: Store part IDs and show exactly which part the response came from.

  • Observation: Include application IDs, configuration logs, and traces so that “what happened?” is accountable.

  • Async + background indexing: Move index building to a background job and keep the API responsive.

Final Thoughts: Productivity is a set of habits.

Building an AI application that survives in the real world is about building a system that is predictable, scalable, and secure.

  • The quality of recovery is measurable: Use matching scores to earn your LLM.

  • Denial is a feature: Better to say “I don’t know” than to lie.

  • Fallbacks are mandatory: Design for the moment that the API goes down.

  • Evals prevent regression: Never commit a change without running your own tests.

About me

I’m Chidozie Managwu, an award-winning AI product architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI delegate and lead the AI ​​Titans network, a community for developers learning how to ship AI products.

My work has been recognized with the Global Tech Hero Award and has been featured on platforms like HackerNoon.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro