7 Steps to Building a Simple Rig System from Scratch

Photo by author

# Introduction

These days, almost everyone uses chatgupt, gemini, or some other large language model (LLM). They make life easier but can still make things go wrong. For example, I remember asking a generative model that won the recent US presidential election and retrieved the former president’s name. It looked confident, but the model relied on training data before the elections took place. This is where Retrieval from Aggregate Generation (RAG) helps LLM to provide more accurate and sophisticated responses. Instead of relying solely on the model’s internal knowledge, it pulls information from external sources — such as PDFs, documents, or APIs — and uses it to create a more contextual and reliable response. In this guide, I’ll walk you through seven practical steps to create a simple rag system from scratch.

# Understanding collective generation workflows from retrieval

Before we get to the code, the idea is simple. A vein system has two main parts: the recovery And The generator. Retrieval searches your knowledge base and extracts the most relevant parts of the text. A generator is the language model that takes these fragments and turns them into a natural, useful response. The process is straightforward, as follows:

A user asks a question.
Retrieval searches your indexed documents or database and returns the best matching sections.
These passages are referred to as context for the LLM.
The LLM then generates a response in response to this retrieval.

Now we’ll break this flow down into seven easy steps and build it end-to-end.

# Step 1: Prepare the data beforehand

While big language models already know a lot from textbooks and web data, they don’t have access to your private or newly created information like research notes, company documents, or project files. Rag helps you feed your data to the model, reducing deception and making responses more accurate and up-to-date. For the sake of this article, we’ll keep things simple and use some short text files about machine learning concepts.

data/
 ├── supervised_learning.txt
 └── unsupervised_learning.txt

supervised_learning.txt:
In this type of machine learning (supervised), the model is trained on labeled data. 
In simple terms, every training example has an input and an associated output label. 
The objective is to build a model that generalizes well on unseen data. 
Common algorithms include:
- Linear Regression
- Decision Trees
- Random Forests
- Support Vector Machines

Classification and regression tasks are performed in supervised machine learning.
For example: spam detection (classification) and house price prediction (regression).
They can be evaluated using accuracy, F1-score, precision, recall, or mean squared error.

unsupervised_learning.txt:
In this type of machine learning (unsupervised), the model is trained on unlabeled data. 
Popular algorithms include:
- K-Means
- Principal Component Analysis (PCA)
- Autoencoders

There are no predefined output labels; the algorithm automatically detects 
underlying patterns or structures within the data.
Typical use cases include anomaly detection, customer clustering, 
and dimensionality reduction.
Performance can be measured qualitatively or with metrics such as silhouette score 
and reconstruction error.

The next task is to load this data. For this, we will create a python file, load_data.py:

import os

def load_documents(folder_path):
    docs = ()
    for file in os.listdir(folder_path):
        if file.endswith(".txt"):
            with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
                docs.append(f.read())
    return docs

Before we use the data, we will clean it. If the text is messy, the model can retrieve irrelevant or incorrect passages, which increases the illusion. Now, let’s create another python file, clean_data.py:

import re

def clean_text(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'(^\x00-\x7F)+', ' ', text)
    return text.strip()

Finally, combine everything into a new file called prepare_data.py To load and clean your documents at once:

from load_data import load_documents
from clean_data import clean_text

def prepare_docs(folder_path="data/"):
    """
    Loads and cleans all text documents from the given folder.
    """
    # Load Documents
    raw_docs = load_documents(folder_path)

    # Clean Documents
    cleaned_docs = (clean_text(doc) for doc in raw_docs)

    print(f"Prepared {len(cleaned_docs)} documents.")
    return cleaned_docs

# Step 2: Converting text into fragments

llms is a small owner Context window – As they are only able to process a limited amount of text at a time. We solve this by breaking long documents into short, overlapping chunks (a chunk is typically 300 to 500 words long). We will use Lingchen‘ RecursiveCharacterTextSplitterwhich divides text into natural places such as sentences or paragraphs. Each fragment makes sense, and the model can quickly find the relevant fragment when responding.

split_text.py

from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents, chunk_size=500, chunk_overlap=100):
 
   # define the splitter
   splitter = RecursiveCharacterTextSplitter(
       chunk_size=chunk_size,
       chunk_overlap=chunk_overlap
   )

   # use the splitter to split docs into chunks
   chunks = splitter.create_documents(documents)
   print(f"Total chunks created: {len(chunks)}")

   return chunks

Chunking helps the model understand the text without losing its meaning. If we don’t add a little overlap between the pieces, the model can get confused at the edges, and the answer doesn’t make sense.

# Step 3: Creating and storing vector embeddings

A computer does not understand textual information. It only understands numbers. So, we need to convert our text parts to numbers. These numbers are called vector embeddings, and they help the computer understand the meaning behind the text. We can use tools like Open Eyefor , for , for , . Penalty Transformersor Hug face For this let’s create a new file called create_embeddings.py And use punctuation transformers to generate embeddings.

from sentence_transformers import SentenceTransformer
import numpy as np

def get_embeddings(text_chunks):
  
   # Load embedding model
   model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
  
   print(f"Creating embeddings for {len(text_chunks)} chunks:")
   embeddings = model.encode(text_chunks, show_progress_bar=True)
  
   print(f"Embeddings shape: {embeddings.shape}")
   return np.array(embeddings)

Each vector embedding receives its semantic value. Similar text segments will have embeddings that are close to each other in vector space. We will now store the embeddings in a vector database such as Foss (Facebook AI similarity search), Chroma, or Pincon. This helps to find faster matches. For example, let’s use fos (lightweight, native option). You can install it using:

Next, let’s create a file called store_faiss.py. First, we make the necessary imports:

import faiss
import numpy as np
import pickle

Now we will create a phase index from our embedding using the function build_faiss_index().

def build_faiss_index(embeddings, save_path="faiss_index"):
   """
   Builds FAISS index and saves it.
   """
   dim = embeddings.shape(1)
   print(f"Building FAISS index with dimension: {dim}")

   # Use a simple flat L2 index
   index = faiss.IndexFlatL2(dim)
   index.add(embeddings.astype('float32'))

   # Save FAISS index
   faiss.write_index(index, f"{save_path}.index")
   print(f"Saved FAISS index to {save_path}.index")

   return index

Each embedding represents a text segment, and phase helps retrieve the closest ones in the future when the user raises a query. Finally, we need to save all the text parts (their metadata) into one pickle file so that they can be easily reloaded later for recovery.

def save_metadata(text_chunks, path="faiss_metadata.pkl"):
   """
   Saves the mapping of vector positions to text chunks.
   """
   with open(path, "wb") as f:
       pickle.dump(text_chunks, f)
   print(f"Saved text metadata to {path}")

# Step 4: Retrieving Relevant Information

In this step, the user’s query is first converted into numeric form, just like what we did with all the text segments before. The computer then compares the numerical values of the fragments with the query vector to find the closest ones. This process is called Search for similarities.
Let’s create a new file called retrieve_faiss.py And make the imports as needed:

import faiss
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

Now, create a function to load the pre-saved fos index from disk so that it can be searched.

def load_faiss_index(index_path="faiss_index.index"):
    """
    Loads the saved FAISS index from disk.
    """
    print("Loading FAISS index.")
    return faiss.read_index(index_path)

We’ll also need another function that loads the metadata, which contains the text parts we saved earlier.

def load_metadata(metadata_path="faiss_metadata.pkl"):
    """
    Loads text chunk metadata (the actual text pieces).
    """
    print("Loading text metadata.")
    with open(metadata_path, "rb") as f:
        return pickle.load(f)

The actual text segments are stored in a metadata file (faiss_metadata.pkl) and is used to map the phase results onto readable text. At this point, we’ll be creating another function that takes the user’s query, embeds it, and finds the top matching segments from the Foss index. Meaningful search takes place here.

def retrieve_similar_chunks(query, index, text_chunks, top_k=3):
    """
    Retrieves top_k most relevant chunks for a given query.
  
    Parameters:
        query (str): The user's input question.
        index (faiss.Index): FAISS index object.
        text_chunks (list): Original text chunks.
        top_k (int): Number of top results to return.
  
    Returns:
        list: Top matching text chunks.
    """
  
    # Embed the query
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    # Ensure query vector is float32 as required by FAISS
    query_vector = model.encode((query)).astype('float32')
  
    # Search FAISS for nearest vectors
    distances, indices = index.search(query_vector, top_k)
  
    print(f"Retrieved top {top_k} similar chunks.")
    return (text_chunks(i) for i in indices(0))

This gives you the top three most relevant text sections to use as context.

# Step 5: Combining the retrieved contexts

Once we have the most relevant parts, the next step is to combine them into a single context block. This context is then added to the user’s query before passing it to the LLM. This step ensures that the model has all the necessary information to produce an accurate and grounded response. You can combine pieces like this:

context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=3)
context = "\n\n".join(context_chunks)

This integrated context will be used later when constructing the final indicator for the LLM.

# Step 6: Using a larger language model to generate the answer

Now, we combine the retrieved context with the user’s query and feed it into the LLM to generate the final response. Here, we’ll use the freely available open-source model from Hug Face, but you can use any model you prefer.

Let’s create a new file called generate_answer.py and add the imports:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from retrieve_faiss import load_faiss_index, load_metadata, retrieve_similar_chunks

Now define a function generate_answer() which performs the full process:

def generate_answer(query, top_k=3):
    """
    Retrieves relevant chunks and generates a final answer.
    """
    # Load FAISS index and metadata
    index = load_faiss_index()
    text_chunks = load_metadata()

    # Retrieve top relevant chunks
    context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k=top_k)
    context = "\n\n".join(context_chunks)

    # Load open-source LLM
    print("Loading LLM...")
    model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    # Load tokenizer and model, using a device map for efficient loading
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

    # Build the prompt
    prompt = f"""
    Context:
    {context}
    Question:
    {query}
    Answer:
    """

    # Generate output
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    # Use the correct input for model generation
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)
    
    # Decode and clean up the answer, removing the original prompt
    full_text = tokenizer.decode(outputs(0), skip_special_tokens=True)
    
    # Simple way to remove the prompt part from the output
    answer = full_text.split("Answer:")(1).strip() if "Answer:" in full_text else full_text.strip()
    
    print("\nFinal Answer:")
    print(answer)

# Step 7: Running the highlight generation pipeline from the full recovery

This last step brings everything together. We will make one main.py file that automates the entire workflow from loading data to generating the final response.

# Data preparation
from prepare_data import prepare_docs
from split_text import split_docs

# Embedding and storage
from create_embeddings import get_embeddings
from store_faiss import build_faiss_index, save_metadata

# Retrieval and answer generation
from generate_answer import generate_answer

Now define the main function:

def run_pipeline():
    """
    Runs the full end-to-end RAG workflow.
    """
    print("\nLoad and Clean Data:")
    documents = prepare_docs("data/")
    print(f"Loaded {len(documents)} clean documents.\n")

    print("Split Text into Chunks:")
    # documents is a list of strings, but split_docs expects a list of documents
    # For this simple example where documents are small, we pass them as strings
    chunks_as_text = split_docs(documents, chunk_size=500, chunk_overlap=100)
    # In this case, chunks_as_text is a list of LangChain Document objects

    # Extract text content from LangChain Document objects
    texts = (c.page_content for c in chunks_as_text)
    print(f"Created {len(texts)} text chunks.\n")

    print("Generate Embeddings:")
    embeddings = get_embeddings(texts)
  
    print("Store Embeddings in FAISS:")
    index = build_faiss_index(embeddings)
    save_metadata(texts)
    print("Stored embeddings and metadata successfully.\n")

    print("Retrieve & Generate Answer:")
    query = "Does unsupervised ML cover regression tasks?"
    generate_answer(query)

Finally, run the pipeline:

if __name__ == "__main__":
    run_pipeline()

Output:

Screenshot of the output Photo by author

# wrap up

The melody closes the gap between what the LLM “already knows” and the ever-changing information in the world. I’ve implemented a very basic pipeline so you can understand how a chord works. At the enterprise level, many advanced concepts, such as guardrails, hybrid search, streaming, and contextual optimization techniques come into use. If you’re interested in exploring more advanced concepts, here are some of my personal favorites:

Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of AI with medicine. He co-authored the eBook “Maximizing Productivity with ChatGPT.” As a 2022 Google Generation Scholar for APAC, she champions diversity and academic excellence. He has also been recognized as a Teradata Diversity in Tech Scholar, a MITACS GlobalLink Research Scholar, and a Harvard Wicked Scholar. Kanwal is a passionate advocate for change, having founded the Fame Code to empower women in stem fields.

# Introduction

# Understanding collective generation workflows from retrieval

# Step 1: Prepare the data beforehand

# Step 2: Converting text into fragments

# Step 3: Creating and storing vector embeddings

# Step 4: Retrieving Relevant Information

# Step 5: Combining the retrieved contexts

# Step 6: Using a larger language model to generate the answer

# Step 7: Running the highlight generation pipeline from the full recovery

# wrap up

Editor's pick

Get latest news

7 Steps to Building a Simple Rig System from Scratch

# Introduction

# Understanding collective generation workflows from retrieval

# Step 1: Prepare the data beforehand

# Step 2: Converting text into fragments

# Step 3: Creating and storing vector embeddings

# Step 4: Retrieving Relevant Information

# Step 5: Combining the retrieved contexts

# Step 6: Using a larger language model to generate the answer

# Step 7: Running the highlight generation pipeline from the full recovery

# wrap up

Artist as Artisan: Mastery in the Machine Age by jm bunthous November, 2025

In a sea of ​​agents, AWS bets on structural processes and special loyalty

You may also like

Leave a Comment Cancel Reply

Editor's pick

Get latest news

In a sea of agents, AWS bets on structural processes and special loyalty