Let me be honest with you.
When I first heard "RAG system" I thought it was some complex research-level thing — the kind of thing only big AI labs and well-funded startups could build. I assumed it required a PhD to understand and a cloud budget to run.
It doesn't.
I built my first RAG pipeline in an afternoon. And by the end of this guide, you will too — using LangChain and FAISS, with real working code at every step.
No theory dumps. No hand-waving. Just the actual build, explained clearly.
What is a RAG System — and Why Does It Matter?
RAG stands for Retrieval-Augmented Generation.
Long name. The idea behind it is simple.
Normal LLMs like GPT have one big problem — they only know what they were trained on. You ask GPT about your company's internal policy document, your product manual, or a PDF you uploaded last week. It has no idea. It will either make something up or tell you it doesn't have access to that information.
RAG fixes this by giving the AI a way to search your documents first, then answer based on what it actually finds.
Think of it like an exam.
A normal LLM is a closed-book exam — the model can only use what it memorized during training.
A RAG system is an open-book exam — the model can look things up before answering.
The difference in answer quality is not small. It's the difference between an AI that guesses and an AI that knows.
This is why RAG is now the most used pattern in production AI systems. Not agentic orchestration. Not fine-tuning. RAG. Because most real-world problems are about connecting AI to private or specific data — and RAG does that cleanly.
How a RAG Pipeline Actually Works
Before we write a single line of code, here is the full picture. Seven steps. That's the whole system.
Step 1 — Load your documents.
You feed the system your PDFs, text files, web pages, or any data source. LangChain has loaders for almost everything.
Step 2 — Split into chunks.
Big documents get cut into smaller pieces. This is not optional — it matters a lot. The AI works better with small, focused chunks than with one giant wall of text. Smaller chunks mean more precise retrieval.
Step 3 — Turn chunks into embeddings.
Each chunk gets converted into a vector — a list of numbers that represents its meaning. This is how the AI understands language. Not as words, but as positions in a mathematical space where similar ideas sit close together.
Step 4 — Store vectors in FAISS.
FAISS (Facebook AI Similarity Search) is a vector database built by Meta. It is fast, free, and runs entirely on your machine. No cloud setup. No API calls. Just a local index you can query in milliseconds.
Step 5 — User asks a question.
The question also gets converted into a vector using the same embedding model.
Step 6 — Find the most relevant chunks.
FAISS compares the question vector against all stored chunk vectors and returns the closest matches. These are the parts of your documents most likely to contain the answer.
Step 7 — Send context + question to the LLM.
The retrieved chunks get passed to the LLM alongside the question. The LLM reads the context and generates a grounded, accurate answer based on your actual documents — not its training data.
That's the whole pipeline. Let's build it.
What You Need Before Starting
- Python 3.10 or higher
- An OpenAI API key
- A PDF or text file you want to query
Install everything in one command:
pip install langchain langchain-openai langchain-community faiss-cpu pypdf python-dotenvA quick note on faiss-cpu — this is the CPU version. It works perfectly for most projects. If you're working with millions of vectors and need GPU acceleration, use faiss-gpu instead. For this guide, faiss-cpu is all you need.
Step 1 — Set Up Your Environment
Create a .env file in your project folder:
OPENAI_API_KEY=your_api_key_hereThen create your main file — rag_pipeline.py — and load the environment at the top:
import os
from dotenv import load_dotenv
load_dotenv()Never hardcode API keys directly in your script. The .env approach keeps them out of version control and makes your project safe to push to GitHub.
Step 2 — Load Your Document
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")LangChain handles all the PDF parsing for you. Each page comes back as a Document object with page_content (the text) and metadata (the source file and page number).
Depending on where your data lives, you can swap PyPDFLoader for other loaders:
TextLoader— for plain.txtfilesWebBaseLoader— for scraping and loading web pagesCSVLoader— for spreadsheet dataUnstructuredMarkdownLoader— for Markdown files
The rest of the pipeline works exactly the same regardless of which loader you use. That's one of the things I genuinely like about LangChain — you swap the source, nothing else changes.
Step 3 — Split Into Chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")Two parameters matter here and both are worth understanding properly.
chunk_size=500 — each chunk contains roughly 500 characters of text. Not tokens, characters. This is intentionally small.
chunk_overlap=50 — consecutive chunks share 50 characters of overlap. This is critical. Important context that sits at the boundary between two chunks does not get lost. Sentences that end right at a split point still appear in the next chunk.
Here is a good starting point depending on your document type:
- Short paragraphs, FAQs, bullet-point content →
chunk_size=300 - Standard business documents, reports →
chunk_size=500 - Dense technical documentation, legal text →
chunk_size=800
The overlap should always be at least 10% of your chunk size. So if you use chunk_size=800, set chunk_overlap=80 at minimum.
Step 4 — Create Embeddings and Store in FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_documents(chunks, embeddings)
vector_store.save_local("faiss_index")
print("Vector store created and saved.")This is where the real work happens.
OpenAIEmbeddings calls OpenAI's API and converts each chunk into a dense vector. The model text-embedding-3-small is fast, cheap, and good enough for most RAG applications. If you need higher accuracy on complex technical content, use text-embedding-3-large — it is more expensive but noticeably better at understanding nuance.
FAISS.from_documents takes all those vectors and builds a local index you can query instantly.
save_local("faiss_index") writes the index to disk.
I cannot stress this enough — always call save_local. I made the mistake of skipping it once during testing. The script crashed halfway through, I lost the entire index, and had to pay for the embedding API calls all over again. Save the index. Every time.
Step 5 — Load the Index and Build the Retriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.load_local(
"faiss_index",
embeddings,
allow_dangerous_deserialization=True
)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)The allow_dangerous_deserialization=True flag is required when loading a saved FAISS index. LangChain added this safety check to prevent loading untrusted index files. Since you created this index yourself, it is safe to allow.
k=4 means return the 4 most relevant chunks for each question. This is a tuning parameter:
- If answers feel incomplete or vague — increase to
k=6ork=8 - If answers feel noisy and unfocused — reduce to
k=2ork=3 - Start at
k=4and adjust from there
Step 6 — Build the RAG Chain
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant.
Answer the question using ONLY the context below.
If the answer is not in the context, say:
"I don't have that information in the provided documents."
Context:
{context}
Question:
{question}
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)The prompt is the most important part of this entire pipeline. More important than the vector database choice. More important than the chunk size settings.
The line that matters most is:
"Answer the question using ONLY the context below."
Without this instruction, the LLM will ignore your retrieved chunks and answer from its own training data. Your entire retrieval pipeline becomes decoration. The LLM just does what it always does — makes up a confident-sounding answer based on patterns from its training data.
With this instruction, it is forced to read the context you provided and answer only from that. That is what turns this from a regular chatbot into an actual RAG system.
I also set temperature=0. For RAG applications you want deterministic, consistent answers. Temperature 0 means the model always picks the most probable next token — no creativity, no variation. That's exactly what you want when accuracy matters.
Step 7 — Ask Your Questions
question = "What is the refund policy?"
answer = rag_chain.invoke(question)
print(answer)That's the full pipeline working end to end. Seven steps from raw document to accurate AI answers.
Full Code — Copy and Run
Here is the complete pipeline in one place:
import os
from dotenv import load_dotenv
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
load_dotenv()
# Step 1 — Load document
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
# Step 2 — Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
# Step 3 — Create embeddings and store in FAISS
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_documents(chunks, embeddings)
vector_store.save_local("faiss_index")
# Step 4 — Load index and build retriever
vector_store = FAISS.load_local(
"faiss_index",
embeddings,
allow_dangerous_deserialization=True
)
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
# Step 5 — Build the LLM and prompt
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template("""
You are a helpful assistant.
Answer using ONLY the context below.
If the answer is not in the context, say:
"I don't have that information in the provided documents."
Context: {context}
Question: {question}
""")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Step 6 — Assemble the RAG chain
rag_chain = (
{
"context": retriever | format_docs,
"question": RunnablePassthrough()
}
| prompt
| llm
| StrOutputParser()
)
# Step 7 — Ask your question
question = "What is the refund policy?"
answer = rag_chain.invoke(question)
print(answer)FAISS vs Other Vector Databases — What Should You Actually Use?
FAISS is the right choice for getting started. But it has real limitations you should understand before you commit to it for a production system.
Use FAISS when:
- You are building locally or prototyping
- Your dataset fits comfortably in RAM
- You want zero infrastructure setup and zero cost
- You are the only person querying the system
Move to something else when:
- You need the index to persist reliably across server restarts
- Your data grows to millions of vectors
- Multiple users will be querying simultaneously
- You need filtering, metadata search, or hybrid retrieval
Alternatives worth knowing:
- ChromaDB — open source, persistent by default, easy drop-in for LangChain. Good middle ground between FAISS and managed solutions.
- Qdrant — open source, production-ready, self-hostable for free. Strong filtering and metadata support. My personal recommendation if you need to self-host at scale.
- Pinecone — fully managed, extremely fast, reliable. You pay for it but you never think about infrastructure. Good for client projects where reliability matters more than cost.
For most freelance projects and small business RAG systems — FAISS is more than enough. Do not overcomplicate the infrastructure before you have validated the product.
3 Mistakes I See Everyone Make Building Their First RAG System
Mistake 1 — Chunk size too large.
Chunks of 1500 or 2000+ characters seem like they should give the AI more context. In practice they make retrieval noisy. The relevant part of the answer gets buried inside a large chunk filled with unrelated text, and the LLM gets confused. Keep chunks between 300 and 800 characters for most documents.
Mistake 2 — Zero overlap between chunks.
Key sentences often sit right at the boundary between two chunks. Without overlap, those sentences get cut in half — one half goes into the first chunk, the other half into the next one. Neither chunk makes complete sense. Always set chunk_overlap to at least 10% of your chunk_size.
Mistake 3 — A prompt that does not enforce context-only answers.
This is the most common mistake and the one that wastes the most time. If your prompt does not explicitly tell the LLM to answer only from the retrieved context, it defaults to answering from its training data. Your retrieval pipeline runs, finds the right chunks, passes them to the LLM, and the LLM ignores them entirely. One clear instruction in the prompt prevents this completely.
What to Build Next
This pipeline is a solid foundation. Once it works, three improvements are worth adding:
1. Conversation memory
Right now the system treats every question as a fresh start. It has no memory of previous questions in the same session. Adding conversation memory with ConversationBufferMemory lets users ask follow-up questions naturally — "what about the exceptions?" — without restating the full context each time.
2. Source citations
Every answer should tell the user which document it came from and on which page. This builds trust and lets users verify the information. LangChain makes this straightforward — the retrieved documents carry metadata you can attach to the response.
3. FastAPI wrapper
Wrapping this pipeline in a FastAPI endpoint turns it from a Python script into a real API. Any frontend — React, Next.js, a mobile app — can connect to it with a simple POST request.
I will cover all three in the next posts in this series.
Key Takeaway
A RAG system is not complicated once you see the whole picture.
Load documents. Split them. Embed them. Store them. Retrieve the relevant ones. Pass them to an LLM. Get an accurate answer.
Seven steps.
The magic is not in the code. It is in the prompt — specifically the instruction that forces the LLM to use your documents instead of its own memory. Get that right and the rest falls into place.
Build this once and you will immediately understand why RAG is the most used pattern in production AI right now. It is not hype. It is the cleanest solution to the most common AI problem — connecting a language model to data it was never trained on.
The code is simple. The impact is real.
Written by Muhammad Yasir (devxyasir) — AI & Automation Engineer
I build AI agents, RAG systems, and Python backends for real-world problems.
GitHub · LinkedIn · Portfolio
If this helped you, share it with someone building their first AI project. It took me longer than one afternoon to figure all of this out — this post is the guide I wish I had.


