How to Build a RAG Application with LangChain and Pinecone (Step-by-Step)
A complete, working tutorial for building a Retrieval-Augmented Generation application using LangChain and Pinecone. Covers document loading, chunking, embeddings, indexing, retrieval, and generation with full Python code.
Most RAG tutorials show you a toy example with five sentences of text and call it done. That is not what building a real RAG application looks like, and it does not prepare you for the decisions that actually matter: how to chunk your documents, which embedding model to use, how to structure the retrieval chain, and how to avoid the model making things up when the context does not contain the answer.
This tutorial builds a complete, working RAG application using LangChain and Pinecone. By the end, you will have a system that loads real documents, indexes them in a production vector database, retrieves relevant context for any question, and generates grounded answers.
I am using LangChain because it standardizes the document loading and chaining logic across providers, and Pinecone because it removes every infrastructure decision so you can focus on the parts of RAG that actually affect answer quality.
What You Are Building
By the end of this tutorial, you will have a Python application that does the following: loads documents from disk, splits them into chunks, converts those chunks into embeddings, stores the embeddings in a Pinecone index, and answers natural language questions by retrieving relevant chunks and passing them to an LLM.
This is the same architecture used in production RAG systems. The difference between this tutorial and a production deployment is scale, not structure.
Prerequisites
You need Python 3.10 or higher, since LangChain dropped support for older versions. You also need two accounts: a Pinecone account with an API key, and an OpenAI account with an API key for both embeddings and the chat model.
If you would rather use a different LLM provider, the retrieval and indexing steps stay identical. Only the model initialization changes, which is covered near the end of this tutorial.
Setting Up the Environment
Install the required packages. LangChain has been restructured into separate packages since the 1.0 stable release, so you need the core package plus the provider-specific integrations.
pip install langchain langchain-openai langchain-pinecone langchain-community pypdfSet your API keys as environment variables. Never hardcode them directly in your script.
export OPENAI_API_KEY="your-openai-api-key"
export PINECONE_API_KEY="your-pinecone-api-key"Or, if you prefer a .env file:
from dotenv import load_dotenv
load_dotenv()Loading Your Documents
LangChain provides document loaders for dozens of formats. This tutorial uses PDF files, but the pattern is the same for text files, Markdown, HTML, or Notion exports.
from langchain_community.document_loaders import PyPDFLoader
import os
def load_documents(directory_path):
documents = []
for filename in os.listdir(directory_path):
if filename.endswith(".pdf"):
filepath = os.path.join(directory_path, filename)
loader = PyPDFLoader(filepath)
documents.extend(loader.load())
return documents
raw_documents = load_documents("./data")
print(f"Loaded {len(raw_documents)} pages")Each loaded item is a LangChain Document object containing the page content and metadata such as the source file and page number. That metadata matters later when you want to cite sources in your answers.
Splitting Text Into Chunks
You cannot feed an entire document into an LLM context window efficiently, and large chunks hurt retrieval precision because they mix multiple topics into a single embedding. The fix is splitting documents into smaller, semantically coherent chunks.
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=120,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(raw_documents)
print(f"Split into {len(chunks)} chunks")chunk_size controls how many characters each chunk holds. chunk_overlap keeps a sliding window of shared text between consecutive chunks so a sentence that gets cut at a boundary still has context in the neighboring chunk. The separators list tells the splitter to prefer breaking on paragraph boundaries first, then sentences, only falling back to mid-word splits as a last resort.
800 characters with 120 of overlap is a reasonable default for technical documentation. If your content has long, dense paragraphs, increase chunk size. If it is mostly short, list-like content, decrease it.
Choosing an Embedding Model
The embedding model converts each text chunk into a vector that captures its semantic meaning. OpenAI's text-embedding-3-small is a strong default: it produces 1536-dimension vectors, costs roughly $0.02 per million tokens, and performs within a few percentage points of the larger text-embedding-3-large model for most retrieval tasks.
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")Only switch to the large model if you have measured a real retrieval quality gap on your own evaluation set. The smaller model is roughly five times cheaper and the quality difference is marginal for most use cases.
Creating the Pinecone Index
Before storing vectors, you need a Pinecone index configured with the right dimension to match your embedding model. text-embedding-3-small produces 1536-dimension vectors, so the index must be created with dimension=1536.
from pinecone import Pinecone, ServerlessSpec
import time
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index_name = "rag-tutorial"
if index_name not in [idx.name for idx in pc.list_indexes()]:
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
# Wait for the index to finish initializing
while not pc.describe_index(index_name).status["ready"]:
time.sleep(1)
index = pc.Index(index_name)This block checks whether the index already exists before creating it, which makes the script safe to rerun without errors. The metric="cosine" setting matches how OpenAI's embedding models are designed to be compared.
Storing Your Chunks in Pinecone
With the index ready, connect LangChain's Pinecone integration and push your chunks. The PineconeVectorStore.from_documents method handles the embedding and the upsert in one call.
from langchain_pinecone import PineconeVectorStore
vector_store = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name=index_name,
)For large document sets, this single call batches the upserts automatically. If you are indexing millions of chunks, consider batching manually and adding a short delay between batches to avoid rate limits, but for most tutorials and small production datasets this one-liner is sufficient.
If you already have an index populated and just want to connect to it without re-indexing, use this instead.
vector_store = PineconeVectorStore.from_existing_index(
index_name=index_name,
embedding=embeddings,
)Testing Retrieval Before Building the Full Chain
Before wiring up the LLM, verify that retrieval itself is working correctly. This step catches chunking and embedding problems early, before they get masked by the LLM generating a plausible-sounding wrong answer.
query = "What are the main steps in the onboarding process?"
results = vector_store.similarity_search(query, k=4)
for i, doc in enumerate(results):
print(f"--- Result {i + 1} ---")
print(doc.page_content[:200])
print(f"Source: {doc.metadata.get('source')}, Page: {doc.metadata.get('page')}")
print()Read the actual retrieved chunks. If they are not relevant to the query, the problem is in your chunking strategy or embedding choice, not in the LLM. This is the single most useful debugging step for a RAG application that gives bad answers, and it is the one most tutorials skip.
Building the Retrieval Chain with LCEL
LangChain Expression Language (LCEL) is the standard way to compose chains in LangChain v1.x. It uses the pipe operator to connect components, which gives you automatic streaming, batching, and async support without extra code.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_template("""
Answer the question using only the context below. If the context does not
contain enough information to answer, say "I don't have enough information
to answer this question" instead of guessing.
Context:
{context}
Question:
{question}
Answer:
""")
def format_docs(docs):
return "\n\n---\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)The prompt explicitly instructs the model to admit when it does not know the answer rather than fabricating one. This single instruction prevents a large share of hallucination problems in RAG applications and should be in every production RAG prompt you write.
Setting temperature=0 makes the model's output more deterministic, which matters for factual question-answering where you want consistent answers rather than creative variation.
Running a Query Through the Full Pipeline
With the chain built, answering a question is a single call.
question = "What are the main steps in the onboarding process?"
answer = rag_chain.invoke(question)
print(answer)Behind this single line, LangChain runs the full pipeline: embed the question, search Pinecone for the four most similar chunks, format them into the prompt template alongside the question, send the completed prompt to the LLM, and parse the output into a plain string.
Returning Sources Alongside the Answer
A production RAG application almost always needs to show where an answer came from, both for user trust and for debugging. Use create_retrieval_chain instead of the raw LCEL pipe when you need the source documents returned alongside the answer.
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
qa_prompt = ChatPromptTemplate.from_messages([
("system", "Answer the question using only the context below. "
"If you cannot find the answer, say so clearly.\n\n{context}"),
("human", "{input}"),
])
document_chain = create_stuff_documents_chain(llm, qa_prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)
result = retrieval_chain.invoke({"input": "What are the main steps in the onboarding process?"})
print("Answer:")
print(result["answer"])
print("\nSources:")
for doc in result["context"]:
print(f"- {doc.metadata.get('source')}, page {doc.metadata.get('page')}")This pattern, known as the "stuff" strategy, inserts every retrieved chunk into a single prompt and sends it to the LLM in one call. It is the simplest and most common approach, and it works well as long as your retrieved context fits comfortably inside the model's context window.
Filtering Retrieval by Metadata
Real applications rarely want to search across the entire index for every query. If your documents have metadata like a category, department, or date, filter retrieval to only the relevant subset.
filtered_retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"category": "hr-policy"},
}
)This both improves answer relevance and reduces the chance of the model pulling context from an unrelated document set. If you are building a multi-tenant application, filtering by tenant ID at the retrieval layer is a requirement, not an optimization.
Handling Empty Retrieval Results
When a question falls outside the scope of your indexed documents, the vector store still returns the closest matches even if none of them are actually relevant. Without a check, the LLM may generate a confident-sounding answer based on irrelevant context.
results = vector_store.similarity_search_with_score(question, k=4)
RELEVANCE_THRESHOLD = 0.75
relevant_results = [doc for doc, score in results if score >= RELEVANCE_THRESHOLD]
if not relevant_results:
answer = "I don't have information about this in the knowledge base."
else:
answer = rag_chain.invoke(question)The exact threshold depends on your embedding model and distance metric, and you should tune it against a small labeled set of in-scope and out-of-scope questions rather than guessing a number.
Streaming the Response
For a chat-style interface, streaming tokens as they are generated makes the application feel responsive instead of making the user wait for the full answer.
for chunk in rag_chain.stream(question):
print(chunk, end="", flush=True)LCEL chains support .stream() automatically because every component implements the same Runnable interface. No extra configuration is needed beyond what you already built.
Swapping in a Different LLM Provider
If you want to use Anthropic's Claude instead of OpenAI for generation while keeping OpenAI for embeddings, only the model initialization changes.
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0)The rest of the chain, including the retriever, the prompt, and the output parser, stays exactly the same. This is the practical benefit of LangChain's standardized interfaces: your retrieval logic and your generation provider are fully decoupled.
Common Mistakes to Avoid
A few mistakes account for most of the bad RAG applications I have seen reviewed.
Chunking without testing retrieval first. Teams build the full chain, get a bad answer, and assume the LLM is broken. Almost always the actual problem is upstream in chunking or embedding choice. Always inspect raw retrieval results before debugging generation.
Using a generic prompt with no grounding instruction. Without an explicit instruction to stay within the provided context, the model falls back to its training data when retrieval is incomplete, which produces confident, wrong answers.
Skipping metadata at indexing time. Metadata is far easier to add when you first index documents than to retrofit later. Always store source, page number, and any business-relevant filters from the start.
Mismatched embedding models between indexing and querying. If you change your embedding model after indexing, you must re-embed and re-index everything. A query embedded with a different model than the one used for indexing will produce meaningless similarity scores.
No empty-result handling. Vector similarity search always returns something, even if nothing in the index is actually relevant. Without a relevance check, your application will confidently answer questions it has no business answering.
Next Steps
This tutorial covers the core RAG pipeline end to end. From here, a few directions are worth exploring depending on what your application needs.
If your retrieval quality plateaus despite good chunking, look into hybrid search to combine semantic and keyword matching. If you are choosing between Pinecone and other vector databases for this kind of workload, the Pinecone vs Weaviate vs Milvus vs Qdrant comparison covers the trade-offs in depth. If you want to understand what is happening underneath similarity_search at the index level, read about how HNSW indexing works.
Related Reading
Follow on Google
Add as a preferred source in Search & Discover
Add as preferred sourceKrunal Kanojiya
Technical Content Writer
I am a technical content writer and former software developer from India. I write clear, in-depth articles on blockchain, AI and machine learning, data engineering, web development, and developer careers. I work at Lucent Innovation now. Before that I wrote about blockchain at Cromtek Solution and did freelance work.
