What is the easiest way to build a RAG application?

LangChain combined with a managed vector database like Pinecone is the fastest path to a working RAG application. LangChain handles document loading, chunking, and chaining the retrieval and generation steps, while Pinecone handles storing and searching the embeddings without any infrastructure to manage. A basic working version can be built in under an hour.

Do I need Pinecone for a RAG application or can I use something else?

You do not need Pinecone specifically. LangChain supports dozens of vector store integrations including Qdrant, Weaviate, Chroma, and pgvector. Pinecone is a strong default for tutorials and prototypes because it requires no local setup and the free tier covers small datasets comfortably.

How much does it cost to run a RAG application with LangChain and Pinecone?

For a small RAG application with under 100,000 vectors and low query volume, Pinecone's free Starter tier and OpenAI's smallest embedding model cost close to nothing per month. Production workloads with millions of vectors and high query volume can run from $50 to several hundred dollars per month on Pinecone, plus LLM API costs that scale with usage.

What is the difference between a RAG chain and a RAG agent in LangChain?

A RAG chain retrieves context once per query and generates an answer in a single LLM call, which is fast and predictable for simple question-answering. A RAG agent decides whether and when to call a retrieval tool, which is more flexible for complex queries that may need multiple searches or no retrieval at all, at the cost of extra latency and LLM calls.

Why is my RAG application giving wrong or hallucinated answers?

The most common causes are poor chunking that splits context awkwardly, a prompt that does not instruct the model to stay grounded in the retrieved context, retrieving too few or irrelevant chunks, or an embedding model mismatch between indexing and querying. Start by inspecting the retrieved chunks directly before assuming the LLM itself is the problem.

Can I use a different LLM instead of OpenAI in this tutorial?

Yes. LangChain supports Anthropic, Google, Cohere, and many other providers through dedicated integration packages. Swapping the LLM means changing the import and the model initialization line; the rest of the retrieval chain stays the same since LangChain standardizes the interface across providers.

Build a RAG App with LangChain and Pinecone (2026 Tutorial)

Most RAG tutorials show you a toy example with five sentences of text and call it done. That is not what building a real RAG application looks like, and it does not prepare you for the decisions that actually matter: how to chunk your documents, which embedding model to use, how to structure the retrieval chain, and how to avoid the model making things up when the context does not contain the answer.

This tutorial builds a complete, working RAG application using LangChain and Pinecone. By the end, you will have a system that loads real documents, indexes them in a production vector database, retrieves relevant context for any question, and generates grounded answers.

I am using LangChain because it standardizes the document loading and chaining logic across providers, and Pinecone because it removes every infrastructure decision so you can focus on the parts of RAG that actually affect answer quality.

What You Are Building

By the end of this tutorial, you will have a Python application that does the following: loads documents from disk, splits them into chunks, converts those chunks into embeddings, stores the embeddings in a Pinecone index, and answers natural language questions by retrieving relevant chunks and passing them to an LLM.

This is the same architecture used in production RAG systems. The difference between this tutorial and a production deployment is scale, not structure.

Prerequisites

You need Python 3.10 or higher, since LangChain dropped support for older versions. You also need two accounts: a Pinecone account with an API key, and an OpenAI account with an API key for both embeddings and the chat model.

If you would rather use a different LLM provider, the retrieval and indexing steps stay identical. Only the model initialization changes, which is covered near the end of this tutorial.

Setting Up the Environment

Install the required packages. LangChain has been restructured into separate packages since the 1.0 stable release, so you need the core package plus the provider-specific integrations.

bash

pip install langchain langchain-openai langchain-pinecone langchain-community pypdf

Set your API keys as environment variables. Never hardcode them directly in your script.

bash

export OPENAI_API_KEY="your-openai-api-key"
export PINECONE_API_KEY="your-pinecone-api-key"

Or, if you prefer a .env file:

python

from dotenv import load_dotenv
load_dotenv()

Loading Your Documents

LangChain provides document loaders for dozens of formats. This tutorial uses PDF files, but the pattern is the same for text files, Markdown, HTML, or Notion exports.

python

from langchain_community.document_loaders import PyPDFLoader
import os

def load_documents(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory_path, filename)
            loader = PyPDFLoader(filepath)
            documents.extend(loader.load())
    return documents

raw_documents = load_documents("./data")
print(f"Loaded {len(raw_documents)} pages")

Each loaded item is a LangChain Document object containing the page content and metadata such as the source file and page number. That metadata matters later when you want to cite sources in your answers.

Splitting Text Into Chunks

You cannot feed an entire document into an LLM context window efficiently, and large chunks hurt retrieval precision because they mix multiple topics into a single embedding. The fix is splitting documents into smaller, semantically coherent chunks.

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(raw_documents)
print(f"Split into {len(chunks)} chunks")

chunk_size controls how many characters each chunk holds. chunk_overlap keeps a sliding window of shared text between consecutive chunks so a sentence that gets cut at a boundary still has context in the neighboring chunk. The separators list tells the splitter to prefer breaking on paragraph boundaries first, then sentences, only falling back to mid-word splits as a last resort.

800 characters with 120 of overlap is a reasonable default for technical documentation. If your content has long, dense paragraphs, increase chunk size. If it is mostly short, list-like content, decrease it.

Choosing an Embedding Model

The embedding model converts each text chunk into a vector that captures its semantic meaning. OpenAI's text-embedding-3-small is a strong default: it produces 1536-dimension vectors, costs roughly $0.02 per million tokens, and performs within a few percentage points of the larger text-embedding-3-large model for most retrieval tasks.

python

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Only switch to the large model if you have measured a real retrieval quality gap on your own evaluation set. The smaller model is roughly five times cheaper and the quality difference is marginal for most use cases.

Creating the Pinecone Index

Before storing vectors, you need a Pinecone index configured with the right dimension to match your embedding model. text-embedding-3-small produces 1536-dimension vectors, so the index must be created with dimension=1536.

python

from pinecone import Pinecone, ServerlessSpec
import time

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

index_name = "rag-tutorial"

if index_name not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    # Wait for the index to finish initializing
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

This block checks whether the index already exists before creating it, which makes the script safe to rerun without errors. The metric="cosine" setting matches how OpenAI's embedding models are designed to be compared.

Storing Your Chunks in Pinecone

With the index ready, connect LangChain's Pinecone integration and push your chunks. The PineconeVectorStore.from_documents method handles the embedding and the upsert in one call.

python

from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name,
)

For large document sets, this single call batches the upserts automatically. If you are indexing millions of chunks, consider batching manually and adding a short delay between batches to avoid rate limits, but for most tutorials and small production datasets this one-liner is sufficient.

If you already have an index populated and just want to connect to it without re-indexing, use this instead.

python

vector_store = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings,
)

Testing Retrieval Before Building the Full Chain

Before wiring up the LLM, verify that retrieval itself is working correctly. This step catches chunking and embedding problems early, before they get masked by the LLM generating a plausible-sounding wrong answer.

python

query = "What are the main steps in the onboarding process?"
results = vector_store.similarity_search(query, k=4)

for i, doc in enumerate(results):
    print(f"--- Result {i + 1} ---")
    print(doc.page_content[:200])
    print(f"Source: {doc.metadata.get('source')}, Page: {doc.metadata.get('page')}")
    print()

Read the actual retrieved chunks. If they are not relevant to the query, the problem is in your chunking strategy or embedding choice, not in the LLM. This is the single most useful debugging step for a RAG application that gives bad answers, and it is the one most tutorials skip.

Building the Retrieval Chain with LCEL

LangChain Expression Language (LCEL) is the standard way to compose chains in LangChain v1.x. It uses the pipe operator to connect components, which gives you automatic streaming, batching, and async support without extra code.

python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

retriever = vector_store.as_retriever(search_kwargs={"k": 4})

prompt = ChatPromptTemplate.from_template("""
Answer the question using only the context below. If the context does not
contain enough information to answer, say "I don't have enough information
to answer this question" instead of guessing.

Context:
{context}

Question:
{question}

Answer:
""")

def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

The prompt explicitly instructs the model to admit when it does not know the answer rather than fabricating one. This single instruction prevents a large share of hallucination problems in RAG applications and should be in every production RAG prompt you write.

Setting temperature=0 makes the model's output more deterministic, which matters for factual question-answering where you want consistent answers rather than creative variation.

Running a Query Through the Full Pipeline

With the chain built, answering a question is a single call.

python

question = "What are the main steps in the onboarding process?"
answer = rag_chain.invoke(question)
print(answer)

Behind this single line, LangChain runs the full pipeline: embed the question, search Pinecone for the four most similar chunks, format them into the prompt template alongside the question, send the completed prompt to the LLM, and parse the output into a plain string.

Returning Sources Alongside the Answer

A production RAG application almost always needs to show where an answer came from, both for user trust and for debugging. Use create_retrieval_chain instead of the raw LCEL pipe when you need the source documents returned alongside the answer.

python

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question using only the context below. "
               "If you cannot find the answer, say so clearly.\n\n{context}"),
    ("human", "{input}"),
])

document_chain = create_stuff_documents_chain(llm, qa_prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

result = retrieval_chain.invoke({"input": "What are the main steps in the onboarding process?"})

print("Answer:")
print(result["answer"])
print("\nSources:")
for doc in result["context"]:
    print(f"- {doc.metadata.get('source')}, page {doc.metadata.get('page')}")

This pattern, known as the "stuff" strategy, inserts every retrieved chunk into a single prompt and sends it to the LLM in one call. It is the simplest and most common approach, and it works well as long as your retrieved context fits comfortably inside the model's context window.

Filtering Retrieval by Metadata

Real applications rarely want to search across the entire index for every query. If your documents have metadata like a category, department, or date, filter retrieval to only the relevant subset.

python

filtered_retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"category": "hr-policy"},
    }
)

This both improves answer relevance and reduces the chance of the model pulling context from an unrelated document set. If you are building a multi-tenant application, filtering by tenant ID at the retrieval layer is a requirement, not an optimization.

Handling Empty Retrieval Results

When a question falls outside the scope of your indexed documents, the vector store still returns the closest matches even if none of them are actually relevant. Without a check, the LLM may generate a confident-sounding answer based on irrelevant context.

python

results = vector_store.similarity_search_with_score(question, k=4)

RELEVANCE_THRESHOLD = 0.75

relevant_results = [doc for doc, score in results if score >= RELEVANCE_THRESHOLD]

if not relevant_results:
    answer = "I don't have information about this in the knowledge base."
else:
    answer = rag_chain.invoke(question)

The exact threshold depends on your embedding model and distance metric, and you should tune it against a small labeled set of in-scope and out-of-scope questions rather than guessing a number.

Streaming the Response

For a chat-style interface, streaming tokens as they are generated makes the application feel responsive instead of making the user wait for the full answer.

python

for chunk in rag_chain.stream(question):
    print(chunk, end="", flush=True)

LCEL chains support .stream() automatically because every component implements the same Runnable interface. No extra configuration is needed beyond what you already built.

Swapping in a Different LLM Provider

If you want to use Anthropic's Claude instead of OpenAI for generation while keeping OpenAI for embeddings, only the model initialization changes.

python

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0)

The rest of the chain, including the retriever, the prompt, and the output parser, stays exactly the same. This is the practical benefit of LangChain's standardized interfaces: your retrieval logic and your generation provider are fully decoupled.

Common Mistakes to Avoid

A few mistakes account for most of the bad RAG applications I have seen reviewed.

Chunking without testing retrieval first. Teams build the full chain, get a bad answer, and assume the LLM is broken. Almost always the actual problem is upstream in chunking or embedding choice. Always inspect raw retrieval results before debugging generation.

Using a generic prompt with no grounding instruction. Without an explicit instruction to stay within the provided context, the model falls back to its training data when retrieval is incomplete, which produces confident, wrong answers.

Skipping metadata at indexing time. Metadata is far easier to add when you first index documents than to retrofit later. Always store source, page number, and any business-relevant filters from the start.

Mismatched embedding models between indexing and querying. If you change your embedding model after indexing, you must re-embed and re-index everything. A query embedded with a different model than the one used for indexing will produce meaningless similarity scores.

No empty-result handling. Vector similarity search always returns something, even if nothing in the index is actually relevant. Without a relevance check, your application will confidently answer questions it has no business answering.

Next Steps

This tutorial covers the core RAG pipeline end to end. From here, a few directions are worth exploring depending on what your application needs.

If your retrieval quality plateaus despite good chunking, look into hybrid search to combine semantic and keyword matching. If you are choosing between Pinecone and other vector databases for this kind of workload, the Pinecone vs Weaviate vs Milvus vs Qdrant comparison covers the trade-offs in depth. If you want to understand what is happening underneath similarity_search at the index level, read about how HNSW indexing works.

What You Are Building

This is the same architecture used in production RAG systems. The difference between this tutorial and a production deployment is scale, not structure.

Prerequisites

If you would rather use a different LLM provider, the retrieval and indexing steps stay identical. Only the model initialization changes, which is covered near the end of this tutorial.

Setting Up the Environment

Install the required packages. LangChain has been restructured into separate packages since the 1.0 stable release, so you need the core package plus the provider-specific integrations.

bash

pip install langchain langchain-openai langchain-pinecone langchain-community pypdf

Set your API keys as environment variables. Never hardcode them directly in your script.

bash

export OPENAI_API_KEY="your-openai-api-key"
export PINECONE_API_KEY="your-pinecone-api-key"

Or, if you prefer a .env file:

python

from dotenv import load_dotenv
load_dotenv()

Loading Your Documents

LangChain provides document loaders for dozens of formats. This tutorial uses PDF files, but the pattern is the same for text files, Markdown, HTML, or Notion exports.

python

from langchain_community.document_loaders import PyPDFLoader
import os

def load_documents(directory_path):
    documents = []
    for filename in os.listdir(directory_path):
        if filename.endswith(".pdf"):
            filepath = os.path.join(directory_path, filename)
            loader = PyPDFLoader(filepath)
            documents.extend(loader.load())
    return documents

raw_documents = load_documents("./data")
print(f"Loaded {len(raw_documents)} pages")

Splitting Text Into Chunks

python

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=120,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(raw_documents)
print(f"Split into {len(chunks)} chunks")

Choosing an Embedding Model

python

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Creating the Pinecone Index

python

from pinecone import Pinecone, ServerlessSpec
import time

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

index_name = "rag-tutorial"

if index_name not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    # Wait for the index to finish initializing
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

Storing Your Chunks in Pinecone

With the index ready, connect LangChain's Pinecone integration and push your chunks. The PineconeVectorStore.from_documents method handles the embedding and the upsert in one call.

python

from langchain_pinecone import PineconeVectorStore

vector_store = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name,
)

If you already have an index populated and just want to connect to it without re-indexing, use this instead.

python

vector_store = PineconeVectorStore.from_existing_index(
    index_name=index_name,
    embedding=embeddings,
)

Testing Retrieval Before Building the Full Chain

python

query = "What are the main steps in the onboarding process?"
results = vector_store.similarity_search(query, k=4)

for i, doc in enumerate(results):
    print(f"--- Result {i + 1} ---")
    print(doc.page_content[:200])
    print(f"Source: {doc.metadata.get('source')}, Page: {doc.metadata.get('page')}")
    print()

Building the Retrieval Chain with LCEL

python

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

retriever = vector_store.as_retriever(search_kwargs={"k": 4})

prompt = ChatPromptTemplate.from_template("""
Answer the question using only the context below. If the context does not
contain enough information to answer, say "I don't have enough information
to answer this question" instead of guessing.

Context:
{context}

Question:
{question}

Answer:
""")

def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Setting temperature=0 makes the model's output more deterministic, which matters for factual question-answering where you want consistent answers rather than creative variation.

Running a Query Through the Full Pipeline

With the chain built, answering a question is a single call.

python

question = "What are the main steps in the onboarding process?"
answer = rag_chain.invoke(question)
print(answer)

Returning Sources Alongside the Answer

python

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the question using only the context below. "
               "If you cannot find the answer, say so clearly.\n\n{context}"),
    ("human", "{input}"),
])

document_chain = create_stuff_documents_chain(llm, qa_prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

result = retrieval_chain.invoke({"input": "What are the main steps in the onboarding process?"})

print("Answer:")
print(result["answer"])
print("\nSources:")
for doc in result["context"]:
    print(f"- {doc.metadata.get('source')}, page {doc.metadata.get('page')}")

Filtering Retrieval by Metadata

Real applications rarely want to search across the entire index for every query. If your documents have metadata like a category, department, or date, filter retrieval to only the relevant subset.

python

filtered_retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"category": "hr-policy"},
    }
)

Handling Empty Retrieval Results

python

results = vector_store.similarity_search_with_score(question, k=4)

RELEVANCE_THRESHOLD = 0.75

relevant_results = [doc for doc, score in results if score >= RELEVANCE_THRESHOLD]

if not relevant_results:
    answer = "I don't have information about this in the knowledge base."
else:
    answer = rag_chain.invoke(question)

The exact threshold depends on your embedding model and distance metric, and you should tune it against a small labeled set of in-scope and out-of-scope questions rather than guessing a number.

Streaming the Response

For a chat-style interface, streaming tokens as they are generated makes the application feel responsive instead of making the user wait for the full answer.

python

for chunk in rag_chain.stream(question):
    print(chunk, end="", flush=True)

LCEL chains support .stream() automatically because every component implements the same Runnable interface. No extra configuration is needed beyond what you already built.

Swapping in a Different LLM Provider

If you want to use Anthropic's Claude instead of OpenAI for generation while keeping OpenAI for embeddings, only the model initialization changes.

python

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0)

Common Mistakes to Avoid

A few mistakes account for most of the bad RAG applications I have seen reviewed.

Next Steps

This tutorial covers the core RAG pipeline end to end. From here, a few directions are worth exploring depending on what your application needs.

What You Are Building

Prerequisites

Setting Up the Environment

Loading Your Documents

Splitting Text Into Chunks

Choosing an Embedding Model

Creating the Pinecone Index

Storing Your Chunks in Pinecone

Testing Retrieval Before Building the Full Chain

Building the Retrieval Chain with LCEL

Running a Query Through the Full Pipeline

Returning Sources Alongside the Answer

Filtering Retrieval by Metadata

Handling Empty Retrieval Results

Streaming the Response

Swapping in a Different LLM Provider

Common Mistakes to Avoid

Next Steps

Related Reading

Krunal Kanojiya

Related Posts

What You Are Building

Prerequisites

Setting Up the Environment

Loading Your Documents

Splitting Text Into Chunks

Choosing an Embedding Model

Creating the Pinecone Index

Storing Your Chunks in Pinecone

Testing Retrieval Before Building the Full Chain

Building the Retrieval Chain with LCEL

Running a Query Through the Full Pipeline

Returning Sources Alongside the Answer

Filtering Retrieval by Metadata

Handling Empty Retrieval Results

Streaming the Response

Swapping in a Different LLM Provider

Common Mistakes to Avoid

Next Steps

Related Reading

Krunal Kanojiya

Related Posts