RAG·11 min read·2,194 words

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Knowing how to build a transformer is one thing. Knowing how to use one in production is another. This article covers prompt engineering, few-shot learning, chain-of-thought, retrieval-augmented generation, and why the model's behavior shifts so dramatically based on how you frame your request.

Krunal Kanojiya

April 20, 2026·Updated June 06, 2026

#prompting#rag#in-context-learning#few-shot#chain-of-thought#retrieval-augmented-generation#llm#prompt-engineering

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Articles 1 through 8 covered how models are built and trained. This one covers how to actually use them.

That shift matters. A lot of engineers who understand transformers at a technical level still write prompts that produce mediocre results. And a lot of people who write excellent prompts have no idea what is happening inside the model. Ideally you understand both, because the mechanics explain the techniques.

This is Article 9 in the series. Article 8 covered fine-tuning and RLHF, which is what makes prompting meaningful in the first place. The aligned model follows instructions because it was trained to. Article 10, the final article, covers evaluation, inference, and deployment, including how you measure whether your prompts and pipelines are working and how you ship them to production.

Why prompting works at all

When you send a message to an LLM, it computes a probability distribution over the next token conditioned on everything in the context. That "everything" includes the system prompt, the conversation history, and your current message. The model is trying to predict what token comes next given all of that input.

The aligned model was trained on examples where helpful, on-topic responses followed well-formed prompts. So well-formed prompts produce better completions because they more closely match the distribution of contexts in the training data where good responses appeared.

This is not manipulation. The model has learned to generalize from patterns. Prompt engineering is learning which patterns produce the outputs you want.

System prompts: the persistent instruction layer

The system prompt is processed before any user message and stays in the model's context throughout the conversation. It is the highest-leverage place to define model behavior.

python

# illustrative API call showing system prompt structure

import openai  # or any compatible client

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """You are a technical documentation assistant for a developer API.
Your responses should:
- Use precise technical language
- Include code examples in Python when relevant
- Acknowledge uncertainty rather than guessing
- Keep answers focused and under 300 words unless the question requires more depth

Do not make up API endpoints or parameter names you are not certain about."""
        },
        {
            "role": "user",
            "content": "How do I paginate results from the list users endpoint?"
        }
    ]
)

A well-written system prompt does three things. It defines the model's role, which activates the patterns the model learned for that persona. It sets behavioral constraints, which limit the probability of certain response patterns. It specifies format expectations, which guide the structure of the output.

One practical rule: be specific about what you do not want. "Do not make up API endpoints" is more reliable than "be accurate," because it directly addresses the failure mode you are trying to prevent.

Few-shot prompting: teaching by example

Few-shot prompting adds examples of (input, output) pairs to the context before the actual question. The model infers the pattern from the examples and applies it to the new input.

python

few_shot_prompt = """Extract the key entities from each customer message.
Format: {"persons": [], "companies": [], "products": []}

Message: "I spoke with Sarah from Acme Corp about their new CRM software."
Entities: {"persons": ["Sarah"], "companies": ["Acme Corp"], "products": ["CRM software"]}

Message: "David at TechStart wants to upgrade their Pro plan."
Entities: {"persons": ["David"], "companies": ["TechStart"], "products": ["Pro plan"]}

Message: "Can you connect me with the team at GlobalBank about their API integration?"
Entities:"""

# The model completes the final entities based on the pattern

This works because the model learned during pre-training that patterns repeat in text. It sees two examples of the extraction format and infers that the third input should be processed the same way.

Three things determine whether few-shot prompting helps or hurts. The examples should cover the edge cases you care about. The format should be consistent across all examples. And the examples should be drawn from the same distribution as your actual inputs. Examples that are too easy or too different from real inputs do not transfer.

Zero-shot prompting works when the task is common enough that the model saw many examples of it during training. Few-shot prompting helps when the task is unusual, when the output format is non-standard, or when you need the model to handle a specific pattern that would not appear naturally.

Chain-of-thought: making reasoning visible

Chain-of-thought (CoT) prompting asks the model to reason step by step before producing an answer. This dramatically improves performance on problems that require multiple reasoning steps.

python

# without chain-of-thought
simple_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?
Answer:"""

# with chain-of-thought
cot_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?

Let's work through this step by step:"""

# the model now generates reasoning steps before the answer:
# "Products sold: 340
#  Products returned: 340 * 0.08 = 27.2 ≈ 27
#  Net products sold: 340 - 27 = 313
#  Revenue: 313 * $72 = $22,536
#  Cost of all purchased products: 340 * $45 = $15,300
#  Refund cost for returned items: 27 * $45 = $1,215 (we already paid for them)
#  Actually, refunded customers paid $72 each, so refund cost is: 27 * $72 = $1,944
#  Net revenue: 340 * $72 - 27 * $72 = 313 * $72 = $22,536
#  Cost basis: 340 * $45 = $15,300
#  Net profit: $22,536 - $15,300 = $7,236"

Why does generating the reasoning steps help the final answer? The intermediate tokens become part of the context for each subsequent token. When the model writes out a calculation explicitly, that correct calculation is in the context when it produces the final answer. The reasoning process conditions the answer.

The zero-shot version just says "Let's think through this step by step." This alone improves accuracy on many tasks. You can also show examples of step-by-step reasoning if you want more control over the reasoning style.

Retrieval-augmented generation

The limitation of prompting alone is that the model can only use knowledge from its pre-training. For questions about your specific codebase, recent news, proprietary documents, or anything that happened after the training cutoff, the model either guesses or refuses.

RAG solves this by retrieving relevant documents and including them in the context before generating an answer. For a full explanation of how RAG works from scratch, see what is RAG in AI.

The basic pipeline has three steps. First, documents are chunked and converted to embeddings (covered in Article 4) and stored in a vector database. Second, when a query arrives, it is embedded and the most similar document chunks are retrieved using cosine similarity. Third, the retrieved chunks are added to the prompt before the model generates its response.

python

import numpy as np
from typing import List

# simplified RAG pipeline

def embed(text: str) -> np.ndarray:
    """Call an embedding model to get a vector representation."""
    # in production: call OpenAI embeddings, Cohere, or a local model
    pass

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class VectorStore:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def add_document(self, text: str, metadata: dict = None):
        embedding = embed(text)
        self.documents.append({"text": text, "metadata": metadata or {}})
        self.embeddings.append(embedding)

    def search(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = embed(query)
        scores = [cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [self.documents[i] for i in top_indices]

def rag_generate(query: str, store: VectorStore, model_client) -> str:
    # retrieve relevant chunks
    relevant_docs = store.search(query, top_k=3)
    context = "\n\n".join(doc["text"] for doc in relevant_docs)

    # build prompt with retrieved context
    prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say so.

Context:
{context}

Question: {query}

Answer:"""

    response = model_client.generate(prompt)
    return response

The vector database is what makes RAG scalable. Popular production options include Pinecone, Weaviate, Qdrant, and pgvector for teams already on Postgres. For local prototyping, FAISS from Meta is fast and free.

RAG's core advantage is that it grounds the model in specific documents. This reduces hallucination on factual questions because the model can cite and follow the retrieved text rather than generating from parametric memory alone. It also keeps your product accurate as your knowledge base updates, because you update the vector database rather than retraining the model.

The failure modes are real and worth knowing. Retrieval can fail to find the right document if the query and document are phrased very differently. Chunk boundaries matter: a question about a topic that spans two chunks may retrieve incomplete context. Long retrieved contexts can confuse the model if the relevant information is buried.

Practical prompt patterns that work

A few patterns that show up repeatedly in production LLM applications, grounded in what the model has learned to do:

Role definition before content: Starting with "You are a [role]" activates the patterns the model learned in that role's context. This is not role-playing for fun. The model was trained on text where different roles produced different styles, and the role specification shifts the probability distribution toward those patterns.

Explicit output format specification: Telling the model to respond in JSON, with specific keys, or in a numbered list dramatically increases compliance. The model learned that certain phrases like "Respond in the following JSON format:" are followed by structured output in training data.

Negative constraints: "Do not include any information that is not in the provided documents" is more reliable than "Only use provided documents" because the former matches patterns where such constraints appeared in the training data more precisely.

Temperature and sampling parameters: Temperature controls the entropy of the output distribution. Lower temperature concentrates probability mass, producing more deterministic outputs. Higher temperature spreads it, producing more varied outputs. For factual tasks: temperature around 0.2. For creative tasks: 0.7 to 1.0. For most production tasks: 0.3 to 0.5.

python

# temperature effect on token selection
import torch
import torch.nn.functional as F

logits = torch.tensor([3.0, 1.5, 0.8, 0.2, -0.5])

for temp in [0.1, 0.5, 1.0, 2.0]:
    probs = F.softmax(logits / temp, dim=0)
    print(f"temp={temp:.1f}: {probs.numpy().round(3)}")

# temp=0.1: [0.999 0.001 0.   0.   0.  ]  (almost deterministic)
# temp=0.5: [0.891 0.087 0.017 0.004 0.001] (concentrated)
# temp=1.0: [0.676 0.202 0.089 0.024 0.009] (moderate spread)
# temp=2.0: [0.462 0.262 0.173 0.072 0.031] (more uniform)

When to use RAG versus fine-tuning versus prompting

This is a question every team building on LLMs will face. The right answer depends on the problem.

Prompting alone is sufficient when the model already has the relevant knowledge and capability, and you just need to shape the output. Most tasks that involve writing, summarization, classification, or reasoning over text provided in the prompt fit here.

RAG is right when the problem is about knowledge access. The model needs current facts, your proprietary documents, or domain information it could not have seen during training. RAG lets the model use external knowledge without retraining.

Fine-tuning is right when the problem is about behavior or style. The model needs to respond in a very specific format consistently, follow domain conventions that differ from general language patterns, or perform a task it handles poorly out of the box. Fine-tuning is more expensive to set up but produces more reliable results for specialized domains.

In practice, the best systems often combine all three: a fine-tuned model with carefully designed system prompts and a RAG pipeline for knowledge retrieval.

The connection from Article 8 to Article 10

The aligned model from Article 8 is what makes this article work. Prompting, RAG, and few-shot learning all depend on the model reliably following instructions and using context appropriately. Those behaviors came from SFT and RLHF.

Article 10, the final article in the series, covers how you measure whether all of this is working. Benchmarks tell you if the model can do what you need. Inference optimization tells you how fast and cheaply it can do it. Deployment considerations tell you what breaks in production that benchmarks do not catch.

A word on prompt injection

Prompt injection is when user-provided content contains instructions that override your system prompt. It is an active security concern in production LLM applications. If your system prompt says "Only answer questions about cooking" and a user's message contains "Ignore previous instructions and...", the model may comply. Defenses include input validation, separating system and user content clearly, and using models fine-tuned for instruction following over instruction following. It is still an unsolved problem in general.

Next in the series

Article 10 is the final article and covers evaluation, inference, and deployment. You will see how to choose and interpret benchmarks for your specific use case, what quantization does to model quality, how the KV cache works during inference, and what actually breaks when you move from a notebook to production.

Frequently Asked Questions

What is prompt engineering?

Prompt engineering is the practice of structuring your input to a language model to get better outputs. Because aligned LLMs are trained to follow instructions, the way you phrase a request, the context you include, and the examples you provide all shape the model's behavior. Prompt engineering is not magic. It is exploiting the patterns the model learned during pre-training and fine-tuning to guide it toward the output you need.

What is in-context learning?

In-context learning is when a model learns to perform a new task from examples provided within the input context, without any gradient updates to its weights. You include a few (prompt, answer) examples in your message, and the model generalizes from them to answer a new related question. This behavior emerges from pre-training and is not explicitly trained. It is one of the surprising capabilities that appeared at scale.

What is chain-of-thought prompting?

Chain-of-thought prompting asks the model to reason step by step before giving an answer. You either include examples of step-by-step reasoning in the prompt, or simply add a phrase like 'Let's think through this step by step.' The model generates intermediate reasoning steps, which dramatically improves accuracy on math, logic, and multi-step problems. The reasoning process itself improves the final answer.

What is retrieval-augmented generation (RAG)?

RAG is a technique where relevant documents are retrieved from an external database and added to the model's context before it generates a response. Instead of asking the model to rely on what it learned during pre-training, you fetch current or domain-specific information and give it to the model as context. RAG reduces hallucination on factual questions, allows the model to answer about events after its training cutoff, and lets you ground responses in specific documents.

When should I use RAG versus fine-tuning?

Use RAG when your problem is about knowledge access: the model needs current facts, specific documents, or domain information that was not in its training data. Use fine-tuning when your problem is about behavior or style: the model needs to respond in a specific format, follow domain-specific conventions, or perform a task it does not do well out of the box. For many production use cases, RAG and fine-tuning complement each other.

What is a system prompt?

A system prompt is an instruction given to the model at the start of a conversation, before any user messages. It defines the model's role, tone, constraints, and behavior for the session. The model gives high weight to the system prompt because aligned models were trained to treat it as a persistent instruction set. System prompts are how you configure an aligned model's behavior without modifying its weights.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

I am a technical writer and former software developer from India. I publish practical tutorials and in-depth guides on AI engineering, data engineering, programming, algorithms, blockchain, and modern software development.

GitHub LinkedIn X

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

RAG vs Traditional Search: What Changed, What Did Not, and Why BM25 Is Not Dead

May 08, 2026 · 15 min read

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

May 07, 2026 · 18 min read

RAG·11 min read·2,194 words

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Krunal Kanojiya

April 20, 2026·Updated June 06, 2026

#prompting#rag#in-context-learning#few-shot#chain-of-thought#retrieval-augmented-generation#llm#prompt-engineering

Articles 1 through 8 covered how models are built and trained. This one covers how to actually use them.

Why prompting works at all

This is not manipulation. The model has learned to generalize from patterns. Prompt engineering is learning which patterns produce the outputs you want.

System prompts: the persistent instruction layer

The system prompt is processed before any user message and stays in the model's context throughout the conversation. It is the highest-leverage place to define model behavior.

python

# illustrative API call showing system prompt structure

import openai  # or any compatible client

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """You are a technical documentation assistant for a developer API.
Your responses should:
- Use precise technical language
- Include code examples in Python when relevant
- Acknowledge uncertainty rather than guessing
- Keep answers focused and under 300 words unless the question requires more depth

Do not make up API endpoints or parameter names you are not certain about."""
        },
        {
            "role": "user",
            "content": "How do I paginate results from the list users endpoint?"
        }
    ]
)

Few-shot prompting: teaching by example

Few-shot prompting adds examples of (input, output) pairs to the context before the actual question. The model infers the pattern from the examples and applies it to the new input.

python

few_shot_prompt = """Extract the key entities from each customer message.
Format: {"persons": [], "companies": [], "products": []}

Message: "I spoke with Sarah from Acme Corp about their new CRM software."
Entities: {"persons": ["Sarah"], "companies": ["Acme Corp"], "products": ["CRM software"]}

Message: "David at TechStart wants to upgrade their Pro plan."
Entities: {"persons": ["David"], "companies": ["TechStart"], "products": ["Pro plan"]}

Message: "Can you connect me with the team at GlobalBank about their API integration?"
Entities:"""

# The model completes the final entities based on the pattern

This works because the model learned during pre-training that patterns repeat in text. It sees two examples of the extraction format and infers that the third input should be processed the same way.

Chain-of-thought: making reasoning visible

Chain-of-thought (CoT) prompting asks the model to reason step by step before producing an answer. This dramatically improves performance on problems that require multiple reasoning steps.

python

# without chain-of-thought
simple_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?
Answer:"""

# with chain-of-thought
cot_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?

Let's work through this step by step:"""

# the model now generates reasoning steps before the answer:
# "Products sold: 340
#  Products returned: 340 * 0.08 = 27.2 ≈ 27
#  Net products sold: 340 - 27 = 313
#  Revenue: 313 * $72 = $22,536
#  Cost of all purchased products: 340 * $45 = $15,300
#  Refund cost for returned items: 27 * $45 = $1,215 (we already paid for them)
#  Actually, refunded customers paid $72 each, so refund cost is: 27 * $72 = $1,944
#  Net revenue: 340 * $72 - 27 * $72 = 313 * $72 = $22,536
#  Cost basis: 340 * $45 = $15,300
#  Net profit: $22,536 - $15,300 = $7,236"

Retrieval-augmented generation

RAG solves this by retrieving relevant documents and including them in the context before generating an answer. For a full explanation of how RAG works from scratch, see what is RAG in AI.

python

import numpy as np
from typing import List

# simplified RAG pipeline

def embed(text: str) -> np.ndarray:
    """Call an embedding model to get a vector representation."""
    # in production: call OpenAI embeddings, Cohere, or a local model
    pass

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class VectorStore:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def add_document(self, text: str, metadata: dict = None):
        embedding = embed(text)
        self.documents.append({"text": text, "metadata": metadata or {}})
        self.embeddings.append(embedding)

    def search(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = embed(query)
        scores = [cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [self.documents[i] for i in top_indices]

def rag_generate(query: str, store: VectorStore, model_client) -> str:
    # retrieve relevant chunks
    relevant_docs = store.search(query, top_k=3)
    context = "\n\n".join(doc["text"] for doc in relevant_docs)

    # build prompt with retrieved context
    prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say so.

Context:
{context}

Question: {query}

Answer:"""

    response = model_client.generate(prompt)
    return response

Practical prompt patterns that work

A few patterns that show up repeatedly in production LLM applications, grounded in what the model has learned to do:

python

# temperature effect on token selection
import torch
import torch.nn.functional as F

logits = torch.tensor([3.0, 1.5, 0.8, 0.2, -0.5])

for temp in [0.1, 0.5, 1.0, 2.0]:
    probs = F.softmax(logits / temp, dim=0)
    print(f"temp={temp:.1f}: {probs.numpy().round(3)}")

# temp=0.1: [0.999 0.001 0.   0.   0.  ]  (almost deterministic)
# temp=0.5: [0.891 0.087 0.017 0.004 0.001] (concentrated)
# temp=1.0: [0.676 0.202 0.089 0.024 0.009] (moderate spread)
# temp=2.0: [0.462 0.262 0.173 0.072 0.031] (more uniform)

When to use RAG versus fine-tuning versus prompting

This is a question every team building on LLMs will face. The right answer depends on the problem.

In practice, the best systems often combine all three: a fine-tuned model with carefully designed system prompts and a RAG pipeline for knowledge retrieval.

The connection from Article 8 to Article 10

A word on prompt injection

Next in the series

Frequently Asked Questions

What is prompt engineering?

What is in-context learning?

What is chain-of-thought prompting?

What is retrieval-augmented generation (RAG)?

When should I use RAG versus fine-tuning?

What is a system prompt?

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source

Appears in Google Discover

Krunal Kanojiya

Technical Content Writer

GitHub LinkedIn X

RAG vs LangChain: What They Are, How They Relate, and Which One You Actually Need

May 09, 2026 · 16 min read

RAG vs Traditional Search: What Changed, What Did Not, and Why BM25 Is Not Dead

May 08, 2026 · 15 min read

Why RAG Fails: Every Failure Mode and How to Fix Each One (2026)

May 07, 2026 · 18 min read

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Why prompting works at all

System prompts: the persistent instruction layer

Few-shot prompting: teaching by example

Chain-of-thought: making reasoning visible

Retrieval-augmented generation

Practical prompt patterns that work

When to use RAG versus fine-tuning versus prompting

The connection from Article 8 to Article 10

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Why prompting works at all

System prompts: the persistent instruction layer

Few-shot prompting: teaching by example

Chain-of-thought: making reasoning visible

Retrieval-augmented generation

Practical prompt patterns that work

When to use RAG versus fine-tuning versus prompting

The connection from Article 8 to Article 10

Next in the series

Frequently Asked Questions

Krunal Kanojiya

Related Posts