Tech11 min read2,178 words

Prompting, RAG, and In-Context Learning: Using LLMs in Real Products

Knowing how to build a transformer is one thing. Knowing how to use one in production is another. This article covers prompt engineering, few-shot learning, chain-of-thought, retrieval-augmented generation, and why the model's behavior shifts so dramatically based on how you frame your request.

K

Krunal Kanojiya

Share:
#prompting#rag#in-context-learning#few-shot#chain-of-thought#retrieval-augmented-generation#llm#prompt-engineering

Articles 1 through 8 covered how models are built and trained. This one covers how to actually use them.

That shift matters. A lot of engineers who understand transformers at a technical level still write prompts that produce mediocre results. And a lot of people who write excellent prompts have no idea what is happening inside the model. Ideally you understand both, because the mechanics explain the techniques.

This is Article 9 in the series. Article 8 covered fine-tuning and RLHF, which is what makes prompting meaningful in the first place. The aligned model follows instructions because it was trained to. Article 10, the final article, covers evaluation, inference, and deployment — how you measure whether your prompts and pipelines are working and how you ship them to production.


Why prompting works at all

When you send a message to an LLM, it computes a probability distribution over the next token conditioned on everything in the context. That "everything" includes the system prompt, the conversation history, and your current message. The model is trying to predict what token comes next given all of that input.

The aligned model was trained on examples where helpful, on-topic responses followed well-formed prompts. So well-formed prompts produce better completions because they more closely match the distribution of contexts in the training data where good responses appeared.

This is not manipulation. The model has learned to generalize from patterns. Prompt engineering is learning which patterns produce the outputs you want.


System prompts: the persistent instruction layer

The system prompt is processed before any user message and stays in the model's context throughout the conversation. It is the highest-leverage place to define model behavior.

python
# illustrative API call showing system prompt structure

import openai  # or any compatible client

response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """You are a technical documentation assistant for a developer API.
Your responses should:
- Use precise technical language
- Include code examples in Python when relevant
- Acknowledge uncertainty rather than guessing
- Keep answers focused and under 300 words unless the question requires more depth

Do not make up API endpoints or parameter names you are not certain about."""
        },
        {
            "role": "user",
            "content": "How do I paginate results from the list users endpoint?"
        }
    ]
)

A well-written system prompt does three things. It defines the model's role, which activates the patterns the model learned for that persona. It sets behavioral constraints, which limit the probability of certain response patterns. It specifies format expectations, which guide the structure of the output.

One practical rule: be specific about what you do not want. "Do not make up API endpoints" is more reliable than "be accurate," because it directly addresses the failure mode you are trying to prevent.


Few-shot prompting: teaching by example

Few-shot prompting adds examples of (input, output) pairs to the context before the actual question. The model infers the pattern from the examples and applies it to the new input.

python
few_shot_prompt = """Extract the key entities from each customer message.
Format: {"persons": [], "companies": [], "products": []}

Message: "I spoke with Sarah from Acme Corp about their new CRM software."
Entities: {"persons": ["Sarah"], "companies": ["Acme Corp"], "products": ["CRM software"]}

Message: "David at TechStart wants to upgrade their Pro plan."
Entities: {"persons": ["David"], "companies": ["TechStart"], "products": ["Pro plan"]}

Message: "Can you connect me with the team at GlobalBank about their API integration?"
Entities:"""

# The model completes the final entities based on the pattern

This works because the model learned during pre-training that patterns repeat in text. It sees two examples of the extraction format and infers that the third input should be processed the same way.

Three things determine whether few-shot prompting helps or hurts. The examples should cover the edge cases you care about. The format should be consistent across all examples. And the examples should be drawn from the same distribution as your actual inputs. Examples that are too easy or too different from real inputs do not transfer.

Zero-shot prompting works when the task is common enough that the model saw many examples of it during training. Few-shot prompting helps when the task is unusual, when the output format is non-standard, or when you need the model to handle a specific pattern that would not appear naturally.


Chain-of-thought: making reasoning visible

Chain-of-thought (CoT) prompting asks the model to reason step by step before producing an answer. This dramatically improves performance on problems that require multiple reasoning steps.

python
# without chain-of-thought
simple_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?
Answer:"""

# with chain-of-thought
cot_prompt = """A store buys products for $45 each and sells them for $72 each.
If the store sells 340 products in a month, but 8% are returned for full refunds,
what is the net profit?

Let's work through this step by step:"""

# the model now generates reasoning steps before the answer:
# "Products sold: 340
#  Products returned: 340 * 0.08 = 27.2 ≈ 27
#  Net products sold: 340 - 27 = 313
#  Revenue: 313 * $72 = $22,536
#  Cost of all purchased products: 340 * $45 = $15,300
#  Refund cost for returned items: 27 * $45 = $1,215 (we already paid for them)
#  Actually, refunded customers paid $72 each, so refund cost is: 27 * $72 = $1,944
#  Net revenue: 340 * $72 - 27 * $72 = 313 * $72 = $22,536
#  Cost basis: 340 * $45 = $15,300
#  Net profit: $22,536 - $15,300 = $7,236"

Why does generating the reasoning steps help the final answer? The intermediate tokens become part of the context for each subsequent token. When the model writes out a calculation explicitly, that correct calculation is in the context when it produces the final answer. The reasoning process conditions the answer.

The zero-shot version just says "Let's think through this step by step." This alone improves accuracy on many tasks. You can also show examples of step-by-step reasoning if you want more control over the reasoning style.


Retrieval-augmented generation

The limitation of prompting alone is that the model can only use knowledge from its pre-training. For questions about your specific codebase, recent news, proprietary documents, or anything that happened after the training cutoff, the model either guesses or refuses.

RAG solves this by retrieving relevant documents and including them in the context before generating an answer.

The basic pipeline has three steps. First, documents are chunked and converted to embeddings (covered in Article 4) and stored in a vector database. Second, when a query arrives, it is embedded and the most similar document chunks are retrieved using cosine similarity. Third, the retrieved chunks are added to the prompt before the model generates its response.

python
import numpy as np
from typing import List

# simplified RAG pipeline

def embed(text: str) -> np.ndarray:
    """Call an embedding model to get a vector representation."""
    # in production: call OpenAI embeddings, Cohere, or a local model
    pass

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class VectorStore:
    def __init__(self):
        self.documents = []
        self.embeddings = []

    def add_document(self, text: str, metadata: dict = None):
        embedding = embed(text)
        self.documents.append({"text": text, "metadata": metadata or {}})
        self.embeddings.append(embedding)

    def search(self, query: str, top_k: int = 3) -> List[dict]:
        query_embedding = embed(query)
        scores = [cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [self.documents[i] for i in top_indices]

def rag_generate(query: str, store: VectorStore, model_client) -> str:
    # retrieve relevant chunks
    relevant_docs = store.search(query, top_k=3)
    context = "\n\n".join(doc["text"] for doc in relevant_docs)

    # build prompt with retrieved context
    prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say so.

Context:
{context}

Question: {query}

Answer:"""

    response = model_client.generate(prompt)
    return response

The vector database is what makes RAG scalable. Popular production options include Pinecone, Weaviate, Qdrant, and pgvector for teams already on Postgres. For local prototyping, FAISS from Meta is fast and free.

RAG's core advantage is that it grounds the model in specific documents. This reduces hallucination on factual questions because the model can cite and follow the retrieved text rather than generating from parametric memory alone. It also keeps your product accurate as your knowledge base updates, because you update the vector database rather than retraining the model.

The failure modes are real and worth knowing. Retrieval can fail to find the right document if the query and document are phrased very differently. Chunk boundaries matter: a question about a topic that spans two chunks may retrieve incomplete context. Long retrieved contexts can confuse the model if the relevant information is buried.


Practical prompt patterns that work

A few patterns that show up repeatedly in production LLM applications, grounded in what the model has learned to do:

Role definition before content: Starting with "You are a [role]" activates the patterns the model learned in that role's context. This is not role-playing for fun. The model was trained on text where different roles produced different styles, and the role specification shifts the probability distribution toward those patterns.

Explicit output format specification: Telling the model to respond in JSON, with specific keys, or in a numbered list dramatically increases compliance. The model learned that certain phrases like "Respond in the following JSON format:" are followed by structured output in training data.

Negative constraints: "Do not include any information that is not in the provided documents" is more reliable than "Only use provided documents" because the former matches patterns where such constraints appeared in the training data more precisely.

Temperature and sampling parameters: Temperature controls the entropy of the output distribution. Lower temperature concentrates probability mass, producing more deterministic outputs. Higher temperature spreads it, producing more varied outputs. For factual tasks: temperature around 0.2. For creative tasks: 0.7 to 1.0. For most production tasks: 0.3 to 0.5.

python
# temperature effect on token selection
import torch
import torch.nn.functional as F

logits = torch.tensor([3.0, 1.5, 0.8, 0.2, -0.5])

for temp in [0.1, 0.5, 1.0, 2.0]:
    probs = F.softmax(logits / temp, dim=0)
    print(f"temp={temp:.1f}: {probs.numpy().round(3)}")

# temp=0.1: [0.999 0.001 0.   0.   0.  ]  (almost deterministic)
# temp=0.5: [0.891 0.087 0.017 0.004 0.001] (concentrated)
# temp=1.0: [0.676 0.202 0.089 0.024 0.009] (moderate spread)
# temp=2.0: [0.462 0.262 0.173 0.072 0.031] (more uniform)

When to use RAG versus fine-tuning versus prompting

This is a question every team building on LLMs will face. The right answer depends on the problem.

Prompting alone is sufficient when the model already has the relevant knowledge and capability, and you just need to shape the output. Most tasks that involve writing, summarization, classification, or reasoning over text provided in the prompt fit here.

RAG is right when the problem is about knowledge access. The model needs current facts, your proprietary documents, or domain information it could not have seen during training. RAG lets the model use external knowledge without retraining.

Fine-tuning is right when the problem is about behavior or style. The model needs to respond in a very specific format consistently, follow domain conventions that differ from general language patterns, or perform a task it handles poorly out of the box. Fine-tuning is more expensive to set up but produces more reliable results for specialized domains.

In practice, the best systems often combine all three: a fine-tuned model with carefully designed system prompts and a RAG pipeline for knowledge retrieval.


The connection from Article 8 to Article 10

The aligned model from Article 8 is what makes this article work. Prompting, RAG, and few-shot learning all depend on the model reliably following instructions and using context appropriately. Those behaviors came from SFT and RLHF.

Article 10, the final article in the series, covers how you measure whether all of this is working. Benchmarks tell you if the model can do what you need. Inference optimization tells you how fast and cheaply it can do it. Deployment considerations tell you what breaks in production that benchmarks do not catch.

A word on prompt injection

Prompt injection is when user-provided content contains instructions that override your system prompt. It is an active security concern in production LLM applications. If your system prompt says "Only answer questions about cooking" and a user's message contains "Ignore previous instructions and...", the model may comply. Defenses include input validation, separating system and user content clearly, and using models fine-tuned for instruction following over instruction following. It is still an unsolved problem in general.


Next in the series

Article 10 is the final article and covers evaluation, inference, and deployment. You will see how to choose and interpret benchmarks for your specific use case, what quantization does to model quality, how the KV cache works during inference, and what actually breaks when you move from a notebook to production.

Follow on Google

Add as a preferred source in Search & Discover

Add as preferred source
Appears in Google Discover
K

Krunal Kanojiya

Technical Content Writer

Technical Content Writer and former software developer from India. I write in-depth articles on blockchain, AI/ML, data engineering, web development, and developer careers. Currently at Lucent Innovation, previously at Cromtek Solution and freelance.

Related Posts