AI/ML

Building Production-Ready RAG Applications with LangChain and Pinecone

Anonymous

January 15, 2024

4 min read

Building Production-Ready RAG Applications

Retrieval-Augmented Generation (RAG) has become the go-to architecture for building intelligent applications that need to answer questions based on your own data. In this guide, I'll walk you through building a production-ready RAG system that goes beyond the basic tutorials.

Why RAG?

Traditional LLMs have a knowledge cutoff date and can't access your proprietary data. RAG solves this by:

Grounding responses in your data - No more hallucinations about facts
Cost efficiency - Context is retrieved dynamically instead of fine-tuning
Real-time updates - Your knowledge base stays current
Source attribution - Users can verify information

Architecture Overview

A production RAG system consists of several key components:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# Initialize embeddings
embeddings = OpenAIEmbeddings()

# Vector store with Pinecone
vectorstore = Pinecone.from_existing_index(
    index_name="my-knowledge-base",
    embedding=embeddings
)

# LLM for generation
llm = ChatOpenAI(model="gpt-4", temperature=0)

# RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)

Document Processing Pipeline

The quality of your RAG system depends heavily on how you process documents:

1. Chunking Strategy

Don't just split by character count. Use semantic chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,  # Important for context continuity
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)

chunks = text_splitter.split_documents(documents)

2. Metadata Enrichment

Add rich metadata for better retrieval:

for chunk in chunks:
    chunk.metadata.update({
        "source": document.source,
        "page": chunk.metadata.get("page", 0),
        "doc_type": classify_document(chunk),
        "created_at": datetime.now().isoformat()
    })

3. Embedding Generation

Use the right embedding model for your domain:

OpenAI ada-002: General purpose, great for most use cases
Cohere multilingual: For multi-language support
Custom fine-tuned: For specialized domains

Vector Database Best Practices

Indexing Strategies

import pinecone

# Initialize with metadata filtering support
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

index = pinecone.Index("my-knowledge-base")

# Upsert with metadata
index.upsert(vectors=[
    {
        "id": f"doc_{i}",
        "values": embedding,
        "metadata": metadata
    }
    for i, (embedding, metadata) in enumerate(zip(embeddings, metadatas))
])

Hybrid Search

Combine semantic and keyword search for best results:

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

# Keyword retriever
keyword_retriever = BM25Retriever.from_documents(documents)
keyword_retriever.k = 10

# Combine with weights
ensemble_retriever = EnsembleRetriever(
    retrievers=[semantic_retriever, keyword_retriever],
    weights=[0.7, 0.3]
)

Prompt Engineering for RAG

The prompt template is crucial for quality outputs:

from langchain.prompts import PromptTemplate

template = """You are an AI assistant helping users find accurate information.
Use the following pieces of context to answer the question. If you don't know 
the answer based on the context, say so - don't make up information.

Context:
{context}

Question: {question}

Instructions:
1. Answer based only on the provided context
2. Cite the source document when possible
3. If the context doesn't contain the answer, say "I don't have enough information"
4. Be specific and detailed in your response

Answer:"""

PROMPT = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

Performance Optimization

Caching Strategies

from langchain.cache import RedisCache
import redis

# Cache LLM responses
redis_client = redis.Redis(host='localhost', port=6379)
langchain.llm_cache = RedisCache(redis_client)

Async Processing

Handle multiple queries efficiently:

import asyncio
from langchain.chains import AsyncRetrievalQA

async def process_queries(queries):
    tasks = [qa_chain.ainvoke({"query": q}) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

Monitoring and Evaluation

Track key metrics:

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    response = qa_chain.invoke({"query": question})
    
    print(f"Total Tokens: {cb.total_tokens}")
    print(f"Prompt Tokens: {cb.prompt_tokens}")
    print(f"Completion Tokens: {cb.completion_tokens}")
    print(f"Total Cost (USD): ${cb.total_cost}")

Real-World Challenges

Handling Multi-Document Questions

Some questions require information from multiple sources:

# Use map-reduce for complex queries
from langchain.chains.question_answering import load_qa_chain

chain = load_qa_chain(
    llm=llm,
    chain_type="map_reduce",
    return_intermediate_steps=True
)

result = chain({"input_documents": docs, "question": question})

Dealing with Contradictions

When sources disagree:

template = """The following sources contain information about the question,
but they may contradict each other. Analyze each source and explain:
1. What each source says
2. Points of agreement
3. Points of disagreement
4. Your best assessment based on source credibility

Sources:
{context}

Question: {question}

Analysis:"""

Conclusion

Building production RAG systems requires attention to:

Document processing quality - Chunking, metadata, and embeddings
Retrieval strategy - Hybrid search, reranking, and filtering
Prompt engineering - Clear instructions and context formatting
Performance - Caching, async processing, and cost optimization
Monitoring - Track accuracy, latency, and costs

The examples in this guide give you a solid foundation, but remember to iterate based on your specific use case and user feedback.

Resources

Happy building! 🚀