Enterprise RAG Chatbot with LangChain

Overview

Built a production-grade RAG (Retrieval-Augmented Generation conversational AI system that enables natural language queries across a large internal knowledge base. The system handles 1M+ queries monthly with sub-2s response times.

Problem Statement

The organization had 10,000+ internal documents (PDFs, docs, markdown) scattered across various systems. Employees spent hours searching for information, and traditional keyword search was inadequate for complex queries.

Requirements:

Natural language querying across all documents
Sub-2 second response time
Source attribution for compliance
Support for multi-turn conversations
Cost-effective scaling
High accuracy (>90% relevance)

Technical Architecture

System Components

┌─────────────┐
│   FastAPI   │  ← REST API layer
└──────┬──────┘
       │
┌──────▼──────┐
│  LangChain  │  ← Orchestration
└──────┬──────┘
       │
    ┌──┴──┐
    │     │
┌───▼─┐ ┌▼────┐
│ GPT │ │Pinecone  ← LLM + Vector DB
└─────┘ └───┬─┘
            │
       ┌────▼────┐
       │  Redis  │  ← Caching layer
       └─────────┘

Tech Stack

Backend: Python 3.11, FastAPI, Celery
AI/ML: LangChain, OpenAI GPT-4, text-embedding-ada-002
Vector DB: Pinecone (1536 dimensions)
Caching: Redis for LLM response caching
Processing: PyPDF2, Unstructured for document parsing
Infrastructure: Docker, AWS ECS, CloudWatch

Implementation Details

1. Document Processing Pipeline

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ".", "!", "?", " "]
        )
        self.embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    
    def process_documents(self, directory: str):
        # Load documents
        loader = DirectoryLoader(
            directory,
            glob="**/*.pdf",
            loader_cls=PyPDFLoader,
            show_progress=True
        )
        documents = loader.load()
        
        # Split into chunks
        chunks = self.text_splitter.split_documents(documents)
        
        # Enrich metadata
        for chunk in chunks:
            chunk.metadata.update({
                "embedding_model": "ada-002",
                "processed_at": datetime.now().isoformat(),
                "chunk_size": len(chunk.page_content)
            })
        
        return chunks

2. RAG Chain with Memory

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Pinecone

class RAGChatbot:
    def __init__(self, index_name: str):
        self.llm = ChatOpenAI(
            model="gpt-4",
            temperature=0,
            max_tokens=500
        )
        
        self.vectorstore = Pinecone.from_existing_index(
            index_name=index_name,
            embedding=OpenAIEmbeddings()
        )
        
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True,
            output_key="answer"
        )
        
        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(
                search_kwargs={"k": 5}
            ),
            memory=self.memory,
            return_source_documents=True,
            verbose=True
        )
    
    def query(self, question: str) -> dict:
        result = self.chain({"question": question})
        return {
            "answer": result["answer"],
            "sources": [
                {
                    "content": doc.page_content,
                    "metadata": doc.metadata
                }
                for doc in result["source_documents"]
            ]
        }

3. FastAPI REST API

from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
import redis.asyncio as redis

app = FastAPI()
redis_client = redis.from_url("redis://localhost")

class QueryRequest(BaseModel):
    question: str
    session_id: str

class QueryResponse(BaseModel):
    answer: str
    sources: list
    confidence_score: float

@app.post("/query", response_model=QueryResponse)
async def query_chatbot(request: QueryRequest):
    # Check cache
    cache_key = f"query:{hash(request.question)}"
    cached = await redis_client.get(cache_key)
    
    if cached:
        return QueryResponse.parse_raw(cached)
    
    # Process query
    chatbot = RAGChatbot.get_or_create(request.session_id)
    result = chatbot.query(request.question)
    
    # Calculate confidence
    confidence = calculate_confidence(result["sources"])
    
    response = QueryResponse(
        answer=result["answer"],
        sources=result["sources"],
        confidence_score=confidence
    )
    
    # Cache for 1 hour
    await redis_client.setex(
        cache_key,
        3600,
        response.json()
    )
    
    return response

4. Hybrid Search for Better Retrieval

from langchain.retrievers import BM25Retriever, EnsembleRetriever

class HybridRetriever:
    def __init__(self, documents, vectorstore):
        # Semantic retriever
        self.semantic_retriever = vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )
        
        # Keyword retriever
        self.keyword_retriever = BM25Retriever.from_documents(documents)
        self.keyword_retriever.k = 10
        
        # Combine with weights
        self.ensemble_retriever = EnsembleRetriever(
            retrievers=[self.semantic_retriever, self.keyword_retriever],
            weights=[0.7, 0.3]  # 70% semantic, 30% keyword
        )
    
    def get_relevant_documents(self, query: str):
        return self.ensemble_retriever.get_relevant_documents(query)

Performance Optimizations

1. Response Caching

Implemented multi-level caching:

L1: In-memory LRU cache (1000 most recent queries)
L2: Redis (1 hour TTL for all queries)
L3: Pinecone query result cache

Result: 60% cache hit rate, 80% reduction in OpenAI API costs

2. Async Processing

import asyncio
from typing import List

async def process_documents_async(documents: List[str]):
    """Process multiple documents concurrently"""
    tasks = [process_single_document(doc) for doc in documents]
    results = await asyncio.gather(*tasks)
    return results

# Reduced document processing time from 2 hours to 15 minutes

3. Prompt Optimization

Reduced token usage by 40% through prompt engineering:

OPTIMIZED_PROMPT = """Answer the question based solely on the provided context. 
Be concise but complete. If unsure, say "I don't have enough information."

Context: {context}

Question: {question}

Answer:"""

# vs original verbose prompt (300 tokens → 180 tokens)

Monitoring & Observability

Custom Metrics

from prometheus_client import Counter, Histogram
import time

query_counter = Counter('chatbot_queries_total', 'Total queries')
query_duration = Histogram('chatbot_query_duration_seconds', 'Query duration')
relevance_score = Histogram('chatbot_relevance_score', 'Relevance score')

@query_duration.time()
async def query_with_metrics(question: str):
    query_counter.inc()
    
    result = chatbot.query(question)
    
    # Track relevance
    score = calculate_relevance(result)
    relevance_score.observe(score)
    
    return result

Alerting

Response time > 3s → Page on-call
Relevance score < 0.7 → Create ticket
Error rate > 1% → Alert engineering

Results & Impact

Performance Metrics:

Response Time: Average 1.8s (95th percentile: 2.5s)
Accuracy: 92% relevance score (human evaluation)
Availability: 99.9% uptime
Throughput: 50K queries/day peak

Business Impact:

Time Saved: 15 hours/employee/month on average
Cost Reduction: $200K annually vs traditional search
User Satisfaction: 4.6/5 rating
Adoption: 85% of employees using weekly

Cost Optimization:

Caching reduced OpenAI costs by 80%
Average cost per query: $0.008
Monthly infrastructure: $1,200

Challenges & Solutions

Challenge 1: Hallucinations

Problem: LLM occasionally fabricated information not in documents

Solution:

Strict prompt engineering with "only use context" instruction
Confidence scoring with source attribution
Fallback to "I don't know" for low confidence

Challenge 2: Long Document Handling

Problem: Important context spread across multiple chunks

Solution:

Implemented parent document retrieval
Context window optimization
Map-reduce for multi-document queries

Challenge 3: Cost at Scale

Problem: OpenAI costs scaling linearly with usage

Solution:

Multi-tier caching strategy
Prompt compression
Batch processing for indexing
Consider fine-tuned models for future

Future Enhancements

Multi-modal Support: Add image and table understanding
Fine-tuned Models: Custom model for domain-specific queries
Active Learning: User feedback loop for continuous improvement
Multi-language: Support for Spanish and French documents
Advanced Analytics: Query pattern analysis and auto-suggestions

Key Learnings

Chunking Strategy Matters: Spent 2 weeks optimizing chunk size and overlap - huge impact on relevance
Hybrid Search Wins: Combining semantic + keyword search improved accuracy by 15%
Caching is Critical: Early caching implementation saved thousands in API costs
User Feedback Loop: Regular evaluation with actual users prevented drift
Start Simple: MVP with basic RAG, then iterate based on real usage

Technical Highlights

Scalable Architecture: Horizontally scalable with container orchestration
Production-Ready: Comprehensive logging, monitoring, and alerting
Cost-Optimized: Multi-level caching and efficient resource usage
High Quality: 92% relevance through careful engineering
Well-Tested: 85% code coverage with integration tests

This project demonstrates end-to-end AI/ML engineering: from data processing to production deployment at scale with real business impact.

Repository: GitHub (private enterprise code)
Tech Stack: Python, LangChain, OpenAI, Pinecone, FastAPI, Redis, Docker, AWS
Role: Lead AI/ML Engineer
Duration: 3 months (MVP), 6 months (production)

Enterprise RAG Chatbot with LangChain

Overview

Problem Statement

Requirements:

Natural language querying across all documents
Sub-2 second response time
Source attribution for compliance
Support for multi-turn conversations
Cost-effective scaling
High accuracy (>90% relevance)

Technical Architecture

System Components

┌─────────────┐
│   FastAPI   │  ← REST API layer
└──────┬──────┘
       │
┌──────▼──────┐
│  LangChain  │  ← Orchestration
└──────┬──────┘
       │
    ┌──┴──┐
    │     │
┌───▼─┐ ┌▼────┐
│ GPT │ │Pinecone  ← LLM + Vector DB
└─────┘ └───┬─┘
            │
       ┌────▼────┐
       │  Redis  │  ← Caching layer
       └─────────┘

Tech Stack

Backend: Python 3.11, FastAPI, Celery
AI/ML: LangChain, OpenAI GPT-4, text-embedding-ada-002
Vector DB: Pinecone (1536 dimensions)
Caching: Redis for LLM response caching
Processing: PyPDF2, Unstructured for document parsing
Infrastructure: Docker, AWS ECS, CloudWatch

Implementation Details

1. Document Processing Pipeline

from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings

class DocumentProcessor:
    def __init__(self):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            separators=["\n\n", "\n", ".", "!", "?", " "]
        )
        self.embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    
    def process_documents(self, directory: str):
        # Load documents
        loader = DirectoryLoader(
            directory,
            glob="**/*.pdf",
            loader_cls=PyPDFLoader,
            show_progress=True
        )
        documents = loader.load()
        
        # Split into chunks
        chunks = self.text_splitter.split_documents(documents)
        
        # Enrich metadata
        for chunk in chunks:
            chunk.metadata.update({
                "embedding_model": "ada-002",
                "processed_at": datetime.now().isoformat(),
                "chunk_size": len(chunk.page_content)
            })
        
        return chunks

2. RAG Chain with Memory

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.chat_models import ChatOpenAI
from langchain.vectorstores import Pinecone

class RAGChatbot:
    def __init__(self, index_name: str):
        self.llm = ChatOpenAI(
            model="gpt-4",
            temperature=0,
            max_tokens=500
        )
        
        self.vectorstore = Pinecone.from_existing_index(
            index_name=index_name,
            embedding=OpenAIEmbeddings()
        )
        
        self.memory = ConversationBufferMemory(
            memory_key="chat_history",
            return_messages=True,
            output_key="answer"
        )
        
        self.chain = ConversationalRetrievalChain.from_llm(
            llm=self.llm,
            retriever=self.vectorstore.as_retriever(
                search_kwargs={"k": 5}
            ),
            memory=self.memory,
            return_source_documents=True,
            verbose=True
        )
    
    def query(self, question: str) -> dict:
        result = self.chain({"question": question})
        return {
            "answer": result["answer"],
            "sources": [
                {
                    "content": doc.page_content,
                    "metadata": doc.metadata
                }
                for doc in result["source_documents"]
            ]
        }

3. FastAPI REST API

from fastapi import FastAPI, HTTPException, Depends
from pydantic import BaseModel
import redis.asyncio as redis

app = FastAPI()
redis_client = redis.from_url("redis://localhost")

class QueryRequest(BaseModel):
    question: str
    session_id: str

class QueryResponse(BaseModel):
    answer: str
    sources: list
    confidence_score: float

@app.post("/query", response_model=QueryResponse)
async def query_chatbot(request: QueryRequest):
    # Check cache
    cache_key = f"query:{hash(request.question)}"
    cached = await redis_client.get(cache_key)
    
    if cached:
        return QueryResponse.parse_raw(cached)
    
    # Process query
    chatbot = RAGChatbot.get_or_create(request.session_id)
    result = chatbot.query(request.question)
    
    # Calculate confidence
    confidence = calculate_confidence(result["sources"])
    
    response = QueryResponse(
        answer=result["answer"],
        sources=result["sources"],
        confidence_score=confidence
    )
    
    # Cache for 1 hour
    await redis_client.setex(
        cache_key,
        3600,
        response.json()
    )
    
    return response

4. Hybrid Search for Better Retrieval

from langchain.retrievers import BM25Retriever, EnsembleRetriever

class HybridRetriever:
    def __init__(self, documents, vectorstore):
        # Semantic retriever
        self.semantic_retriever = vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )
        
        # Keyword retriever
        self.keyword_retriever = BM25Retriever.from_documents(documents)
        self.keyword_retriever.k = 10
        
        # Combine with weights
        self.ensemble_retriever = EnsembleRetriever(
            retrievers=[self.semantic_retriever, self.keyword_retriever],
            weights=[0.7, 0.3]  # 70% semantic, 30% keyword
        )
    
    def get_relevant_documents(self, query: str):
        return self.ensemble_retriever.get_relevant_documents(query)

Performance Optimizations

1. Response Caching

Implemented multi-level caching:

L1: In-memory LRU cache (1000 most recent queries)
L2: Redis (1 hour TTL for all queries)
L3: Pinecone query result cache

Result: 60% cache hit rate, 80% reduction in OpenAI API costs

2. Async Processing

import asyncio
from typing import List

async def process_documents_async(documents: List[str]):
    """Process multiple documents concurrently"""
    tasks = [process_single_document(doc) for doc in documents]
    results = await asyncio.gather(*tasks)
    return results

# Reduced document processing time from 2 hours to 15 minutes

3. Prompt Optimization

Reduced token usage by 40% through prompt engineering:

OPTIMIZED_PROMPT = """Answer the question based solely on the provided context. 
Be concise but complete. If unsure, say "I don't have enough information."

Context: {context}

Question: {question}

Answer:"""

# vs original verbose prompt (300 tokens → 180 tokens)

Monitoring & Observability

Custom Metrics

from prometheus_client import Counter, Histogram
import time

query_counter = Counter('chatbot_queries_total', 'Total queries')
query_duration = Histogram('chatbot_query_duration_seconds', 'Query duration')
relevance_score = Histogram('chatbot_relevance_score', 'Relevance score')

@query_duration.time()
async def query_with_metrics(question: str):
    query_counter.inc()
    
    result = chatbot.query(question)
    
    # Track relevance
    score = calculate_relevance(result)
    relevance_score.observe(score)
    
    return result

Alerting

Response time > 3s → Page on-call
Relevance score < 0.7 → Create ticket
Error rate > 1% → Alert engineering

Results & Impact

Performance Metrics:

Response Time: Average 1.8s (95th percentile: 2.5s)
Accuracy: 92% relevance score (human evaluation)
Availability: 99.9% uptime
Throughput: 50K queries/day peak

Business Impact:

Time Saved: 15 hours/employee/month on average
Cost Reduction: $200K annually vs traditional search
User Satisfaction: 4.6/5 rating
Adoption: 85% of employees using weekly

Cost Optimization:

Caching reduced OpenAI costs by 80%
Average cost per query: $0.008
Monthly infrastructure: $1,200

Challenges & Solutions

Challenge 1: Hallucinations

Problem: LLM occasionally fabricated information not in documents

Solution:

Strict prompt engineering with "only use context" instruction
Confidence scoring with source attribution
Fallback to "I don't know" for low confidence

Challenge 2: Long Document Handling

Problem: Important context spread across multiple chunks

Solution:

Implemented parent document retrieval
Context window optimization
Map-reduce for multi-document queries

Challenge 3: Cost at Scale

Problem: OpenAI costs scaling linearly with usage

Solution:

Multi-tier caching strategy
Prompt compression
Batch processing for indexing
Consider fine-tuned models for future

Future Enhancements

Multi-modal Support: Add image and table understanding
Fine-tuned Models: Custom model for domain-specific queries
Active Learning: User feedback loop for continuous improvement
Multi-language: Support for Spanish and French documents
Advanced Analytics: Query pattern analysis and auto-suggestions

Key Learnings

Chunking Strategy Matters: Spent 2 weeks optimizing chunk size and overlap - huge impact on relevance
Hybrid Search Wins: Combining semantic + keyword search improved accuracy by 15%
Caching is Critical: Early caching implementation saved thousands in API costs
User Feedback Loop: Regular evaluation with actual users prevented drift
Start Simple: MVP with basic RAG, then iterate based on real usage

Technical Highlights

Scalable Architecture: Horizontally scalable with container orchestration
Production-Ready: Comprehensive logging, monitoring, and alerting
Cost-Optimized: Multi-level caching and efficient resource usage
High Quality: 92% relevance through careful engineering
Well-Tested: 85% code coverage with integration tests

This project demonstrates end-to-end AI/ML engineering: from data processing to production deployment at scale with real business impact.