DocMind: Agentic RAG — LangChain, FAISS, Multi-Agent LLM Document Intelligence

Architecture Overview

LangGraph Linear Pipeline + LangChain LCEL Agents

DocMind is built around a LangGraph StateGraph with a clean, linear 5-node pipeline — no cyclic routing, no rewrite loops. Each LLM agent is a LangChain LCEL chain (ChatPromptTemplate | ChatOpenAI | StrOutputParser) with .with_retry(stop_after_attempt=2) for transient error resilience. Only 3 of the 5 agents make LLM calls, keeping cost and latency low. The frontend shows a live animated pipeline — nodes glow and arrows flow in real time as each agent runs.

🎯

Planner

LCEL chain · LLM · temp 0.3

🔍

Retriever

FAISS + BM25 + RRF · local

⚖️

Grader

Score-based · no LLM · ~1ms

✍️

Generator

LCEL chain · LLM · temp 0.4

🔬

Critic

LCEL chain · LLM · temp 0.1

Module Breakdown

Five Agents — Roles & Design

🎯 Planner Agent

Task Decomposition

Receives the user question and produces a brief research plan describing which aspects of the uploaded document are most relevant to answer it. Built as a LangChain LCEL chain — model is selectable at runtime.

ChainChatPromptTemplate | ChatOpenAI | StrOutputParser

Temperature0.3 · max 200 tokens

🔍 Retriever Agent

Hybrid RAG Search

Runs parallel FAISS semantic search and BM25 keyword search over the indexed chunks. Fuses results via Reciprocal Rank Fusion (k=60) for ranked hybrid output. No API calls — runs entirely locally.

Vector indexFAISS IndexFlatIP (cosine)

Keyword indexBM25Okapi

⚖️ Grader Agent

Relevance Scoring

Scores each retrieved chunk 0.0–1.0 using the hybrid search score and keyword overlap between query and chunk. Entirely score-based — no LLM call — making it instant and deterministic.

Methodscore × 0.7 + overlap × 0.3

LLM neededNone — instant

✍️ Generator Agent

Cited Answer Generation

Receives the top-graded chunks as context. Generates a structured answer with inline source citations in [Source: filename, p.N] format using only the provided context. Model is selectable from the UI picker.

ChainChatPromptTemplate | ChatOpenAI | StrOutputParser

Temperature / Max tokens0.4 · 512

🔬 Critic Agent

Hallucination Detection

Evaluates the generated answer against the source context for hallucinations and completeness. Outputs APPROVED (high confidence) or NEEDS_REVIEW (low confidence — user sees a warning badge). Model is runtime-selectable.

ChainChatPromptTemplate | ChatOpenAI | StrOutputParser

Temperature / Max tokens0.1 · 150 (near-deterministic)

📄 Ingestor

PDF & URL Ingestion

Accepts PDF uploads (up to 10 MB via pypdf) or public URLs (fetched via requests + BeautifulSoup). Uses LangChain's RecursiveCharacterTextSplitter for chunking and HuggingFaceEmbeddings (LangChain-native) for embeddings.

EmbeddingsLangChain HuggingFaceEmbeddings · bge-small-en-v1.5

Chunk size1500 chars · 200 overlap

Technology Stack

Models, Libraries & Chains

The entire stack uses LCEL pipe syntax (ChatPromptTemplate | ChatOpenAI | StrOutputParser) in every LLM agent — not legacy LLMChain. LangGraph handles the stateful pipeline orchestration while LangChain LCEL handles the individual agent chains, demonstrating both frameworks together the way production systems actually use them.

LangGraph 0.2 — StateGraph, linear 5-node pipeline

Clean sequential flow: Planner → Retriever → Grader → Generator → Critic. No cyclic routing, no rewrite loops — simple, fast, debuggable.

Core

3 Selectable LLMs via HF Router (model picker UI)

Qwen 2.5-7B (default · fast) · Mistral Nemo 12B (stronger reasoning) · Phi-3.5 Mini 3.8B (ultra-fast). Switch without reloading — set_model() updates the factory globally. All via langchain_openai.ChatOpenAI with .with_retry(stop_after_attempt=2).

LLM

FAISS + BM25 + Reciprocal Rank Fusion · LangChain embeddings

BAAI/bge-small-en-v1.5 via langchain_huggingface.HuggingFaceEmbeddings — runs locally, no API calls. Chunking via LangChain RecursiveCharacterTextSplitter (1500 chars · 200 overlap).

RAG

Flask 3.1 + Gunicorn + Python threading

Async graph execution via daemon threads — query_id polling every 1.5 s lets the UI drive a live animated pipeline (CSS keyframe glows + flowing arrow dots) without SSE or WebSocket complexity.

Backend

Interactive Explorer

Representative Agent Trace Outputs

Each tab shows a representative trace from a real query run — the exact output format the live observability dashboard displays for each agent node.

Outputs shown are from real runs against a sample PDF research paper. Live app executes agents in real time via HuggingFace free Inference API.

Performance Snapshot

Benchmarks & Agent Metrics

Agent Latency (ms)

Retrieval Quality

Model Benchmarks

Average latency per agent measured over 30 test queries on the free HuggingFace Inference API. Retriever and Grader run locally (near-zero); Generator is the bottleneck due to long output generation. Grader latency is ~1ms — score-based, no LLM call.

Hybrid search (FAISS + BM25 + RRF) vs. pure semantic search only. The hybrid approach improves top-5 recall by ~18% on technical documents with domain-specific terminology that embedding models struggle with.

Published benchmark scores for the three selectable models. Mistral Nemo 12B leads on reasoning-heavy tasks; Qwen 2.5-7B is the best all-round 7B free-tier model; Phi-3.5 Mini achieves strong results at just 3.8B parameters — ideal for fast, focused queries.

Design Decisions

Key Engineering Choices

⚡

3 LLM Calls, Not 8

Replacing the LLM-based grader with a score formula (hybrid score × 0.7 + keyword overlap × 0.3) eliminated 5 sequential LLM calls, cutting average query time by ~60%. The grader is now instant, deterministic, and costs nothing — while answer quality stays the same.

🏠

Local Embeddings = No Rate Limits

Running BAAI/bge-small-en-v1.5 locally via langchain_huggingface.HuggingFaceEmbeddings means the Retriever agent has zero API dependency and sub-millisecond embedding latency. Only the 3 LLM reasoning steps hit the free HF Router, keeping the system responsive even under multiple concurrent queries.

🔌

3-Model Picker + Provider-Agnostic Layer

A compact dropdown near the Ask button lets users choose Qwen 2.5-7B, Mistral Nemo 12B, or Phi-3.5 Mini without reloading. set_model() updates a global key; every get_llm() call reads it fresh — no stale cached chains. Swapping to OpenAI/Groq/Ollama needs one URL change.

🎨

Live Pipeline + Light / Dark Theme

The ingestion row (Source→Chunker→Index) animates with CSS @keyframes nodeGlow when a file is being processed. Agent nodes glow color-coded (blue=LLM, green=local, gold=score) and arrows show a flowing dot as each step runs. Default is light theme; a header toggle persists preference via localStorage.

DocMind — 5-Agent Document Q&A, 60% Faster Than Standard RAG

LangGraph Linear Pipeline + LangChain LCEL Agents

Score-Based Grading — Fast, Deterministic, Zero API Cost

Five Agents — Roles & Design

Models, Libraries & Chains

langchain_openai.ChatOpenAI pointed at HuggingFace Router

Representative Agent Trace Outputs

Benchmarks & Agent Metrics

Key Engineering Choices

At a Glance

Try It Live

Project Info

Tech Stack

Dashboard Pages

Related Work