Agentic AI / LangGraph + LangChain LCEL Python · Flask · 3 Selectable LLMs Live on HuggingFace Spaces

DocMind — 5-Agent Document Q&A, 60% Faster Than Standard RAG

A clean, multi-agent document research system built with LangGraph 0.2 StateGraph for orchestration, LangChain LCEL chains in every LLM agent, Hybrid RAG (FAISS + BM25 + RRF), and just 3 LLM calls per query. Features a live animated pipeline with light/dark theme, a 3-model picker (Qwen 2.5-7B / Mistral Nemo 12B / Phi-3.5 Mini), and free deployment on HuggingFace Spaces.

2025 Mohammad Noorchenarboo 5 LangGraph Agents · 3 LLM Calls · LCEL chains FAISS + BM25 + RRF Hybrid Index
5
Specialized LangGraph Agents
3
LLM Calls per Query · 3 Selectable Models
RRF
Hybrid Retrieval (FAISS + BM25)
0ms
Grader Latency (score-based, no LLM)
Free
HuggingFace Inference Tier
Architecture Overview

LangGraph Linear Pipeline + LangChain LCEL Agents

DocMind is built around a LangGraph StateGraph with a clean, linear 5-node pipeline — no cyclic routing, no rewrite loops. Each LLM agent is a LangChain LCEL chain (ChatPromptTemplate | ChatOpenAI | StrOutputParser) with .with_retry(stop_after_attempt=2) for transient error resilience. Only 3 of the 5 agents make LLM calls, keeping cost and latency low. The frontend shows a live animated pipeline — nodes glow and arrows flow in real time as each agent runs.

🎯
Planner
LCEL chain · LLM · temp 0.3
🔍
Retriever
FAISS + BM25 + RRF · local
⚖️
Grader
Score-based · no LLM · ~1ms
✍️
Generator
LCEL chain · LLM · temp 0.4
🔬
Critic
LCEL chain · LLM · temp 0.1
💡

Score-Based Grading — Fast, Deterministic, Zero API Cost

The Grader doesn't call the LLM at all. It combines the hybrid search score (FAISS + BM25 relevance) with a keyword overlap between the query and each chunk to produce a 0–1 relevance grade instantly. This reduces the pipeline from 8 LLM calls to 3, cutting average query time by ~60%.

Module Breakdown

Five Agents — Roles & Design

🎯 Planner Agent
Task Decomposition
Receives the user question and produces a brief research plan describing which aspects of the uploaded document are most relevant to answer it. Built as a LangChain LCEL chain — model is selectable at runtime.
ChainChatPromptTemplate | ChatOpenAI | StrOutputParser
Temperature0.3 · max 200 tokens
🔍 Retriever Agent
Hybrid RAG Search
Runs parallel FAISS semantic search and BM25 keyword search over the indexed chunks. Fuses results via Reciprocal Rank Fusion (k=60) for ranked hybrid output. No API calls — runs entirely locally.
Vector indexFAISS IndexFlatIP (cosine)
Keyword indexBM25Okapi
⚖️ Grader Agent
Relevance Scoring
Scores each retrieved chunk 0.0–1.0 using the hybrid search score and keyword overlap between query and chunk. Entirely score-based — no LLM call — making it instant and deterministic.
Methodscore × 0.7 + overlap × 0.3
LLM neededNone — instant
✍️ Generator Agent
Cited Answer Generation
Receives the top-graded chunks as context. Generates a structured answer with inline source citations in [Source: filename, p.N] format using only the provided context. Model is selectable from the UI picker.
ChainChatPromptTemplate | ChatOpenAI | StrOutputParser
Temperature / Max tokens0.4 · 512
🔬 Critic Agent
Hallucination Detection
Evaluates the generated answer against the source context for hallucinations and completeness. Outputs APPROVED (high confidence) or NEEDS_REVIEW (low confidence — user sees a warning badge). Model is runtime-selectable.
ChainChatPromptTemplate | ChatOpenAI | StrOutputParser
Temperature / Max tokens0.1 · 150 (near-deterministic)
📄 Ingestor
PDF & URL Ingestion
Accepts PDF uploads (up to 10 MB via pypdf) or public URLs (fetched via requests + BeautifulSoup). Uses LangChain's RecursiveCharacterTextSplitter for chunking and HuggingFaceEmbeddings (LangChain-native) for embeddings.
EmbeddingsLangChain HuggingFaceEmbeddings · bge-small-en-v1.5
Chunk size1500 chars · 200 overlap
Technology Stack

Models, Libraries & Chains

The entire stack uses LCEL pipe syntax (ChatPromptTemplate | ChatOpenAI | StrOutputParser) in every LLM agent — not legacy LLMChain. LangGraph handles the stateful pipeline orchestration while LangChain LCEL handles the individual agent chains, demonstrating both frameworks together the way production systems actually use them.

LangGraph 0.2 — StateGraph, linear 5-node pipeline
Clean sequential flow: Planner → Retriever → Grader → Generator → Critic. No cyclic routing, no rewrite loops — simple, fast, debuggable.
Core
3 Selectable LLMs via HF Router (model picker UI)
Qwen 2.5-7B (default · fast) · Mistral Nemo 12B (stronger reasoning) · Phi-3.5 Mini 3.8B (ultra-fast). Switch without reloading — set_model() updates the factory globally. All via langchain_openai.ChatOpenAI with .with_retry(stop_after_attempt=2).
LLM
FAISS + BM25 + Reciprocal Rank Fusion · LangChain embeddings
BAAI/bge-small-en-v1.5 via langchain_huggingface.HuggingFaceEmbeddings — runs locally, no API calls. Chunking via LangChain RecursiveCharacterTextSplitter (1500 chars · 200 overlap).
RAG
Flask 3.1 + Gunicorn + Python threading
Async graph execution via daemon threads — query_id polling every 1.5 s lets the UI drive a live animated pipeline (CSS keyframe glows + flowing arrow dots) without SSE or WebSocket complexity.
Backend
⚙️

langchain_openai.ChatOpenAI pointed at HuggingFace Router

langchain_openai.ChatOpenAI(base_url="https://router.huggingface.co/v1") uses LangChain's OpenAI integration against the HF inference router — which speaks the OpenAI chat completions protocol. The entire LLM layer can be swapped to GPT-4o, Claude, Groq, or a local Ollama model by changing one URL and one API key, with zero agent code changes.

Interactive Explorer

Representative Agent Trace Outputs

Each tab shows a representative trace from a real query run — the exact output format the live observability dashboard displays for each agent node.

Outputs shown are from real runs against a sample PDF research paper. Live app executes agents in real time via HuggingFace free Inference API.

Performance Snapshot

Benchmarks & Agent Metrics

Agent Latency (ms)
Retrieval Quality
Model Benchmarks

Average latency per agent measured over 30 test queries on the free HuggingFace Inference API. Retriever and Grader run locally (near-zero); Generator is the bottleneck due to long output generation. Grader latency is ~1ms — score-based, no LLM call.

Hybrid search (FAISS + BM25 + RRF) vs. pure semantic search only. The hybrid approach improves top-5 recall by ~18% on technical documents with domain-specific terminology that embedding models struggle with.

Published benchmark scores for the three selectable models. Mistral Nemo 12B leads on reasoning-heavy tasks; Qwen 2.5-7B is the best all-round 7B free-tier model; Phi-3.5 Mini achieves strong results at just 3.8B parameters — ideal for fast, focused queries.

Design Decisions

Key Engineering Choices

3 LLM Calls, Not 8
Replacing the LLM-based grader with a score formula (hybrid score × 0.7 + keyword overlap × 0.3) eliminated 5 sequential LLM calls, cutting average query time by ~60%. The grader is now instant, deterministic, and costs nothing — while answer quality stays the same.
🏠
Local Embeddings = No Rate Limits
Running BAAI/bge-small-en-v1.5 locally via langchain_huggingface.HuggingFaceEmbeddings means the Retriever agent has zero API dependency and sub-millisecond embedding latency. Only the 3 LLM reasoning steps hit the free HF Router, keeping the system responsive even under multiple concurrent queries.
🔌
3-Model Picker + Provider-Agnostic Layer
A compact dropdown near the Ask button lets users choose Qwen 2.5-7B, Mistral Nemo 12B, or Phi-3.5 Mini without reloading. set_model() updates a global key; every get_llm() call reads it fresh — no stale cached chains. Swapping to OpenAI/Groq/Ollama needs one URL change.
🎨
Live Pipeline + Light / Dark Theme
The ingestion row (Source→Chunker→Index) animates with CSS @keyframes nodeGlow when a file is being processed. Agent nodes glow color-coded (blue=LLM, green=local, gold=score) and arrows show a flowing dot as each step runs. Default is light theme; a header toggle persists preference via localStorage.