Hybrid RAG Pipeline: BM25, FAISS, LangChain, LLM & Vector Search

Architecture Overview

End-to-End RAG Pipeline

The system ingests real arXiv papers on-demand via the official API, splits abstracts into overlapping chunks using three configurable strategies, embeds each chunk with a sentence transformer, and builds both a FAISS inner-product index and a BM25 sparse index simultaneously. At query time, both indices retrieve candidates independently and Reciprocal Rank Fusion merges the ranked lists, consistently outperforming either method alone. Flan-T5 then receives a distilled selection of the six most relevant sentences as context.

📡

arXiv API

Fetch papers by topic

→

✂️

Chunking

3 strategies

→

🧠

Embeddings

MiniLM-L6-v2

→

📊

Indices

FAISS + BM25

→

🔀

RRF Fusion

Hybrid rank

→

💬

Flan-T5

Answer gen

Feature Set

Dashboard Modules

📥 Ingest

Paper Ingestion

Search arXiv by keyword and fetch papers on-demand. Configure chunking strategy (fixed, sentence-window, or semantic) and batch size.

API/api/fetch-arxiv

Status✅ Live

🔍 Ask & Compare

Unified Query Page

Ask a question and get everything on one page: AI-generated answer, live metrics (relevance · faithfulness · diversity), radar + bar charts, and 3-column passage comparison across all three methods simultaneously.

Methods runAll 3 at once

Status✅ Live

💡 Domain Chips

Quick-Start Topics

Pre-built topic chips (📊 Data Science · 📣 Marketing · 🏥 Medical) fill the arXiv search box in one click. After loading, the Ask tab surfaces 2 contextual question chips tailored to the selected domain.

Domains3 built-in

Status✅ Live

🔌 REST API

Programmatic Access

Full JSON API: /api/fetch-arxiv, /api/query, /api/compare. All dashboard features accessible without the UI.

Endpoints3 routes

Status✅ Live

🧠 Fine-tune

Embedding Fine-Tuning

Domain-specific contrastive fine-tuning of the sentence-transformer on the ingested corpus using hard-negative mining to improve retrieval precision.

MethodContrastive / triplet loss

Status🗓️ Planned

ML Stack

Models & Methods

Every component runs locally inside the Docker container with no external API calls. The sentence transformer provides 384-dimensional L2-normalised embeddings, making cosine similarity equivalent to inner product — which is exactly what FAISS IndexFlatIP computes. The pairing is intentional: normalising once at index time makes every similarity query a simple dot product.

all-MiniLM-L6-v2

384-dim · L2-normalised · ~90 MB · sentence-transformers

Embedder

BM25Okapi

Term frequency · inverse document frequency · rank-bm25

Sparse Retrieval

FAISS IndexFlatIP

Exact inner-product search · CPU · faiss-cpu 1.8

Dense Retrieval

google/flan-t5-large

780M params · seq2seq · CPU-friendly · ~900 MB · swappable

Generator

Interactive Explorer

Retrieval Method Comparison

Select a retrieval method to see representative metrics from a live run on arXiv papers about transformer architectures.

Metrics computed on arXiv CS.CL papers. Context Relevance = cosine similarity. Diversity = 1 − mean pairwise similarity. Faithfulness = answer token overlap with context.

Performance Snapshot

Evaluation Results

Method Comparison

Per-Query Relevance

Model Faithfulness

Grouped bar chart showing Context Relevance and Diversity for each retrieval method. Hybrid RRF leads on both axes — 0.47 relevance vs 0.41 for BM25 and 0.51 diversity vs 0.38 for BM25.

Per-query context relevance across 4 test queries. Hybrid (green) consistently stays above BM25 (amber) and Dense Vector (indigo), with the largest gap on abstract definitional questions.

Answer faithfulness by generator model and question type. Flan-T5 XL reaches 0.84 on factual questions but shows diminishing returns over Large on definitional queries.

Design Decisions

Key Engineering Choices

🔀

Hybrid beats both

RRF fuses BM25's keyword precision with the vector index's semantic recall. The formula 1/(60+rank) was chosen empirically in the original TREC paper and remains the canonical default — the project implements it verbatim.

✂️

Distil, don't truncate

Instead of truncating the context to fit the model's input, sentence-level embedding similarity selects the 6 most relevant sentences. This fixes the title-string contamination problem that caused Flan-T5 to output paper titles as answers.

🌙

CSS vars, not Python

Dark/light theming is handled entirely by CSS custom properties toggled via a clientside callback — no Python re-renders on theme switch. Plotly charts receive hex colors via theme-keyed dicts since they cannot read CSS variables.

Beyond Keyword Search — Hybrid RAG Benchmarked on Live arXiv Papers

End-to-End RAG Pipeline

Dashboard Modules

Models & Methods

Why no LangChain?

Retrieval Method Comparison

Evaluation Results

Key Engineering Choices

At a Glance

Try It Live

Project Info

Tech Stack

Dashboard Modules

Related Work