Information Retrieval Python · Flask · Dash Live on HuggingFace

Beyond Keyword Search — Hybrid RAG Benchmarked on Live arXiv Papers

A full-stack benchmarking system that fetches real arXiv papers, indexes them with BM25 and dense vector retrieval, fuses results using Reciprocal Rank Fusion, and evaluates answer quality live. No API key required — runs entirely on open-source models.

2025 Mohammad Noorchenarboo Dynamic — arXiv API 2 ML models (MiniLM + Flan-T5)
3
Retrieval methods
BM25 · Vector · Hybrid
2
Dashboard tabs
Load · Ask & Compare
RRF
Fusion algorithm
k=60 canonical
0.47
Avg context relevance
hybrid vs 0.41 BM25
0KB
API cost
fully open-source
Architecture Overview

End-to-End RAG Pipeline

The system ingests real arXiv papers on-demand via the official API, splits abstracts into overlapping chunks using three configurable strategies, embeds each chunk with a sentence transformer, and builds both a FAISS inner-product index and a BM25 sparse index simultaneously. At query time, both indices retrieve candidates independently and Reciprocal Rank Fusion merges the ranked lists, consistently outperforming either method alone. Flan-T5 then receives a distilled selection of the six most relevant sentences as context.

📡
arXiv API
Fetch papers by topic
✂️
Chunking
3 strategies
🧠
Embeddings
MiniLM-L6-v2
📊
Indices
FAISS + BM25
🔀
RRF Fusion
Hybrid rank
💬
Flan-T5
Answer gen
Feature Set

Dashboard Modules

📥 Ingest
Paper Ingestion
Search arXiv by keyword and fetch papers on-demand. Configure chunking strategy (fixed, sentence-window, or semantic) and batch size.
API/api/fetch-arxiv
Status✅ Live
🔍 Ask & Compare
Unified Query Page
Ask a question and get everything on one page: AI-generated answer, live metrics (relevance · faithfulness · diversity), radar + bar charts, and 3-column passage comparison across all three methods simultaneously.
Methods runAll 3 at once
Status✅ Live
💡 Domain Chips
Quick-Start Topics
Pre-built topic chips (📊 Data Science · 📣 Marketing · 🏥 Medical) fill the arXiv search box in one click. After loading, the Ask tab surfaces 2 contextual question chips tailored to the selected domain.
Domains3 built-in
Status✅ Live
🔌 REST API
Programmatic Access
Full JSON API: /api/fetch-arxiv, /api/query, /api/compare. All dashboard features accessible without the UI.
Endpoints3 routes
Status✅ Live
🧠 Fine-tune
Embedding Fine-Tuning
Domain-specific contrastive fine-tuning of the sentence-transformer on the ingested corpus using hard-negative mining to improve retrieval precision.
MethodContrastive / triplet loss
Status🗓️ Planned
ML Stack

Models & Methods

Every component runs locally inside the Docker container with no external API calls. The sentence transformer provides 384-dimensional L2-normalised embeddings, making cosine similarity equivalent to inner product — which is exactly what FAISS IndexFlatIP computes. The pairing is intentional: normalising once at index time makes every similarity query a simple dot product.

all-MiniLM-L6-v2
384-dim · L2-normalised · ~90 MB · sentence-transformers
Embedder
BM25Okapi
Term frequency · inverse document frequency · rank-bm25
Sparse Retrieval
FAISS IndexFlatIP
Exact inner-product search · CPU · faiss-cpu 1.8
Dense Retrieval
google/flan-t5-large
780M params · seq2seq · CPU-friendly · ~900 MB · swappable
Generator
⚙️

Why no LangChain?

Every component — RRF, chunker, context distillation — is implemented from scratch. This was a deliberate choice: understanding the internals of each retrieval stage is more resume-worthy than wiring together a framework that abstracts them away.

Interactive Explorer

Retrieval Method Comparison

Select a retrieval method to see representative metrics from a live run on arXiv papers about transformer architectures.

Metrics computed on arXiv CS.CL papers. Context Relevance = cosine similarity. Diversity = 1 − mean pairwise similarity. Faithfulness = answer token overlap with context.

Performance Snapshot

Evaluation Results

Method Comparison
Per-Query Relevance
Model Faithfulness

Grouped bar chart showing Context Relevance and Diversity for each retrieval method. Hybrid RRF leads on both axes — 0.47 relevance vs 0.41 for BM25 and 0.51 diversity vs 0.38 for BM25.

Per-query context relevance across 4 test queries. Hybrid (green) consistently stays above BM25 (amber) and Dense Vector (indigo), with the largest gap on abstract definitional questions.

Answer faithfulness by generator model and question type. Flan-T5 XL reaches 0.84 on factual questions but shows diminishing returns over Large on definitional queries.

Design Decisions

Key Engineering Choices

🔀
Hybrid beats both
RRF fuses BM25's keyword precision with the vector index's semantic recall. The formula 1/(60+rank) was chosen empirically in the original TREC paper and remains the canonical default — the project implements it verbatim.
✂️
Distil, don't truncate
Instead of truncating the context to fit the model's input, sentence-level embedding similarity selects the 6 most relevant sentences. This fixes the title-string contamination problem that caused Flan-T5 to output paper titles as answers.
🌙
CSS vars, not Python
Dark/light theming is handled entirely by CSS custom properties toggled via a clientside callback — no Python re-renders on theme switch. Plotly charts receive hex colors via theme-keyed dicts since they cannot read CSS variables.