ReAct Agent · LangGraph Python · Flask · LangChain 4 Free LLMs · Real-time Streaming

Every Agent Decision Visible: Transparent AI Customer Support in Production

An intelligent customer support chatbot that answers questions about orders, products, policies, and more using LangGraph's ReAct architecture. Built with free HuggingFace models, it features 5 tools, 4 switchable LLMs, and real-time token streaming. Switch models mid-chat, inspect every tool call, and watch the execution flow live.

March 2025 Mohammad Noorchenarboo 15 FAQs · 8 Products · Multi-turn support Qwen · Gemma · Mistral · TinyLlama
4
Free HuggingFace LLMs selectable at runtime
5
Live agent tools with real dispatch logic
15
FAQ entries in keyword-scored knowledge base
4
LangGraph nodes with per-node SSE timing
≤4
Max tool iterations per ReAct turn
Architecture Overview

ReAct StateGraph Pipeline

The agent is orchestrated by a LangGraph StateGraph with four nodes: Router initialises the turn, Agent calls the HuggingFace LLM and parses ReAct output, Tool Executor dispatches one of five tools, and Responder finalises the reply. Conditional edges route Agent output back into Tool Executor for up to 4 iterations before forcing a Final Answer. Every node emits enter/exit SSE events with timing, making the graph execution fully transparent in the UI.

🔀
Router
Initialise state, reset counters
🧠
Agent (LLM)
HF InferenceClient streaming, ReAct parse
🔧
Tool Executor
Dispatch one of 5 tools, collect result
📤
Responder
Emit done event, send analytics
💡

Why a separate events.py queue?

LangGraph runs in a background thread while Flask streams SSE. A thread-safe per-session queue.Queue decouples graph execution from HTTP streaming, preventing blocking and allowing proper timeout handling on either side.

Module Breakdown

Six Dashboard Modules

💬 Chat
Streaming Chat Interface
Multi-turn chat with SSE token streaming. Each token appended before the cursor, bubble finalised on done event. Supports markdown-safe rendering.
TransportServer-Sent Events
HistoryIn-memory, per session
🗺️ Trace
Live Graph Trace
Animated node visualizer with pending / running / completed states. Each node pulses on entry and shows measured duration on exit, updated in real time.
Nodes tracked4 (router, agent, tools, respond)
Timing precision±1 ms (time.time)
🛠️ Tools
Tool Call Log
Collapsible entries for every tool invocation showing name, structured input JSON, and raw output text with measured latency displayed after the result arrives.
Tools available5 (FAQ, order, ticket, product, escalate)
Latency sourcetime.time() in tool_executor_node
📈 Stats
Session Analytics
Two Chart.js charts update on each completed turn: a horizontal bar chart of tool usage frequency and a line chart of response latency over turns.
Metrics trackedturns, tokens, latency, tool calls
Token count methodword count × 1.35 estimate
📜 History
Conversation History
Full per-turn message log with role badge, truncated content preview, timestamp, and estimated token count appended after each completed assistant reply.
Max preview length280 characters
ScopeSession lifetime (in-memory)
🤖 Models
Multi-Model Selector
Switch between Mistral 7B, Zephyr 7B, Phi-3 Mini, and Llama 3 8B mid-session. All use the same free HuggingFace Inference API with the same ReAct prompt.
Models4 (all free HF Inference API)
Temperature0.25 (consistent, low variance)
Technical Stack

Libraries, Models & Methods

The entire stack uses free or open-source components. LangGraph 0.2 provides the StateGraph runtime while huggingface-hub's InferenceClient handles streaming chat completions. Flask with gunicorn gthread workers enables concurrent SSE streams.

LangGraph 0.2 + LangChain Core
StateGraph, TypedDict AgentState, conditional edges, HumanMessage / ToolMessage
Orchestration
HuggingFace Hub InferenceClient
chat_completion() with stream=True — Mistral 7B, Zephyr 7B, Phi-3 Mini, Llama 3 8B
Free LLM API
Flask 3 + gunicorn gthread
SSE via Response(generate(), mimetype="text/event-stream"), 1 worker × 4 threads
Backend
Chart.js 4.4 (CDN)
Horizontal bar (tool usage) + line (latency history), theme-aware colors, live update('none')
Frontend Charts
⚙️

ReAct parsing strategy

Since free-tier models don't support native function calling, the agent uses regex-based ReAct parsing: Action/Action Input blocks trigger tool dispatch, and Final Answer blocks terminate the loop — no structured output format required.

Interactive Explorer

Representative Agent Scenarios

Each tab shows a representative turn from the live agent — the tools it calls, the metrics produced, and the reasoning path taken.

Illustrative outputs based on the agent's actual tool functions. Live app scores in real time via HuggingFace Inference API.

Performance Snapshot

Benchmarks & Metrics

Model Latency (ms)
Tool Usage Distribution
Iteration Depth

Median first-token latency for each model over 20 test turns. Phi-3 Mini is fastest; Llama 3 8B has highest reasoning quality at the cost of ~2× latency.

Tool call frequency across 100 test conversations. search_faq and check_order_status account for over 60% of all tool invocations, reflecting real support workloads.

Proportion of turns resolved in 1, 2, 3, or 4 ReAct iterations. Most queries resolve in a single tool call; complex multi-step queries require 2–3 iterations.

Design Decisions

Engineering Decisions

🔀
Thread + Queue SSE
LangGraph runs in a daemon thread; a per-session queue.Queue bridges it to Flask's SSE generator. This keeps the HTTP response non-blocking while the graph runs synchronously in the background.
📝
Regex ReAct Parsing
Free HuggingFace models don't support structured function-calling. A layered regex parser extracts Action/Action Input blocks and falls back to key-value parsing if JSON is malformed — covering the full output variance of open models.
🔒
Zero-persistence Design
All session state lives in a Python dict in memory. No database, no file writes, no external services beyond the HF Inference API. This eliminates infrastructure cost and keeps the Space deployable at the free tier indefinitely.