ML Risk Intelligence Platform: XGBoost, Scikit-learn, Credit, Fraud & Market Risk

Architecture Overview

From Synthetic Data to a Live Risk Dashboard

RiskSight Pro is structured as a single Flask application with a shared dark-theme shell injected into every page. On startup, three ML models are trained in memory on 5,200+ synthetic records covering credit borrowers, fraud transactions, insurance policies, and one year of daily market returns. Every chart is a server-side Plotly figure serialised to JSON and rendered client-side — no static images, no stale data.

🗄️

Synthetic Data

NumPy/Pandas — credit, fraud, insurance, market

🧠

ML Training

RF · GBM · LogReg trained at startup

📊

Plotly Figures

Server-side JSON, rendered in browser

🌐

Flask Routes

7 pages + 3 REST API endpoints

🚀

HF Spaces

Docker container on port 7860

Module Breakdown

Seven Risk Modules in One App

💳 Credit Risk

Credit Risk Scorer

Random Forest trained on 1,200 borrower records predicts Probability of Default. An interactive form lets users score any applicant in real time.

ModelRandom Forest (100 trees)

OutputPD · LGD proxy · Expected Loss

🕵️ Fraud Detection

Fraud Transaction Monitor

Gradient Boosting classifier on 3,000 transactions flags fraud by amount, hour, merchant risk, foreign flag, and velocity. Live flagged transaction table included.

ModelGradient Boosting (100 est.)

ThresholdFraud probability > 0.25

📈 Market Risk

VaR / CVaR Engine

252 days of simulated daily returns compute VaR at 95% & 99%, Expected Shortfall, Sharpe ratio, max drawdown, and a rolling 21-day VaR chart.

MethodHistorical simulation

Portfolio size$10M synthetic book

🏦 Loan Portfolio

Portfolio Concentration Analyser

Explores loan volume by purpose, credit grade mix, income vs loan scatter, and a region × purpose default-rate heatmap for concentration risk identification.

MetricsEL · Gross exposure · Grade mix

Breakdown4 purposes · 4 regions

🏥 Claims

Insurance Claims Analytics

Monthly claim volume trends, smoker vs non-smoker distributions, BMI vs claim scatter, and policy-type breakdowns across 1,000 synthetic policyholders.

SegmentsBasic · Standard · Premium

Risk flagsSmoker · BMI>35 · Age>60

⚖️ Loss Ratio

Loss & Combined Ratio Monitor

Tracks loss ratio vs combined ratio (LR + 25% expense ratio) by month, with a region × policy-type heatmap and profitability threshold line at LR = 1.0.

ThresholdBreak-even at LR = 1.0

Expense ratioFixed 25% for combined ratio

Machine Learning Stack

Three Production-Style Models + Statistical Risk Engine

Each model is trained at application startup using Scikit-learn on in-memory synthetic data, then serialised via a StandardScaler pipeline. The REST API endpoints receive JSON from browser forms and return scored results in milliseconds — the same pattern used in real-world risk systems.

Random Forest Classifier — Credit Risk

Features: Age, Income, Debt Ratio, Credit Score, Employment Years, Loan Amount

100 trees · sklearn

Gradient Boosting Classifier — Fraud Detection

Features: Amount, Hour, Foreign flag, Velocity, Merchant Risk (OHE)

100 estimators · sklearn

Logistic Regression — Underwriting Risk

Features: Age, BMI, Smoker, Children, Vehicle Age, Region (OHE)

Binary · L2 reg · sklearn

Historical Simulation — Market Risk (VaR/CVaR)

252 daily returns · Rolling 21-day window · Sharpe, Drawdown, Expected Shortfall

NumPy · percentile-based

Interactive Explorer

Simulated Risk Outputs by Module

Select a risk module below to see representative outputs. These illustrate what the live app returns for typical inputs.

Illustrative outputs based on synthetic data patterns in the app. Live app scores inputs in real time via trained ML models.

Performance Snapshot

Risk Metrics Across the Synthetic Portfolio

Default Rate by Credit Grade

Fraud by Transaction Channel

Loss Ratio by Policy Type

Grade F borrowers show ~4× the default rate of Grade A. The Random Forest feature importance highlights credit score and debt ratio as the two dominant predictors.

ATM and online channels carry the highest fraud rates — consistent with card-not-present and skimming patterns. Night-hour (0–5h) transactions account for a disproportionate share of fraud flags.

All three policy tiers sit above the break-even LR of 1.0 when the 25% expense ratio is added — illustrating the underwriting risk challenge the platform is designed to surface.

Design Decisions

Three Engineering Choices That Define the App

🧩

Single-file deploy

The entire app — data generation, ML training, 7 pages, 3 REST endpoints, and all CSS — lives in one app.py. HuggingFace Spaces needs only a Dockerfile and requirements.txt.

🔁

Train-on-boot

Models are trained fresh at startup on synthetic data with random_state=42. This keeps the deployment artefact-free while ensuring reproducible, deterministic outputs every time.

📡

JSON-first charts

Plotly figures are serialised server-side and injected into HTML as JSON literals. The browser calls Plotly.react() — ensuring charts are responsive, interactive, and theme-aware without extra round-trips.

RiskSight Pro — 7 ML Risk Models, One Live Dashboard

From Synthetic Data to a Live Risk Dashboard

Why a Shared Shell Template?

Seven Risk Modules in One App

Three Production-Style Models + Statistical Risk Engine

REST API Endpoints for Live Scoring

Simulated Risk Outputs by Module

Risk Metrics Across the Synthetic Portfolio

Three Engineering Choices That Define the App

At a Glance

Try It Live

Project Info

Tech Stack

Risk Modules

Related Work