How We Stress-Test Fraud ML: Synthetic, Real, and the Gaps We Refuse to Hide

Every fraud-ML vendor shows you a benchmark number. Almost none will tell you whether that number survives contact with a different dataset — let alone a real one. Banks know the pattern by heart: a model dazzles on a vendor's curated sample, then crumbles in model-risk review or quietly underperforms in production. So the most useful question a buyer can ask is not “how good is your number?” It is “what did you test it against, and where does it break?”

Our answer is a methodology, not a single benchmark. We took one fraud-ML approach and ran it against three deliberately different evidence slices, each chosen to stress a different risk — and we report what each one does and does not prove. Everything below is offline research on synthetic and public datasets. None of it is live-bank performance, and we say so at every step. Real bank data is the activation gate that comes after this work — not something this post claims to have cleared.

A single benchmark is not evidence

One impressive metric on one dataset tells you almost nothing about how a model behaves on yours. Fraud has many shapes — card-not-present, account-to-account transfer, money-mule networks — and a model tuned to one shape routinely fails to transfer to another. Worse, a curated sample is easy to overfit, so a headline accuracy or AUC can reflect memorization rather than detection. And a black box that cannot explain why it flagged a transaction will not survive a serious model-risk review, no matter how high the number on the slide.

That is the credibility problem in fraud ML: the part everyone shows you (the number) is the part that generalizes least, and the part that earns trust with a model-validation team (rigor, leakage discipline, explainability, candor about limits) is the part most vendors skip. We built our validation to invert that.

Validate across axes, not across a single dataset

Our principle is simple: test along two axes at once, so a good result cannot be an artifact of one data style. The first axis is real versus synthetic — synthetic data lets us shape realistic fraud typologies, but only real data carries the messiness of actual customer behavior. The second axis is feature shape — account-to-account transfers, card-not-present transactions, and anonymized statistical components each exercise a different part of the modeling pipeline.

Pick datasets that move you along both axes and a single strong result has to clear more than one kind of skepticism. That is why we did not look for one perfect benchmark dataset. We looked for three imperfect ones that fail in different directions.

The three datasets

Here is the evidence set, side by side. Two of the three are public and openly licensed so the setup is conceptually reproducible; the third is our own synthetic Southeast-Asia bank rehearsal data, which is internal and not published.

	Synthetic SEA-bank rehearsal (RTD-internal)	ULB credit-card (public)	IBM AML (public)
Real or synthetic	Synthetic	Real (European cardholders, 2013)	Synthetic
Source / license	RTD-internal synthetic rehearsal data	Worldline & ULB Machine Learning Group, via Kaggle — DbCL-1.0	IBM “Synthetic Transactions for AML” — CDLA-Sharing-1.0
Fraud shape	Account-to-account transfers; AML-style typologies	Card-not-present transactions	AML transfer networks / laundering typologies
Feature shape	Point-in-time counts (e.g. prior-transfer count, trailing-90-day transaction count)	PCA-anonymized components (V1–V28) + amount	Transaction-graph aggregates (fan-in / fan-out, velocity, counterparty patterns)
Class imbalance	Engineered as a production-prevalence, transfer-shaped synthetic rehearsal	0.17% fraud (extreme, real-world)	Low (engineered, AML-realistic)
Eval metric	recall @ 2.5% false-positive rate	recall @ 2.5% false-positive rate	recall @ 1% false-positive rate
Result	83.8% (conservative, status-feature-ablated floor)	95.9% stratified / 86.7% temporal split	82.1% (leakage-free)

All results are research figures on these datasets — not live-bank performance. The false-positive levels are evaluation choices for these experiments, not a product setting.

The three datasets do not all test the same thing, and being precise about that is the honest move — it actually makes the combined evidence stronger, because each slice retires a different risk.

Dataset	What it validates	What it does NOT validate
Synthetic SEA-bank rehearsal	The full approach: gradient-boosted trees plus rules-as-features, scored on recall-at-fixed-false-positive, on our transfer-shaped benchmark with interpretable feature families	It is synthetic — not real-bank traffic
ULB credit-card (real)	The gradient-boosted core and the recall-at-fixed-false-positive evaluation method survive real, extreme-imbalance (0.17%) fraud	Cannot test rules-as-features or human-readable feature engineering — its features are PCA-anonymized
IBM AML (synthetic)	The transfer-graph feature contract (fan-in / fan-out, velocity, counterparty familiarity) computes and discriminates on transfer-shaped data	It is synthetic AML, not first-party or account-takeover transaction fraud; not a real-adversarial proof

The honest gap: none of these is a real Southeast-Asia bank's production data. Public and synthetic datasets have characteristics no single bank's live traffic will exactly share. These experiments validate the method; real bank data is the activation gate that comes after — not something we are claiming to have cleared here.

How we test offline

The method is deliberately boring, because boring is what holds up in review. Five disciplines do the work:

A gradient-boosted core, with rules-as-features where the data allows it. On transfer-shaped data — our synthetic SEA-bank set and the IBM AML graph features — we feed existing fraud rules into the model as input signals rather than running them as a separate, parallel system. That keeps domain knowledge in the loop and the model interpretable. On the anonymized ULB card data, rules-as-features is not possible, so ULB tests the gradient-boosted core alone.
Recall at a fixed, low false-positive rate — an evaluation choice. Accuracy and AUC hide the cost that actually matters to a fraud team: how many good customers a model would wrongly flag. So we fix a low false-positive rate for the experiment and ask the only question that counts — how much fraud do we catch under that constraint? (The 1–2.5% levels here are evaluation settings for these datasets, not a product operating point.)
Leakage-safe by construction. Temporal train/test splits; point-in-time features computed only from strictly-prior events, with no peeking at the future; graph aggregates built only from past transactions. Where it matters we report both stratified and temporal splits — and the temporal split is the honest one.
Explainability, scoped honestly. Where features are interpretable — the synthetic SEA-bank set and the IBM AML transfer-graph features — SHAP decomposes a score into human-readable feature contributions a reviewer can interrogate. On ULB, SHAP runs over PCA-anonymized components (V1–V28): that is faithful model attribution, but not human-readable banking reasoning, and we label it as such. Pretending an anonymized feature is interpretable would undermine exactly the trust we are trying to build.
A conservative-floor discipline. For the synthetic set we report a status-feature-ablated number (83.8%) — we deliberately remove the easiest, most separable signals and report the harder, lower figure, because the easy number would flatter us. A held-out gold-sample slice acts as an independent oracle to confirm we are measuring detection, not memorization. Reporting the floor is the point.

Results, and the “separability tax”

Synthetic data is cleaner and more separable than the messy reality of a real bank's traffic, so the honest prior is that a method validated on synthetic data should be discounted when you move to real data. Call that expected discount the “separability tax.” The interesting question is how large it actually is.

On real card fraud (ULB), the tax was negative — for the gradient-boosted core. The same core and the same recall-at-fixed-false-positive method reached recall of 95.9% (stratified) and 86.7% (temporal split) at a 2.5% false-positive level on real data — at or above the conservative 83.8% synthetic floor. Real, extreme-imbalance fraud, where only 0.17% of transactions are fraudulent, did not punish the core; if anything it rewarded it. ULB cannot test rules-as-features, so this is a statement about the core model plus evaluation method, not about the full setup.

On AML transfer data (IBM), the transfer-graph feature contract held under a stricter setting. Recall reached 82.1% at a tighter 1% false-positive level, leakage-free, with transaction-graph aggregates — fan-in / fan-out, velocity, counterparty familiarity — emerging as the dominant signals. That validates that the transfer and graph features compute and discriminate on transfer-shaped data. It is synthetic AML, though, not a real-adversarial proof.

What this adds up to, stated narrowly: the evidence retires two separate risks rather than proving one sweeping claim. The gradient-boosted core held on one real, rare-fraud card dataset, and the transfer feature contract discriminated on one synthetic AML transfer dataset. That is a meaningfully de-risked methodology — not a claim that “the method generalizes” universally, and not a claim about any bank's live data. This is genuinely encouraging, and it is not a universal generalization off two external datasets. Real bank data remains the activation gate before any of this scores a live decision.

Why this matters — for the product and the reader

For us, this de-risks the scorer methodology in bounded, defensible steps. The full gradient-boosted-plus-rules-as-features setup is validated on a synthetic transfer-shaped benchmark; the core model and evaluation method survive real, extreme-imbalance card fraud; and the transfer-graph feature contract computes and discriminates on transfer-shaped AML data. Each step is the kind of evidence a buyer's model-risk team actually interrogates — and we present them as separate, bounded results, not one blended headline.

For the reader, the takeaway is about how to evaluate any fraud-ML claim, ours included. Rigor across multiple datasets, explainability scoped honestly, and candor about the synthetic-versus-real gap are not nice-to-haves — in a market full of overclaiming, being the vendor that shows its work and states its limits is the whole differentiator. The path forward from here is a real-data-validated scorer that can stand up to model-risk review; the public and synthetic work is done in the open, and real bank data is the gate that comes next.

So do not trust a fraud-ML number you cannot interrogate. Ask the vendor which datasets it was measured on, which metric, leakage-safe in what way, explainable in what way, and what the honest gap is to your own data. Those are the questions we built our validation to answer out loud — and the gaps we found are the ones we refuse to hide.

Run-True Decision is building a fraud decision engine purpose-built for Southeast Asian banks. Talk to us to learn more.

How We Stress-Test Fraud ML: Synthetic, Real, and the Gaps We Refuse to Hide

A single benchmark is not evidence

Validate across axes, not across a single dataset

The three datasets

How we test offline

Results, and the “separability tax”

Why this matters — for the product and the reader

Explore the Platform

Related Articles

What Your Auditor Asked Three Months Later

How We Made Our Fraud Engine 2.5× Faster in a Day

MindForge and the Next Phase of Fraud Decisioning