Self-Improving Fraud Engines: How Autonomous Optimization Loops Sharpen Detection

The Manual Tuning Trap

Most fraud detection systems share an uncomfortable secret: their rules were tuned by a human, once, and haven't been meaningfully revisited since. The same is true for the machine learning models behind them — trained on a fixed dataset, deployed, and left to decay.

It's not hard to understand why. A typical fraud decision engine has dozens of rules with individual scores, threshold constants that determine when those rules fire, decision boundaries for each transaction flow, and one or more ML models with hyperparameters that significantly affect performance. Changing any one parameter can ripple through the entire scoring pipeline. Swap the ML model type and every downstream threshold shifts. Raise a rule score to catch more fraud, and precision drops elsewhere.

The result is tuning paralysis. Teams know their parameters aren't optimal, but the cost of experimentation — time, risk, expertise — keeps them frozen. Rule scores get set during initial deployment. ML models ship with default hyperparameters. The fraud landscape evolves; the engine doesn't.

This matters more in Southeast Asia than almost anywhere else. Fraud patterns across the region shift rapidly — from cross-border wire fraud in Singapore to social engineering scams in Indonesia to check deposit fraud in the Philippines. A static ruleset and a stale model tuned for last quarter's patterns are already behind.

Andrej Karpathy and the Autonomous Research Pattern

Andrej Karpathy — former Senior Director of AI at Tesla, founding member of OpenAI, and one of the most influential figures in modern deep learning — recently open-sourced a project called autoresearch that has rapidly captured the attention of the AI community. The idea is deceptively simple: let an AI agent run ML experiments autonomously, overnight, improving a model through hundreds of small iterations.

The architecture is elegant. Three files: a program prompt that tells the agent what to optimize, a training script it can modify, and an evaluation metric it can measure. The agent enters a loop — modify the code, train, evaluate, keep if improved, discard if not, repeat. One GPU, one file, one metric. The result: roughly 100 experiments per overnight run, each building on the last successful configuration.

The concept has quickly spread beyond ML research. Developers and entrepreneurs are now applying the same autonomous iteration pattern to business optimization — A/B testing email copy, iterating landing page designs, tuning ad creatives, even backtesting quantitative trading strategies. Any process with an objective metric, a fast feedback loop, and API access to the variables becomes a candidate for autonomous optimization.

At RTD, we saw this pattern and recognized an immediate application: fraud detection is a textbook fit. We have objective metrics (expected cost, precision, recall). We have fast feedback loops (backtesting completes in seconds for rules, minutes for ML). And we have full control over the parameters — rule scores, thresholds, decision boundaries, ML hyperparameters. Within days of autoresearch gaining mainstream traction, we had an implementation plan and a working prototype adapting the pattern to optimize our Fraud Decision Engine end-to-end.

How It Works: From Rules to ML Models

RTD's Fraud Decision Engine adapts Karpathy's three-file architecture into an autonomous optimization loop that spans two layers of the fraud detection stack.

Layer 1: Rules and Scoring Thresholds

The foundation layer optimizes four surfaces that determine how individual fraud signals are weighted and combined into decisions:

Rule scores — the default weight each detection rule contributes when triggered (e.g., a suspicious beneficiary country rule might score 15 or 25 — the optimal value depends on how it interacts with other rules)
Rule thresholds — the constants inside individual rules that determine when they fire (how many ACH returns in 90 days counts as suspicious?)
Decision thresholds — the boundaries between pass, review, and reject for each transaction flow
Score aggregation — how individual rule scores are combined into a final risk score

These experiments run in seconds — a full backtest against synthetic data completes almost instantly. This means the optimizer can run 30+ experiments in minutes, rapidly converging on improved configurations.

Layer 2: ML Model Selection and Hyperparameter Tuning

This is where the autoresearch parallel is most direct. Karpathy's original project optimized ML training code — and RTD applies the same pattern to optimize the ML models that power fraud scoring.

The ML optimization runs a two-phase experiment loop:

Phase 1 — Model Selection. The optimizer trains and evaluates all five supported model types (logistic regression, histogram gradient boosting, XGBoost, LightGBM, random forest) with their default hyperparameters. Each model is scored by expected cost — the real economic impact of its fraud decisions. The best-performing model type is selected automatically, driven purely by data rather than intuition.

Phase 2 — Hyperparameter Tuning. With the model type locked in, the optimizer enters the classic autoresearch loop: modify one hyperparameter in the configuration file, retrain the model, evaluate against the incumbent best, keep if improved, discard if not. Learning rate, tree depth, number of estimators, regularization strength — each is tested individually, and the optimizer learns from its own experiment history to focus on the parameters that move the needle most.

The discipline is strict: one change per experiment. Only the YAML configuration file is modified — the trained model itself is a build artifact, regenerated from config. This means every improvement is reproducible and version-controlled.

Safety by Design

Autonomous optimization of a production fraud engine demands serious guardrails. Moving fast without controls isn't innovation — it's recklessness.

RTD's implementation enforces multiple layers of safety:

Branch isolation — all experiments run on a dedicated git branch, never on the production configuration. The main branch is untouched until a human reviews the results.
Metric constraints — every experiment must maintain minimum precision, recall, and F1 scores relative to the baseline. A change that reduces expected cost but drops precision below tolerance is automatically discarded.
Iteration caps — the loop has a hard maximum to prevent runaway execution (30 for rules, 15 for ML).
Early stopping — if five consecutive experiments are discarded, the loop halts. This signals that easy optimizations are exhausted.
Human approval gate — the final output is a pull request with a clean commit history (only kept improvements), a full experiment log, and a summary report. No changes reach production without human review.

The primary optimization metric — expected cost (false positive cost + false negative cost) — captures the real economic impact of fraud decisions, not just statistical accuracy. This ensures the optimizer balances fraud losses against the operational cost of false alarms.

Why This Matters for Southeast Asian Banks

For mid-market banks and fintechs in Southeast Asia, this capability addresses a structural challenge: fraud patterns evolve faster than teams can retune their defenses.

A bank running a fraud decision engine with 50+ rules, multiple transaction flows, and ML models faces a combinatorial tuning problem. Even experienced fraud analysts can only practically test a handful of parameter changes per week. An autonomous optimization loop can run dozens of experiments in hours — systematically exploring the parameter space that humans simply don't have time to cover.

The practical benefits are concrete:

Faster adaptation — when new fraud patterns emerge (and in SEA, they emerge constantly), both rule parameters and ML models can be re-optimized against updated datasets in hours rather than weeks
Data-driven model selection — instead of defaulting to whatever model a data scientist chose at deployment, the optimizer evaluates all candidates and selects the best performer for the current fraud landscape
Cost-optimized decisions — optimizing for expected cost rather than raw accuracy means the engine balances fraud losses against false positive costs, reducing both operational burden and customer friction
No dedicated ML team required — the optimization runs autonomously within the existing fraud engine infrastructure, making it accessible to banks that don't have specialized data science teams
Transparent, auditable changes — every parameter change is individually committed with its metrics delta, creating a complete audit trail that compliance teams can review

The underlying principle is simple: build systems that make your fraud engine better over time, not just systems that detect fraud today. Rules and thresholds form the foundation. ML model optimization extends the same pattern to the layer where the biggest performance gains live. And because the entire loop runs autonomously — with safety guardrails and human approval at the end — it delivers continuous improvement without requiring a team of ML engineers to operate.

Run-True Decision is building a fraud decision engine purpose-built for Southeast Asian banks — one that gets sharper with every iteration. Talk to us to learn more.

Self-Improving Fraud Engines: How Autonomous Optimization Loops Sharpen Detection

The Manual Tuning Trap

Andrej Karpathy and the Autonomous Research Pattern

How It Works: From Rules to ML Models

Layer 1: Rules and Scoring Thresholds

Layer 2: ML Model Selection and Hyperparameter Tuning

Safety by Design

Why This Matters for Southeast Asian Banks

Explore the Platform

Related Articles

ML-Powered Fraud Scoring for Southeast Asian Banks

Beyond Pass/Fail: Graduated Risk Scoring Bands

The Four Decision Outcomes Every Fraud Engine Must Get Right