Your AI Coding Agent Is Only as Good as Your Tests

We had 1,583 unit tests, nearly zero integration tests, and 44 scattered end-to-end configs left over from ad-hoc debugging sessions. Our tests were passing. We were still shipping bugs to production. The problem was not test coverage — it was test architecture.

With AI coding agents like Claude Code making implementation nearly effortless, the verification layer has become the critical bottleneck. Teams that invest in test architecture will ship faster and safer than teams that invest in more AI coding tools. Here is what we learned rebuilding our test suite from a practitioner's perspective.

The Testing Pyramid Was Upside Down

Our fraud decision engine had the classic "ice cream cone" anti-pattern: heavy on unit tests, empty in the middle, chaotic at the top.

The 1,583 unit tests covered individual functions in isolation — they proved that calculate_risk_score() returned the right number given the right inputs. But they said nothing about whether the API actually accepted a transaction payload, applied the correct rules, and returned a decision. The integration layer — where components actually talk to each other — was completely missing.

Meanwhile, 44 end-to-end configurations had accumulated from debugging sessions. They were not a test suite; they were archaeology. No one knew which ones still worked, which ones tested overlapping paths, or which ones were safe to delete. Twelve had pre-existing failures that everyone had learned to ignore.

Why AI Makes This Worse

AI coding agents generate code at 10x the speed of manual development, but every line of generated code is surface area that needs verification.

When a developer writes a function by hand, they build a mental model of how it integrates with the rest of the system. When an AI agent writes it, that mental model does not exist. The agent optimises for the immediate task — make this function work, pass this test — without understanding the broader integration context. This is not a flaw in the agent; it is the nature of the tool. The correction mechanism is tests.

As Momentic.ai's research puts it: "Tests are executable specs for AI." They are not just verification — they are the language through which AI agents understand what the system should do. An agent that sees a comprehensive integration test suite will generate better code than one that sees only unit tests, because integration tests encode the contracts between components.

The Transformation: Real Numbers

We rebuilt the test suite in a single AI-assisted session. Here is what changed:

Metric	Before	After
Unit tests	1,583	1,600+
Integration tests	0	32 (API + auth boundary)
SDK contract tests	0	57
E2E configs	44 ad-hoc	7 critical-path
Pre-existing failures	12	0
Order-dependent tests	11 (hidden)	0
CI baseline	Unreliable	3,324 passed, 0 failed

The key insight was not writing more tests — it was writing the right kind of tests in the right layer.

Filling the integration layer

FastAPI's TestClient gave us integration tests essentially for free. No running server, no Docker containers, no complex setup — just a Python client that exercises the full request/response cycle including middleware, authentication, and serialisation. Thirty-two integration tests now cover every API endpoint with realistic payloads, and they run in under 4 seconds.

Consolidating E2E ruthlessly

We deleted 37 of the 44 end-to-end configurations. The remaining 7 cover the critical user paths: login, dashboard load, event investigation, case management, rule configuration, monitoring, and link analysis. Each uses condition-based waits (waiting for specific DOM states) instead of arbitrary timeouts, which eliminated the flakiness that made the old suite unreliable.

Finding hidden bugs with randomisation

pytest-randomly shuffles test execution order on every run. When we first enabled it, 11 tests that had been "passing" for months immediately failed. They depended on state left behind by earlier tests — a database fixture not properly torn down, a global variable mutated by a previous test, a cache not cleared between runs. These were real bugs hiding behind deterministic execution order.

Five Rules for AI-Ready Testing

Based on what worked for our fraud decision engine, here are the five rules we now follow for every project where AI agents write code:

1. Tests are specs, not afterthoughts. Your AI agent reads tests to understand what the system should do. A well-written integration test is more valuable than a design document, because the test is executable and always up to date.

2. Fill the integration layer first. Unit tests are necessary but insufficient. Integration tests — especially API-level tests using tools like FastAPI TestClient — catch the bugs that AI agents introduce most often: serialisation mismatches, authentication edge cases, and middleware interaction failures.

3. Consolidate E2E ruthlessly. Seven focused, reliable end-to-end specs are worth more than 44 scattered, flaky configurations. Every E2E test should cover a critical decision path that, if broken, would block a user from completing their primary task.

4. Shift security left. Authentication boundary tests — verifying that unauthenticated requests are rejected, that role-based access controls work, that session management is correct — catch more real bugs than additional E2E scenarios. These are the tests that prevent the bugs you cannot afford to ship.

5. Randomise test order. If your tests pass in sequence but fail when shuffled, you have hidden dependencies. pytest-randomly costs nothing to adopt and will find bugs your team does not know exist.

What We Have Not Solved Yet

Honesty matters more than polish. Two significant gaps remain in our testing approach.

First, we do not yet use property-based testing (Hypothesis) for our risk scoring engine. Unit tests verify specific inputs and outputs; property-based tests would verify invariants across thousands of random inputs — for example, that a risk score always falls between 0 and 100, or that adding a high-risk signal never decreases the score. This is next on the roadmap.

Second, our E2E tests still require human orchestration. Fully autonomous testing — where an AI agent generates, runs, and triages test results without human intervention — is the direction the industry is heading, but we are not there yet. The 7 critical-path specs are stable enough that a CI pipeline runs them reliably, but expanding coverage still requires a human deciding what to test.

The teams that will ship the fastest in the AI era are not the ones with the most sophisticated coding agents. They are the ones with the most disciplined test architecture — because the agent is only as good as the tests that correct it.

Run-True Decision builds its fraud decision engine with AI-assisted development and rigorous test architecture — 3,324 tests across unit, integration, SDK contract, and E2E layers. Talk to us about how we build reliable fraud infrastructure.

Your AI Coding Agent Is Only as Good as Your Tests

The Testing Pyramid Was Upside Down

Why AI Makes This Worse

The Transformation: Real Numbers

Filling the integration layer

Consolidating E2E ruthlessly

Finding hidden bugs with randomisation

Five Rules for AI-Ready Testing

What We Have Not Solved Yet

Explore the Platform

Related Articles

Is Your Fraud Engine Agent-Ready?

Self-Improving Fraud Engines: How Autonomous Optimization Loops Sharpen Detection

Four Decision Outcomes Every Fraud Engine Should Support