The hard part of changing a payments system was never writing the code. It was proving the change would not break something at scale.
AI-assisted coding sharpens that problem.
Tools that generate code faster are increasing the volume and pace of changes flowing into payment systems. The work of writing is getting cheaper. The work of proving correctness is not.
Vivek Yadav, an engineering manager at Stripe, set out the tension in a BankThink column. Faster code generation is useful, he argues, but it shifts the bottleneck from producing a change to showing it behaves at production scale.
The bottleneck moves downstream
When code is expensive to write, writing is the constraint. When AI makes it cheap, the constraint moves to everything that comes after.
Proving a change is safe becomes the slow step. And in payments, safe has a precise meaning.
Why payments are so hard to test
Payment behaviour is the product of many moving parts. Network rules, merchant configuration, card product, transaction type, region, authentication method, rate tables, effective dates and feature flags all interact.
Together they produce outcomes at the system level that are hard to reason about and harder to cover with hand-written tests.
Unit and integration tests check individual rules and a few representative flows. They cannot tell you what a new implementation would have done across a real transaction history.
That gap matters more as AI raises throughput. Existing test suites were not built for this pace, and they can become the thing that slows safe delivery.
What replay testing does
Yadav's proposed answer is replay-based testing.
New logic runs against historical transaction data through an offline pipeline. The system reconstructs the context of each past transaction, then compares what the current code produced against what the candidate code would produce.
The output is concrete. It shows transaction-level differences, the segments affected, the aggregate impact and the specific rules behind each divergence.
For pricing, billing, routing and large migrations, that is far more useful than a green tick on a curated test.
Where it earns its keep
Routine configuration changes benefit too.
A small tweak to a rate table or a feature flag can have effects that only show up across millions of live transactions. Replaying the change against real traffic surfaces them before customers do.
Migration projects are among the biggest winners. Proving behavioural stability against historical traffic gives teams evidence, not hope, that a new system matches the old one where it should.
The catch
Replays are not a crystal ball.
They depend on reconstructing the right state, and they cannot predict every future condition a live system will meet.
They also carry a heavier duty of care. Datasets derived from production demand strong privacy controls, data minimisation and clear governance. Real customer transactions are involved.
A testing asset, not an afterthought
The deeper shift Yadav is pointing to is one of mindset.
As AI makes code cheap to produce, the scarce resource becomes confidence that a change is safe. Historical transaction data is where that confidence can come from.
For payment firms, the real work now is turning years of transaction history into a testing asset they can use again and again.