From Manual A/B Testing to Agentic Experimentation

Exploring how agent-based optimisation loops could scale product experimentation

What is this experiment about

This OpenClaw experiment explores whether product experimentation can be automated and run autonomously at scale.

Context, problem and constraints

Test-driven companies like Amazon, Booking, and Microsoft treat experimentation as a core competency. The best ones run 1,000+ A/B tests a year, continuously optimising conversion, engagement, and revenue. The logic is simple: more experiments means faster learning, and faster learning compounds into a durable competitive advantage.

Despite heavy investment in experimentation platforms, most tech companies struggle to launch more than 10 A/B tests per month. The bottleneck isn't ideas or tooling. It's coordination overhead. Running a single test requires a PM, a designer, an engineer, and an analyst to align on hypothesis definition, variant design, implementation, platform configuration, tracking validation, QA, and analysis. Each handoff introduces delay. The technology is ready. The process isn't.

Assumption and what I built

If agents can run experiments autonomously, then we can scale to hundreds of experiments per month, continuously optimising any KPI linked to business objectives.

To explore this assumption I built Optiloops, an OpenClaw agent that continuously run tests to optimise the the outputs generated by other AI agents. To keep the experiment small, I used a maze as the test environment.

How it works

A worker agent generates an initial maze, and outputs a sequence of moves to navigate from A to B:

RIGHT → RIGHT → DOWN → LEFT → RIGHT

Then the sequence is scored by a function (not an LLM) on four signals:

Distance to target: the primary signal
Goal reached: large reward if yes, nothing if no
Wall hits: penalised, discourages noisy paths
Steps taken: penalised, encourages shorter solutions

A poor run scores around -140. A decent one -60. A good one -18. The optimiser's job is to push that number toward zero.

The optimiser never sees the maze and knows nothing about navigation. It only sees the current best sequence and its score, then proposes small targeted mutations each iteration:

Mutation: move[3] LEFT → RIGHT | move[7] DOWN → UP Score: -60 → -42 Decision: ACCEPTED ✓

Mutation: move[2] RIGHT → DOWN | move[5] UP → LEFT Score: -42 → -51 Decision: REJECTED ✗

Accepted changes become the new baseline. Rejected ones are discarded. Evaluation is deterministic rather than LLM-based for consistency. If the scoring function drifts between runs, the optimiser can loose its compass. So you need a stable, reliable signal to optimise against. Which, as it turns out, is also the hardest part of real product problems.

The optimiser doesn't understand mazes. It only runs one loop: generate, evaluate, keep improvements, repeat. That same loop can optimise a checkout flow.

Impact and learnings

Applied to product experimentation, a system like this could run hundreds of experiments per month. The throughput ceiling shifts from human coordination speed to compute speed. You “just” need to swap the maze for a product outcome.

Distance to target → conversion gap
Wall hits → user friction
Steps taken → time to complete action
Goal reached → purchase or retention event

The honest constraint is evaluation. If you define the scoring function well, the optimiser improves the right thing. Define it poorly and it optimises the wrong behaviour, very efficiently. Going into this I was thinking about experimentation velocity. Coming out I was thinking about how to define what "better" actually means. That turns out to be a product problem, not an engineering one.

All projects