Xavier Puig / Product Management · User Experience · AI Systems
Back to work

From Manual A/B Testing to Agentic Experimentation

Exploring how agent-based optimisation loops could scale product experimentation

What is this experiment about

This OpenClaw experiment explores whether product experimentation can be automated and run autonomously at scale.

Context, problem and constraints

Test-driven companies like Amazon, Booking, and Microsoft treat experimentation as a core competency. The best ones run 1,000+ A/B tests a year, continuously optimising conversion, engagement, and revenue. The logic is simple: more experiments means faster learning, and faster learning compounds into a durable competitive advantage.

Despite heavy investment in experimentation platforms, most tech companies struggle to launch more than 10 A/B tests per month. The bottleneck isn't ideas or tooling. It's coordination overhead. Running a single test requires a PM, a designer, an engineer, and an analyst to align on hypothesis definition, variant design, implementation, platform configuration, tracking validation, QA, and analysis. Each handoff introduces delay. The technology is ready. The process isn't.

Assumption and what I built

If agents can run experiments autonomously, then we can scale to hundreds of experiments per month, continuously optimising any KPI linked to business objectives.

To explore this assumption I built Optiloops, an OpenClaw agent that continuously run tests to optimise the the outputs generated by other AI agents. To keep the experiment small, I used a maze as the test environment.

How it works

A worker agent generates an initial maze, and outputs a sequence of moves to navigate from A to B:

RIGHT → RIGHT → DOWN → LEFT → RIGHT

Then the sequence is scored by a function (not an LLM) on four signals:

  • Distance to target: the primary signal
  • Goal reached: large reward if yes, nothing if no
  • Wall hits: penalised, discourages noisy paths
  • Steps taken: penalised, encourages shorter solutions

A poor run scores around -140. A decent one -60. A good one -18. The optimiser's job is to push that number toward zero.

The optimiser never sees the maze and knows nothing about navigation. It only sees the current best sequence and its score, then proposes small targeted mutations each iteration:

Mutation: move[3] LEFT → RIGHT | move[7] DOWN → UP Score: -60 → -42 Decision: ACCEPTED ✓

Mutation: move[2] RIGHT → DOWN | move[5] UP → LEFT Score: -42 → -51 Decision: REJECTED ✗

Accepted changes become the new baseline. Rejected ones are discarded. Because the optimiser is LLM-powered, it reasons about which mutations are worth trying rather than mutating blindly, which is an important distinction from naive hill climbing.

Evaluation is deterministic rather than LLM-based for consistency. If the scoring drifts between runs, the optimiser loses its compass. You need a stable, reliable signal to optimise against. Which, as it turns out, is also the hardest part of applying this to real product problems.

The optimiser doesn't understand mazes. It only runs one loop: generate, evaluate, keep improvements, repeat. That same loop can optimise a checkout flow.
Impact and learnings

The system converges on an optimal path after enough iterations, not because the agent was taught anything about navigation, but because it kept what worked and discarded what didn't. Swap the maze for a product surface and the structure holds exactly:

  • Distance to target → conversion gap
  • Wall hits → user friction
  • Steps taken → time to complete action
  • Goal reached → purchase or retention event

Applied to product experimentation, a system like this could run hundreds of variant cycles per month, far beyond what any team could coordinate manually. The throughput ceiling shifts from human coordination speed to compute speed.

The honest constraint is evaluation design. Define the scoring function well and the optimiser improves the right thing. Define it poorly and it optimises the wrong behaviour at scale, very efficiently. Going into this I was thinking about experimentation velocity. Coming out I was thinking about evaluation design: how do you define what "better" actually means before you let a system loose on it. That turns out to be a product problem, not an engineering one.