● Calibrated Observation Matching Benchmark

Comb.

Name: COMB — Calibrated Observation Matching Benchmark
Creator: Honey Nudger

An open RSI benchmark for AI that learns from experience.

Most AI tests measure a single answer. COMB measures whether an agent actually gets better at its job over time, and it's what we use to score our own recursive self-improvement (RSI) system, Honey Nudger.

§ 01 · The benchmark itself

What COMB is, how it works, and why it matters.

Three things to know about the test itself before getting into how our system scores on it.

§ 01

What COMB is

A hidden answer key.

COMB defines 22 useful behaviors an AI should eventually figure out on its own — spread across nine everyday situations like writing ads, handling support tickets, and managing a personal calendar. We keep those answers hidden. A system being tested has to discover them by doing the work, not by being told.

§ 02

How it works

Any AI system can take the test.

We don't care what's inside — a chatbot, a bandit, a retrieval pipeline, a fine-tuned model. The test just compares what the system discovered against the hidden answer key. So any team can run their own AI through COMB and post a score everyone can read on the same scale.

§ 03

Why it matters

The first public scoreboard for learning AI.

Most AI tests grade a single answer. COMB grades whether a system actually gets better as it does more work — what researchers call recursive self-improvement (RSI). There's no public scoreboard for that yet. We're building one, and putting our own system on it first.

§ 03 · Take a next step

Read it, follow it, or build with us.

The full report goes deeper on COMB itself and the build-by-build story behind the score you see above.

Every score on this page defined in plain English, with formulas
The build-by-build journal — what we changed at each version, and what moved the score
Why current AI benchmarks don’t measure learning from experience
What an outside team would need to run COMB on their own AI

Licence · Apache-2.0Repo · github.com/honeynudger/comb (coming soon)

Read the full report Follow on X for updates Get in touch / collaborate