An open RSI benchmark for AI that learns from experience.
Most AI tests measure a single answer. COMB measures whether an agent actually gets better at its job over time. We built it because no public test for that existed and we needed to be able to score our own recursive self-improvement (RSI) system.
Three things to know about the test itself before getting into how our system scores on it.
COMB defines 22 useful behaviors an AI should eventually figure out on its own — spread across nine everyday situations like writing ads, handling support tickets, and managing a personal calendar. We keep those answers hidden. A system being tested has to discover them by doing the work, not by being told.
We don't care what's inside — a chatbot, a bandit, a retrieval pipeline, a fine-tuned model. The test just compares what the system discovered against the hidden answer key. So any team can run their own AI through COMB and post a score everyone can read on the same scale.
Most AI tests grade a single answer. COMB grades whether a system actually gets better as it does more work — what researchers call recursive self-improvement (RSI). There's no public scoreboard for that yet. We're building one, and putting our own system on it first.
The full report goes deeper on COMB itself and the build-by-build story behind the score you see above.