Name: COMB — Calibrated Observation Matching Benchmark
Creator: Honey Nudger

An open RSI benchmark for AI that learns from experience.

Most AI tests measure a single answer. COMB measures whether a system actually gets better at its job over time. 22 known-good behaviors are hidden across 9 everyday scenario clusters; every build of the system under test is scored on the same fixed rule and published as a live ledger.

Versions scored

—

Date span

—

Current

—

Highest on record

—

§ 01 · What COMB is

A hidden answer key.

22 useful behaviors an AI should eventually figure out on its own, hidden across 9 everyday situations. A system being tested has to discover them by doing the work, not by being told.

§ 02 · How it works

Any AI system can take the test.

We don't care what's inside — chatbot, bandit, retrieval pipeline, fine-tuned model. The test just compares what the system discovered against the hidden answer key. Same scale for every team.

§ 03 · Why it matters

The first public scoreboard for learning AI.

Most AI tests grade a single answer. COMB grades whether a system actually gets better as it does more work — what researchers call recursive self-improvement (RSI). No public scoreboard for that existed.

Honeynudger.ai/comb-rsi-benchmark · live ledger · —

Why this benchmark, why now.

The gap COMB fills

Most public benchmarks measure single-shot performance — does the model answer this question correctly today? Recursive self-improvement (RSI) systems live or die by what they discover and retain across many interactions, not what they output once. COMB is built to measure that learning loop specifically.

How COMB is different

A held-out set of 22 ground-truth behaviors is hidden from the system under test. The system runs thousands of simulated interactions, surfaces its own hypotheses, and is scored on whether those discovered hypotheses match the held-out set at a frozen cosine threshold. Strategy-agnostic — the scorer never inspects the system's internal mechanism.

What a moving COMB score implies

A higher Discovery score implies the system is mining its interaction stream for transferable structure rather than just answering each request locally. A high Routing-Aware Recall implies that structure also lands in the right scenario at serving time. The two together approximate what "production-ready RSI" looks like in measurable terms.

Why the system stays separate

The system being benchmarked (Honey Nudger) is intentionally walled off from the benchmark itself. COMB ships open-source; Honey Nudger does not. Anyone can build a competing system and score it against the same fixtures and threshold. That separation is what keeps the number honest.

What COMB actually measures.

COMB is a calibrated benchmark for experiential AI learning — systems whose job is to accumulate useful observations and reuse them. The benchmark holds the environment, the ground-truth set, and the scoring code fixed, and lets the system under test iterate.

Stage 01

Ground-truth set.

22 hypotheses across the nine scenario clusters above, each with a published expected behavior. The set is frozen and held out of the system under test.

Stage 02

Simulated interactions.

Each version replays thousands of seeded synthetic interactions through the system under test. We report this as 'epochs × interactions-per-epoch' (e.g. 10 × 500 = 10 epochs of 500 interactions each = 5,000 total). The corpus is reproducible.

Stage 03

System under test.

Whatever the system chooses to do with the interactions — observation, distillation, retrieval, internal experimentation, A/B testing — happens here. The benchmark doesn't constrain this stage. It only consumes what comes out the other side: a list of discovered hypotheses with a routing decision per scenario.

Stage 04

Scorer.

The benchmark scorer matches the system's discovered hypotheses against the ground-truth set at a frozen cosine threshold (currently 0.65) using a greedy best-match assignment. Every metric on this page comes straight from that scorer's output.

Scenario clusters

Marketing.

marketing_search_ads
marketing_social_posts
marketing_campaign_opt

E-commerce support.

ecomm_support_chat
ecomm_support_email
ecomm_support_complaint

Assistant.

assistant_task_mgmt
assistant_calendar
assistant_email_draft

Each cluster carries 2–3 ground-truth hypotheses. A tenth corpus partition, baseline_general, holds no scoring targets and is excluded from Discovery.

Principle 01

Frozen match threshold.

The cosine threshold (0.65) is pinned in code. It was tightened once — from 0.60 to 0.65 — before V34, and that change is annotated wherever it affects cross-version reading. No further changes will happen without bumping the benchmark spec.

Principle 02

System-agnostic.

COMB scores observation-matching, not the strategy used to get there. Some systems will use A/B testing internally, some won't; some will use bandits, distillation, RAG, or something else entirely. None of that is in the benchmark — the scorer only sees the discovered hypotheses and matches them against the held-out ground-truth set.

Principle 03

Routing is part of Discovery.

Plain Discovery has a known failure mode: a correct hint that arrives in the wrong scenario is not a correct hint. Routing-Aware Recall makes this explicit, and the gap between Discovery and RAR is a first-class signal published for every version.

Reproducibility ledger

The full spec ships with open-source. Until then, this is the public list of what's pinned, what's pending, and what an outside team would need to run COMB themselves.

Cosine threshold

0.65 (tightened from 0.60 before V34)

Pinned

Ground-truth set size

22 hypotheses across 9 scenario clusters

Pinned

Match algorithm

greedy best-match (one-to-one assignment)

Pinned

Version-sizing convention

epochs × interactions-per-epoch (e.g. 10 × 500)

Pinned

Embedding model

to be pinned in the open-source spec

Pending

Per-scenario allocation rule

to be published as a fixture in the open-source spec

Pending

GT authorship + validation process

to be documented in the open-source spec

Pending

Threshold-selection evidence

precision/recall sweep to be published alongside spec

Pending

Integration contract (input/output schemas)

interactions in, ranked hypotheses + routing out; full schema to be published alongside spec

Pending

Every metric, defined.

Each entry below maps directly to a real scorer output. Where a metric was added partway through the program, that's annotated. A handful of proposed additions appear at the bottom — metrics the benchmark could start producing to sharpen future signal.

Matched ground truths ÷ total ground truths

Shipped

Discovery.

Fraction of ground-truth hypotheses the system surfaced within the run's discovered hypothesis set. The headline COMB number.

Formula

Discovery = \frac{|\{g ∈ G : \max_h cos(h,g) ≥ 0.65\}|}{|G|}

Range

0% — 100%

Introduced

core metric — present in every version

Notes

Counted using a frozen match threshold (cosine ≥ 0.65). Threshold was tightened from 0.60 → 0.65 before V34; all scores shown on this page use the 0.65 rule.

What ships under Apache-2.0.

The COMB benchmark, simulation harness, ground-truth fixtures, and scoring code release under Apache-2.0. The system being evaluated stays proprietary. The intent is for the benchmark to outlive Honey Nudger's involvement once it's in the open.

Threshold changes require a versioned spec bump.

The cosine threshold (0.65) is pinned in code. Any future change requires a public changelog and a new spec version — same surface area as a deprecation in a major library.

Ground-truth set is frozen per spec version.

Adds, edits, and removes to the held-out set trigger a benchmark spec bump. Cross-version comparisons inside the same spec version are apples-to-apples; across versions, the changelog says why.

Per-version artifacts are published as-is.

Every benchmark JSON the scorer emits for a published version stays in the public ledger. No retroactive rescoring without a spec bump.

External submissions are welcome.

Anyone running COMB against their own system can submit results for inclusion on the ledger. The fixture set, scoring code, and threshold are the only variables that matter — strategy is theirs.

Licence · Apache-2.0 (planned) · Repo · github.com/honeynudger/comb (coming soon) · Last regenerated · 2026-06-06