benchmarks
brainctl v2.4.3 · backend: Brain.search · default settings · no tuning for benchmark data. results committed to tests/bench/baselines/ and gated in CI.
both benchmarks ran against Brain.search (the primary retrieval backend) with default settings. no benchmark-specific tuning, no cherry-picked queries, no post-hoc filtering. the test runner is tests/bench/run.py; baselines are committed as JSON and fail the build on any >2% regression in P@1 / P@5 / MRR / nDCG@5.
this page shows both the committed Brain.search baseline and separate no-LLM LoCoMo operating points (turn/session/hybrid). on the latest sweep, hybrid is the best top-heavy operating point (Hit@1 0.6983, Hit@5 0.9132, MRR 0.7920), with a small single-hop Hit@5 giveback vs session and near-tied Hit@10.
LongMemEval
289 questions · 4 retrieval-friendly categories · subset of longmemeval_s filtered to questions whose gold answer is checkable via string / fuzzy match against the conversation content. Temporal-reasoning and knowledge-update categories (which need an LLM-as-judge to score) are excluded from this overall — gold session IDs are still present for those if measured separately.
| category | n | hit@1 | hit@5 | MRR | nDCG@10 |
|---|---|---|---|---|---|
| multi‑session | 133 | 0.910 | 0.985 | 0.944 | 0.888 |
| single‑session‑assistant | 56 | 1.000 | 1.000 | 1.000 | 1.000 |
| single‑session‑preference | 30 | 0.500 | 0.833 | 0.671 | 0.732 |
| single‑session‑user | 70 | 0.900 | 1.000 | 0.935 | 0.951 |
| metric | old FTS-only | final locked | abs delta | rel delta |
|---|---|---|---|---|
| hit@1 | 0.8824 | 0.8685 | -0.0139 | -1.58% |
| hit@5 | 0.9758 | 0.9792 | +0.0034 | +0.35% |
| hit@10 | 0.9896 | 0.9896 | +0.0000 | +0.00% |
| hit@20 | 1.0000 | 1.0000 | +0.0000 | +0.00% |
| MRR | 0.9241 | 0.9147 | -0.0094 | -1.02% |
| nDCG@5 | 0.8910 | 0.8815 | -0.0095 | -1.07% |
| Recall@5 | 0.9217 | 0.9158 | -0.0059 | -0.64% |
coverage stayed near ceiling (hit@10/hit@20 unchanged), with gains concentrated in hit@5 and modest givebacks on hit@1 / MRR / nDCG@5.
LOCOMO
1982 questions · 10 conversations · 5 categories · tests temporal, adversarial, and multi-hop recall
| metric | turn | session | hybrid | hybrid vs session |
|---|---|---|---|---|
| Hit@1 | 0.3734 | 0.6731 | 0.6983 | +0.0252 (+3.74%) |
| Hit@5 | 0.6120 | 0.9117 | 0.9132 | +0.0015 (+0.16%) |
| Hit@10 | 0.6892 | 0.9606 | 0.9601 | -0.0005 (-0.05%) |
| MRR | 0.4731 | 0.7749 | 0.7920 | +0.0171 (+2.21%) |
| single-hop Hit@5 | 0.4645 | 0.8688 | 0.8546 | -0.0142 (-1.63%) |
| multi-hop Hit@5 | 0.3696 | 0.6522 | 0.6739 | +0.0217 (+3.33%) |
| temporal Hit@5 | 0.6604 | 0.8972 | 0.8972 | +0.0000 (+0.00%) |
hybrid is the best overall LoCoMo operating point in this sweep: stronger hit@1/hit@5/MRR and multi-hop hit@5 than session, equal temporal hit@5, and a small single-hop hit@5 giveback.
| category | n | hit@1 | hit@5 | MRR | nDCG@10 |
|---|---|---|---|---|---|
| adversarial | 446 | 0.377 | 0.603 | 0.479 | 0.521 |
| multi-hop* | 92 | 0.174 | 0.315 | 0.232 | 0.202 |
| open-domain | 841 | 0.373 | 0.602 | 0.479 | 0.517 |
| single-hop* | 282 | 0.167 | 0.429 | 0.282 | 0.220 |
| temporal | 321 | 0.405 | 0.648 | 0.510 | 0.538 |
* baseline Brain.search still has weak hop-heavy hit@1 (single-hop 0.167, multi-hop 0.174). latest hybrid operating points close most of that gap on top-heavy metrics while keeping similar temporal performance.
head-to-head: brainctl vs MemPalace
measured · same machine · same datasets · same scoring · run 2026-04-18
| hardware | Intel Core Ultra 7 258V · 33.9 GB RAM · Windows 10 Home |
| repro command | python benchmarks/compare_memory_engines.py --label full_compare |
| result bundle | benchmarks/results/full_compare_20260418_033425/ |
| system | R@5 | R@10 | NDCG@5 | NDCG@10 |
|---|---|---|---|---|
| brainctl Brain.search | 0.9681 | 0.9894 | 0.9204 | 0.9253 |
| brainctl cmd_search | 0.9702 | 0.9894 | 0.9206 | 0.9253 |
| mempalace raw_session | 0.9660 | 0.9830 | 0.8930 | 0.8948 |
| system | avg recall |
|---|---|
| brainctl cmd_session | 0.9217 |
| mempalace raw_session | 0.6028 |
session-level granularity (does the right session appear?) — distinct from the turn-level Hit@K reported in the LOCOMO section above. Same dataset, different scoring level.
| system | hit@5 |
|---|---|
| brainctl cmd_turn | 0.930 |
| mempalace raw_turn | 0.885 |
FirstAgent slice only — full MemBench sweep pending. ConvoMem run was blocked because the evidence payload fetch failed; no fair same-machine number yet.
- brainctl
Brain.search= FTS-only lexical retrieval - brainctl
cmd_search= full brainctl retrieval pipeline - mempalace
raw_*= raw retrieval baseline
Honesty note: the vector-on/off flag for the cmd_search run was not persisted into the artifact bundle. We will not overclaim the cmd_search numbers as a clean vector-vs-FTS statement without rerunning that exact variant with the flag captured.
The competitor harness in v2.4.0 (tests/bench/competitor_runs/) has adapters for Mem0, Letta, Zep, Cognee, MemPalace, and OpenAI Memory. Adapters share the same SearchFn(query, k) protocol and a skip-not-fabricate contract: missing SDK or API key raises CompetitorUnavailable instead of returning a fake 0. The MemPalace numbers above are from a separate same-machine run; sweeps for Mem0 / Zep are gated on hosted-API budget, and Cognee can run free locally.
cells on /comparison still marked ? are measured-but-not-published-yet, or genuinely unknown from competitor docs. Nothing on this page is a measured loss displayed as “—”.
harness: tests/bench/run.py · baselines: tests/bench/baselines/ · CI gate: tests/test_search_quality_bench.py · regression threshold: >2% drop on any headline metric fails the build.