← home

benchmarks

brainctl v2.4.3 · backend: Brain.search · default settings · no tuning for benchmark data. results committed to tests/bench/baselines/ and gated in CI.

methodology

both benchmarks ran against Brain.search (the primary retrieval backend) with default settings. no benchmark-specific tuning, no cherry-picked queries, no post-hoc filtering. the test runner is tests/bench/run.py; baselines are committed as JSON and fail the build on any >2% regression in P@1 / P@5 / MRR / nDCG@5.

this page shows both the committed Brain.search baseline and separate no-LLM LoCoMo operating points (turn/session/hybrid). on the latest sweep, hybrid is the best top-heavy operating point (Hit@1 0.6983, Hit@5 0.9132, MRR 0.7920), with a small single-hop Hit@5 giveback vs session and near-tied Hit@10.

LongMemEval

289 questions · 4 retrieval-friendly categories · subset of longmemeval_s filtered to questions whose gold answer is checkable via string / fuzzy match against the conversation content. Temporal-reasoning and knowledge-update categories (which need an LLM-as-judge to score) are excluded from this overall — gold session IDs are still present for those if measured separately.

overall
88.2%
hit@1
97.6%
hit@5
99.0%
hit@10
92.4%
MRR
by category
categorynhit@1hit@5MRRnDCG@10
multi‑session1330.9100.9850.9440.888
single‑session‑assistant561.0001.0001.0001.000
single‑session‑preference300.5000.8330.6710.732
single‑session‑user700.9001.0000.9350.951
lock snapshot (old FTS-only vs final locked)
metricold FTS-onlyfinal lockedabs deltarel delta
hit@10.88240.8685-0.0139-1.58%
hit@50.97580.9792+0.0034+0.35%
hit@100.98960.9896+0.0000+0.00%
hit@201.00001.0000+0.0000+0.00%
MRR0.92410.9147-0.0094-1.02%
nDCG@50.89100.8815-0.0095-1.07%
Recall@50.92170.9158-0.0059-0.64%

coverage stayed near ceiling (hit@10/hit@20 unchanged), with gains concentrated in hit@5 and modest givebacks on hit@1 / MRR / nDCG@5.

LOCOMO

1982 questions · 10 conversations · 5 categories · tests temporal, adversarial, and multi-hop recall

overall
34.1%
hit@1
57.2%
hit@5
65.8%
hit@10
44.5%
MRR
latest operating points (no-LLM retrieval)
metricturnsessionhybridhybrid vs session
Hit@10.37340.67310.6983+0.0252 (+3.74%)
Hit@50.61200.91170.9132+0.0015 (+0.16%)
Hit@100.68920.96060.9601-0.0005 (-0.05%)
MRR0.47310.77490.7920+0.0171 (+2.21%)
single-hop Hit@50.46450.86880.8546-0.0142 (-1.63%)
multi-hop Hit@50.36960.65220.6739+0.0217 (+3.33%)
temporal Hit@50.66040.89720.8972+0.0000 (+0.00%)

hybrid is the best overall LoCoMo operating point in this sweep: stronger hit@1/hit@5/MRR and multi-hop hit@5 than session, equal temporal hit@5, and a small single-hop hit@5 giveback.

by category
categorynhit@1hit@5MRRnDCG@10
adversarial4460.3770.6030.4790.521
multi-hop*920.1740.3150.2320.202
open-domain8410.3730.6020.4790.517
single-hop*2820.1670.4290.2820.220
temporal3210.4050.6480.5100.538

* baseline Brain.search still has weak hop-heavy hit@1 (single-hop 0.167, multi-hop 0.174). latest hybrid operating points close most of that gap on top-heavy metrics while keeping similar temporal performance.

head-to-head: brainctl vs MemPalace

measured · same machine · same datasets · same scoring · run 2026-04-18

provenance
hardwareIntel Core Ultra 7 258V · 33.9 GB RAM · Windows 10 Home
repro commandpython benchmarks/compare_memory_engines.py --label full_compare
result bundlebenchmarks/results/full_compare_20260418_033425/
LongMemEval · 470 questions · longmemeval_s_cleaned.json
systemR@5R@10NDCG@5NDCG@10
brainctl Brain.search0.96810.98940.92040.9253
brainctl cmd_search0.97020.98940.92060.9253
mempalace raw_session0.96600.98300.89300.8948
LoCoMo · 1,986 QA · locomo10.json · session-level recall
systemavg recall
brainctl cmd_session0.9217
mempalace raw_session0.6028

session-level granularity (does the right session appear?) — distinct from the turn-level Hit@K reported in the LOCOMO section above. Same dataset, different scoring level.

MemBench · 200 q (FirstAgent slice, partial)
systemhit@5
brainctl cmd_turn0.930
mempalace raw_turn0.885

FirstAgent slice only — full MemBench sweep pending. ConvoMem run was blocked because the evidence payload fetch failed; no fair same-machine number yet.

retrieval settings + honesty note
  • brainctl Brain.search = FTS-only lexical retrieval
  • brainctl cmd_search = full brainctl retrieval pipeline
  • mempalace raw_* = raw retrieval baseline

Honesty note: the vector-on/off flag for the cmd_search run was not persisted into the artifact bundle. We will not overclaim the cmd_search numbers as a clean vector-vs-FTS statement without rerunning that exact variant with the flag captured.

other competitor adapters

The competitor harness in v2.4.0 (tests/bench/competitor_runs/) has adapters for Mem0, Letta, Zep, Cognee, MemPalace, and OpenAI Memory. Adapters share the same SearchFn(query, k) protocol and a skip-not-fabricate contract: missing SDK or API key raises CompetitorUnavailable instead of returning a fake 0. The MemPalace numbers above are from a separate same-machine run; sweeps for Mem0 / Zep are gated on hosted-API budget, and Cognee can run free locally.

cells on /comparison still marked ? are measured-but-not-published-yet, or genuinely unknown from competitor docs. Nothing on this page is a measured loss displayed as “—”.

harness: tests/bench/run.py · baselines: tests/bench/baselines/ · CI gate: tests/test_search_quality_bench.py · regression threshold: >2% drop on any headline metric fails the build.

brainctl — benchmarks