benchmarks

brainctl v2.4.3 · backend: Brain.search · default settings · no tuning for benchmark data. results committed to tests/bench/baselines/ and gated in CI.

methodology

both benchmarks ran against Brain.search (the primary retrieval backend) with default settings. no benchmark-specific tuning, no cherry-picked queries, no post-hoc filtering. the test runner is tests/bench/run.py; baselines are committed as JSON and fail the build on any >2% regression in P@1 / P@5 / MRR / nDCG@5.

this page shows both the committed Brain.search baseline and separate no-LLM LoCoMo operating points (turn/session/hybrid). on the latest sweep, hybrid is the best top-heavy operating point (Hit@1 0.6983, Hit@5 0.9132, MRR 0.7920), with a small single-hop Hit@5 giveback vs session and near-tied Hit@10.

LongMemEval

289 questions · 4 retrieval-friendly categories · subset of longmemeval_s filtered to questions whose gold answer is checkable via string / fuzzy match against the conversation content. Temporal-reasoning and knowledge-update categories (which need an LLM-as-judge to score) are excluded from this overall — gold session IDs are still present for those if measured separately.

overall

88.2%

hit@1

97.6%

hit@5

99.0%

hit@10

92.4%

MRR

by category

category	n	hit@1	hit@5	MRR	nDCG@10
multi‑session	133	0.910	0.985	0.944	0.888
single‑session‑assistant	56	1.000	1.000	1.000	1.000
single‑session‑preference	30	0.500	0.833	0.671	0.732
single‑session‑user	70	0.900	1.000	0.935	0.951

lock snapshot (old FTS-only vs final locked)

metric	old FTS-only	final locked	abs delta	rel delta
hit@1	0.8824	0.8685	-0.0139	-1.58%
hit@5	0.9758	0.9792	+0.0034	+0.35%
hit@10	0.9896	0.9896	+0.0000	+0.00%
hit@20	1.0000	1.0000	+0.0000	+0.00%
MRR	0.9241	0.9147	-0.0094	-1.02%
nDCG@5	0.8910	0.8815	-0.0095	-1.07%
Recall@5	0.9217	0.9158	-0.0059	-0.64%

coverage stayed near ceiling (hit@10/hit@20 unchanged), with gains concentrated in hit@5 and modest givebacks on hit@1 / MRR / nDCG@5.

LOCOMO

1982 questions · 10 conversations · 5 categories · tests temporal, adversarial, and multi-hop recall

overall

34.1%

hit@1

57.2%

hit@5

65.8%

hit@10

44.5%

MRR

latest operating points (no-LLM retrieval)

metric	turn	session	hybrid	hybrid vs session
Hit@1	0.3734	0.6731	0.6983	+0.0252 (+3.74%)
Hit@5	0.6120	0.9117	0.9132	+0.0015 (+0.16%)
Hit@10	0.6892	0.9606	0.9601	-0.0005 (-0.05%)
MRR	0.4731	0.7749	0.7920	+0.0171 (+2.21%)
single-hop Hit@5	0.4645	0.8688	0.8546	-0.0142 (-1.63%)
multi-hop Hit@5	0.3696	0.6522	0.6739	+0.0217 (+3.33%)
temporal Hit@5	0.6604	0.8972	0.8972	+0.0000 (+0.00%)

hybrid is the best overall LoCoMo operating point in this sweep: stronger hit@1/hit@5/MRR and multi-hop hit@5 than session, equal temporal hit@5, and a small single-hop hit@5 giveback.

by category

category	n	hit@1	hit@5	MRR	nDCG@10
adversarial	446	0.377	0.603	0.479	0.521
multi-hop^*	92	0.174	0.315	0.232	0.202
open-domain	841	0.373	0.602	0.479	0.517
single-hop^*	282	0.167	0.429	0.282	0.220
temporal	321	0.405	0.648	0.510	0.538

* baseline Brain.search still has weak hop-heavy hit@1 (single-hop 0.167, multi-hop 0.174). latest hybrid operating points close most of that gap on top-heavy metrics while keeping similar temporal performance.

head-to-head: brainctl vs MemPalace

measured · same machine · same datasets · same scoring · run 2026-04-18

provenance

hardware	Intel Core Ultra 7 258V · 33.9 GB RAM · Windows 10 Home
repro command	`python benchmarks/compare_memory_engines.py --label full_compare`
result bundle	`benchmarks/results/full_compare_20260418_033425/`

LongMemEval · 470 questions · longmemeval_s_cleaned.json

system	R@5	R@10	NDCG@5	NDCG@10
brainctl Brain.search	0.9681	0.9894	0.9204	0.9253
brainctl cmd_search	0.9702	0.9894	0.9206	0.9253
mempalace raw_session	0.9660	0.9830	0.8930	0.8948

LoCoMo · 1,986 QA · locomo10.json · session-level recall

system	avg recall
brainctl cmd_session	0.9217
mempalace raw_session	0.6028

session-level granularity (does the right session appear?) — distinct from the turn-level Hit@K reported in the LOCOMO section above. Same dataset, different scoring level.

MemBench · 200 q (FirstAgent slice, partial)

system	hit@5
brainctl cmd_turn	0.930
mempalace raw_turn	0.885

FirstAgent slice only — full MemBench sweep pending. ConvoMem run was blocked because the evidence payload fetch failed; no fair same-machine number yet.

retrieval settings + honesty note

brainctl Brain.search = FTS-only lexical retrieval
brainctl cmd_search = full brainctl retrieval pipeline
mempalace raw_* = raw retrieval baseline

Honesty note: the vector-on/off flag for the cmd_search run was not persisted into the artifact bundle. We will not overclaim the cmd_search numbers as a clean vector-vs-FTS statement without rerunning that exact variant with the flag captured.

other competitor adapters

The competitor harness in v2.4.0 (tests/bench/competitor_runs/) has adapters for Mem0, Letta, Zep, Cognee, MemPalace, and OpenAI Memory. Adapters share the same SearchFn(query, k) protocol and a skip-not-fabricate contract: missing SDK or API key raises CompetitorUnavailable instead of returning a fake 0. The MemPalace numbers above are from a separate same-machine run; sweeps for Mem0 / Zep are gated on hosted-API budget, and Cognee can run free locally.

cells on /comparison still marked ? are measured-but-not-published-yet, or genuinely unknown from competitor docs. Nothing on this page is a measured loss displayed as “—”.

harness: tests/bench/run.py · baselines: tests/bench/baselines/ · CI gate: tests/test_search_quality_bench.py · regression threshold: >2% drop on any headline metric fails the build.