brainctl: a persistent memory layer for autonomous agents
Autonomous agents built on large language models have no durable memory. Every session begins from zero, every context window is a goldfish bowl, and every handoff loses most of what the previous agent learned. brainctl is an open-source memory layer that gives agents a persistent brain: a single SQLite database with full-text and vector search, a typed knowledge graph, three first-class memory types (episodic, semantic, procedural) backed by their own tables and gates, provenance-tracked decisions, and sleep-inspired consolidation. It ships as a Python library, a CLI, a 209-tool Model Context Protocol server (stdio and streamable HTTP, with a tool-allowlist for clients that cap the surface), and nineteen first-party plugins spanning agent frameworks (Claude Code, Codex CLI, Cursor, Eliza, Gemini CLI, Goose, Hermes, OpenClaw, OpenCode, Pi, Rig, Virtuals Game, Zerebro) and trading bots (Freqtrade, Jesse, Hummingbot, NautilusTrader, OctoBot, Coinbase AgentKit). On top of the memory substrate brainctl ships an on-chain economy: Ed25519-signed memory bundles, optional Solana memo pinning, Light Protocol compressed-token minting, and a chain-canonical agent-to-agent marketplace at brainctl.org with full negotiation primitives (offer / counter / accept / reject / withdraw), just-in-time cNFT minting at settlement, and a transparent fee schedule (2.5% protocol fee at settlement, flat per-op fees on every chain interaction). Provider import adapters bring memories in from mem0 and arbitrary JSON exports; wallet export to Phantom / Backpack / Solflare / Glow is one CLI call. This paper describes the motivation, the substrate, the seven-store memory typology, the worthiness gate with adaptive five-factor admission control and Bayesian confidence model, AGM-style belief revision, an eight-phase NREM/REM consolidation pipeline grounded in synaptic homeostasis theory (Tononi & Cirelli 2003), self-improving retrieval via Thompson Sampling explore/exploit, context-dependent encoding, Q-value utility scoring, and phase-aware quantum amplitude scoring over a zero-LLM entity-linked knowledge graph, hybrid search with reciprocal rank fusion, multi-agent handoff, the security posture, the on-chain primitives (signed exports → mint → marketplace) and protocol fee schedule, comparison with existing approaches, and the economics of funding open-source infrastructure with a token.
motivation
the forgetting agent
Consider a coding agent asked to implement a multi-day feature. On day one it decides to use Retry-After headers for backoff because the server controls the rate-limit window. On day two, a new session, the agent is told to "improve backoff" and — with no memory of yesterday's rationale — reinvents an exponential backoff scheme that silently fights the server's headers. On day three a third session reconciles the two by introducing yet another layer. The code grows; coherence decays.
This is the canonical failure mode of stateless agents. It is not a failure of reasoning; each session is locally correct. It is a failure of memory — specifically, the inability to preserve the epistemic status of a decision across a session boundary.
why context windows don't fix it
The naive fix — feed the full transcript into the next session's context window — scales in three ways that all break down in practice.
- Cost. A 200k-token window at frontier-model pricing is roughly one to two dollars per query. A multi-day agent running tens of thousands of queries pays for an office building.
- Coherence. Instruction-following degrades with position. A 200k-token window shrinks to an effective ~80k under retrieval-from-context benchmarks, and the degradation is uneven — middle of context is the weakest position.
- Summarization drift. Every summarization pass destroys distinguishing detail. The phrase "we decided to use Retry-After" survives one pass; "because the server controls the rate-limit window" rarely survives three.
The empirical case against simply scaling context windows is well-established in the literature. Liu et al. 2024 ("Lost in the Middle") showed that language models systematically attend more strongly to information at the beginning and end of their context windows than to information in the middle, with a characteristic U-shaped accuracy curve that gets worse as context length increases. The finding holds across model families and across multi-document QA, key-value retrieval, and reading-comprehension tasks. The naive fix degrades retrieval quality before it runs out of length.
why retrieval alone doesn't fix it either
The next line of defence — a vector database bolted onto the agent — helps, but not enough. Vector retrieval treats memory as a bag of facts indexed by surface similarity. This misses at least three structural properties of real memory:
- 1. Provenance. A note written by the agent itself is not the same as a note from the user. A decision ratified twice is not the same as a decision mentioned once. Retrieval that ignores provenance promotes the popular over the reliable.
- 2. Temporal structure. "Alice is CTO" written on Monday is superseded by "Alice left the company" written on Tuesday. A retrieval system that returns both with similar cosine scores is broken at the level of meaning.
- 3. Epistemic status. Some facts are confident; some are tentative hypotheses; some are contradictions pending resolution. Flattening them into the same embedding space loses the gradient that matters most.
That vector retrieval alone is insufficient for agent memory is now broadly held in the literature. Park et al. 2023 ("Generative Agents: Interactive Simulacra of Human Behavior") introduced a memory-stream architecture with importance scoring and a reflection step that periodically synthesizes higher-level beliefs from raw observations — already past flat similarity retrieval. Packer et al. 2023 (MemGPT) reframed agent memory as an OS-style hierarchy of context, recall, and archival tiers with explicit paging between them. Zhang et al. 2024 surveyed the space and concluded that effective agent memory needs explicit structure, lifecycle management, and consolidation. brainctl is positioned in that line of work.
design goals
brainctl is an attempt to build a memory layer that is:
- local — no servers, no API calls on the hot path, no cloud lock-in
- durable — survives crashes, power loss, and reformatting
- structured — memories carry type, provenance, confidence, and temporal scope
- self-pruning — stale, low-value memories decay without explicit delete
- introspectable — the entire brain is a file an operator can read, audit, and edit with standard tools
- framework-agnostic — works with any agent, any language, any orchestrator
non-goals and scope
It is equally important to be clear about what brainctl is not trying to be. The agent-memory problem is adjacent to several other hard problems, and conflating them leads to over-scoped projects that do none of them well. The non-goals below are not things we will get to later; they are things we are actively choosing to leave to other tools.
- Not a planner. brainctl does not decide what the agent should do next. Planning, task decomposition, tree-of-thought search, and ReAct-style tool-use loops are the host agent's responsibility. brainctl gives the planner better inputs by making prior decisions, observations, and contradictions legible; it does not replace the planner.
- Not a reasoning engine. brainctl does not perform symbolic inference over stored memories. It surfaces relevant facts and flags contradictions via AGM revision, but it does not derive new theorems, execute Prolog-style rule chains, or run SAT solvers against the knowledge graph. Reasoning is delegated to the LLM consuming the retrieved context.
- Not a model. brainctl is a substrate. It has no weights to train, no gradient to optimize, no loss function. It does not produce text, generate embeddings in the model-architecture sense, or make predictions. The only neural component it touches is an embedding model (Ollama
nomic-embed-text) invoked as a black box at write and query time. - Not a vector database. brainctl uses vectors — via
sqlite-vec— but a vector database is a tool; brainctl is a memory architecture that happens to use one. If your use case is storing and retrieving embeddings at scale without the typology, provenance, AGM revision, or consolidation machinery, use Qdrant or LanceDB or pgvector directly. brainctl is what you get when you need all of the structure and you're willing to trade peak retrieval throughput for it. - Not a chat transcript store. Storing raw conversation history is trivial and largely useless for agent memory — the signal is in the decisions and observations that emerged from the conversation, not the conversation itself. brainctl deliberately does not offer a "save-every-turn-verbatim" primitive; it offers structured primitives (
remember,decide,entity,event_add) and expects the agent to distill raw turns into those shapes before writing. - Not a replacement for version control. brainctl tracks memories, not source code. Git is still the system of record for code; the
file_anchored_memoriestable (migration 027) lets memories reference code locations, but it does not track code itself. - Not a multi-tenant hosted service. The single-file architecture is optimized for single-operator or small-team use. A team that needs thousands of isolated tenants with row-level ACLs, audit logs pushed to SIEM, and managed upgrades is better served by a hosted alternative. Federation (§6.3) is the direction that might change this, but it is explicitly not the current posture.
These non-goals are chosen, not forced. They exist because scope discipline is the difference between a memory layer that can be understood in one sitting and a platform that accretes features until no one can reason about it as a whole.
a concrete multi-day agent trace
To make the motivation less abstract, here is the kind of trace brainctl is trying to preserve. Consider a coding agent implementing an API-v2 migration over three sessions spanning four days. Without brainctl, each session starts from zero and the trace looks like:
day 1 session A
orient -> [empty]
work -> decide to use Retry-After headers for backoff
(rationale: server controls the rate-limit window)
implement initial fetcher against /api/v2/orders
notice rate limit kicks in at ~100 req / 15s
wrap_up -> [lost]
day 2 session B
orient -> [empty]
user -> "please improve the backoff"
work -> invent exponential backoff from scratch
(rationale: handle rate limits gracefully)
silently conflicts with day-1 Retry-After logic
wrap_up -> [lost]
day 4 session C
orient -> [empty]
user -> "the rate limiter is broken, there's two layers"
work -> introduce a reconciliation layer on top of both
(rationale: can't figure out which to remove)
wrap_up -> [lost]
outcome: 3 sessions, 3 uncoordinated decisions, 1 bug,
coherence lost at the first handoff.With brainctl, the same three sessions look like:
day 1 session A
orient -> Brain.orient(project="api-v2")
-> context package: [empty — new project]
work -> brain.decide("use Retry-After for backoff",
"server controls rate-limit window")
brain.entity("RateLimitAPI", "service",
observations=["100 req/15s"])
brain.remember("rate-limit: 100/15s",
category="integration")
wrap_up -> handoff_packet {
goal: "implement api-v2 order fetcher",
current_state: "fetcher against /orders working,
Retry-After backoff in place",
open_loops: ["pagination not yet implemented"],
next_step: "add cursor-based pagination"
}
day 2 session B
orient -> Brain.orient(project="api-v2")
-> context package: {
handoff: (from session A, verified),
decisions: ["use Retry-After for backoff"],
entities: [RateLimitAPI],
memories: ["rate-limit: 100/15s"]
}
user -> "please improve the backoff"
work -> brain.search("backoff")
-> hit: decision(Retry-After) w/ rationale
agent proposes: "tighten Retry-After parsing,
add jitter, keep existing model"
brain.decide("add jitter to Retry-After delay",
"avoid thundering herd on recovery")
wrap_up -> handoff_packet { ... }
day 4 session C
orient -> same flow
-> context now carries both day-1 and day-2
decisions with full provenance
user -> "there might be an issue with rate limits"
work -> brain.search("rate limit")
agent finds both decisions, understands the
layered design, investigates the real bug.
outcome: 3 sessions, 1 coherent design, no contradiction
loop, each handoff preserves the epistemic state.The difference is not that brainctl is smarter than the agent. It is that brainctl prevents the agent from being repeatedly reintroduced to its own prior reasoning. The cost of forgetting is paid at every session boundary otherwise; brainctl moves that cost to a one-time write.
architecture
substrate choice: sqlite, one file
SQLite was chosen over Postgres, DuckDB, LMDB, and every dedicated vector database. The reasons are practical rather than ideological.
- zero-install — ships in every OS, every language runtime, every container
- one file — the entire brain is a file you can copy, version-control, rsync, or commit to git
- durable — WAL mode survives power loss; SQLite is one of the most-tested software artifacts in history
- fast at the scale of a single agent — reads are faster than network-hosted alternatives because there is no socket in the hot path
- extensible —
sqlite-vecbrings vector ops into the same engine as relational and FTS5 ops, eliminating cross-store joins - public domain — no licensing risk, ever
The database runs with foreign keys enforced, WAL journaling, and a 64 MB cache page pool. A lazy shared connection per Brain instance amortizes connection setup across calls.
Schema evolution is a first-class concern. As of v1.5.0 the package ships a safe migration runner (brainctl migrate) that detects drift between what the database records as applied and what the migration files actually require, classifies each pending migration as likely-applied, partial, or needs-apply via column- and table-level heuristics, and only then replays what is missing. v1.5.1 hardened the heuristic for GENERATED ALWAYS AS VIRTUAL columns and ADD COLUMN IF NOT EXISTS patterns. The practical effect: users who pip-upgrade to a new brainctl release can run one command and have their existing brain.db file brought up to the new schema without losing any data — the single-file invariant survives upgrades, not just clean installs.
memory typology
Following Tulving (1972), brainctl represents memory as several first-class stores with different write and retrieval patterns. These are not just different tables — they have different invariants and different consolidation paths. The three Tulvingian types — episodic, semantic, and procedural — are present as separate first-class data models, joined by four supporting stores (decisional, associative, affective, prospective) that map onto specific cognitive functions the typology alone does not cover. Procedural memory was added as a first-class store in v2.7.0 — prior versions mapped that label onto the decisional store, a flattening this typology corrects.
| store | type | table(s) | invariants |
|---|---|---|---|
| episodic | time-stamped events, causally linked | events | append-only, cheap, high volume |
| semantic | abstract facts, conventions, generalizations | memories | W(m) gated, FTS5-indexed, confidence-tracked |
| procedural | reusable workflows + tool sequences with fitness | procedures, procedure_steps, procedure_sources | step-ordered, feedback-tracked, FTS5-indexed (migration 052) |
| decisional | decisions and their rationale | decisions | provenance required, immutable rationale |
| associative | typed knowledge graph | entities, knowledge_edges | nodes carry JSON observations; edges are typed and directional |
| affective | emotional salience signal | affect_log | VAD coordinates (valence, arousal, dominance) |
| prospective | intentions that fire on future queries | memory_triggers | keyword-matched, lifecycle-tracked |
Semantic memories are categorized into nine types — convention, decision, environment, identity, integration, lesson, preference, project, and user — each with its own default confidence prior and decay constants. Procedures, separately, carry a status (draft, active, retired), an execution-feedback counter, and a fitness score updated by the procedure_feedback tool when a procedure's steps are run and the outcome is recorded. Event types are drawn from a similarly closed vocabulary: artifact, decision, error, handoff, result, session_start, session_end, task_update, warning, observation. Entity types cover agent, concept, document, event, location, organization, person, project, service, tool.
The live schema at v2.7.0 contains ~62 user-facing tables across 51 migrations. The primitives above sit alongside auxiliary tables for RBAC (memory_trust_scores, access_log), quarantine (decoherent_memories, recovery_candidates), the global workspace (workspace_broadcasts, workspace_phi), neuromodulation state, theory-of-mind models, EWC importance weights, and Bayesian uncertainty logs.
the Brain interface
The Python API exposes a single class whose public surface is deliberately small. Five methods cover the entire session lifecycle.
from agentmemory import Brain brain = Brain(agent_id="my-agent") # session start — pull a context package ctx = brain.orient(project="api-v2") # during work — write brain.remember("rate-limit: 100/15s", category="integration") brain.entity("RateLimitAPI", "service", observations=["100 req/15s"]) brain.decide("use Retry-After for backoff", "server controls timing") # during work — read hits = brain.search("rate limits", k=5) # session end — produce the handoff packet = brain.wrap_up("auth module complete", project="api-v2")
A handoff packet contains four fields: goal, current_state, open_loops, and next_step. These are the minimum sufficient statistics for the next agent to resume work cold. Packets are stored in handoff_packets with a signature over their contents so an auditor can verify the receiving agent saw what the sender wrote.
the mcp surface
The same operations are exposed over the Model Context Protocol as 209 tools, grouped by capability: memory, events, entities, decisions, procedures, consolidation, belief management, affect, workspace, federation, neuromodulation, theory-of-mind, expertise tracking, reflexion, and uncertainty. The MCP server is a thin adapter — the underlying logic lives in the Python package — so Python and MCP callers see identical semantics. Two transports ship in-tree: brainctl-mcp over stdio (the default; spawned by Claude Desktop, Codex, Cursor, etc.) and brainctl-mcp-http (added in v2.5.0), a streamable-HTTP server with bearer-token auth and an allowlist for remote callers (xAI Grok, Strand, hosted-agent platforms).
For clients that cap the total MCP tool count — Google's Antigravity IDE enforces a hard limit of 100 — v2.6.4 added a BRAINCTL_ALLOWED_TOOLS env var. When set to a comma-separated list of tool names, the stdio server returns only the listed tools from tools/list and rejects everything else from tools/call. Unknown names are a hard error at startup, with a difflib.get_close_matches hint so a typo like memory-add reports "did you mean memory_add?" rather than silently dropping the tool. Unset is the default and exposes the full 209-tool surface for backward compatibility.
At v2.7.0, nineteen first-party plugins ship in-tree across two families. Agent frameworks: Claude Code, Codex CLI, Cursor, Eliza, Gemini CLI, Goose, Hermes, OpenClaw, OpenCode, Pi, Rig, Virtuals Game, and Zerebro. Trading bots: Freqtrade, Jesse, Hummingbot, NautilusTrader, OctoBot, and Coinbase AgentKit. Each plugin is an idempotent installer that wires the brainctl MCP server into its host framework's native configuration surface and, where the framework has a distinct memory abstraction, adapts it to the Brain interface. Trading-bot plugins additionally ship a strategy-mixin pattern that gives a live strategy persistent recall over its own past trades, regimes, and parameter sweeps without forcing the operator to write the persistence layer themselves.
The plugin tree spans three structurally distinct shapes — pure-MCP registration via the host's native config (Claude Code, Codex CLI, Cursor, Gemini CLI, Goose, Hermes, the trading bots), MCP plus first-party hook scripts in the host's native runtime (OpenClaw skills, OpenCode TypeScript hooks), and proxied-MCP via a community adapter for hosts that deliberately ship without built-in MCP support (Pi via the pi-mcp-adapter proxy convention). The shape is dictated by what each host framework actually exposes, not by preference; the surface area an integration can occupy is exactly the surface its host gives it.
The Codex CLI plugin, added in v1.4.0, is representative. Codex discovers MCP servers from ~/.codex/config.toml, so the plugin is a sentinel-wrapped installer that merges a [mcp_servers.brainctl] block into the user's config without disturbing other servers, with automatic backup, dry-run preview, and clean uninstall. It ships with an AGENTS.md.template that teaches Codex the orient / wrap_up lifecycle on every session start.
The OpenClaw plugin (also added in v1.6.0) takes a different shape from the config-file-merge installers: it ships as a skill with an AGENTS.md snippet injection, because OpenClaw's multi-agent topology discovers brainctl through its skill registry rather than a static config file. Any remaining stdio-speaking MCP client — VS Code, Claude Desktop, Zed — works out of the box without a dedicated plugin.
concurrency and durability
SQLite's concurrency model is both simpler and stronger than most networked databases for the access pattern brainctl has. Writes are serialized — at any moment there is exactly one writer — and reads are concurrent with both reads and the active writer. In Write-Ahead Logging (WAL) mode, which brainctl uses unconditionally, readers never block writers and writers never block readers; the only lock contention is writer-on-writer, resolved by a short retry with exponential backoff.
For a memory layer this is the right model. Agents write sparsely (a few to a few dozen rows per session), read frequently (every orient call is a small burst of reads), and almost never need two concurrent writers on the same brain. In the rare case where two agents share a brain and race on a write, the second writer retries; brainctl's connection pool hides this from the caller. The lazy shared connection per Brain instance (added in v1.2.0) amortizes SQLite connection setup across a session, so a typical orient + dozen-write + wrap_up flow pays the connection cost once.
Durability comes from three layers. First, WAL journaling with synchronous=NORMAL flushes every commit to the WAL file, which survives process crashes; a power loss costs at most the last in-flight commit. Second, the W(m) gate runs inside a transaction that either admits or rejects the memory atomically — there is no such thing as a partially-admitted memory. Third, the migration runner (v1.5.0, §2.1) wraps each migration in savepoints so an upgrade that fails halfway rolls back to the pre-migration state with no manual intervention. Operators who want stronger durability guarantees can switch to synchronous=FULL with a one-line PRAGMA change; the trade-off is roughly 2× write latency in exchange for fsync-per-commit.
Backup is trivial by design: copy the file. SQLite's online backup API (exposed via brainctl backup) does this safely against a running database by walking pages rather than blocking the writer. For point-in-time recovery the WAL file can be retained alongside the main database and replayed. No backup daemon, no snapshot coordinator, no distributed consensus.
privacy and data isolation
brainctl is designed for the case where the brain holds data the operator wants kept close: private code, user identifiers, internal decisions, draft reasoning. Several architectural choices flow from that posture.
No network on the hot path. The Brain interface performs zero outbound network calls during orient, remember, decide, entity, search, or wrap_up. Embeddings are produced by a local Ollama instance; the FTS5 index is in-process; the vector index is in-process. An operator can physically disconnect the network and a brainctl agent will still function for every read and write path. The only code that touches the network is opt-in ingestion pipelines the operator wires up themselves.
Scope-based isolation. Every entity, memory, and decision carries a scope field with values like global, project:api-v2, or agent:reviewer-bot. Queries filter by scope at the storage layer, so an agent invoked with scope project:api-v2 cannot see memories written under project:billing unless they are explicitly marked global. Enforcement happens in SQL via parameterized query rewriting, not in application code that could be bypassed.
Per-agent identity. Every write is attributed to an agent_id set at Brain(agent_id=...) construction time. The agent ID is the unit of source attribution for the source-monitoring layer (§7.3). Two agents sharing the same brain file see each other's memories only to the extent that scopes and trust levels allow — the trust-scoped RBAC layer (§7.4) enforces this at query time.
PII as a first-class signal. When a memory contains personally identifiable information, detection runs at write time (not audit time), and the result is stored alongside the memory in pii_audit. A read-time filter can redact PII before the memory reaches the working context. Operators can answer data-retention and right-to-erasure requests with a single SQL statement against the PII audit log, which matters for anyone shipping brainctl into a regulated environment.
What brainctl does not do: it does not encrypt brain.db at rest. SQLite has a proprietary encryption extension (SEE) and a free alternative (SQLCipher), and brainctl is compatible with both but does not ship encryption on by default. Operators who need at-rest encryption should either use SQLCipher directly or rely on filesystem-level encryption (FileVault, LUKS, BitLocker) — the same posture most operators use for git repositories holding sensitive code.
the memory model
Before describing the model, two epistemic notes. First, brainctl is inspired by the cognitive-science research it cites; it is not a model of any of it. We do not simulate hippocampal cell assemblies, we do not replay neural firing patterns, and we do not implement biologically plausible learning rules. What we do is borrow the architectural patterns these systems have evolved — episodic/semantic separation, decay-with-reinforcement, schema-driven consolidation, source attribution, belief revision under contradiction — and implement them as a SQLite schema with worker processes. Some of the mappings are tight (the Bayesian α/β confidence model is literal Bayesian inference; AGM belief revision follows the formal postulates), some are loose (the dream cycle is structurally analogous to sharp-wave ripple replay, not a simulation of it). Where the analogy is loose, we say so explicitly in the relevant section.
The goal is the smallest set of mechanisms that prevent the failure modes of stateless agents — forgetting, confabulation, contradiction loops, catastrophic supersedes, attention starvation. Cognitive science is the prior art that already worked through these failure modes in a different substrate; we borrow it the way distributed systems engineers borrow from biology, as a source of solved problems.
episodic, semantic, procedural
The three Tulvingian memory types map onto distinct write patterns. brainctl writes episodic memory freely (the append-only event stream), gates semantic memory aggressively (the W(m) worthiness check at §3.2), and treats procedural memory as a separate first-class store (the procedures + steps + sources tables added in v2.7.0, with their own status lifecycle and execution-feedback loop). The asymmetry is intentional: episodic storage is the sharp-wave ripple buffer in the mammalian analogy, so it stays cheap and high-volume; semantic storage shapes every future retrieval and must stay dense; procedural storage records the actual how an agent gets a class of tasks done, so it accumulates fitness evidence rather than fading on decay. The consolidation cycle (§4) is the pipeline that promotes stable episodic patterns into semantic entries during quiet hours; the procedure-feedback loop (§3.7) is the analogous strengthening signal for procedural memory.
the worthiness gate W(m)
Every candidate semantic memory m is scored by a five-factor worthiness function before admission to long-term storage. The design follows Zhang et al.'s Adaptive Memory Admission Control (A-MAC, ICLR 2026 Workshop), which demonstrated F1=0.583 on the LoCoMo benchmark with 31% latency reduction. The five factors decompose the admission decision into interpretable, independently tunable dimensions:
The gate admits m only when the weighted sum clears a category-specific threshold θ. Each factor pulls in a specific direction.
Future utility (weight 0.15). Estimates how likely the memory is to be retrieved and useful in a future session. A memory about a one-off debugging session has lower utility than a memory about a recurring integration pattern. In practice, brainctl approximates utility by combining the Q-value from temporal-difference learning on past retrieval outcomes (§5) with the category's historical access frequency.
Factual confidence (weight 0.15). The Bayesian posterior from §3.3 — a candidate with a high E[p] from its source trust score is more likely to be admitted than a low-confidence observation from an unverified tool output.
Semantic novelty (weight 0.20). How different m is from what the agent already knows. brainctl approximates novelty by the inverse of the maximum embedding similarity to any existing memory in the same scope: a candidate that closely matches something written an hour ago carries almost no novelty, while a candidate with no near neighbor above a similarity floor carries high novelty. This subsumes both the surprise signal (within category) and the cross-category deduplication signal — even a high-novelty write against its own category will be rejected if a near-identical memory exists elsewhere, preventing the "we wrote this as a lesson, then wrote the same thing as a convention three days later" pathology.
Temporal recency (weight 0.10). An exponential decay favoring recent candidates over stale ones, reflecting the observation that memories written closer to the current context are more likely to be contextually relevant.
Content type prior (weight 0.40). The single most influential factor. This is the historical accept rate per memory category — decision memories start with a higher admission bias than observation memories because the cost of losing a decision is higher than the cost of losing an observation. The prior also incorporates the dynamic ewc_importance score from §4.3: a candidate that would merge with or supersede an existing high-importance memory gets a negative prior shift, which is how brainctl protects load-bearing memories from being overwritten by slightly-different-but-wrong new writes.
The outcome of rejection. If W(m) < θ, the candidate is not simply discarded. Instead it is merged back into its nearest match: the nearest match's recalled_count increments, its confidence posterior shifts slightly toward the new evidence, and its last-touched timestamp is updated so it resists decay. The rejected candidate itself is discarded. This is how brainctl prevents the "we already know this fifty times" pathology that kills every naive journal-based memory — redundant writes strengthen the thing they would have restated, instead of creating noise.
Modification resistance. Older, high-importance memories develop increasing resistance to reconsolidation-induced change (O'Neill & Winters 2026). When a new write attempts to supersede a well-established memory, the gate's effective threshold rises with the target's age and importance, preventing catastrophic overwriting of knowledge that has been stable across many sessions.
Gate calibration feedback loop. The gate continuously tracks whether its admission decisions correlate with downstream utility: do high-scoring candidates actually get retrieved more often, and do retrieved memories actually contribute to task success? Calibration error is fed back into the factor weights via the memory_outcome_calibration table (Dunlosky & Metcalfe 2009), closing the loop between admission and outcome.
confidence: bayesian α/β
Each memory carries a Bayesian posterior on reliability, parameterized as a Beta(α, β) distribution.
Memories begin at Beta(1, 1) (uniform prior) unless the caller supplies a trust override — user-written memories, for example, start at Beta(3, 1). Successful recalls are those that contributed to a confirmed outcome; refutations are updates where a subsequent decision invalidated the prior claim. High-stakes retrievals (the planning layer) filter by expected confidence so tentative hypotheses do not leak into load-bearing reasoning.
A per-category calibration log (memory_outcome_calibration) tracks whether the confidence estimates are well-calibrated over time, using Brier scores against observed outcomes.
decay and forgetting
Unsupported memories age on an Ebbinghaus-style retention curve:
where t is time since last access and S is a strength parameter that grows with every successful recall — the testing effect (Ebbinghaus 1885). When retention drops below a category-specific floor, the memory is marked for consolidation review rather than deleted outright. Decay is a soft pressure, not a hard delete, because a seemingly-stale memory may become load-bearing under a future topic shift.
The dual concern — forgetting too little vs forgetting too much — is the central tension in continual learning. McCloskey and Cohen 1989 first characterized catastrophic interference in connectionist networks: training a neural net sequentially on task A then task B causes near-total loss of task A performance. Parisi et al. 2019 reviewed the modern continual-learning literature and grouped responses into three families: regularization-based (penalize weight changes that hurt prior tasks, exemplified by EWC), structural (allocate new capacity for new tasks), and replay-based (interleave old samples with new ones during training). brainctl uses regularization and replay in combination — EWC-style importance weights protect load-bearing memories, the dream cycle replays episodic traces during quiet hours — and treats forgetting as a controlled pressure rather than something to eliminate. Total recall is its own pathology.
belief revision (AGM)
Contradictions are data. When a new fact m' contradicts an existing memory m, brainctl runs an Alchourrón–Gärdenfors–Makinson style revision (Alchourrón, Gärdenfors, Makinson 1985). AGM imposes three postulates:
- 1. Closure — the belief set remains closed under logical consequence after revision.
- 2. Success — the new fact is admitted to the revised set.
- 3. Inclusion (minimality) — the revision makes the smallest change consistent with 1 and 2.
The losing belief is not deleted. It is written to belief_collapse_events with the collapse reason, the winner's citation, credibility rankings, and a full provenance chain. This lets operators reverse a revision if they later discover the winner was itself unreliable — the superseded belief can be recovered with its history intact. Open conflicts are surfaced via belief_conflicts and can be resolved interactively through the resolve_conflict tool.
prospective memory
A memory system built only around retrospective storage — facts the agent has already learned — is structurally missing half of human memory. Prospective memory is the cognitive capacity to remember to do something in the future: to notice, when a specific context recurs, that there is a pending intention attached to it. Einstein and McDaniel 1990 characterized prospective memory as distinct from retrospective memory, with its own failure modes (the agent forgets the intention itself) and its own success signals (the intention fires at the right moment without explicit query).
brainctl implements prospective memory as a first-class primitive via the memory_triggers table (migration 014, extended in subsequent releases). A trigger is a small record with three parts: a precondition (a set of keywords, an entity reference, or a task context), an action hint (what the agent should remember to consider when the precondition matches), and a lifecycle state (pending / fired / retired).
# seed a prospective memory at decision time brain.trigger_create( precondition="billing schema", action="remember: invoices table was renamed to charges last week", scope="project:billing", ) # later, during a future session, a query about billing # automatically surfaces the trigger before the agent acts ctx = brain.orient(project="billing") # ctx includes any triggers whose preconditions match # the orient context window.
The trigger_check tool is called automatically on every orient call and on every search that crosses a relevance threshold. Matched triggers are surfaced into the working context with a visual marker so the agent can distinguish a prospective-memory hit from a regular retrieval. Fired triggers transition to the fired state and no longer match, preventing the same intention from firing indefinitely; triggers that become stale without firing transition to retired on a configurable timeout.
Prospective memory is one of the most common failure modes of naive agent memory. A stateful agent can remember that "we renamed this table" perfectly well as a retrospective fact — but if the agent is not prompted to recall it, the fact is functionally unreachable. Prospective triggers close that gap by making the act of recall context-addressed rather than query-addressed: the memory finds the agent, not the other way around.
procedural memory
Episodic memory captures what happened. Semantic memory captures what is true. Neither captures how to do something — the reusable sequences of tool calls, decisions, and recovery steps an agent assembles through repeated practice. Tulving 1972 separated procedural from declarative memory for precisely this reason: skill-like knowledge accrues through repetition rather than through one-shot encoding, and it lives in a different cognitive subsystem than facts and episodes.
brainctl shipped procedural memory as a first-class store in v2.7.0, after Velamj contributed it as PR #94 on 2026-04-24. Migration 052 adds three canonical tables: procedures (the procedure itself with a name, description, status, and aggregate fitness counter), procedure_steps (ordered tool invocations or sub-procedures with optional preconditions), and procedure_sources (provenance pointers back to the episodic events or semantic memories the procedure was derived from). An FTS5 virtual table sits over the procedures+steps content so queries can search by tool name, step text, or description simultaneously.
Status lifecycle. Procedures move through three states: draft (just authored, not yet validated against outcomes), active (validated by at least one successful procedure_feedback call), and retired (either superseded by a better procedure or explicitly retired by the agent after repeated failure). Status is consulted by procedure_search so a query for "how do I do X?" by default surfaces active procedures first.
Fitness tracking. Every execution outcome flows back through procedure_feedback, which increments per-procedure success and failure counters, updates a last-execution timestamp, and adjusts an internal fitness score. The fitness score is consulted at retrieval time as a small reranking signal: among procedures that match the query, those with higher demonstrated success rate surface earlier. This is the procedural analogue of the Bayesian α/β confidence model on semantic memories in §3.3 — concrete outcome evidence updating a posterior, with no LLM call required.
Bridge synopsis. Each procedure also writes a small bridge row into memories with category convention or lesson, summarizing the procedure's name and purpose. Legacy memory_search calls therefore still find procedures via their text content, without callers needing to switch APIs. The bridge row is recomputed when the underlying procedure is updated, so the synopsis stays in sync with the canonical content.
CLI + MCP surface. brainctl procedure {add|get|list|search|update|feedback|backfill|stats} on the CLI side; eight MCP tools (procedure_add, procedure_get, procedure_list, procedure_search, procedure_update, procedure_feedback, procedure_backfill, procedure_stats) on the agent side. The backfill command exists specifically for the case where an agent has been operating without procedural memory and wants to derive procedures retroactively from its episodic trace — it walks the events stream looking for repeated tool sequences with similar outcomes and proposes procedure candidates the operator can promote.
Acquisition discipline. procedure_backfill with dry_run=False walks the memories + decisions tables, applies a regex classifier (looks_procedural: how-to phrasing, if-then conditionals, rollback language, step markers, ordering hints), and writes accepted candidates straight into procedures — no operator approval in the loop. That covers acquisition from already-written semantic memories. Acquisition directly from the raw event stream (learned clustering of successful tool sequences without a hand-designed classifier) is the next layer, listed in §11.1.
consolidation
the consolidation pipeline
Consolidation runs as an eight-phase NREM/REM pipeline, structurally analogous to the mammalian sharp-wave ripple replay observed in the hippocampus during slow-wave sleep and quiet wakefulness. It is demand-driven rather than scheduled: a homeostatic pressure metric (total confidence mass divided by active memory count) accumulates during normal operation, and when pressure exceeds a configurable setpoint, consolidation fires without waiting for a fixed interval — matching Tononi & Cirelli's (2003) synaptic homeostasis hypothesis, in which sleep pressure builds during waking and is discharged during sleep. Consolidation can also be invoked manually via brainctl-consolidate dream-cycle. The quiet-hours cron is a separate housekeeping pipeline that runs decay passes and bookkeeping; it is not the consolidation pipeline.
The neuroscience worth grounding here. Sharp-wave ripples (SWRs) are short, high-frequency (≈150–250 Hz) oscillations in the CA1 region of the hippocampus, characterized in detail by Buzsáki and colleagues over four decades and reviewed in Buzsáki 2015. Wilson and McNaughton 1994 showed that place-cell sequences active during a maze-traversal task reactivate during subsequent sleep, at compressed timescales, in the same temporal order as the original experience. Diba and Buzsáki 2007 extended this finding to reverse replay during awake immobility, and Karlsson and Frank 2009 demonstrated that even remote experiences — places the animal had not visited recently — reactivate during awake SWRs. The consistent finding across these studies is that hippocampal replay during quiet states is causally implicated in memory consolidation and the gradual integration of episodic detail into cortical semantic memory. McClelland, McNaughton, and O'Reilly 1995 formalized this in their Complementary Learning Systems theory: a fast, sparse hippocampal store interleaves new experiences during quiet hours into a slow, distributed cortical store, balancing rapid acquisition against catastrophic interference.
Replay is one of the few mechanisms that survived the move from biological neural systems into artificial ones essentially unchanged. Lin 1992 introduced experience replay in reinforcement learning: store past transitions in a buffer, revisit them during training. Mnih et al. 2015 — the Deep Q-Network paper that first achieved human-level play on Atari — attributed much of DQN's sample efficiency and stability to its experience replay buffer, sampling mini-batches uniformly from past transitions to break the harmful correlation between sequential samples. Schaul et al. 2016 extended this with prioritized experience replay, sampling buffer entries proportionally to their TD-error so that the most surprising transitions are revisited more often. brainctl's replay phase sits in this lineage: rather than uniform sampling (Lin / Mnih) or TD-error-proportional sampling (Schaul), it replays entity-clustered, salience-weighted candidates — closer to content-aware prioritized replay than to stochastic sampling, but in the same family.
Once triggered, the pipeline proceeds through eight ordered phases:
- N2 – synaptic tagging. Memories within the labile window (recently written, or flagged by high-importance events within a ±2-hour window) are tagged to protect them from the global downscaling that follows (Frey & Morris 1997). This mirrors the biological synaptic tag-and-capture mechanism, where strong encoding events set a molecular tag that protects synapses from homeostatic downscaling.
- N3 – proportional downscaling. A single multiplicative factor (setpoint / pressure) is applied to all non-permanent, non-tagged memories. High-importance memories resist downscaling via factor(1 − importance), implementing an analog of Elastic Weight Consolidation (Kirkpatrick et al. 2017). Memories that fall below a retirement threshold are soft-deleted (Tononi & Cirelli 2014).
- Entity-clustered replay. Replay candidates are grouped by shared entity references (Niediek et al. 2026) and ranked by salience magnitude (Robinson et al. 2026). Replay is decoupled from Hebbian strengthening — the system replays broadly, then tags selectively (Widloski & Foster 2025). Memories created near high-importance events (importance ≥ 0.7) receive elevated replay weight, mirroring Yang & Buzsáki's (2024) finding that awake sharp-wave ripples tag episodic memories for subsequent sleep consolidation.
- Coupling gate. Only memories with at least one knowledge-graph edge pass for promotion; isolated, unconnected memories are held back until the entity linker (§5.8) provides connectivity (Schwimmbeck et al. 2026).
- Schema-accelerated promotion. Memories with ≥ 3 entity links bypass the normal episodic holding period and are immediately promoted to semantic tier, implementing Tse et al.'s (2007) finding that schema-consistent information consolidates an order of magnitude faster than schema-inconsistent information.
- De-overlap. Similar-but-distinct memories are detected and their boundaries sharpened, mirroring the brain's active separation of overlapping representations during sleep (Aquino Argueta et al. 2026).
- REM – dream synthesis. Bisociation synthesis generates cross-domain connections between memories that share latent structure but occupy different knowledge-graph clusters. Affect dampening preserves the factual content of emotionally tagged memories while reducing affective intensity (Walker & van der Helm 2009). Isolated-memory bridge discovery finds memories with zero edges, embeds them, and connects each to its nearest semantic neighbor if the similarity clears a threshold — where the pipeline does its actual creative work.
- Housekeeping. Hebbian strengthening of tagged co-accessed pairs, metric updates, label-propagation community detection on the knowledge graph, high-betweenness bridge node identification, and tag cycle decrement.
Spacing-effect decay. Between consolidation cycles, memory stability increases when a memory is recalled at well-spaced intervals (inter-study interval ≥ 15% of retention interval per category), based on Cepeda et al.'s (2006) meta-analysis of 839 assessments from 317 experiments on distributed practice. Each memory carries a next_review_at timestamp computed from its temporal class and stability; the consolidation cycle checks for due reviews and replays them at expanding intervals, constituting an integrated spaced-repetition system (Cepeda et al. 2006; Murre & Dros 2015).
The load-bearing functional analogy: selective re-processing of episodic experience during a quiet period improves the structure of long-term memory. The eight phases implement this in sequence — tag what matters, downscale what does not, replay broadly, gate on connectivity, accelerate schema-consistent memories, separate overlapping representations, synthesize cross-domain connections, clean up. The replay_priority and ripple_tags columns in the schema borrow SWR terminology as a naming convention for the offline re-processing pass; the consolidation pipeline and the W(m) write-time admission gate are separate pipelines that operate at different points in the memory lifecycle.
proactive interference gate
A Proactive Interference Index — the PII gate — blocks supersedes that would erase too-recent context. Without it, a long-running agent can overwrite load-bearing memories with a noisy summarization pass and enter a catastrophic forgetting state. The gate computes a recency-weighted dependency score over the memory graph and refuses supersede operations that would drop the score of any descendant below a threshold.
The gate is enforced at write time, logged to pii_audit, and can be reviewed. Rejected supersedes are not dropped silently — they are written to a pending queue and either reconciled by the agent in a later session or escalated to the operator.
ewc-style importance weighting
Inspired by Kirkpatrick et al.'s Elastic Weight Consolidation (Kirkpatrick et al. 2017), brainctl maintains an importance score for each memory based on how often it is touched during planning and how many downstream decisions cite it. The analogue to EWC's Fisher information is a simpler access-frequency × citation-depth product, stored in ewc_importance. The score is consulted by the W(m) write-time gate: when a new memory would merge into or supersede an existing one, a candidate that wants to displace a high-importance memory has to clear a higher worthiness bar than one that would displace an obscure note. Importance is a write-time prior, not a consolidation operation.
schema integration
A complementary frame for what consolidation is doing comes from schema theory. Bartlett 1932, in Remembering, showed that human memory is reconstructive rather than reproductive: people do not retrieve experiences verbatim, they reconstruct them by fitting fragments into pre-existing schemas — organized knowledge structures that capture what kinds of things tend to go together. His "War of the Ghosts" experiment demonstrated that participants reshaped an unfamiliar Native American folk tale toward their own cultural schemas with each retelling. Rumelhart 1980 later formalized schemas as the building blocks of cognition: data structures for representing the generic concepts stored in memory, into which new experiences are slotted and against which they are interpreted.
brainctl's semantic memory is, in effect, a small set of explicit schemas. The nine memory categories — convention, decision, environment, identity, integration, lesson, preference, project, user — are not arbitrary tags; they are the schemas that semantic memories must fit into to be admitted, each with its own confidence prior and decay constants. Viewed through the schema-theoretic lens, the W(m) worthiness gate is a schema-fit check: if a candidate memory does not fit any existing schema and is not novel enough to warrant a new instance, it gets merged into its nearest schematic match. Consolidation, in turn, is the process of pulling stable patterns out of episodic detail and fitting them into these schemas — the same compression Bartlett observed in human reconstruction.
cost, scheduling, and failure modes
Consolidation is expensive relative to the hot path — a single cycle touches hundreds or thousands of rows, runs embedding comparisons, performs graph traversal, and invokes a local LLM for the REM-phase bisociation step. Running it on the critical path would blow up orient and wrap_up latency. The whole point of the homeostatic trigger (§4.1) is to take this work offline and batch it behind the back of the agent.
Scheduling semantics. The homeostatic trigger fires on demand when consolidation pressure exceeds the configured setpoint. Two supplementary conditions serve as fallbacks: inactivity (no events written in the last IDLE_SECONDS, default 300) or volume pressure (the number of new episodic writes since the last cycle exceeds PRESSURE_EVENTS, default 50). The volume condition prevents a continuously-active agent from never consolidating. A manual invocation via brainctl-consolidate dream-cycle ignores all conditions and runs immediately. The quiet-hours cron is a separate pipeline that runs housekeeping (decay passes, calibration updates, stale-trigger sweeps), not the eight-phase consolidation pipeline itself.
Cost envelope. Rough figures on a single-agent brain with ~5k semantic memories and ~20k events, measured on an M2 MacBook Pro: the combined NREM phases (tagging, downscaling, replay, coupling, schema acceleration, de-overlap) run in ~2–3 seconds, REM bisociation takes 2–4 seconds depending on how many hypothesis candidates the LLM is asked to score, and housekeeping runs in ~500 ms. Total wall-clock for a full cycle on a brain of that size sits between 4 and 7 seconds. The cost is dominated by the REM phase because it is the only phase that invokes the LLM; the rest are pure SQL and Python compute.
Failure modes and idempotency. Consolidation is designed to be crash-safe. Each phase writes to the database in its own transaction and records its progress in consolidation_events. If the process crashes mid-cycle:
- any database writes from completed phases are durable (committed at phase boundaries, not at cycle end)
- the next invocation of the cycle detects the incomplete run, replays the unfinished phase, and proceeds normally
- the W(m) gate still applies on any semantic writes that occur during REM bisociation — so a partially-consolidated brain cannot end up with duplicate or malformed memories
The worst outcome of an interrupted consolidation is a small amount of duplicated work on the next run, not corruption. Consolidation is also safe to skip entirely — a brain that never consolidates is still fully functional, just denser and with less hypothesis-generated structure.
What can go wrong. A misconfigured quiet-hours cron can trigger housekeeping at the same moment the homeostatic trigger fires a consolidation cycle, causing writer contention (resolved by SQLite's retry but measurable as latency). A runaway REM-phase LLM call can hang the cycle if the upstream model is unreachable — a configurable timeout aborts the phase and marks it for retry on the next cycle. The consolidation_stats tool surfaces per-phase timing and error counts for operators who want to monitor the pipeline.
causal reasoning and temporal structure
An agent memory system that stores and retrieves facts without causal structure is fundamentally limited: it cannot explain why a fact matters, what would change if the fact were different, or how facts at different time scales relate. brainctl addresses this with four mechanisms that layer causal and temporal structure on top of the consolidation pipeline.
Typed causal edges. The knowledge graph supports three causal relation types: causes, enables, and prevents. A causal chain tracer follows these edges forward from any memory or event up to a configurable hop limit, giving the agent (or an auditor) a concrete answer to "what led to this outcome?"
Counterfactual attribution. Working backward from a task outcome, counterfactual attribution (Kang et al. 2025) traces the causal graph in reverse and boosts the Q-values of contributing memories proportional to their edge weight — answering "which memories caused this success?" This closes the loop between retrieval (Q-value reranking from §5) and outcome (causal attribution), making the system genuinely self-improving: memories that demonstrably contributed to good outcomes surface more often in future retrievals.
Temporal abstraction hierarchy. Memories are assigned to a six-level temporal hierarchy based on age: moment (< 12h), session (12h–1d), day (1–7d), week (7–30d), month (30–90d), quarter (> 90d). Hierarchical summarization compresses moment-level memories into day-level summaries, day-level into week-level, and so on — achieving significant memory length reduction while maintaining retrieval quality across time scales (Shu et al. 2025). The hierarchy gives the consolidation pipeline a natural multi-resolution structure: recent memories are preserved in full detail, while older memories are compressed into summaries that retain their load-bearing content.
Belief collapse. Superposed beliefs — states where the agent holds multiple mutually exclusive interpretations simultaneously — are resolved to definite states via four collapse triggers: task checkout (the agent acts on a belief), direct query (an external observer asks), evidence threshold (accumulated evidence exceeds a confidence bound), and time decoherence (the belief has been superposed longer than a configurable window). Each collapse event is logged with the measured amplitude and collapse context, preserving the pre-collapse state for auditability. The quantum Zeno effect applies: frequent measurement slows collapse, so agents that query a belief repeatedly keep it superposed longer than agents that ignore it.
retrieval
Retrieval-augmented language models are now an established design pattern. Lewis et al. 2020 introduced retrieval-augmented generation (RAG), pairing a dense retriever with a generator that conditions on retrieved passages. Karpukhin et al. 2020 (Dense Passage Retrieval) showed that a learned dense retriever could outperform traditional sparse methods like BM25 on open-domain QA. Khandelwal et al. 2020 (kNN-LM) demonstrated that augmenting a language model with nearest-neighbor lookups against a datastore at inference time improves perplexity without retraining the model. Borgeaud et al. 2022 (RETRO) scaled retrieval to a 2-trillion-token backbone and showed that a 7B-parameter model with retrieval can match a 280B-parameter model without it. The pattern is mature.
brainctl's retrieval layer is a small variant of this pattern with two differences. First, the retrieval target is the agent's own structured memory (a typed graph plus episodic and semantic stores) rather than an external corpus of passages. Second, the merge between lexical and semantic ranking is reciprocal rank fusion rather than a learned re-ranker, because the corpus is small enough and the heterogeneity high enough that RRF beats learning-to-rank on the kind of corpora a single agent's brain produces. The next four subsections describe the layer concretely.
hybrid search
All queries go through a single search interface that fans out to two indexes. FTS5 handles lexical search with BM25 ranking, stemming, and phrase queries. sqlite-vec handles semantic search with cosine similarity over locally-computed embeddings. Both return ranked result lists.
reciprocal rank fusion
The two result lists are merged with reciprocal rank fusion (Cormack, Clarke, Buettcher 2009):
RRF is robust to score scale differences between lexical and semantic rankers, does not require calibration, and is parameter-light. Empirically it beats both linear combination and learning-to-rank approaches on the kind of small, heterogeneous corpora a single agent's brain produces.
local embeddings
Embeddings are produced by a local Ollama instance running nomic-embed-text. The choice is deliberate — it means the hot path has zero external API calls, retrieval is free at inference time, and an operator can run the entire stack offline. The trade-off is a small quality gap versus frontier embedding models; in practice this is dominated by the hybrid retrieval step and by the worthiness gate keeping the corpus clean.
Embeddings live in embeddings alongside shadow tables generated by the sqlite-vec extension. A lazy recompute strategy means re-embedding only touches rows whose content hash has changed.
v2.4.0 widened the embedding surface without changing the retrieval substrate. A five-model registry — nomic-embed-text (default), bge-m3, mxbai-embed-large, snowflake-arctic-embed2, and qwen3-embedding:8b — is selectable via BRAINCTL_EMBED_MODEL and re-indexable in place via brainctl reindex --model <name> with dim-mismatch validation. An optional cross-encoder reranker (bge-reranker-v2-m3) ships behind the brainctl[rerank] extra and lazy-imports sentence-transformers; it is default-off, falls back to a no-op when the dep is missing, and slots after retrieval rather than inside it.
Neither change alters the FTS+vec RRF backbone. The registry widens the set of reachable embedding shapes (multilingual, longer context, larger dim) so an operator can pick an embedding that matches the corpus instead of being locked to one default; the reranker adds an opt-in reordering pass for cases where top-K precision matters more than retrieval latency. Both are instrumentation, not a new ranking model.
spreading activation
Hybrid search returns the memories that match a query. Spreading activation, in the sense of Collins and Loftus 1975, returns the memories that are connected to the matches — the neighbors in semantic space whose activation matters because the matches activated them. Their original model was a semantic network with weighted edges and parallel decay: when a node is activated, activation spreads to connected nodes proportionally to edge weight, with the activation decaying over distance and time. The model was originally proposed to explain priming effects in semantic memory experiments — why hearing "doctor" speeds up recognition of "nurse" even when the two are not co-presented.
brainctl's knowledge graph — entities as nodes, knowledge_edges as typed directional relations — is the substrate for an approximation of this. When the search interface surfaces an entity, the retrieval pass also walks the graph one or two hops out and gathers the connected entities, weighted by recency, edge type, and access history. The effect is that a query for RateLimitAPI does not just return the entity; it returns the related decisions, the upstream services that depend on it, the recent observations attached to it, and any contradictions in flight.
salience routing
Retrieval is only the first step. A global-workspace-inspired attention budget (Dehaene & Changeux 2011, Baars 1988) decides which of the matched memories actually surface into the agent's working context. The budget weighs:
- recency — how recently the memory was created or touched
- emotional salience — the VAD coordinates from
affect_log - task relevance — cosine similarity to the current orient query
- confidence — expected
pfrom the Bayesian posterior - EWC importance — how load-bearing the memory is
Only winners make it into the prompt. The workspace broadcast is logged to workspace_broadcasts so the agent (or an auditor) can later inspect what was attended to and why. This is the difference between "what does retrieval return" and "what does the agent actually see" — a distinction that matters when the agent makes a mistake and you need to trace it.
retrieval in the session lifecycle
Retrieval in brainctl is not a single operation invoked whenever the agent is curious. It is three distinct operations tied to three distinct phases of the session lifecycle, and the difference matters for both performance and correctness.
Orient retrieval (bulk, session-start). Brain.orient(project=...) is called once at the beginning of a session and returns a composed context package rather than a results list. The package includes: the most recent handoff packet for the project, the top-N semantic memories by access-frequency within scope, the relevant entities and their immediate neighbors in the knowledge graph, any open belief conflicts, any prospective memory triggers whose preconditions match the project keywords, and the decay-protected EWC-important memories. The package is assembled by running each of those sub-queries in parallel and merging the results under a token budget. The agent sees it once at the start of the session and does not need to re-query the basics.
Ad-hoc search (mid-session, query-addressed). Brain.search(query, k=...) is the operation the agent calls when it actually doesn't know something. It runs the full hybrid retrieval pipeline from §5.1–§5.4 and returns a ranked list. The caller is expected to pass a specific question, not a vague topic — the hybrid retriever works best when the query has enough lexical signal for FTS5 to find candidates that the vector index then reranks. A good rule of thumb: if the agent can phrase the query in one sentence, it's a good search; if it would phrase it as a whole paragraph, it should use orient with a narrower scope instead.
Wrap-up retrieval (session-end, self-referential). Brain.wrap_up(summary, project=...) performs a small internal retrieval against the session's own writes — memories, decisions, and events created during this session — and composes the handoff packet. This pass deliberately does not reach across scope; it only sees what the current agent did. The result is a packet that describes the session's own state, not the broader brain state, so the next agent orienting into the project can layer the handoff on top of its own orient context without double-counting.
The three operations form a closed loop. Orient reads broadly, search reads narrowly and on demand, wrap_up writes a structured summary that future orient calls read back. An agent that uses only remember and search gets much less than an agent that uses the full lifecycle, because without orient the agent starts cold every time and without wrap_up every session's work has to be re-discovered by the next. The lifecycle is why agent_orient and agent_wrap_up are native MCP tools (§2.4) rather than Python-only conveniences.
self-improving retrieval
The retrieval mechanisms above (§5.1–§5.6) describe a static pipeline: a query arrives, indexes are consulted, results are ranked and surfaced. brainctl extends this with six mechanisms that make retrieval a learning process that improves with every query.
Thompson Sampling exploration. Search reranking draws from Beta(α, β) distributions rather than using confidence point-estimates. Memories with uncertain confidence are explored more frequently; memories with high certainty are exploited. This converts static retrieval into a self-improving explore/exploit learner with zero additional infrastructure — it uses the existing Bayesian alpha/beta columns that track retrieval outcomes (Thompson 1933).
Retrieval-practice strengthening. Each successful recall boosts memory confidence proportional to retrieval prediction error. Hard retrievals — those with high prediction error — strengthen more than easy ones, implementing the "desirable difficulties" effect from cognitive psychology: memories that are used become stronger, memories that are not fade naturally (Roediger & Karpicke 2006; Bjork 1994).
Q-value utility scoring. Each memory carries a Q-value updated via temporal-difference learning after retrieval outcomes. Memories that contribute to task success receive higher Q-values, creating a reinforcement signal that links retrieval rank to downstream utility (Zhang et al. 2026 / MemRL).
Context-matching reranker. Every memory captures a JSON snapshot of the agent's operational context at write time — project, agent ID, session ID — plus a SHA-256 hash for fast matching, implementing Tulving & Thomson's (1973) encoding specificity principle. Search results receive a score boost (up to 20%) when their encoding context matches the current retrieval context: a full hash match contributes a 0.3 bonus; partial key-value overlap gives proportional credit. This signal is fused into the reciprocal rank fusion pipeline alongside FTS5, vector, Thompson Sampling, and PageRank signals, validated by Smith & Vela's (2001) meta-analysis of 93 context-dependent memory experiments.
Temporal contiguity. Memories created near each other in time receive a co-retrieval bonus (Dong et al. 2026), reflecting the well-established finding that temporal proximity at encoding predicts co-activation at retrieval. Complementing this, an encoding affect linkage (Eich & Metcalfe 1989) connects memories to the agent's emotional state at encoding time.
Per-project retrieval presets. The orient() call returns project-specific retrieval weight presets stored in the agent state table, enabling progressive specialization to project-specific retrieval patterns (Finn et al. 2017). A new project inherits global defaults; over time, project-level weights diverge to reflect the distinct statistical structure of each codebase or domain.
The net effect is that retrieval accuracy improves consistently with use. Early queries rely on keyword matching and recency; later queries benefit from accumulated Thompson posteriors, Q-value rankings, calibrated confidence, and project-tuned weight profiles.
knowledge graph activation and quantum scoring
The knowledge graph is not decorative — it is the retrieval substrate. Without entity connectivity, the consolidation coupling gate (§4.1) rejects memories from promotion, quantum interference computes at parity with classical retrieval, and PageRank has no edges to traverse. On early production data, 92% of episodic memories had zero knowledge-graph connections. The root cause was that memory_add did not automatically link new memories to known entities.
Zero-LLM entity linking. brainctl solves this with a three-layer pipeline that requires no LLM calls. Layer 1 scans all active memories for FTS5 substring matches against the known entity names (case-insensitive, names < 3 characters excluded). On production data this single pass created 746 new mentions edges, dropping isolation from 92% to 16%. Layer 2 (optional) runs GLiNER, a 205M-parameter bidirectional transformer for zero-shot named entity recognition (Zaratiana et al. 2024, NAACL), extracting person, project, tool, service, concept, and organization entities from remaining unlinked memories. Layer 3 creates entity-to-entity co-occurrence edges for memories mentioning two or more entities — producing 2,270 new co_occurs edges that densify the graph for both PageRank traversal and quantum interference. The combined pipeline grew the graph from 8 connected clusters to 81.
Phase-aware quantum amplitude scoring. With a connected graph in place, brainctl applies phase-aware amplitude scoring inspired by quantum cognition models. Each memory's amplitude is computed as √(confidence) × ei × phase, where phase encodes the memory's position in the knowledge-graph interference pattern. Neighbors connected via knowledge_edges contribute constructive interference (similar phases boost retrieval score) or destructive interference (opposing phases reduce it). The quantum signal is blended with the classical score and gated on confidence_phase being populated, enabling progressive rollout.
With 3,000+ edges from the entity linking pipeline, quantum interference has the substrate it needs to differentiate retrieval scores based on graph topology rather than keyword overlap alone. Memories that are well-connected to the current query's entity neighborhood receive constructive boosts; memories that are topologically distant receive destructive interference, effectively suppressing false positives that would score well on text similarity alone.
multi-agent and handoff
handoff packets
A handoff packet is a compact four-field structure emitted by brain.wrap_up: goal (what was the agent trying to achieve), current_state (where things stand), open_loops (what is unresolved), and next_step (what should happen first in the next session). These are the minimum sufficient statistics for continuation and map directly onto the continuation state a human would give a colleague.
Packets are signed (HMAC over the packet contents keyed by agent identity) so the receiving agent can verify that what it is orienting from is in fact what the previous agent wrote. This matters once multiple agents share a brain.
theory of mind
Each agent maintains not only its own belief state but a model of the beliefs of agents it hands off to — stored in agent_perspective_models. This allows asymmetric handoffs: the sending agent can tailor the packet to what the receiving agent already knows, and the receiving agent can reason about discrepancies between its own view and the sender's.
In the simple case this reduces to "skip context the receiver already has." In the interesting case it enables correction loops where a receiver can detect that the sender was operating under a stale assumption.
federation
Federation between brains — shared context across multiple physical databases with access control — is a direction rather than a scheduled milestone. The design sketch is a minimal sync protocol over signed append-only logs, allowing one brain to pull memories from another under per-scope permissions. The goal is to keep the single-file invariant at rest while enabling collaboration between operators in motion. Interest and concrete use cases from operators are what will drive whether and when this gets built.
trust model between cooperating agents
When a single brain file is shared by multiple agents — an orchestrator and several workers, a human and a reviewer bot, a research agent and a coding agent — the question of trust becomes load-bearing. Who can read whose memories? Who can overwrite whose decisions? What happens when a low-trust agent writes something that contradicts a high-trust agent's prior belief?
brainctl's trust model is built from four orthogonal dimensions that compose at query time:
- 1. Identity (agent_id). Every write is attributed to the agent that produced it. There is no anonymous write path. Two agents sharing a brain always know who wrote what, as a literal database join.
- 2. Scope. Memories are written into a scope —
global,project:api-v2, oragent:reviewer-bot— and reads filter by scope. A worker agent scoped toproject:api-v2cannot see memories written underproject:billing. Scopes are not hierarchical by default, but operators can configure inclusion chains (e.g., everyproject:*scope also seesglobal). - 3. Trust level (memory_trust_scores). Each source is assigned a trust score that determines what it can overwrite, not just what it can read. A high-trust writer (the operator, a verified human reviewer) can supersede memories written by lower-trust writers. A low-trust writer (an ingest pipeline, a web scraper, a third-party tool) cannot supersede anything outside its own writes. The trust score is consulted by the W(m) gate: a low-trust candidate that would merge with a high-trust existing memory fails the gate with a trust downgrade flag, not a silent merge.
- 4. Provenance chain. The source-monitoring layer (§7.3) ensures that even a trusted agent's derived memories carry the attribution of the facts they were derived from. If a high-trust agent writes a conclusion that was derived from a low-trust observation, the conclusion inherits a parent link to the low-trust source. A later retrieval can apply a trust floor and filter both out.
Write conflicts between peers. When two agents write contradictory claims into the same scope at the same trust level, the contradiction flows into belief_conflicts rather than being silently resolved. Neither write is discarded. An operator (or a higher-trust arbiter agent) can then call resolve_conflict to rank the competing claims and collapse the loser via AGM (§3.5). This prevents the pathological case where two agents thrash on the same memory, each overwriting the other.
Read asymmetry. Cross-agent reads are unrestricted by default within the same scope — sharing a brain is the entire point of running multiple agents on one. But an operator can mark memories private to an agent, which restricts reads to that agent's identity even within the shared scope. This is how brainctl supports a reviewer pattern where a reviewer agent can read everything a worker agent writes but the worker cannot see the reviewer's private notes.
What brainctl does not yet do: cryptographic attribution. Agent IDs are database-level identities, not signed identities. A compromised process with write access to the brain file can write under any agent_id it wants; the RBAC layer protects against curious mistakes and tool-output contamination, not against an adversary that has already achieved filesystem-level access. Cryptographic per-write signatures are a candidate for a future release, but they would only matter in a multi-operator setting which is itself not yet the primary deployment model.
lineage
The memory-as-explicit-state line of work in ML agents has three load-bearing recent papers. Park et al. 2023 ("Generative Agents: Interactive Simulacra of Human Behavior"), the Smallville simulation, introduced a memory-stream architecture with importance, recency, and relevance scoring plus a periodic reflection step that synthesizes higher-level beliefs from raw observations. Packer et al. 2023 (MemGPT) reframed agent memory as an operating-system-style hierarchy with paging between context, recall, and archival tiers. Shinn et al. 2023 (Reflexion) showed that letting an agent verbally reflect on its own failures and write those reflections back to a persistent buffer improves task success across reasoning and coding benchmarks.
brainctl draws from all three: handoff packets generalize the Generative Agents reflection step into a session-bridging signed signature; the typed memory stores generalize MemGPT's tier hierarchy from three levels to six; and the reflexion_lessons table (migration 008, plus thereflexion_failure_recurrence tracker) is a direct implementation of Shinn et al.'s persistent reflection buffer. The lineage is not implicit. brainctl is what you get when you take those three papers seriously, keep the substrate local, and add the cog-sci pieces (AGM revision, schema integration, source monitoring) that the ML literature mostly leaves unaddressed.
security posture
quarantine
Untrusted input — memories written by an agent that ingested a web page, a user message, or a tool output — lands in the memory_quarantine table before it reaches memories. A human operator or a trusted reviewer agent marks each quarantined item as safe, malicious, or uncertain. Malicious items are purged with all derived knowledge edges retracted; safe items are promoted; uncertain items stay quarantined.
This is the primary defence against memory poisoning attacks, where an adversary tries to inject false premises into the agent's long-term memory via a tool response or a retrieved document.
pii audit trail
Personally identifiable information detection runs on every semantic write. Hits are logged to pii_audit with the source memory ID, the detected category (email, phone, name, account number, ...), and the action taken (redact, drop, escalate). This gives operators a single log to query when answering data-retention requests.
provenance chains and source monitoring
The cognitive-science frame for what this section describes is the source monitoring framework of Johnson, Hashtroudi, and Lindsay 1993. Source monitoring is the cognitive process by which a person attributes a memory to its origin — was this fact something I read, was it told to me, did I infer it, did I imagine it? Source monitoring failures cause confabulation: the propositional content of the memory may be correct, but the source attribution is wrong, and any decision grounded in that memory is grounded in a phantom premise. The DRM paradigm (Roediger & McDermott 1995) showed how easily even healthy human memory generates false memories under associative pressure, and how robust the confidence in those false memories can be.
brainctl's provenance posture is a literal implementation of source monitoring at the data layer, with the explicit goal of preventing the agent equivalent of confabulation. Every memory carries an agent ID, a source type (user-written, agent-written, tool-output, ingested-document, derived, consolidation-promoted), a creation timestamp, and — for memories derived from other memories — a list of parent IDs forming a directed acyclic provenance graph. The access_log table records every read with the reader, timestamp, and surrounding query context. Together these let an auditor trace any fact in the brain back to its origin and answer the question "why does the agent think this?" — not as a metaphor, but as a literal join.
rbac
Memory RBAC (migration 017) attaches a trust level to each memory source and a scope to each reader. High-trust writers (the operator, the user) can write to any scope; lower-trust writers (ingest pipelines, external tools) are confined to sandboxed scopes. Readers filter by scope at query time via memory_trust_scores.
encryption, supply chain, and incident response
Three concerns sit alongside the threat model above and deserve explicit treatment. None of them are novel problems — they are the standard operational-security concerns any system holding sensitive data faces — but pretending they do not apply to brainctl would leave the reader with gaps.
Encryption at rest. brainctl does not encrypt brain.db by default. SQLite has a closed-source encryption extension (SEE) and a free, widely-used alternative (SQLCipher), and the brainctl code path is compatible with both — the database is opened through a single function that respects a pragma-level encryption configuration. Operators who need at-rest encryption should either link against SQLCipher (one-time library swap, no code changes) or rely on filesystem-level encryption (FileVault, LUKS, BitLocker, ZFS native encryption). The default posture is to inherit from the filesystem because the majority of operators already have disk encryption enabled for everything else on the machine; adding another layer on top offers diminishing returns. For regulated environments where application-level encryption is contractually required, SQLCipher is the recommended path and a short operator guide exists in the repository.
Supply chain. brainctl's runtime dependencies are deliberately short. The core requirements are Python 3.11+, SQLite with WAL mode (shipped with Python), the sqlite-vec extension (a single shared library with no transitive dependencies), and — for semantic retrieval — a local Ollama instance running nomic-embed-text. The MCP server adds the mcp Python package. Nothing in this list is network-native at runtime: Ollama runs on localhost, sqlite-vec is compiled into the process, and the mcp stdio transport is file-descriptor-based. There is no browser surface, no long-lived network connection, and no auto-updater reaching out to a remote server. This is an intentional reduction in supply-chain surface: the fewer packages in the runtime, the fewer places a malicious upstream can land a compromised release.
Development dependencies are larger (test frameworks, linters, benchmark harnesses), but they are isolated to requirements-dev.txt / the dev extras and are not loaded by the runtime. The repository publishes a lockfile and the PyPI release is built from a deterministic CI pipeline; operators who want reproducible builds can pin against the lockfile and verify the built artifact against the release hash.
Incident response. If an operator discovers that a memory, a tool output, or a source was compromised, brainctl provides the primitives for a clean response without data loss:
- 1. Identify the compromised source. Query
access_logfor the offending agent ID or source type. The log is append-only and timestamped, so the window of compromise is recoverable. - 2. Quarantine derived memories. The
memory_quarantinetable accepts writes from an operator that mark a set of memories as pending review. The trust-propagation pipeline then walks the provenance graph from the quarantined memories outward, flagging every downstream memory whose parent chain touches the compromised source. None of these are deleted; they are held pending review. - 3. Replay consolidation with the compromised sources excluded. A dream cycle can be invoked with an exclusion set, so the Hebbian pass and the REM bisociation step do not reinforce the poisoned edges while the operator decides what to restore.
- 4. Restore or purge. Memories marked
safeafter review flow back into the active store; memories markedmaliciousare purged with thequarantine_purgetool, which retracts all derived knowledge edges and records the retraction in the audit log. The purge is a soft tombstone rather than a hard delete, so a subsequent investigation can recover the original content and its provenance.
The incident-response primitives above are designed to assume bad input is inevitable and to give the operator tools to contain rather than tools to prevent. Prevention is always incomplete; containment and auditability are what determine how bad a compromise becomes.
threat model: memory poisoning
The threat model brainctl's quarantine, source monitoring, and RBAC layers defend against is grounded in a specific recent security literature. Greshake et al. 2023 ("Not What You've Signed Up For") characterized indirect prompt injection: attacks where untrusted content arrives via a tool output, a retrieved document, or a web page and contains instructions intended to manipulate the agent's downstream behavior. The original paper demonstrated end-to-end attacks against LLM-integrated applications including Bing Chat and email assistants, with payloads as simple as a HTML comment hidden in a web page.
For stateful agents the threat is amplified, because injected content can persist across sessions if the agent writes it to memory. An attacker who controls a single tool response or document can craft input designed to land in the agent's long-term store and bias every future decision that retrieves from that region of memory. This is the agent equivalent of a persisted XSS attack, and it survives every restart and context reset.
brainctl's three structural defenses against memory poisoning are: (1) the memory_quarantine table, which holds untrusted writes pending review before they reach memories; (2) the trust-scoped RBAC of §7.4, which prevents low-trust writers from contaminating high-trust scopes even if quarantine is bypassed; (3) the source-monitoring provenance chain of §7.3, which makes every retrieved fact traceable to its origin so a compromised tool can be retroactively quarantined and all downstream memories it spawned can be retracted. None of these eliminate the risk — no defense does — but they make the difference between an agent that can be permanently poisoned by a single malicious tool response and an agent whose poisoning attempts are isolated, attributable, and reversible.
implementation and benchmarks
At v2.7.0, brainctl is implemented in approximately 64k lines of Python inside src/agentmemory/ with a SQLite schema defined by 51 migration files rebuilding to ~62 user-facing tables. The MCP server exposes 209 tools across two transports (stdio + streamable HTTP). Nineteen first-party plugins ship in-tree — agent frameworks (Claude Code, Codex CLI, Cursor, Eliza, Gemini CLI, Goose, Hermes, OpenClaw, OpenCode, Pi, Rig, Virtuals Game, Zerebro) and trading bots (Freqtrade, Jesse, Hummingbot, NautilusTrader, OctoBot, Coinbase AgentKit). The release sequence below is the migration-by-migration story of how the surface got here.
The v1.5.0 migration runner brings existing brain.db files up to the current schema safely. v1.6.0 added a deterministic single-system search-quality harness with a pytest regression gate that fails the build on any >2% drop in P@1 / P@5 / Recall@5 / MRR / nDCG@5. v2.4.0 added a five-model embedding registry, an optional cross-encoder reranker, and a same-fixture competitor harness under tests/bench/competitor_runs/ with adapters for Mem0, Letta, Zep, Cognee, MemPalace, and OpenAI Memory under a skip-not-fabricate contract. v2.4.1 added per-row provenance (retrieval_mode, vector_enabled, embedding_model, rerankers_active, search_args) plus a vec write-path connection pool that takes 30–100 ms off the Brain.remember hot path. v2.4.2 added brainctl status, a single-screen brain-health overview that combines DB stats, doctor-style issue detection, and service- availability checks and exits non-zero on any actionable issue so it can gate CI. v2.4.3 added the OpenCode (TypeScript hook) and Pi (proxy-via-adapter) plugin shapes alongside Goose (pure-MCP YAML). v2.4.5 added brainctl ingest code, a tree-sitter-based source-tree walker that writes file / function / class entities + contains and imports knowledge_edges into brain.db with SHA-256 idempotency caching, shipping three grammars (Python, TypeScript, Go) to keep the wheel footprint at ~4 MB.
v2.5.0 added the streamable-HTTP MCP transport (brainctl-mcp-http) alongside the existing stdio server, exposing the full tool surface over HTTP with bearer-token auth and an allowlist for remote clients (xAI Grok, Strand). v2.5.1 closed an MCP dispatcher bug + FTS5 cold-start consistency gap surfaced by the beta-audit.
v2.6.0 shipped the marketplace negotiation CLI (brainctl marketplace api ...) with full offer / counter / accept / reject / withdraw semantics, the protocol-prefix rename (brainctl-marketplace/v1:) decoupling the on-chain protocol identifier from the community-token ticker, the protocol fee schedule (flat per-op fees plus 2.5% on marketplace settlement, lowered from an earlier 3.5% before any production volume hit chain), and an independent treasury wallet (preserving the dev wallet’s anti-sniping hold ahead of token launch). v2.6.1 added provider import adapters (brainctl import mem0 / json) for onboarding from other memory providers into a quarantine scope, and brainctl bundle decrypt for local AES decryption of bundles you minted yourself. v2.6.2 fixed an MCP-server idle-timeout bug (#108) that was killing stdio servers under Claude Desktop after one hour of no requests. v2.6.3 added brainctl wallet export-key, which renders the managed wallet secret in the base58 format that Phantom, Backpack, Solflare, and Glow accept under their import-private-key flows, so a brainctl wallet can be paired with any standard Solana wallet UI. v2.6.4 added BRAINCTL_ALLOWED_TOOLS for stdio MCP clients that cap tool count (Google's Antigravity IDE caps at 100; brainctl exposes 201+ now and would refuse to load without the allowlist) — unknown names hard-error at startup with difflib "did you mean?" suggestions (closes #114).
v2.7.0 shipped procedural memory as a first-class store (§3.7) — migration 052, three canonical tables (procedures, procedure_steps, procedure_sources), an FTS5 virtual table for procedure search, and eight new MCP tools (procedure_add / get / list / search / update / feedback / backfill / stats) bringing the MCP surface to 209. The release credits Velamj as the author of PR #94, opened 2026-04-24, which predated comparable procedural-memory work in the public agent-memory space by ~19 days. Procedural memory is the third of Tulving's 1972 tripartite typology to become a first-class store in brainctl; earlier releases used the "procedural" label for what the new typology (§2.2) more accurately calls the decisional store.
Retrieval quality on standard benchmarks (measured 2026-04-18). Intel Core Ultra 7 258V, 33.9 GB RAM, default brainctl settings, no benchmark-specific tuning. Repro: python -m tests.bench.run --check-strict.
| benchmark | scoring | brainctl |
|---|---|---|
| LoCoMo (n=1,986) | session-level avg recall | 0.9217 |
| LongMemEval (n=470) | R@5 | 0.9702 |
| LongMemEval (n=470) | R@10 | 0.9894 |
| MemBench FirstAgent (n=200) | hit@5 | 0.930 |
The numbers above are committed as a regression baseline under tests/bench/baselines/ and gated in CI on every push — a >2% drop in P@1 / P@5 / Recall@5 / MRR / nDCG@5 fails the build. A same-fixture competitor harness lives under tests/bench/competitor_runs/ for operators who want to compare against Mem0, Letta, Zep, Cognee, MemPalace, or OpenAI Memory on identical inputs; adapters and reproduction instructions are in the repo. Per-row provenance fields added in v2.4.1 (retrieval_mode, vector_enabled, embedding_model, rerankers_active) make any future result bundle self-describing.
The local-first architecture is a deliberate performance and privacy choice. Every read is a direct SQLite hit — sub-millisecond on the hot path, no network round trip, no remote API to rate-limit or bill by the operation. Writes commit to WAL and return immediately; consolidation runs asynchronously during quiet hours. The brain.db file is yours: copy it, back it up, move it between machines. No token gates what your agent can remember. No third-party storage layer holds data you cannot inspect or export.
Representative figures on a single-agent brain after one month of continuous use on an M2 MacBook Pro:
| operation | p50 | p99 | notes |
|---|---|---|---|
brain.remember | 3.1 ms | 9.8 ms | includes W(m) gate + embedding |
brain.search (k=10) | 6.4 ms | 18 ms | FTS5 + vec + RRF + salience |
brain.orient | 22 ms | 55 ms | full context package assembly |
brain.wrap_up | 8 ms | 25 ms | packet + signature + decisions flush |
| consolidation pass | 1.4 s | 4.1 s | per 1000 episodic rows, offline |
Numbers are indicative, not normative; a formal benchmark suite is in bench/ and runs in CI. The point is that the hot path is small — dominant cost is local embedding computation, which is still single-digit milliseconds.
dependencies and platform support
brainctl's runtime dependency list is intentionally short. The core requirements:
- Python 3.11+ — the codebase uses
tomlliband modern typing features that require 3.11 - SQLite 3.38+ with WAL mode — usually bundled with Python, otherwise from the system package manager
- sqlite-vec — a single-file shared library loaded as a SQLite extension; zero transitive dependencies
- Ollama running
nomic-embed-text— optional but required for semantic retrieval; without it the system falls back to FTS5-only search - mcp (the Python MCP package) — required only if running the MCP server; pure Python library
Platform support tracks SQLite and sqlite-vec. brainctl runs on macOS (Intel and Apple Silicon), Linux (x86_64 and aarch64), and Windows. CI exercises the full suite on macOS and Ubuntu on every push. Windows is supported but less battle-tested; the repository accepts Windows-specific bug reports and fixes them. The one platform that is explicitly unsupported is anything without a writable local filesystem — the single-file architecture is fundamentally incompatible with serverless runtimes that expose only ephemeral disk.
testing and release cadence
The test suite exceeds 1,700 tests covering the MCP tool surface, the W(m) gate, the migration runner, the consolidation pipeline, the belief revision logic, and the integration plugins. CI runs the full suite on every push and every pull request, on both macOS and Ubuntu, with the runs pinned to a fresh SQLite build to catch extension-loading regressions. The brainctl-mcp --doctor command runs a local subset of the CI checks on the operator's own installation — useful for diagnosing a broken install without shipping the environment to a maintainer.
Releases are SemVer: major for incompatible API changes, minor for backward-compatible features (plugin additions, new MCP tools, new tables via migration), patch for fixes. Release cadence is irregular but approximately every two to four weeks for minor releases and as-needed for patches. Every release includes a CHANGELOG entry with the migration list and any operator-visible behavioral changes. The PyPI release is built from a deterministic CI pipeline and published with an attestation so operators can verify the built artifact against the release hash.
contribution
brainctl is open source under MIT and accepts pull requests. The contribution model is intentionally low-ceremony: fork, branch, open a PR, get review. The one hard rule is that every new mechanism must have a research note in the research/ directory explaining the cognitive-science, ML, or systems grounding behind the design — this is how the paper trail in §10.1 stays current and how reviewers understand why the code looks the way it does. Tests are required for any change that touches the hot path or the W(m) gate; style conforms to ruff and black with defaults.
Good first issues typically fall into five categories: (1) new MCP tools wrapping existing Python API calls, (2) new plugins for agent frameworks that don't yet have first-party support, (3) research notes implementing a specific paper's mechanism as an optional feature, (4) bench harnesses extending the benchmark suite, (5) documentation and reproducibility bundles for the existing research notes. The project is not looking for large speculative refactors.
comparison with existing approaches
The agent-memory landscape is crowded and moving fast. Several mature projects already address subsets of the problem brainctl targets, and any honest comparison has to acknowledge that. The table below is descriptive — what each project does at the architectural level — not evaluative. The paragraph that follows it is the actual positioning.
| project | substrate | memory typology | belief revision | consolidation |
|---|---|---|---|---|
| LangGraph checkpointer | per-graph state, pluggable backend | flat dict or schema-defined state | last-write-wins on update | none |
| Letta (formerly MemGPT) | hosted, postgres or sqlite | core / recall / archival | none | recall ↔ archival swap |
| Mem0 | hosted, postgres-backed | flat memories with metadata | none | LLM-driven memory updates |
| Zep | hosted service, postgres + vector | session messages + extracted facts | none | none |
| Cognee | postgres + vector + graph | knowledge-graph-first | none | offline graph build |
brainctl | sqlite, single file | 7 typed stores + knowledge graph (incl. first-class procedural in v2.7) | AGM with collapse audit | 8-phase consolidation pipeline |
Letta (formerly MemGPT) pioneered the recall/archival memory swap and remains the strongest reference point for hosted multi-agent memory. Mem0 targets memory-as-a-service for production agents and has the cleanest REST surface in the category. Zep offers enterprise-grade session storage layered on Postgres and is probably the right call for any team that needs row-level ACLs, audit logs, and SOC2-style compliance from day one. Cognee leads on knowledge-graph-first memory and is the closest cousin to brainctl in spirit. LangGraph's checkpointer is the right answer for teams already invested in LangChain's runtime. CrewAI and AutoGen both ship first-party memory layers and are the path of least resistance inside their respective frameworks. The built-in memory features in ChatGPT and Claude shape end-user expectations and quietly set the floor for what users assume an agent should remember.
None of these are wrong. brainctl does not try to be a better Letta or a better Mem0; it tries to be the answer for a specific design corner that the rest of the field does not occupy: local-first SQLite as the only required infrastructure, seven typed memory stores with all three of Tulving's episodic / semantic / procedural types as first-class data models, AGM belief revision with a full collapse audit instead of last-write-wins, an eight-phase consolidation pipeline that actively generates hypotheses during quiet hours, and a chain-canonical agent-to-agent marketplace built on top of the memory primitives. If you need any of those five things and you also need MIT licensing (Apache 2.0 on the marketplace components) and zero hosted dependencies, brainctl is the answer. If you need anything the others ship that brainctl does not — managed multi-tenant hosting, enterprise audit tooling, framework-native ergonomics — use them. The design space brainctl occupies is underpopulated, not contested.
economics: why a token
license posture
brainctl is MIT-licensed and will remain so. There is no enterprise edition, no paid tier, no gated features. Every mechanism in this paper is implemented in the public repository, with tests and a research note in research/. The ~40 research notes in that directory document the cognitive-science grounding of each mechanism with citations and reproduction instructions.
why a token rather than grants or VC
Open-source infrastructure is chronically underfunded. The two mainstream funding mechanisms both fail in characteristic ways.
- Grants are slow, competitive, and carry opinionated reporting requirements that drag a project away from its users.
- Venture capital demands a closed core or a hosted product within ~18 months — both of which contradict the MIT posture above.
A token is a third option. It aligns funding with the people who benefit from the memory layer — builders running agents, operators paying for inference, the agents themselves eventually — without putting the software behind a paywall. If it fails, the software is still free. If it succeeds, the research accelerates.
distribution
The brainctl community token has not launched yet. There is no contract address, no pump.fun listing, and no circulating supply. The ticker symbol is intentionally being withheld until launch — anything trading today under a brainctl-style ticker is not us. When the token does launch, the intent is a fair launch on pump.fun — no team allocation, no pre-sale, no vesting — with the launch itself serving as the distribution.
The development wallet is already public, however, and already being tracked live on /transparency. Every inbound and outbound transfer is rendered on the page — fetched server-side from the Solana chain via the Helius enhanced transactions API, cached for 60 seconds, and cross-checkable against any independent RPC. The point is that the ledger is public before any money moves, not after.
Nothing in this section is financial advice or an offer to sell securities. The brainctl community token is unlaunched and its ticker is withheld until deploy. The brainctl software is free, open source, and MIT-licensed independent of any token.
commitments
Two concrete commitments at launch. Nothing else is promised about how future fees, treasury balances, or proceeds will be deployed — those decisions will be made and disclosed as they happen, on the public transparency page below. Making no commitment is preferred to making one we'd need to walk back.
- 1. Buy + burn ~10% of total supply. The team will use launch proceeds to buy approximately ten percent of total token supply on the open market and burn it. This is a one-time deflationary action executed shortly after launch, with the burn transaction published on the transparency page when it occurs.
- 2. Lock ~10% of total supply. An additional ~10% of total supply will be acquired and locked, with the lock address and unlock terms published on the transparency page at lock time.
Transparency. The development wallet is public from the moment the token deploys (pre-launch it is held privately to prevent sniping, §10.3). Every inbound and outbound transfer is rendered live on /transparency via a server-side Helius fetch, cached 60 seconds, cross-checkable on Solscan. There is no separate "internal" wallet, no treasury sub-account hidden off the transparency page, no off-chain spending channel. Operators, contributors, token holders, and adversaries all see the same ledger at the same time.
License invariant. The brainctl software is MIT-licensed (the marketplace components Apache 2.0) and version-controlled on GitHub. That invariant does not depend on the token, the treasury, or any market outcome.
the on-chain primitives (signed exports → mint → marketplace)
brainctl ships three composable on-chain primitives that together let agents trade memories with no custodial layer in the middle. Each is usable in isolation; the marketplace is the third, built on top of the first two. All three are shipped and on PyPI as of v2.7.0 — this section is a specification of the running system, not a forward-looking promise.
Layer 0: signed exports (brainctl 2.3). brainctl export --sign produces an Ed25519-signed JSON bundle of memories that anyone can verify offline without brainctl itself. --pin-onchain additionally writes the bundle’s SHA-256 hash to Solana via the SPL Memo program — only the hash. The signature proves the bundle came from a specific wallet; the memo proves the bundle existed at a specific slot. The primitive added a flat protocol fee in v2.6.0 (a tiny SOL transfer atomic with the memo); below the fee schedule.
Layer 1: mint (brainctl 2.5). brainctl export --sign --mint takes the same signed bundle and:
- Generates a fresh 32-byte AES-256-GCM key, encrypts the bundle client-side, persists the key at
~/.brainctl/keys/<mint>.key(mode 0600). - Uploads the ciphertext to Arweave via Irys. Hard cap: 80 KB per bundle (Irys free-tier ceiling). Above that, the command exits and suggests
--ids/--category/--created-afterfilters. - Mints a Light Protocol compressed token on Solana with metadata pointing at the Arweave URI. Rent is sponsored by Light Protocol, so the only out-of-pocket is the Solana tx fee — approximately $0.0001 at current SOL price.
The compressed token is a standard SPL asset: indexable by Helius, transferable from any wallet UI, listable on Tensor or Magic Eden. The chain sees the bundle hash and the ciphertext URI; it does not see plaintext.
Layer 2: marketplace (brainctl 2.6+). Live at brainctl.org/marketplace and via brainctl marketplace api ... in the Python CLI. Sellers list a signed bundle’s hash rather than a pre-minted token — the cNFT is minted just-in-time to the buyer at settlement, so a single bundle can be sold to many buyers, each receiving their own freshly-minted token. Listings are USD-pegged with a $10,000 cap; settlement is in SOL pre-launch and switches to the community token via a single env-flip post-launch. The marketplace components are Apache 2.0 (the rest of brainctl stays MIT) — patent grant included, encouraging other agent platforms to adopt the memo format and the indexer pattern without friction.
The marketplace runs as a chain-as-database: state lives in Solana memos and Arweave manifests, not in any server-side database. Every action — list, offer, counter, accept, reject, withdraw, buy, release, cancel — is a signed memo with the deterministic prefix brainctl-marketplace/v1:<action>:.... The brainctl.org API is an indexer and a transaction builder; if it disappears, the same state is reconstructible by any other party from the chain. The API does not hold keys, does not custody funds, does not escrow. Authentication is wallet-signature challenge-response, persisted only in ephemeral KV (5-minute nonce TTL, 24-hour session TTL).
Negotiation. A listing picks a visibility mode at creation: auction (offers visible to all browsers) or private (offers visible only to the seller and the offerer; the chain memos themselves are public, but the indexer hides them from third parties). Each offer carries a USD-pegged price (capped at $10,000, converted to the payment token at settle time via Jupiter spot) and a TTL capped at 24 hours. Either side can counter an offer; the chain preserves the full lineage so a later reputation index can be derived from it.
Settlement. The buyer’s settle transaction does, in a single signature:
- Transfer (price − fee) of the payment token to the seller’s payment address. SOL pre-launch; the community token post-launch via a configuration flip in the API.
- Transfer the 2.5% protocol fee to the marketplace treasury wallet. The fee is fixed in the API, not configurable per-listing.
- Post a buy memo carrying the buyer’s X25519 pubkey (derived from their Solana ed25519 key via the libsodium curve25519 transform — the same conversion the buyer’s daemon uses to decrypt the released envelope).
On detection of the buy memo, the seller’s daemon (brainctl marketplace api listen) mints a fresh compressed token to the buyer, SealedBox-encrypts the bundle’s AES key to the buyer’s X25519 pubkey, uploads the envelope to Arweave, and posts a release memo brainctl-marketplace/v1:release:<listing>:<envelope>:<minted_cnft>. The buyer polls for the release memo, decrypts the envelope locally to recover the AES key, fetches the encrypted bundle from the listing’s Arweave URI, decrypts, and optionally ingests into its own brain.db under scope=imported:<listing_id> — a quarantine scope that does not blend into the agent’s primary memory until explicit promotion.
Protocol fee schedule. Two layers of fee, both atomic with the operation they accompany so the buyer / seller never lands in a partially-settled state. The flat per-op fees (calibrated to SOL ≈ $200, overridable per-deployment via env vars) are:
sign --pin-onchain: 0.0005 SOL (~$0.10) — bundle hash pinned on chainmarketplace api list / offer / counter / accept / reject / withdraw / cancel: 0.0005 SOL each (~$0.10) — every negotiation stepexport --sign --mint: 0.0025 SOL (~$0.50) — JIT cNFT mintmarketplace api settle: 2.5% of the trade value (no flat fee on top) — the settlement protocol fee- marketplace JIT-mint-at-settle (seller’s daemon): zero fee — the seller has already paid 2.5% at settle
- pure offline
export --sign: zero fee — never touches the chain - devnet: zero fee on every op — free dev / test
The treasury wallet that receives these fees is deliberately separate from the dev wallet that holds the community-token allocation and the quarterly-burn pool. The split preserves the dev wallet’s anti-sniping hold (the address is held privately ahead of token launch, so snipers cannot watch it for the createToken transaction and front-run launch participants) while still letting the marketplace fee infrastructure run on a fully public, queryable address from day one.
Trust model. The remaining trust requirement is on the seller actually releasing the bundle key after payment lands. The shipped enforcement is the chain record itself: every release (or non-release) is signed by the seller’s wallet and visible to every future buyer, so a seller who collects payment without releasing keys ends up with a wallet history that disqualifies them from future sales. Stake-to-list, dispute window, and slashing are referenced in the constants (LISTING_STAKE_USD = 1.0 in agentmemory.marketplace) but are not yet enforced on chain. The next major iteration replaces the trusted-seller release entirely with Lit Protocol threshold encryption conditional on payment-confirmation finality, which removes the trust assumption. That work is roadmap, not shipped.
direction
brainctl does not maintain a fixed quarterly roadmap. The project is issue-driven: priorities come from the public GitHub issue tracker, from operator feedback, and from whichever research threads are returning interesting signal. A frozen eighteen-month plan would be fiction, and the alternative — publishing one anyway — is exactly the kind of thing a serious open-source project should refuse to do.
What the authors are currently interested in, without committing to timelines, includes:
- Lit Protocol threshold encryption for marketplace releases. The trust requirement on sellers releasing the bundle key after payment (§10.5) goes away if the key release is conditional on payment-confirmation finality, brokered by a threshold-encryption network. This is the single largest remaining trust assumption in the marketplace and the most-asked roadmap question from operators.
- On-chain Solana program holding listings + escrow. The chain-canonical posture currently runs through memos + Arweave manifests + a server-side indexer; the canonical state would be more robust if listings lived in a dedicated Solana program with bonding curves / dispute windows / slashing built in. This is what unlocks
LISTING_STAKE_USDfrom being a config constant to being actually enforced. - x402 middleware for hosted brainctl-mcp-http billing. The streamable-HTTP transport (v2.5.0) is built to be remoted. Pairing it with an x402 (HTTP 402 Payment Required) middleware lets agents pay per-tool-call in SOL / community token without subscriptions or API keys. This is the natural revenue layer for "brainctl as a hosted service" if anyone wants to run that product.
- Capability groups on the MCP allowlist. v2.6.4 added a flat-allowlist for tool-capped clients (§2.4); a named-group layer (
BRAINCTL_TOOL_GROUPS=core,knowledge_graph,procedural) on top of that primitive would let operators pick curated bundles rather than maintaining hand-written tool lists. - Autonomous procedural-memory acquisition. v2.7.0 ships procedural memory with operator-approved backfill (§3.7). The next step is the loop that watches the event stream, clusters successful trajectories, and lifts the clusters into procedures without operator intervention.
- Cross-brain federation. §6.3 sketches the design. The hard parts are not transport — signed append-only logs are well-understood — but the semantics of cross-brain AGM conflict resolution and the importance-score scaling across remote pulls.
- Deeper head-to-head benchmarks. §8 has measured numbers against MemPalace on the same machine. Extending the same harness to Letta, Mem0, Zep, Cognee, and OpenAI Memory on a common multi-day agent-trajectory benchmark is the single most valuable contribution the field could make right now; the bench harness in
tests/bench/competitor_runs/is designed to be pluggable but the adapters are not all written yet. - A reference brain-browser UI. Operators currently inspect
brain.dbwith the CLI or by opening the file in any SQLite browser. A purpose-built reader that understands the typology, the consolidation lineage, and the provenance chains would make audit work much faster. - More research notes in
research/with reproducibility bundles — the existing ~40 are the documentation trail for why the substrate looks the way it does, and the trail should keep growing as new mechanisms ship.
Anything here may change. If a direction matters to you, the fastest way to affect priorities is to open an issue or send a pull request — every v2.6.x release this month closed a community-filed issue (#108 idle timeout, #113 Windows SIGHUP, #114 tool allowlist) and v2.7.0 itself shipped an external contributor's PR (#94 procedural memory, by Velamj).
open questions and research frontier
The second half of this section is an honest inventory of hard problems that are not solved in brainctl today. The list matters because a whitepaper that only describes what works gives the reader a dishonest model of what the project actually is. These are the places where the cognitive science or the ML literature suggests a better answer than what brainctl currently implements, and where work we would be happy to see (or do ourselves) is wide open.
The distinction between this section and the roadmap above: §11 lists work we intend to ship, with engineering paths visible. §11.1 lists research questions we don't yet know how to answer well. Some items in §11 will collapse into this list when the engineering reveals an open design question; some items here will graduate to §11 when the path becomes clear.
Learned schemas. The nine memory categories in §2.2 and the schema-integration story in §4.4 use hand-designed schemas. Humans acquire new schemas throughout life; a schema-acquiring agent would be able to grow its own category system in response to new domains. Concretely: given a corpus of observation memories the agent keeps writing, can brainctl identify that a new category has emerged and propose its addition? This is a clustering-plus-validation problem with the extra constraint that the new category should be stable across sessions.
Procedural fitness vs novelty. §3.7 ships a fitness score that updates from execution outcomes — high-fitness procedures naturally surface earlier in procedure_search. That's exploitation. The counterpart problem is when to stop reaching for the well-trodden procedure and try something new — exploration. Retrieval has Thompson Sampling for this; procedural memory currently does not. A naïve copy of the same explore/exploit mechanism into procedural retrieval is the obvious starting point, but procedures execute side-effects (real tool calls, durable changes), so the exploration cost is higher than for read-only retrieval. Calibrating the explore/exploit tradeoff to that asymmetry is an open design question.
Learned procedural acquisition. procedure_backfill already auto-creates procedures from candidate memories using a hand-designed regex heuristic (looks_procedural: how-to phrasing, if-then conditionals, rollback language, step markers). That covers acquisition from already-written memories. What is genuinely open is learned acquisition directly from the event stream — clustering successful trajectories at the tool-sequence level and promoting stable clusters into procedures without a hand-designed classifier. The clustering is easy; the validation — "is this cluster actually capturing the same task, or is it three superficially-similar but distinct workflows?" — is the hard part, and it's an open ML question.
Intent-matched prospective memory. §3.6 is honest that prospective memory triggers match on keywords, not intent. A trigger with precondition "billing schema" fires for every query mentioning those words, including unrelated contexts. An intent-matched variant would use the embedding of the precondition and a threshold against the query embedding, or a small trained classifier, to decide whether the trigger is actually relevant to the agent's current task. The data structures exist; the matching logic is the gap.
Spreading activation: equilibrium dynamics + inhibition. The current spreading_activation implementation already does more than bounded BFS — it propagates with per-edge-type weights (semantic_similar: 1.0, causes: 0.9, causal_chain_member: 0.8, etc.) and exponential decay per hop (decay ** (hop + 1)), citing Collins & Loftus 1975 directly. What is not there: iterative relaxation until equilibrium (the implementation does a fixed two-hop sweep), sibling inhibition (Collins & Loftus had inhibitory edges between siblings of an activated parent), and path accumulation (the current code takes the max contribution per target, not the sum across paths). Each of those adjustments changes how multi-hop reasoning surfaces related concepts; whether the fidelity gain justifies the complexity cost is the open empirical question.
Neural replay at short timescales. §4.1 is honest that brainctl's replay has nothing to do with theta rhythms or sequence preservation at biologically meaningful timescales. There is a research question whether any of that matters for agents: mammalian sharp-wave ripples compress 1–10 seconds of experience into ~100 ms, and the compression ratio itself may be a load-bearing part of why replay works. Whether an agent memory system benefits from time-compressed replay of sequences, or whether the current unordered top-K replay captures everything that actually matters, is not known.
Closed-loop W(m) calibration. The memory_outcome_calibration table is written to (every retrieval outcome flows in via outcome_eval), but nothing currently reads that table to adjust the W(m) coefficients (α, β, γ from §3.2). The loop is half-closed: outcome data is captured but coefficient updates are still manual. A worthiness gate that tuned its own coefficients from accumulated calibration data — same Bayesian update pattern as §3.3, applied one level up — would close the loop. The data and the schema exist; the adjustment policy is the gap.
Whole-pipeline retrieval calibration. The RRF k parameter has been tuned empirically for brainctl: the shipped value is k=30, deliberately moved off Cormack 2009's canonical k=60 after observation that brainctl's corpus (tens to low hundreds of memories per scope, not the millions in traditional IR benchmarks) shifted the optimum. What remains open is whether every retrieval-time constant (the salience-routing weights in §5.5, the per-edge-type weights in spreading_activation, the Q-value temperature for Thompson exploration, the temporal-recency half-life) has been similarly calibrated. Most have not been ablated. A full sensitivity analysis over a common agent-trajectory benchmark — alongside the competitor benchmarks in §11 — would close that loop too.
Cryptographic agent attribution. §6.4 notes that agent IDs are database-level identities, not signed identities, and that a compromised process with filesystem access can write under any agent ID. Cryptographic per-write signatures would harden multi-operator deployments but require a key distribution story that brainctl currently does not have. This becomes urgent once federation ships, and the answer probably ties into the same wallet-identity model the marketplace already uses for authentication.
Cross-brain conflict semantics. Federation is in the roadmap, but the hard parts are not transport — signed append-only logs are well-understood — they are the semantics of cross-brain conflict resolution. When two brains hold contradictory AGM beliefs and attempt to merge, whose provenance chain wins? What happens to the importance scores of memories pulled from a remote brain — do they inherit, decay, or normalize? Is the receiving brain's W(m) gate the right admission authority, or does the remote write's home gate matter? Open design questions all the way down.
Marketplace reputation that resists gaming. §10.5's current trust enforcement is just the chain record itself: every release (or non-release) is signed and visible. That's sufficient if the community can read provenance fluently, but it scales poorly — agents shopping for memories need a single signal, not a forensic exercise. A reputation index is the natural next layer, but every naïve formulation (release rate, sale velocity, age-of-wallet) is gameable. Sybil-resistant reputation that derives meaningful signal from on-chain history without being trivially manipulable is an open marketplace-economics question, and the answer likely cross-pollinates with the broader on-chain identity literature.
Trustless key release at threshold. Lit Protocol threshold encryption is in the roadmap (§11) as the mechanism that removes the trusted-seller assumption from marketplace settlement. The design space is wider than "just use Lit": what threshold size resists seller-validator collusion, how does the payment-confirmation oracle integrate with Solana finality (single-slot, sub-slot, or n-block-back), what happens when the threshold network has stale state vs the chain, and how is key rotation handled for long-lived listings. These are cryptographic-systems-design open questions, not just integration work.
Hosted-MCP per-call billing economics. The x402 middleware in §11 is shaped like a Stripe-without-Stripe primitive: agent calls a tool, server returns 402, agent's wallet pays, tool runs. The mechanism is well-defined; the pricing is not. Should hosted brainctl-mcp-http charge per-tool-call (every retrieval is the same price), per-result (only successful retrievals charged), per-memory-volume (writes scale with bundle size), per-token (input + output of the LLM the tool was feeding)? Each has different incentive structures and different spam-resistance properties. Empirical answers require running the experiment.
MCP surface vs cognitive load. brainctl exposes 209 tools and is still growing. Anthropic, OpenAI, and Google all publish recommended ceilings (typically 50–100 tools per server) below which agent tool-selection accuracy holds, but the empirical basis is thin. v2.6.4's BRAINCTL_ALLOWED_TOOLS is a coping mechanism — it lets tool-capped clients work — but the underlying question is whether the full surface degrades any agent's reasoning, even on uncapped clients. Capability groups (§11) help if the answer is "yes, curate"; they're wasted scope if the answer is "no, just sort intelligently". A clean ablation would settle it.
Formal verification of the W(m) gate. The gate's correctness properties — monotonicity under trust level, idempotency under rewrite, convergence under repeated redundant writes — are stated informally in §3.2 but not proven. A formal model of the gate in a proof assistant would be a strong signal of seriousness and a genuinely useful artifact for anyone extending the system.
This list is not exhaustive. It is what we can currently articulate as open problems from inside the project; the most interesting research questions are usually the ones that are not yet legible as questions. We expect this list to grow, not shrink, as the project matures — a healthy research program is one that accumulates hard problems faster than it solves them.
references
- Alchourrón, Gärdenfors, Makinson (1985). On the logic of theory change: partial meet contraction and revision functions. Journal of Symbolic Logic 50(2): 510–530.
- Aquino Argueta et al. (2026). Reactivation during sleep segregates the neural representations of episodic memories. bioRxiv.
- Baars (1988). A cognitive theory of consciousness. Cambridge University Press.
- Bartlett (1932). Remembering: A study in experimental and social psychology. Cambridge University Press.
- Bjork (1994). Memory and metamemory considerations in the training of human beings. In Metcalfe & Shimamura (Eds.), Metacognition. MIT Press.
- Borgeaud et al. (2022). Improving language models by retrieving from trillions of tokens. ICML.
- Buzsáki (2015). Hippocampal sharp wave-ripple: A cognitive biomarker for episodic memory and planning. Hippocampus 25(10): 1073–1188.
- Cepeda, Pashler, Vul, Wixted, Rohrer (2006). Distributed practice in verbal recall tasks: A review and quantitative synthesis. Psychological Bulletin 132(3): 354–380.
- Collins, Loftus (1975). A spreading-activation theory of semantic processing. Psychological Review 82(6): 407–428.
- Cormack, Clarke, Buettcher (2009). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. SIGIR.
- Dehaene, Changeux (2011). Experimental and theoretical approaches to conscious processing. Neuron 70(2): 200–227.
- Diba, Buzsáki (2007). Forward and reverse hippocampal place-cell sequences during ripples. Nature Neuroscience 10(10): 1241–1242.
- Diekelmann, Born (2010). The memory function of sleep. Nature Reviews Neuroscience 11(2): 114–126.
- Dong, Lu, Norman, Michelmann (2026). Towards large language models with human-like episodic memory. Trends in Cognitive Sciences 30(2).
- Dunlosky, Metcalfe (2009). Metacognition. SAGE Publications.
- Ebbinghaus (1885). Über das Gedächtnis. Duncker & Humblot.
- Eich, Metcalfe (1989). Mood dependent memory for internal versus external events. Journal of Experimental Psychology: LMC 15(3): 443–455.
- Einstein, McDaniel (1990). Normal aging and prospective memory. Journal of Experimental Psychology: Learning, Memory, and Cognition 16(4): 717–726.
- Finn, Abbeel, Levine (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML.
- Fountas, Oomerjee, Bou-Ammar, Wang, Burgess (2026). Why the brain consolidates: Predictive forgetting for optimal generalisation. arXiv:2603.04688.
- Frey, Morris (1997). Synaptic tagging and long-term potentiation. Nature 385(6616): 533–536.
- Greshake, Abdelnabi, Mishra, Endres, Holz, Fritz (2023). Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. AISec workshop, ACM CCS.
- Gutierrez, Shu, Gu, Yasunaga, Su (2024). HippoRAG: Neurobiologically inspired long-term memory for large language models. NeurIPS.
- Johnson, Hashtroudi, Lindsay (1993). Source monitoring. Psychological Bulletin 114(1): 3–28.
- Kang et al. (2025). Hindsight: Causal attribution for improved retrieval. arXiv:2512.12818.
- Karlsson, Frank (2009). Awake replay of remote experiences in the hippocampus. Nature Neuroscience 12(7): 913–918.
- Karpukhin et al. (2020). Dense passage retrieval for open-domain question answering. EMNLP.
- Khandelwal, Levy, Jurafsky, Zettlemoyer, Lewis (2020). Generalization through memorization: Nearest neighbor language models. ICLR.
- Kirkpatrick et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS 114(13): 3521–3526.
- Klinzing, Niethard, Born (2019). Mechanisms of systems memory consolidation during sleep. Nature Neuroscience 22(10): 1598–1610.
- Lewis et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.
- Lin (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3–4): 293–321.
- Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12: 157–173.
- McClelland, McNaughton, O'Reilly (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review 102(3): 419–457.
- McCloskey, Cohen (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation 24: 109–165.
- Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature 518: 529–533.
- Morici et al. (2026). Dorsoventral hippocampus neural assemblies reactivate during sleep following an aversive experience. Nature Neuroscience.
- Murre, Dros (2015). Replication and analysis of Ebbinghaus' forgetting curve. PLOS ONE 10(7): e0120644.
- Niediek et al. (2026). Episodic memory consolidation by reactivation of human concept neurons during sleep reflects contents, not sequence. bioRxiv.
- O'Neill, Winters (2026). Breaking boundaries: Dopamine's role in prediction error, salient novelty, and memory reconsolidation. Neuroscience 594: 31–41.
- Packer, Wooders, Lin, Fang, Patil, Stoica, Gonzalez (2023). MemGPT: Towards LLMs as operating systems. arXiv:2310.08560.
- Parisi, Kemker, Part, Kanan, Wermter (2019). Continual lifelong learning with neural networks: A review. Neural Networks 113: 54–71.
- Park, O'Brien, Cai, Morris, Liang, Bernstein (2023). Generative agents: Interactive simulacra of human behavior. UIST.
- Robinson et al. (2026). Large sharp-wave ripples promote hippocampo-cortical memory reactivation and consolidation during sleep. Neuron 114(2): 226–236.
- Roediger, Karpicke (2006). Test-enhanced learning: Taking memory tests improves long-term retention. Psychological Science 17(3): 249–255.
- Roediger, McDermott (1995). Creating false memories: Remembering words not presented in lists. Journal of Experimental Psychology: Learning, Memory, and Cognition 21(4): 803–814.
- Rumelhart (1980). Schemata: The building blocks of cognition. In Spiro, Bruce, Brewer (Eds.), Theoretical Issues in Reading Comprehension. Erlbaum.
- Schaul, Quan, Antonoglou, Silver (2016). Prioritized experience replay. ICLR.
- Schwimmbeck et al. (2026). Sequential coupling of sleep oscillations enables concept-neuron reactivation. bioRxiv.
- Shu et al. (2025). TiMem: Temporal integration for memory-augmented LLM agents. arXiv:2601.02845.
- Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS.
- Smith, Vela (2001). Environmental context-dependent memory: A review and meta-analysis. Psychonomic Bulletin & Review 8(2): 203–220.
- Thompson (1933). On the likelihood that one unknown probability exceeds another. Biometrika 25(3–4): 285–294.
- Tononi, Cirelli (2003). Sleep and synaptic homeostasis: a hypothesis. Brain Research Bulletin 62(2): 143–150.
- Tononi, Cirelli (2014). Sleep and the price of plasticity. Neuron 81(1): 12–34.
- Tse et al. (2007). Schemas and memory consolidation. Science 316(5821): 76–82.
- Tulving (1972). Episodic and semantic memory. In Tulving & Donaldson (Eds.), Organization of Memory. Academic Press.
- Tulving, Thomson (1973). Encoding specificity and retrieval processes in episodic memory. Psychological Review 80(5): 352–373.
- Walker, van der Helm (2009). Overnight therapy? The role of sleep in emotional brain processing. Psychological Bulletin 135(5): 731–748.
- Wang (2025). Democratizing GraphRAG: Linear, CPU-Only Graph Retrieval for Multi-Hop QA. arXiv:2602.23372.
- Widloski, Foster (2025). Replay without sharp wave ripples in a spatial memory task. Nature Communications 16: 10287.
- Wilson, McNaughton (1994). Reactivation of hippocampal ensemble memories during sleep. Science 265(5172): 676–679.
- Yang, Buzsáki (2024). Awake sharp-wave ripples tag episodic memories for consolidation. Nature Neuroscience.
- Zhang, Bo, Lan, Hu, Song, Lan, Wang, Wen (2024). A survey on the memory mechanism of large language model based agents. arXiv:2404.13501.
- Zhang et al. (2026). Adaptive memory admission control for LLM agents. arXiv:2603.04549. ICLR 2026 Workshop MemAgents.
- Zaratiana, Tomeh, Holat, Charnois (2024). GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer. NAACL, 5364–5376.