How we measure bi-temporal RESOLVER latency.
Last updated: 2026-05-05
Benchmark methodology
Latency is the question every AI engineer on a buying committee asks. Mycelium publishes the methodology first, the harness second, the numbers third. The bi-temporal RESOLVER methodology is on this page. The shipped baseline section below carries the first measured numbers from the precursor benchmark (Bucket 2 hybrid retrieval) that the RESOLVER builds on. The bi-temporal harness ships in ai-brain-starter v0.5; the first bi-temporal RESOLVER run lands when memory-runtime-pro v1.0 ships.
What we measure
Wall-clock latency of a bi-temporal query against the typed-memory graph. Bi-temporal means the query carries two time axes: when the fact was true (valid time) and when the system learned about it (transaction time). Sub-200ms p99 against an enterprise-scale graph is the public target, set to match Zep's published claim. Latency is measured at the resolver boundary, after authentication and before agent-runtime serialization, so the number is the substrate's, not the surrounding stack's.
Test corpus
The public benchmark runs against the ai-brain-starter test fixtures: a synthetic enterprise graph with 50,000 typed memory records, 5,000 entities, 12,000 decisions, and 8,000 events spanning 24 months of valid time. The fixtures ship with the repository so the methodology is reproducible by anyone with a git clone and a Postgres install.
How the harness works
- Step 1. Load the 50,000-record fixture into a clean Postgres database.
- Step 2. Warm the cache with one read pass over every record (eliminates cold-start variance).
- Step 3. Run a thousand bi-temporal queries against the warmed cache, randomly sampled across entity, time, and predicate dimensions.
- Step 4. Record p50, p95, p99, and max latency at the resolver boundary.
- Step 5. Repeat the run on three different machine sizes (4-core / 16-core / 64-core) and publish all three series.
- Step 6. Re-run on every tagged release of memory-runtime-pro and append to the public benchmark history.
How you reproduce it
Clone github.com/adelaidasofia/ai-brain-starter, install the resolver harness from benchmarks/resolver/README.md, run the harness against your own machine. Your numbers are yours. Send anomalies to contact@mycelium-ai.co; we publish reproducible third-party runs alongside our own.
Current status
| Methodology version | v1, published 2026-05-05 |
| RESOLVER (bi-temporal) harness | In development; ships in ai-brain-starter v0.5 |
| RESOLVER (bi-temporal) first run | Scheduled with memory-runtime-pro v1.0 release |
| Bucket 2 (hybrid retrieval) harness | Shipped: memory-runtime-pro/benchmarks/bucket-2-latency.py |
| Bucket 2 first measured run | 2026-05-09 — cleared sub-200ms p99 (see shipped baseline below) |
| Target (both surfaces) | Sub-200ms p99 across the published configurations |
| Verification standard | Reproducible third-party runs from the public harness against the public fixtures |
Shipped baseline (Bucket 2 hybrid retrieval, 2026-05-09)
The Bucket 2 hybrid retrieval substrate is what the RESOLVER will sit on top of for bi-temporal queries. Its first measured baseline shipped 2026-05-09 against a 5,000-note synthetic corpus (30,029 chunks) using the hash-v1 embedder on commodity hardware (MacBook Air, Apple Silicon). All ten queries in the production-shaped query battery landed between 41 and 58 milliseconds median, including the no-lexical-match fallback case that cannot leverage FTS5 (57 ms median, 93 ms p95).
| Metric | Pre-FTS5 baseline | Post-FTS5 baseline | Speedup |
|---|---|---|---|
| p50 | 690 ms | 50.6 ms | 13.6x |
| p95 | 760 ms | 59.0 ms | 12.9x |
| p99 | 802 ms | 93.5 ms | 8.6x |
| mean | 699 ms | 50.6 ms | 13.8x |
Architecture. Two-phase retrieval. Phase A asks SQLite's FTS5 sidecar (a virtual table over chunk bodies plus headings, kept in lockstep with the main chunks table by INSERT/UPDATE/DELETE triggers) for up to 100-500 candidates ordered by bm25() rank. Phase B runs a numpy-vectorized cosine reranking pass on those candidates' already-hydrated vectors. For pure-semantic queries with zero lexical overlap, a vector-only fallback decodes the entire vector matrix in a single np.frombuffer call on the joined blob bytes, runs cosine, and partial-sorts top-K via numpy.argpartition. The Python-side BM25 fan-out that was on the hot path before remains on the class for backward compatibility but is no longer invoked.
Methodology. 10 production-shaped queries (the same battery the buyer's AI engineer would run: 'what did we decide about pricing in Q2', 'audit log cross-tenant isolation tests', 'consolidation daemon cadence and scope', and so on) ran 5 repeats each after a single warm-up pass per query. 32 of 32 retrieval tests passed locally; the full non-retrieval suite was 434/434 green. Reports written to memory-runtime-pro/benchmarks/bucket-2/report-hash-5000n-*.md so any reviewer can see the full per-query distribution, not just the headline numbers.
Case-study artifacts
The pre/post case-study chart is the renewal-conversation artifact. It compares month 1 against month 6 on three rows the buyer's CFO will recognize: decisions re-litigated, hours per week answering questions, and manual context pastes. The same generator that produced the digest sample produced these. Re-running scripts/generate-sample-artifacts.py against the public synthetic tenant reproduces them byte-for-byte (modulo the SVG generated-at footer).
- Sample pre/post chart (SVG) — 21 to 15 decisions re-litigated (29% drop), 1.23 to 0.88 hours/week answering (28% drop), 18 to 4 manual pastes (78% drop).
- Sample pre/post chart (CSV) — Underlying numbers a CFO would re-export to a spreadsheet.
- Sample pre/post chart (JSON) — Machine-readable snapshot for downstream pipelines.
Methodology details for the chart's three rows live on /digest-methodology, alongside the weekly digest's three numbers + earned-its-keep moment.
What we will not do
Mycelium will not publish latency numbers from a private internal harness against private fixtures. Numbers without a public harness and public corpus are unverifiable claims, and the procurement reader is right to ignore them. The methodology lands first; the numbers follow when both the harness and the corpus are public.
Mycelium · founded 2026