Friday Roundup Week 1: AI Memory, Rust, Holiday Hacking
🗞️ Roundup - Week 1
Welcome to 2026. The first week of January is always interesting—still in that liminal space where the holidays feel recent but the year’s momentum hasn’t fully kicked in. This year I spent that space diving deep into AI memory systems, writing my first serious Rust project, and wrapping up what’s become an annual tradition of holiday research and development.
đź§ Improving How AI Remembers: subcog vs git-notes-memory
zircote/subcog is a multi-domain, long-horizon memory system that supersedes my earlier zircote/git-notes-memory experiment. Where git-notes-memory stored context in git notes (clever, but limited), subcog provides proper semantic memory with progressive hydration, a prompt store, and CLI integration.
Think of it as giving your AI assistant actual working memory that persists across sessions and projects. It tracks context across multiple domains—codebases, documentation, past decisions—and loads relevant context progressively rather than dumping everything at once.
The architecture includes:
- Multi-domain tracking: Different memory spaces for different contexts
- Progressive hydration: Intelligent context loading based on relevance
- Prompt store: Reusable templates that evolve with your workflow
- CLI integration: Works with any MCP-enabled assistant (Claude currently, designed for broader adoption)
- Active recollection: Searches transcripts and past interactions
The Benchmark Results
January 1, 2026 comprehensive benchmarking validated the impact. I ran all four memory evaluation frameworks with the memory-benchmark-harness using 20 questions per benchmark across 2 trials plus baseline.
Key finding: Memory provides +34% improvement over baseline, with learning effect demonstrating +26% improvement from Trial 1 to Trial 2.
| Metric | Value |
|---|---|
| Baseline Accuracy (no memory) | 28% (22/80) |
| Trial 1 Accuracy (with memory) | 36% (57/160) |
| Trial 2 Accuracy (with learning) | 62% (99/160) |
| Improvement vs Baseline | +34% |
| Learning Effect (Trial 1→2) | +26% |
Model: gpt-4o-mini. Each trial runs all 4 benchmarks with 20 questions each (80 total per trial, 160 across both trials). Hybrid vector + BM25 retrieval.
Learning Progression
The learning system stores Q&A pairs when the model answers incorrectly, then retrieves them on subsequent trials:
Trial Correct Accuracy vs Baseline
--------------------------------------------------
Baseline 22/80 28% —
Trial 1 57/160 36% +8%
Trial 2 99/160 62% +34%
The insight: Trial 1 shows immediate benefit from memory retrieval (+8%). Trial 2 shows dramatic improvement (+34%) because wrong answers from Trial 1 were stored and retrieved directly in Trial 2. The system learns from its mistakes.
Per-Benchmark Breakdown
Final results from Trial 2 show subcog reaching near-perfect accuracy on memory-dependent benchmarks (77/80 questions correct = 96% aggregate):
| Benchmark | Subcog | No-Memory | Delta |
|---|---|---|---|
| LoCoMo (conversations) | 100% (20/20) | 0% (0/20) | +100% |
| LongMemEval (factual recall) | 100% (20/20) | 0% (0/20) | +100% |
| ContextBench (multi-hop) | 95% (19/20) | 35% (7/20) | +60% |
| MemoryAgentBench (consistency) | 90% (18/20) | 75% (15/20) | +15% |
LoCoMo and LongMemEval at 100%: These test memory of facts from extended conversations and retrieval of personal details from prior sessions. Without memory, the model has no access to this information (0% baseline). With subcog, perfect recall.
ContextBench at 95%: Multi-hop questions like “What is the age of someone who works with X?” require navigating entity relationships across files. Memory stores these connections, enabling 95% accuracy versus 35% baseline.
MemoryAgentBench at 90%: Tests accurate retrieval, test-time learning, long-range understanding, and conflict resolution. The modest improvement (75% → 90%) reflects that many questions embed context directly; memory still helps with conflict resolution and consistency.
What Makes Memory Work
Three technical insights emerged from the benchmarking:
-
Semantic search beats keyword matching: Hybrid vector + BM25 retrieval finds relevant context even when question phrasing differs from stored information.
-
Learning from mistakes compounds: Storing incorrect answers (score < 1.0) as Q&A pairs creates a feedback loop. Trial 2 benefits from Trial 1’s errors, showing +26% improvement.
-
Progressive hydration matters: Loading only relevant context (not everything) keeps latency low while maintaining accuracy. Here, “entries” are individual memory records (chunked documents, code spans, conversation logs, and embeddings), not just Q&A pairs—so each question can generate many indexed entries. Memory grew from 0 to ~681 indexed entries in Trial 1, then to ~76,154 by Trial 2 as the system accumulated context across both trials.
🦀 From Python to Rust: The Evolution of git-notes-memory
zircote/git-notes-memory started as a Python proof of concept: semantic memory stored in git notes with vector embeddings for retrieval. It worked, validated the architecture, and taught me what mattered for memory systems.
Then I rewrote it in Rust.
The Rust version isn’t just “faster Python”—it’s a complete rethinking of how memory systems should work. Where Python afforded rapid prototyping, Rust brought:
- Performance: Semantic search is compute-intensive. Rust makes it fast enough to be invisible.
- Safety: Memory bugs in a memory system would be ironic. Rust’s ownership model eliminates entire classes of bugs.
- Type safety: Compile-time guarantees about data structures mean fewer runtime surprises.
- Concurrency: Parallel processing of embeddings and retrieval without data races.
The benchmark results speak for themselves—90-95% success rates on LoCoMo, LongMemEval, and ContextBench. This wasn’t just an incremental improvement; it was validating that the approach works at production scale.
Python was the right choice for exploration. Rust is the right choice for production.
🛠️ Writing My First Rust Project with Claude
I resisted Rust longer than I should have. The learning curve looked steep, the ownership model seemed pedantic, and “it compiles, it works” felt too good to be true.
It is true.
Building subcog with @claude as my pair programmer made the Rust learning curve manageable. Claude understands the ownership model deeply, catches borrow checker issues before they compile, and explains why the compiler is complaining in plain English.
Not without friction—early on, Claude’s tendency to generate function stubs rather than complete implementations was frustrating. But once we established a working pattern, the collaboration clicked.
The experience was revelatory:
The Rust compiler is a teacher. Error messages don’t just say “this is broken”—they explain what’s wrong, why it’s wrong, and often suggest fixes. Combined with Claude’s ability to translate compiler errors into conceptual explanations, I learned Rust idioms faster than any tutorial could teach them.
Performance is exhilarating. Watching semantic search queries that took seconds in Python complete in single-digit milliseconds in Rust is viscerally satisfying. You feel the machine working with you instead of despite you.
“If it compiles, it works” isn’t hype. The type system and borrow checker are aggressive, but once your code satisfies them, entire categories of bugs simply don’t exist. No null pointer dereferences. No data races. No use-after-free. The safety guarantees are real.
Spec-driven development helped: I had the architecture validated in Python, which meant I could focus on learning Rust’s patterns rather than solving architectural problems simultaneously. The zircote/subcog spec translated cleanly from Python to Rust, proving that good abstractions transcend languages.
Would I do it again? Absolutely. Rust is now my default for systems that need both performance and reliability.
🎄 The Annual Holiday Research Tradition
This marks another year of what’s become a personal tradition: spending the quiet time between Christmas and New Year focused on research, reading papers and journals, and developing ideas that don’t fit normal work schedules.
Last year it was obscure thesis papers on plasma physics and energy systems. This year, AI long-horizon memory management.
There’s something valuable about dedicating uninterrupted time to deep technical exploration without immediate deliverables. No sprint deadlines, no stakeholder reviews—just curiosity-driven research and the space to follow interesting threads wherever they lead.
The results speak for themselves. zircote/subcog emerged from this dedicated focus, as did the benchmark harness and the insights about what makes semantic memory systems work (or not work).
It’s also been genuinely enjoyable. There’s a particular pleasure in having days where you can:
- Read a paper on memory architectures over coffee
- Implement an idea from that paper before lunch
- Benchmark it in the afternoon
- Iterate based on results
- Repeat
No context switching. No meetings. Just deep work on problems that matter.
I’m grateful to have the privilege of spending holiday time this way. Not everyone can dedicate a week to pure research, and I don’t take that lightly. But if you can carve out time for deep technical exploration—even just a day or two—I highly recommend it. The ROI on focused, curiosity-driven work is remarkable.
💡 What’s Next
Week 1 sets the tone for the year. zircote/subcog is in active development with early adopters providing feedback. The benchmark harness is available at zircote/memory-benchmark-harness for anyone interested in evaluating their own memory systems.
More Rust projects are planned—once you experience the performance and safety guarantees, it’s hard to go back to languages that make you guess whether your code is correct.
And the Friday Roundup series continues. If this format is useful, let me know. Feedback on content mix, structure, or topics is always welcome.
What are you building in 2026?
I’m curious what projects you’re excited about this year. Working on memory systems? Learning Rust? Dedicating time to research? Drop a comment or reach out—I’d love to hear what you’re focused on.
This is Week 1 of the Friday Roundup series. These posts cover projects, research, and developments across AI, developer tools, and agriculture tech. If you find this format useful, let me know and I’ll continue it throughout 2026.
Comments will be available once Giscus is configured.