Agent-Native Durability: Why I Built AXME Code
A few months ago I was caught in a cycle of zero-knowledge cold starts with Claude Code — re-explaining my stack, my architecture, and my safety rules every session. So I built AXME Code: a structured knowledge base, hook-based safety guardrails, and ~10x better token efficiency on LongMemEval than the closest competitor.
A few months ago, I was caught in a cycle of zero-knowledge cold starts. Every time I ran Claude, it was Day Zero. I was spending my own tokens re-explaining my stack, my architectural preferences, and my safety requirements on every session. I felt like I was working with a genius architect who had severe short-term memory loss. I was tired of being the “Context Engineer” for my own tools. So I built AXME Code.
The Infrastructure: The Project Oracle
AXME doesn’t just store strings. It manages a persistent, structured knowledge base in your project root. This is the Project Oracle I’ve been working on for a while now. It extracts decisions and patterns from your code, configs, and history so your agent starts every session at “senior dev” level context. Not a database. Not one long markdown file. A typed store with explicit decisions, memories, safety rules, and a per-session handoff — all readable as plain markdown and YAML, all versionable, all greppable.
Why Not Existing Market Solutions?
I looked at the market for a solution, but everything I found — from Mastra to Zep — treated agents like a search-engine problem. They weren’t building for the non-deterministic, long-running reality of software engineering. They lacked the hard guardrails and the structured persistence needed for production work.
| AXME Code | MemPalace | Mastra | Zep | Mem0 | Supermemory | |
|---|---|---|---|---|---|---|
| Capabilities | ||||||
| Structured decisions with enforcement levels | ✓ | — | — | — | — | — |
| Pre-execution safety hooks | ✓ | — | partial | — | — | — |
| Structured session handoff | ✓ | — | — | — | partial | — |
| Automatic knowledge extraction | ✓ | — | ✓ | ✓ | ✓ | ✓ |
| Project map and codebase context | ✓ | — | — | — | — | — |
| Multi-repo workspace | ✓ | — | — | — | — | — |
| Local-only storage | ✓ | ✓ | ✓ | — | — | — |
| Semantic memory search | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Multi-client support | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Capabilities total | 9/9 | 3/9 | 4/9 | 3/9 | 3/9 | 3/9 |
| Benchmarks | ||||||
| ToolEmu safety (accuracy) | 100.00% | — | — | — | — | — |
| ToolEmu safety (FPR) | 0.00% | — | — | — | — | — |
| LongMemEval E2E | 89.20% | — | 84.23% / 94.87% | 71.20% | 49.00% | 85.40% |
| LongMemEval R@5 | 97.80% | 96.60% | — | — | — | — |
| Tokens per correct | ~10K | — | ~105–119K | ~70K | ~31K | ~29K |
AXME values and MemPalace R@5 are measured. Competitor LongMemEval scores are from published results; token counts are estimates from each system's methodology. Full breakdown, footnotes, and reproduction instructions in the benchmarks README.
While other tools provide basic semantic search, they miss the critical primitives I needed: structured decisions with enforcement, pre-execution safety hooks, and multi-repo workspace awareness. I didn’t want a “memory palace”; I wanted a durable harness.
Deterministic Guardrails: Safety Hooks
Prompts are a weak defense. If an agent has terminal access, I need a deterministic blocking layer. Relying on an agent’s “best effort” to remember a safety rule is how your Friday night gets ruined.
AXME Code implements hook-based safety guardrails — pre-tool-use intercepts at the Claude Code harness level, independent of the agent’s prompt. In our ToolEmu benchmark, AXME achieved 100% safety accuracy with 0% false positives on the tested cases.
Agent: "I'll force push to update the branch..."
$ git push --force origin main
❌ BLOCKED by pre-tool-use hook
Rule: force push denied on protected branch 'main'
Result: the command was NOT executed
The hook fires before the tool call reaches the shell. There is no path that lets the agent talk its way around it — even in bypassPermissions mode.
Performance at Scale: ~10x Fewer Tokens per Correct Answer
One thing that always bothered me about existing memory systems was the background LLM noise. Most tools use an Observer/Reflector pattern — the agent is constantly narrating its own thoughts to maintain state. This leads to 100K+ tokens per simple answer.
I engineered AXME Code on a semantic Reader + Judge architecture: 2 LLM calls per question at evaluation time, no continuous narration loop.
Token efficiency on LongMemEval
Top-right = best. In our LongMemEval-based benchmark, AXME used ~10× fewer tokens per correct answer than Mastra while delivering competitive accuracy. Model-agnostic — pricing changes, token counts don't.
AXME’s tokens are measured from a 500-question run; competitor tokens are estimated from each system’s published methodology (Observer per turn, Reflector aggregation, fact extraction, graph construction). Full methodology and reproduction instructions in the benchmarks README.
By moving the “thinking” into the pre-execution phase, we maximize the agent’s focus on actual code generation.
The Anxiety of the Pause: Block, Handoff, Resume
In a professional repo, work is rarely synchronous. But in a stateless shell, “waiting” is where context goes to die. I used to feel a constant pressure to answer the agent immediately. I knew that if I closed the terminal while I went to find an API key, I’d return to a “cold” session and pay the re-onboarding tax all over again.
AXME’s approach is more boring than “asynchronous HITL”, and that’s the point:
- A safety hook blocks the dangerous command and returns the rule that triggered it.
- When the session closes, the background auditor writes a structured handoff: what was being attempted, what blocked it, what the in-progress state is.
- When you start the next session, the agent loads that handoff before doing anything else. It picks up at the exact point of the block, not at the top of the conversation.
The result is the same as the result you actually want: you can walk away mid-task without losing the context, and when you come back the agent already knows what’s unresolved. The work doesn’t have to be synchronous to be durable.
Conclusion: A Durable Harness, Not a Memory Palace
AXME Code is my response to the exhaustion of zero-knowledge development. We’ve moved past treating agent memory as just “store the chat log” — toward a durable harness where the Project Oracle provides the ground truth, structured decisions carry enforcement levels, and hooks turn safety from a polite request into a hard intercept.
It’s time to move from prompting agents to running them on infrastructure that survives the gaps between sessions — the way our codebases already do.