The AI Memory Week That Changed Knowledge Management
Karpathy's LLM wiki on April 2. Mem0 Series A on April 4. MemPalace at 23K stars in 72 hours. Letta Code launch. Hindsight Hermes. Five projects, one week, one realization: knowledge workers are fed up with AI amnesia.
A week from now, when the dust settles, the first week of April 2026 is going to look like the moment AI memory stopped being an engineering subreddit topic and became a category. Five projects shipped between April 2 and April 8. None of them coordinated. All of them were arguing the same thing from slightly different angles: the dominant approach to giving language models a memory is wrong, and the next approach is already here.
This is the editorial pass — not a launch round-up, not a benchmark wars piece, and not a product pitch. We have two longer articles on the Karpathy wiki specifically and on its honest weaknesses, and this article is the wider frame around both. What follows is the briefing version: what happened, why it happened now, what's actually new, what's overhyped, and what it means for the next twelve months.
The week, in one timeline
- April 2. Andrej Karpathy posts his "LLM Knowledge Base" workflow on X. The thread crosses sixteen million views inside five days. Quote that ends up everywhere: "a large fraction of my recent token throughput is going less into manipulating code, and more into manipulating knowledge."
- April 4. Karpathy publishes the formal write-up as a GitHub gist. The same day, Mem0 publishes its "State of AI Agent Memory 2026" report and announces a Series A round.
- April 6. MemPalace launches on GitHub. Letta releases Letta Code, a memory-first coding agent. Hindsight Hermes ships its MCP integration.
- April 6 – 8. MemPalace goes from zero to roughly 23,000 GitHub stars in seventy-two hours. A benchmark controversy erupts, the community pushes back, the project corrects itself in public.
Five releases, one week, one shared diagnosis: knowledge workers are tired of AI amnesia. Every conversation begins from scratch. Every query rediscovers context the model already saw yesterday. Nothing accumulates. Karpathy was the first major voice to say it out loud, and the moment he did, four other teams had a finished answer ready to ship within four days. Independent convergence is rarely a coincidence. It usually means the problem has been ripe for longer than the tweets suggest.
Why this week happened now
For about two years, the answer to "how do we give an LLM knowledge?" has been the same: chunk your documents, embed them in a vector database, retrieve the top-k matches at query time, generate. RAG. The architecture absorbed billions of dollars of investment and became the default story every enterprise AI startup told.
It also stopped feeling like enough about a year ago. The frustration was diffuse but consistent. Power users would describe it the same way in different words: the model never learns anything from me. I explain my project on Monday, I explain it again on Tuesday, I explain it again on Friday. The vector store remembers strings; it does not remember relationships, and it certainly does not remember the editorial decisions I keep making about what matters. Retrieve-and-forget is a fundamentally amnesic loop, and once you have lived inside it for long enough, you start wanting something different even if you cannot articulate what.
What April 2 – 8 demonstrated is that several teams had already articulated it, in private, and were waiting for the cultural moment. Karpathy's post was that moment. He was not the first person to build a compiled wiki — markdown wikis are older than most of the people building them — but he was the first widely-trusted figure to say "I have tried this seriously and it works better than RAG at the scale I actually live in." The dam broke because a lot of people had been holding the same idea behind the same dam.
The two architectural poles
The most useful frame for the week is to ignore the marketing and look at the architectural philosophies. Three of them are now visible, and two of them are at direct odds.
Compile-then-query. This is Karpathy's wiki. The LLM reads raw sources at ingest time and makes editorial decisions: what concepts matter, what links to draw, how to structure the resulting markdown article, where to update an existing page versus create a new one. At query time, the LLM navigates the compiled wiki, not the raw archive. Knowledge is, in Karpathy's framing, compiled once and kept current, not re-derived on every query. The strengths are real: minimal infrastructure, structured reasoning preserved across passes, knowledge that compounds because each ingest enriches the existing structure rather than getting flattened back into a chunk store. The weaknesses are also real: index files outgrow context windows past a few hundred articles, citation precision is loose, and the same model that compiles is the model that lints, with all the self-checking limitations that implies.
Store-then-retrieve. This is MemPalace's pole, and it is openly an argument against the third approach we will get to in a moment. MemPalace's pitch is that you should never let a language model decide what is worth remembering, because the moment you do, you have lost information you cannot get back. Store everything verbatim. Retrieve via metadata-filtered vector search over a local SQLite-and-ChromaDB stack. The README puts the philosophical position cleanly: "Other memory systems try to fix this by letting AI decide what's worth remembering. It extracts 'user prefers Postgres' and throws away the conversation where you explained why." The strengths: no information loss, fast local retrieval, real data sovereignty, no cloud dependency. The weaknesses: there is no inherent synthesis layer, retrieval still depends on the same vector similarity that everyone else depends on, and the much-hyped "palace structure" is, when you read the code, mostly metadata filtering on top of ChromaDB. It is a good engineering choice. It is not a new shape of memory.
Extract-then-store. This is the third pole, and it is the one being attacked from both sides at once. An LLM extracts facts during ingestion — "Patrik prefers Postgres," "the Q3 deadline moved to October" — and stores compressed representations as the unit of memory. This is what most existing memory systems, Mem0 included, actually do. Karpathy's wiki is an argument that the extracted facts should be linked into a structured artifact rather than stored as flat fragments. MemPalace's pitch is that the act of extracting at all is the original sin — you throw away the explanation, the qualifier, the caveat, the conversation around the fact. Both are saying, in different vocabularies, that lossy summarization at ingest time is the enemy.
The honest production answer is going to involve all three. We will get to that in the closing section.
The benchmark wars
This is the part of the week that needs to be told carefully, because it is the part where the most damage to public understanding got done.
MemPalace launched with a claim of 100% on LongMemEval, the standard long-context memory benchmark from the LongMemEval paper. A perfect score is the kind of number that gets a project to twenty-three thousand stars in three days, and that is roughly what happened. The community then pushed back, hard, in a way that ended up being a model for how this kind of disagreement should go.
The actual story has four pieces, and all four matter.
First, MemPalace measures recall@5 — a retrieval-only metric. The official LongMemEval leaderboard measures end-to-end QA accuracy: retrieve, generate, get judged on whether the answer is right. These are not the same metric, and it is not close. As GitHub user dial481 put it on the issue thread that ended up being the cleanest summary anywhere: "A system that retrieves perfectly and then answers wrong scores 100% under recall_any@5 and 0% on the leaderboard." You can be perfect on one and zero on the other.
Second, the 100% score itself was achieved with three targeted patches written specifically for the three failing questions in the dev set. After the patches were removed and the system was run against the held-out 450 questions, the score dropped to 98.4%. Still excellent. Considerably more honest. The project's own BENCHMARKS.md, to its credit, ended up acknowledging the issue in plain language: "This is teaching to the test." That is a sentence you almost never see in a launch repo, and it is a sentence the community deserves more of.
Third, the headline raw-mode score is 96.6% on LongMemEval recall@5, and that one is real. Independent reproduction by user @gizmax on an M2 Ultra confirmed the number on the published evaluation set. So the underlying retrieval engine is genuinely strong. The problem was never that MemPalace was bad. The problem was that the marketing number was a different metric than the leaderboard number, and the difference was not made obvious until the community made it obvious.
Fourth, Mem0's published numbers have not had a clean ride either. Independent evaluation on LongMemEval put Mem0 at roughly 49.0% on the end-to-end task. Letta tried to reproduce Mem0's own claimed benchmarks and could not. The pattern across the week is that almost every memory project has a published headline number that does not survive a careful third party run, and the reasons are almost always about which metric, which subset, and which judge, rather than outright dishonesty. Penfield Labs' write-up is the cleanest external take on the MemPalace specific case.
The takeaway is the same takeaway that applied to the early days of LLM evals in general: in a category this young, treat the headline numbers as marketing until somebody else has run them. Read the methodology section, not the press release. And give credit to the projects that publish their own caveats — MemPalace's BENCHMARKS.md is, in retrospect, one of the more honest documents to come out of the week, even though it had to be dragged into existence by a public argument.
What's actually new
Strip the launch energy out and ask the harder question: what did this week genuinely contribute to the field?
Karpathy's wiki contributed an argument, well-articulated, for compile-then-query at personal scale. The schema-file pattern (a CLAUDE.md describing how the wiki should be shaped) is novel and useful. The architectural separation between raw/ and wiki/ — between the immutable source layer and the regeneratable compiled layer — is the most defensible idea in the post. Most of what people will steal from Karpathy in the next year is not the workflow itself. It is that one architectural distinction.
MemPalace contributed nineteen read-write MCP tools, which is more than every other memory MCP shipped to date combined. Most memory MCPs are read-only — you can ask them to recall something, but you cannot ask them to store something. The read-write era starts here. MemPalace also contributed a thoughtful four-layer loading system that scales context cost from roughly 170 tokens at startup up to deep search on demand, which is a real engineering choice with measurable consequences. And it contributed a local-first architecture that uses SQLite for the knowledge graph instead of requiring Neo4j the way Zep does, which is the difference between "self-hostable in theory" and "self-hostable on a laptop in practice." Finally, and this is the soft contribution that gets undervalued: the project responded to its own benchmark controversy by correcting in public, merging the critical PRs, updating the README, and walking back the false claims. That behavior is itself a contribution to a category that badly needs it.
Mem0 contributed the first formal "State of the field" report with its "State of AI Agent Memory 2026," and the first signal that VCs are pricing memory as a category rather than a feature. The Series A is the part that will get the press, but the report is the part that will actually shape how product teams talk about this for the rest of the year.
Letta Code contributed the framing of "memory as the OS layer." A coding agent built memory-first rather than memory-bolted-on. Whether the framing survives contact with reality is a separate question, but the framing itself is going to be quoted a lot.
Hindsight Hermes contributed a 91.4% on LongMemEval end-to-end — a real end-to-end number, not a recall-only number — under an MIT license, with embedded PostgreSQL, MCP-first. It is the strongest evidence from the week that the open-source memory ecosystem is maturing fast and that the gap between the closed and open implementations is narrower than it looked a month ago.
That is the real contribution sheet. It is meaningful. It is also smaller than the Twitter coverage suggests, which is fine. Most weeks contribute less.
What everyone is missing
Read the codebases and the launch posts together and one absence becomes loud. None of these systems treat meeting transcripts as a first-class source.
MemPalace's repo does not contain the words "meeting" or "transcript" anywhere meaningful. Mem0's chat-centric framing assumes the input is a stream of chat turns, which is the wrong shape for a meeting. Letta Code is built around the developer workflow. Hindsight Hermes is general-purpose but unopinionated about the source layer. The one project that actually names meeting transcripts as a valid raw source is, ironically, Karpathy's gist — he lists them explicitly as an input his pattern would handle — but no implementation has built the meeting-specific ingestion path.
This is the gap, and it is a strange gap to find at the end of a week about memory, because meetings are where the most valuable organizational knowledge actually gets created. Decisions, context, commitments, the why behind the what — these things happen in conversation, not in documents. A memory system that cannot ingest meetings is missing the largest source of unstructured knowledge in any organization. The longer version of this argument lives in the meeting-to-wiki gap article, and it is the single most important piece of the week's puzzle that the launches ignored.
The verbatim-versus-extraction debate, applied to meetings
There is a useful way to ground the MemPalace-versus-Mem0 philosophical disagreement in a domain people actually understand: the long-running argument inside the meeting transcription space about whether to keep the raw transcript or just keep the AI-extracted notes.
The verbatim camp is Otter.ai, Read.ai, every legal and compliance use case, any setting where the conversation itself might be the evidence. Read.ai's framing has been quoted a lot lately and it lines up almost exactly with MemPalace's pitch: "not a vague summary, but the actual discussion... captures the nuances: the edge case someone raised at minute 32." The argument is that the moment you let a model summarize, you have thrown away the minute-32 edge case, and the minute-32 edge case is almost always the part you needed.
The extraction camp is Granola, most AI meeting assistants, the entire "summary plus action items" school. The argument is that nobody re-reads transcripts, so the summary is the whole product, and the raw transcript is just an artifact of the pipeline. In Mem0's vocabulary, this is "extract the facts and throw the rest away."
The honest production answer, in meetings as in memory, is both. Store the verbatim transcript, because future questions you have not thought of yet will need it. Also extract the structured layer — decisions, entities, commitments, topics — because nobody wants to grep an audio file. Treat the structured layer as a regeneratable cache on top of the immutable raw layer. That is the same pattern Karpathy points at with his raw/ and wiki/ split, and it is the pattern that the meeting-tool conversation has been circling around for two years without quite naming. Proudfrog's bet, for what it is worth, is that meetings are the right scope for an AI-maintained knowledge base specifically because the input volume is naturally bounded — the meetings happen whether you want them to or not, and the set you care about is finite in a way that "interesting articles I read on the internet" never is.
What this week means for the next twelve months
Forward-looking, with the usual caveat that forecasts in this category have a half-life of about four weeks.
Memory becomes a first-class architectural component, not a feature. Every AI product launch from here forward is going to have to answer the question "how does this remember?" — and "we use a vector store" is going to stop being a sufficient answer the way "we use an LLM" stopped being sufficient eighteen months ago.
Local-first wins the philosophy battle, even if it does not win every market. MemPalace, OMEGA, the Karpathy wiki, Cognee, Hindsight Hermes — every interesting memory project this week emphasized local execution, data sovereignty, and user ownership of the underlying store. The Nordic angle here is real and not just rhetorical: the regulatory environment in the EU has been pushing this direction for two years, and the projects that built for it are now better-positioned than the ones that built for unconstrained cloud retention.
MCP becomes the universal interface, and the read-only era ends. Read-write memory MCPs were the missing primitive. MemPalace ships nineteen of them. The next six months will be about whether other tools follow.
The benchmark wars get worse before they get better. LongMemEval is the de facto standard, but the metric confusion exposed this week is not going away on its own. Expect a cleaner, harder, end-to-end-only benchmark to land in Q3 2026, almost certainly from one of the academic groups that has been quietly tired of the leaderboard shuffling.
Meeting transcripts are the next obvious source. Karpathy listed them. Nobody built it. Someone will, probably within ninety days, and the version that wins will be the one that treats the meeting as the unit of memory rather than treating the chat turn as the unit of memory.
Compile-versus-store converges into hybrids. Both poles have weaknesses that the other pole solves. Production systems twelve months from now will store verbatim, extract structured artifacts on top, and compile a navigable wiki on top of those. Three layers, not one. The architectural diagram already exists in a dozen private Notion docs. It will exist in public soon.
Read this if you cared about Karpathy
The week of April 2 – 8 is not going to look like a category inflection in the kind of histories that get written about it later. It is going to look like the moment four or five teams realized they had been working on the same problem from different angles and decided, in roughly the same seventy-two hours, to ship. Inflections rarely look like inflections from the inside. They look like a busy week.
The Proudfrog hypothesis, stated once, late, and without a pitch attached: meetings are the right scope for an AI-maintained knowledge base because the input is naturally bounded, the source is intrinsically conversational and high-context, and the maintenance burden scales with input volume rather than with user discipline. Whether that hypothesis survives 2026 is a question we are not going to pretend to have settled.
If you came in through the Karpathy thread and want to keep going, the rest of our coverage of the week lives in three pieces:
- Andrej Karpathy's LLM Wiki Workflow: How It Works (and Why Meetings Are the Missing Source) — the explainer.
- The Complete Guide to Karpathy's LLM Wiki Workflow — the practical replication guide.
- The LLM Wiki at Scale: Token Costs, Hallucination, and the Second Brain Graveyard — the skeptic's briefing.
And if the meeting-shaped gap in the memory landscape is the part of this that grabbed you, the meeting-to-wiki gap, the knowledge worker workflow, and the Proudfrog feature page are the rest of the argument.
Frequently Asked Questions
What's the fastest way to understand the difference between Karpathy's wiki and MemPalace?
Karpathy compiles. MemPalace stores. Karpathy's pattern reads your raw sources, makes editorial decisions about structure and links, and writes a maintained markdown wiki that you query later. MemPalace stores everything verbatim and retrieves via metadata-filtered vector search at query time, on the philosophical principle that letting an LLM decide what to remember is the original sin. They are arguing opposite halves of the same problem, and the production answer is going to use both.
Is MemPalace's 100% benchmark score real?
Partly. The 96.6% raw-mode score on LongMemEval recall@5 is real and has been independently reproduced. The 100% number was achieved with three targeted patches written for the three failing questions in the dev set, and the project's own BENCHMARKS.md now acknowledges that as "teaching to the test." The deeper issue is that recall@5 measures retrieval only, while the official LongMemEval leaderboard measures end-to-end QA accuracy — those are different metrics, and conflating them is how perfect scores get manufactured. The underlying engine is genuinely strong. The marketing was ahead of the methodology.
Why is everyone building local-first memory systems?
Three reasons stacked on top of each other. Privacy and data sovereignty, especially in the European regulatory environment. Cost — once you eliminate the cloud retrieval layer, the marginal cost of recall goes to zero. And trust — users have learned, often the hard way, that any memory store hosted by someone else is a memory store that can be deprecated, repriced, or acquired and shut down. The Limitless shutdown in December 2025 made that concrete in a way that nobody who lived through it has forgotten.
Does any of this work for meeting transcripts?
Not yet, in the sense that none of the projects that shipped this week treat meeting transcripts as a first-class source. Karpathy's gist names them as a valid input but does not implement the ingestion. MemPalace's repo does not mention meetings. Mem0's framing assumes chat turns. The structural fit is excellent — meetings are bounded, high-context, and naturally relational — but the meeting-native version of these systems still has to be built. That is the gap we keep coming back to.
What's MCP and why does it keep coming up?
The Model Context Protocol is the emerging standard for how AI systems connect to external tools and data sources. Until very recently, most memory MCPs were read-only — you could ask them to recall something but not to store something. MemPalace shipped nineteen read-write memory tools at launch, which is more than the rest of the category combined, and that is the moment the read-only era starts to end. Expect every serious memory project to ship a read-write MCP in the next six months.
Should I wait for this category to settle before building on it?
Depends on whether you can afford to be wrong. If you are building a production system that needs to be stable for the next two years, wait — the architectures are still moving fast and the benchmark situation is unreliable enough that today's leader could be tomorrow's footnote. If you are exploring the category for your own work or running a bounded internal experiment, build now, set a budget cap, treat the compiled artifact as regeneratable, and plan to throw away your first implementation. The ideas will outlast the specific tools, and the discipline of having built one will outlast both.