The LLM Wiki at Scale: Token Costs, Hallucination Contamination, and the Second Brain Graveyard

Karpathy's LLM wiki is the most viral knowledge management idea in years. Here's what nobody is saying about the token costs, the hallucination problem, and why every previous second brain tool died.

knowledge-graphai-knowledgeskepticcritiquellm

In the five days after Andrej Karpathy posted his "LLM Knowledge Base" workflow on X, the thread collected more than sixteen million views. Within a week, GitHub had fifteen-plus implementations with names like llm-wiki-compiler and karpathy-wiki. DAIR.AI announced a virtual event to dissect the pattern. Obsidian forum threads filled with screenshots of self-maintaining vaults. The idea — that an LLM can compile your raw inputs into a linked, linted, queryable markdown wiki — became, briefly, the most viral knowledge management concept since Roam launched block references in 2020.

Most of what has been written so far is celebration. We wrote one of those pieces ourselves, and we stand behind it: the architecture is genuinely interesting, and the anti-RAG finding deserves attention.

But celebration is not analysis. There are three things almost nobody is saying about Karpathy's pattern, and if you are about to spend a weekend building one of these for yourself, you deserve to hear them before you start. This is the skeptic's briefing — not cynical, not a takedown, just the parts of the conversation that the hype cycle has skipped.

1. The hallucination contamination problem

Start with the deepest technical concern, because it is also the one that gets waved away most quickly.

When an LLM compiles a wiki, it is not just transcribing. It is making editorial decisions — which concepts link to which, which claims generalize, which details to preserve. Every one of those decisions is a place where a hallucination can enter the record. And once it enters, it is no longer labeled as a hallucination. It is labeled as a line in your notes.

Sebastian Raschka's Antigravity team put it bluntly in their write-up of the pattern: "If the LLM hallucinates a connection between two concepts, that false link now lives in your wiki and could influence future queries." Sergey Lyapustin's Your Second Brain Has Amnesia essay went further: "When an LLM makes up a meeting that didn't happen, or puts the wrong words in someone's mouth from your own notes, the failure mode is not that you catch it — it's that you might actually believe it."

The standard response is: "That's what the linting pass is for." Karpathy's workflow includes a linting step where the LLM sweeps the wiki looking for contradictions and gaps. In theory, a contaminated claim gets caught on the next pass.

In practice, the linter is the same class of model that introduced the hallucination. Self-checking has real limits. We have several years of research showing that LLMs are worse at catching their own errors than at catching errors written by a different system, and even cross-model checking leaves long tails of undetected mistakes. If Claude writes a plausible-but-false cross-reference on Monday, Claude on Wednesday is likely to read it, find it internally consistent, and certify it.

There is also an unsolved problem underneath all of this: citation precision. A good human wiki lets you trace a claim back to "page 47, paragraph 3" of the source. An LLM-compiled wiki has no built-in mechanism to produce that link. You can ask it to attach citations, and it will — but the citation itself can be wrong, and at that point you are auditing the auditor.

Steph Ango, Obsidian's co-creator, gave advice that reads differently once you sit with the contamination problem. He suggested keeping any AI-generated content in a separate vault so the AI can "make a mess in their own space." That is not a quirky preference. That is someone who understands markdown knowledge bases deeply telling you that the output of these systems is not yet trustworthy enough to mix with your real notes.

The practical implication is important and often missed: the contamination risk is roughly fine for personal knowledge work, where the downside of a stray error is small. It is meaningfully dangerous in any context where someone will later act on the wiki as if it were ground truth — legal research, medical notes, regulated business decisions, anything involving other people's money or safety. The Karpathy pattern is not a high-stakes knowledge store. Not yet.

2. The token cost nobody is publishing

Here is an odd gap in the discourse. Karpathy's post mentions that his wiki has grown to roughly a hundred articles and four hundred thousand words. It says nothing at all about what that costs to maintain. Neither do most of the enthusiastic write-ups. Neither do the GitHub repos.

Let's do the math that nobody else seems to want to do, with the clear caveat that it is back-of-envelope and your numbers will vary.

A typical ingest in this pattern is not cheap. When a new document arrives, the LLM has to read the document, pull up the ten to fifteen existing wiki pages most likely to be affected, decide what cross-references to update, rewrite the affected sections, and emit the diff. If each touched page averages around three thousand tokens, plus a five-thousand-token system prompt, plus reasoning overhead, you are looking at roughly thirty-five thousand tokens for a single ingest. That is for one new input.

Now add the linting pass. Linting is a full-wiki sweep in Karpathy's framing, which means it scales with the size of your wiki. A hundred-article wiki at three thousand tokens per article is three hundred thousand tokens per lint, before any reasoning. Run it weekly and you are in multi-million-token territory just for hygiene.

At Claude Opus pricing (~$15 per million input, ~$75 per million output), an active wiki user doing daily ingests and weekly lints could plausibly spend anywhere from five to fifty dollars a month depending on how aggressive they are. Sonnet cuts that substantially. Haiku or Gemini Flash cut it more. But the point is that nobody is publishing the real numbers. The community is operating on faith.

The one benchmark that does exist — the "84% token reduction per session" figure from ussumant/llm-wiki-compiler — is real, but it measures something different from what most people think it measures. It is a per-query efficiency claim: once the wiki is built, querying it uses fewer tokens than running the same query over raw sources. That is true and useful. It tells you nothing about the cost of building and maintaining the wiki in the first place, which is the cost that actually scales with your life.

Local LLMs are the obvious escape hatch. Ollama, LM Studio, llama.cpp — run a 30B or 70B model on your own machine and the marginal token cost goes to zero. The problem is quality. Compilation tasks are hard. They need long context, careful instruction following, and consistent formatting across many passes. Current open-weight models can do this, but the drop in quality from Sonnet or Opus is noticeable in exactly the dimensions that matter most for a wiki — hallucination rate, cross-reference accuracy, and stylistic consistency.

The honest summary is that nobody yet knows what this pattern costs at sustained personal scale, and the people building the loudest implementations are not the ones paying the bill at the end of the month. If you build one, set a budget cap on your API key before you start. You will learn more from the cap hitting than from any blog post.

3. The second brain graveyard

This is the section that should temper the most enthusiasm, because it is the part of the conversation with the most evidence behind it. The history of "store everything, find anything" tools is not a history of triumph. It is a history of impressive launches, big rounds, enthusiastic early users, and slow collapse.

Evernote. Once valued above a billion dollars, once serving two hundred million users, once the flagship product of the entire second brain era. The Bending Spoons acquisition gutted the free plan down to fifty notes and pushed through near-hundred-percent price increases. In 2026 the product still exists, but it is dying in public, and nobody in the knowledge management community treats it as a destination anymore.

Roam Research. Nine million dollars raised. The "tools for thought" darling of 2020 and 2021. Block references, bidirectional links, a passionate cult. By 2024 the momentum was gone. Fifteen dollars a month with an indifferent mobile experience and a founder more interested in philosophy than shipping. Active user counts trended down. It still exists. It is no longer winning.

Skiff. Fourteen million raised, two million users, privacy-forward, a genuinely good product. Acquired by Notion in February 2024. Fully shut down by August 2024. Six months from acquisition to gone.

Limitless (formerly Rewind). The always-on context layer. A pendant, a Mac app, the ambition to record your entire life and let AI make sense of it. Meta acquired them on December 5, 2025. Pendant hardware sales stopped that day. The Mac app was permanently shut down on December 19. Service ceased entirely in the EU, UK, and Brazil — the last detail reading as a quiet acknowledgment that GDPR was never really solved for "record everything."

Mem.ai. Twenty-nine million dollars led by OpenAI's fund. Positioned as the AI-native note-taker that would finally crack automatic organization. Critics took to calling it "the forty-million-dollar second brain failure" when it failed to find retention. Relaunched as Mem 2.0 in October 2025. Better. Not fixed.

This is not a list of bad products. Most of these were well-built, well-funded, and run by people who understood knowledge management deeply. They failed anyway, and the failure pattern is consistent enough to be worth naming.

The productivity trap

Glasp's analysis of knowledge-management failure, echoing earlier work by APQC, identifies tool hopping as the single most common failure mode. Users cycle through Notion, Obsidian, Logseq, Capacities, Tana, Mem, ending with "five half-populated systems and no usable knowledge base." The second killer is over-engineering: users spend more time building and tweaking the system than doing the knowledge work the system was supposed to support. "The system becomes the project, and the actual knowledge work never happens."

The broader context is grim. Roughly 92% of SaaS startups fail within three years (myICOR's longitudinal analysis, 2024). The specific subset of "personal knowledge management" startups fails at or above that rate. The market has been trying and failing to solve this problem for about twenty years.

None of which is a reason to not try. But if you are reading a viral thread about Karpathy's wiki and feeling the urge to rebuild your whole information life around it, that urge should be sitting right next to the memory of Roam, Mem, Skiff, Limitless, and Evernote. They all felt inevitable too.

Why this time might actually be different

Now the steel-man, because the skeptical case is not a closed argument.

The single clearest distinction between the Karpathy pattern and every previous second brain tool is this: the old tools required you to maintain the knowledge base. That was the wall users hit. You captured into Evernote on day one, organized feverishly for two weeks, and six months later you had three thousand untagged notes and a vague sense of guilt. The maintenance load, not the capture load, was what killed these systems.

The Karpathy pattern is fundamentally different in exactly this dimension. The LLM does the maintenance. It reads, it links, it lints, it rewrites. You are no longer responsible for the shape of the knowledge base — you are responsible for the inputs and the occasional audit. That is a meaningfully different job.

Tool hopping was, in retrospect, rational behavior. Users were searching for a tool that would do the work for them, and no such tool existed. AI agents are the first thing that plausibly can. It is not that users were fickle. It is that the thing they wanted did not exist yet.

The hallucination problem is real but bounded. If you keep your raw sources immutable and treat the compiled wiki as ephemeral and regeneratable — essentially, as a cache on top of a trusted source layer — the blast radius of any given hallucination is small. You can rebuild the wiki from scratch whenever you lose trust in it. That is a property no previous note-taking tool had.

The cost problem may solve itself on a shorter horizon than people expect. Local models in the Gemma 3, Llama 4, and Qwen 3 generation are closing the gap on compilation-style tasks faster than most of the cost analyses assume. Twelve to eighteen months from now, the math for running this locally may simply be different.

And the shift in value proposition matters. The previous wave was about storing knowledge and hoping it would pay off later. This wave is about querying a structured version of your inputs on demand, and rebuilding the structure when it rots. That is a different deal with the user, and it might be a better one.

The honest question that decides it

Strip away the hype and the history, and the open question is narrower than it looks.

Does AI-maintained knowledge actually solve the maintenance burden, or does it just relocate the burden from "organizing notes" to "curating sources and running health checks"?

The optimistic answer: source curation is finite work. You only have so many sources you genuinely care about. Note organization, by contrast, is infinite — every new note creates new organizational decisions that compound.

The pessimistic answer: source curation is also infinite if you take it seriously. You become the editor of your own micro-Wikipedia, and editing is a full-time job for a reason.

The realistic answer is that it depends entirely on whether your inputs are bounded or unbounded. If you point the Karpathy pattern at "everything I read on the internet," you will drown. If you point it at a specific research project, a single domain you are trying to master, or a naturally constrained firehose like your own meetings, it has a real chance of working — because the maintenance problem is proportional to the input volume, and the input volume is actually capped.

Where this leaves meeting transcripts specifically

Meetings happen to have a property that directly addresses one of the "second brain graveyard" failure modes: the inputs are naturally bounded. You do not have to decide what to capture. The meetings happen whether you want them to or not, and the set of meetings you care about is finite in a way that "interesting articles" or "thoughts I had today" never are.

That makes the maintenance burden genuinely lower than a capture-everything-you-read wiki. It also makes the hallucination risk lower in a subtle way: full conversational transcripts contain a lot of context per token, so LLM extraction is more grounded than extraction from a terse note. There is more signal for the model to hold onto. The privacy and cost questions remain, and they are not small. But the structural fit is better than most domains.

Proudfrog is built around this hypothesis — that meetings are the right scope for an AI-maintained knowledge base precisely because the input is naturally constrained. Whether that hypothesis is correct is still an open question, and we are not going to pretend otherwise in a skeptical article about exactly this kind of hypothesis. If you want the more detailed version of the argument, the meeting-to-wiki gap and the complete workflow guide dig further in. If you want the case for meetings as a knowledge worker workflow, that lives there.

What the smart skeptic should do

If you have read this far and still want to try, here is the version of the experiment that will teach you the most with the least wasted time.

  • Start with a bounded use case. One specific research project, your meetings, one team's documentation. Do not start with "all of my knowledge." You will lose.
  • Keep raw sources immutable and Git-versioned. The wiki is a derived artifact. You should be able to delete it entirely and regenerate it without losing anything important.
  • Treat the wiki as regeneratable, not precious. The moment you feel protective of the compiled version, you have started building the same trap the previous generation built.
  • Set a hard budget cap on your LLM API key before you start. Not after. The budget hitting is data. Surprise bills are not.
  • Audit for hallucinations weekly for the first month. Pick five claims at random each week and trace them back to source. If you cannot, that is what you learned.
  • Decide in advance when you will quit. Write down what would prove the experiment has failed — a contamination rate, a monthly cost, a maintenance time budget — and honor it when you hit it. Otherwise sunk cost will keep you in the system long past its useful life, which is exactly how the previous generation of second brains ended up as graveyards.

This is not cynicism. It is the discipline that the previous wave of tools did not have. The Karpathy pattern might be the first knowledge management architecture that genuinely clears the maintenance wall. It also might not be. The only way to find out is to try it the way a skeptic tries things — bounded, auditable, cheap, and with a pre-committed exit.

See how Proudfrog approaches meeting knowledge if the bounded-input version of this argument is interesting to you.


Frequently Asked Questions

Is the LLM wiki approach actually new, or just RAG repackaged?

It is not RAG. RAG retrieves chunks at query time based on similarity; the Karpathy pattern compiles a structured artifact in advance and queries the artifact directly. They solve different problems. RAG scales better to millions of documents. Compilation works better at personal scale because it preserves reasoning and explicit cross-references that similarity search discards. The novelty is not the wiki format — markdown wikis are thirty years old — it is using the LLM as the continuous maintainer of the wiki rather than as the query engine on top of raw sources.

How much does it actually cost to run a Karpathy-style wiki?

Nobody has published a credible number. Back-of-envelope math suggests an active user doing daily ingests and weekly full-wiki lints could spend roughly five to fifty dollars a month on a frontier model like Claude Sonnet or Opus, depending on wiki size and usage intensity. Cheaper models cut that substantially, local models eliminate the marginal cost but cost quality. Set a budget cap on your API key before you start. Treat any numbers you see in blog posts as estimates until someone publishes a real study.

What happens when the LLM hallucinates a fact and writes it into my wiki?

It stays there until you catch it. The linting pass is supposed to find contradictions, but the linter is the same class of model that introduced the error, and self-checking has real limits. The best current defense is to keep raw sources immutable and treat the compiled wiki as regeneratable — if you lose trust in the wiki, rebuild it from sources. For high-stakes contexts (legal, medical, financial decisions), the hallucination risk is currently high enough that you should not use this pattern as a primary knowledge store.

Why have so many "second brain" products died?

The consistent failure mode is that the maintenance burden eventually outweighs the value. Users capture enthusiastically, organize for a few weeks, then drift. Tool hopping compounds the problem — users cycle through systems looking for one that will do the work for them. Evernote, Roam, Mem, Skiff, and Limitless all hit some version of this wall. The Karpathy pattern is interesting specifically because it targets the maintenance wall directly, by letting the LLM do the maintenance. Whether that is enough to break the pattern is still an open question.

Is this safer with local LLMs?

Safer for privacy and cost, yes. Safer for hallucination, not really — open-weight models in the current generation hallucinate more on compilation tasks than Claude Sonnet or GPT-class models, not less. The right answer for most users in 2026 is a hybrid: use a frontier model for the compilation and linting passes, use a local model for casual querying against the finished wiki. Within twelve to eighteen months, local models will probably be good enough to handle the compilation passes too, and the cost calculus will shift.

Should I bother trying this if I'm not a developer?

Probably not yet, if you mean building one yourself from Karpathy's description. The community tooling is early, the workflows are fragile, and the debugging requires comfort with markdown, shell scripts, and prompt engineering. If you want the value without the build, wait a few months for the first wave of productized versions to shake out — or use a domain-specific tool that already implements the pattern for a bounded input like meetings, research papers, or a specific codebase. The ideas will still be here when the tools catch up.