LLM Wiki System & Docs Site

Overview

I built a compounding knowledge base for the homelab where every working session gets distilled into wiki pages and embedded into pgvector and retrieved by AI agents in future sessions. Once it got dense enough I put a public frontend on it at homelab.nbkelley.com, a Hugo static site with 92 pages across 16 sections that just rebuilds itself after every session.

Why I Built It

AI agents forget everything between sessions. Every new conversation you’re starting from scratch re-explaining your whole setup and what broke last time and it gets old.

I wanted something that got more useful the longer I used it. Every session the interesting stuff gets distilled into wiki pages and the next session the agent reads them at bootstrap and already knows the stack and what’s been going on. 75 pages in it knows things I’ve already forgotten.

Once the wiki got dense enough I built a public frontend for it. The interesting part was keeping them in sync without maintaining two separate things so I made the Hugo site a fully generated artifact from the wiki source, I never touch the content/docs directory directly.

How It Works

The wiki lives on a dedicated VM. Two agents maintain it:

Agent	Role
Claude Code	Active sessions — reads context, writes and updates pages, answers queries
Local models	Scheduled batch — ingests transcripts and docs overnight

Raw sources drop into raw/ and the pipeline handles the rest.

Wiki Pipeline

Step 1 — Clean: either a small model stripping boilerplate or just deterministic Python if the source is already clean enough.

Step 2 — Crystallize: a larger model reads the cleaned text alongside semantically similar existing wiki pages and decides whether to update something or write something new.

Step 3 — Embed: nomic-embed-text runs against the page body and writes the vector to wiki_embeddings in pgvector along with the file path, namespace, wiki, and page title. At session start Claude Code embeds the current task, does a cosine distance query against the whole table, and anything above 0.5 similarity gets loaded into context. The bootstrap reads the relevant pages before you type anything.

The Frontmatter Embedding Problem

Every wiki page has this big block of identical YAML frontmatter at the top, title version date namespace tags, and it’s like 200 characters of the exact same stuff on all 75 pages and that was enough to completely screw up the embeddings before nomic-embed-text even got to the actual content. Pairwise similarity was sitting at 0.725 which basically means everything looked like everything else and the 0.5 threshold was just not doing anything useful at all.

Turned out you just gotta strip the frontmatter before you embed and only run it on the body. Similarity dropped down to 0.684 and the threshold started working like it was supposed to. One of those bugs that you just don’t see until you happen to look at the right number.

Docs Site Pipeline

After every crystallize session build_and_push.py just runs and handles the whole thing, converts everything to Hugo format builds the site pushes it and Cloudflare picks it up and deploys. The conversion is doing a lot of work because the wiki and Hugo formats are pretty different from each other, wiki pages use [[wikilink]] syntax and have all these internal frontmatter fields like namespace and version that Hugo doesn’t know what to do with so the converter builds a title to URL map and just resolves everything at convert time. content/docs/ is completely generated I never touch it.

The visual style is hugo-book with a custom Shibui layer, monospace at 13px everywhere, headings are weight not size, sticky breadcrumbs, bordered code blocks. Took like 20 commits in one sitting honestly but it got there.

Challenges

Frontmatter in embeddings — covered above, took way too long to notice
Wikilink resolution — the wiki uses human readable titles like [[AI-Driven Monitoring Pipeline]] and getting those to resolve to the right Hugo URLs meant building a title to URL map at convert time and handling all the weird edge cases where a target page doesn’t exist yet
CSS specificity with hugo-book — the theme has deeply nested selectors that just steamroll custom styles in ways that aren’t obvious and a lot of those 20 commits were just hunting down which rule needed to be more specific
Breadcrumbs on leaf pages — section pages show the right path but individual pages show /homelab instead of the parent section, off-by-one in the ancestor traversal, still not fixed
Long conversation chunking — multi-hour sessions can be 100k+ characters so the pipeline chunks at ## USER / ## ASSISTANT boundaries and deduplicates by page title, simple idea but kind of annoying to actually get right

Result

75 pages across 16 sections and they’re all embedded with zero lint errors and no broken wikilinks. It’s been running long enough at this point that the bootstrap is actually surfacing useful stuff every single session even when I’m not really thinking about what I need to look up.

homelab.nbkelley.com is live with 92 pages on it and it just rebuilds itself automatically after every session which is honestly the whole point of the thing. I never have to think about whether it’s current or not it just kind of stays current on its own as a side effect of doing the normal wiki work.

2026-05-02

../