[Discover] Inside the NERV // Open Source Division: A One-Person Semantic Archive

How Tom Doerr built a 13,000-repo visual search engine entirely on GitHub Pages.

The Architect of the Archive

In the vast, noisy expanse of GitHub, discovery is often broken. We rely on star counts or trending lists, but these rarely capture the semantic nuance of what a project actually does. Enter repo_posts, a project by Tom Doerr that functions less like a personal blog and more like a high-fidelity intelligence dashboard—what the developer calls the 'MAGI//ARCHIVE' (EVA theme).

The One-Person Army

What makes repo_posts exceptional is not just its aesthetic, but its technical ambition. This is a one-person project that manages a massive dataset of 13,000 repositories using nothing but static site hosting, clever Python automation, and client-side vector search. Instead of a heavy backend, the repo leverages GitHub Actions to transform raw repository data into a searchable, 3D-navigable knowledge graph.

Under the Hood: Semantic Engineering

This isn’t your typical Jekyll blog. The project uses a sophisticated pipeline:

Embedding Pipeline: tools/generate_related.py calculates embeddings using sentence-transformers/all-MiniLM-L6-v2.
Binary Search: Rather than relying on a server-side database, the repo exports embeddings to binary files (.f32) and uses docs/assets/js/sem.js to execute real-time cosine similarity directly in the visitor’s browser via WebGPU/ONNX.
3D Visualization: For the truly curious, tools/export_3d_coords.py applies UMAP dimensionality reduction to map these 13,000 repositories into a Three.js 3D space. It is a breathtaking way to see how software projects cluster based on their actual purpose, not just their names.

Why It Matters

Most developers would look at this requirement and immediately spin up a PostgreSQL instance with pgvector or reach for a managed search service. Tom Doerr took the 'GitHub Pages as a platform' philosophy to its absolute limit. By handling the heavy lifting of semantic indexing in pre-build workflows, the live site remains completely static, lightning-fast, and essentially free to host. It is a masterclass in 'static-first' architecture.

The Rough Edges

Being a one-person army, the repository is highly opinionated. The contribution policy is strict: the site is curated automatically, and PRs for repo suggestions are rejected. It is a personal archive, not a community wiki. Additionally, the 'Liquid numeric gotcha' (where math defaults to integers) serves as a reminder that working within the constraints of Jekyll can occasionally require some unconventional workarounds.

The Bottom Line

If you want to see how far you can push a static site, look at the repo_posts architecture. It proves that with a bit of Python, some clever CI/CD, and a flair for UI, you can build a discovery engine that rivals commercial SaaS tools. It’s a love letter to open source, wrapped in a beautiful, glitch-aesthetic interface.

[Read full article on The Gap →](https://blog.teum.io/inside-the-nerv-open-source-division-a-one-person-semantic-archive/)

#GitHub Pages#Jekyll#Semantic Search#WebGPU#Vector Embeddings