Building askgit: semantic search for GitHub repositories
Repository
Keyword search is useful when you already know what a symbol, filename, or package is called. It is much less helpful when the question is closer to intent than exact syntax.
That gap shows up quickly when working in unfamiliar repositories. You might want to ask where authentication is enforced, how a job is scheduled, or which part of a system is responsible for chunking and indexing content. Those are natural questions, but they do not always map to a single identifier or exact string.
askgit is an attempt to make that mode of repository exploration practical. The project clones a GitHub repository, chunks the codebase with strategies that depend on file type, generates embeddings, stores the results in PostgreSQL with pgvector, and exposes search through an MCP server so assistant tooling can query it directly.
At a high level, the repository contains four important pieces:
pgvectorIt also includes the surrounding operational pieces needed to run the system locally: environment configuration, Docker Compose for PostgreSQL, Alembic migrations, and a small example script for agent integration.
The architecture is intentionally straightforward.
pgvector.The value is not in a large number of moving parts. It is in choosing a pipeline that is simple enough to operate, but still smart enough to produce useful chunks for retrieval.
The ingestion path is fairly simple on paper, but it sets the baseline for everything that comes after it. It is responsible for cloning a repository and turning it into a stream of files that the rest of the pipeline can work with. If that step pulls in the wrong files, skips important directories, or loses too much structural context too early, the quality of the search results drops quickly.
Most of the retrieval quality is decided here.
Instead of treating every file the same way, askgit uses a few chunking strategies depending on the content:
Treating every file as if it were just plain text with a different extension turns out to be too crude. Fixed-size chunks are easy to implement, but they cut straight through function boundaries and usually lose too much context. Semantic chunking helps for prose, but it is not enough for code where functions, classes, and module boundaries carry most of the meaning. AST-aware splitting costs a bit more, but it gives back chunks that line up much better with how source code is actually read.
Embeddings are generated through LiteLLM, which keeps the provider boundary flexible. For a project like this, that matters because it keeps the focus on indexing and retrieval quality instead of tying the whole system too tightly to a single embedding backend.
pgvector on PostgreSQL felt like the right tradeoff for this stage of the project. It is familiar, easy to run locally, and good enough to support the search workflow without adding another datastore just for vectors. For a developer tool that starts as a local or small shared service, that simplicity matters quite a bit.
The MCP layer is what turns askgit from a standalone indexing script into something that can participate in an assistant workflow. Once the search functionality is exposed as tools, it becomes much easier to use the repository index from the same place where the questions are already being asked.
If there is one design choice that carries most of the weight in askgit, it is this one.
Code is not just text, and the retrieval pipeline works better when it reflects that. Different languages and file types benefit from different splitting strategies. AST-based chunking helps preserve complete functions, methods, and classes. Language-aware splitters still keep more structure than plain fixed windows when full AST support is not available. Semantic chunking then covers markdown and plain text so the non-code parts of the repository remain useful too.
For a system like this, PostgreSQL is a very reasonable default:
That does not make PostgreSQL the universal answer, but for a project like this it keeps the whole system approachable and easy to run.
MCP is a good fit here because a retrieval engine on its own is rarely the end goal. What usually matters is repository search as one capability inside a larger assistant workflow. MCP makes that integration much more natural.
The happy path is intentionally small:
uv.env.template to .envThe exact commands are already in the repository README, which is where the operational detail belongs. For the purposes of this post, the more important point is that askgit is structured like something you can actually run and inspect, not just something that sounds plausible in a diagram.
The current version is a strong technical foundation, but it is still early in a few ways.
None of those limitations are especially surprising for a project at this stage. The more important point is that the repository already supports a real workflow and provides a solid base for improving the retrieval side over time.
If I were pushing this further, the next steps would be:
askgit is the kind of project this blog is meant to focus on: a practical tool with a clear job, a public repository, and a few design decisions that are worth unpacking properly.
If you want to try it or build on it, start with the linked repository above.