To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
I'm lucky enough to have 95% of my docs in small markdown markdown files so I'm just... not (+). I'm using SQLite FTS5 (full text search) to build a normal search index and using that. Well, I already had the index so I just wired it up to my mastra agents.
Each file has a short description field, so if a keyword search surfaces the doc they check the description and if it matches, load the whole doc.
This took about one hour to set up and works very well.
(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.
Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.
The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.
I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.
For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.
static embedding models im finding quite fast
lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality
I am surprised to see very few setups leveraging LSP support. (Language Server Protocol)
It has been added to Claude Code last month.
Most setups rely on naive grep.
For the purposes of learning, I’ve built a chatbot using ollama, streamlit, chromadb and docling. Mostly playing around with embedding and chunking on a document library.
The Nextcloud MCP Server [0] supports Qdrant as a vectordb to store embeddings and provide semantic search across your personal documents. This enables any LLM & MCP client (e.g. claude code) into a RAG system that you can use to chat with your files.
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
TL;DR:
- chunk files, index chunks
- vector/hybrid search over the index
- node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
lee101/gobed https://github.com/lee101/gobed static embedding models so they are embedded in milliseconds and on gpu search with a cagra style on gpu index with a few things for speed like int8 quantization on the embeddings and fused embedding and search in the same kernel as the embedding really is just a trained map of embeddings per token/averaging
To answer the question more directly, I've spent the last couple of years with a few different quant models mostly running on llama.cpp and ollama, depending. The results are way slower than the paid token api versions, but they are completely free of external influence and cost.
However the models I've tests generally turn out to be pretty dumb at the quant level I'm running to be relatively fast. And their code generation capabilities are just a mess not to be dealt with.
This took about one hour to set up and works very well.
(+) At least, I don't think this counts as RAG. I'm honestly a bit hazy on the definition. But there's no vectordb anyway.
Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )
The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.
For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.
Question being: WHY would I be doing RAG locally?
For local deployments, Qdrant supports storing embeddings in memory as well as in a local directory (similar to sqlite) - for larger deployments Qdrant supports running as a standalone service/sidecar and can be made available over the network.
[0] https://github.com/cbcoutinho/nextcloud-mcp-server
TL;DR: - chunk files, index chunks - vector/hybrid search over the index - node app to handle requests (was the quickest to implement, LLMs understand OpenAPI well)
I wrote about it here: https://laurentcazanove.com/blog/obsidian-rag-api
https://pypi.org/project/faiss-cpu/
If the total size of your data isn't loo large...?
Data being a plural gets me.
You might have small datums but a lot of kilobytes!
Works well, but I didn't tested on larger scale
Also I've got no idea what this product does, this is just a generic page of topical ai buzzwords
Don't tell me what it is, /show me why/ you built it. Then go back and keep that reasoning in, show me why I should care