Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Don't use a vector database for code, embeddings are slow and bad for code. Code likes bm25+trigram, that gets better results while keeping search responses snappy.


You can do hybrid search in Postgres.

Shameless plug: https://github.com/jankovicsandras/plpgsql_bm25 BM25 search implemented in PL/pgSQL ( Unlicense / Public domain )

The repo includes also plpgsql_bm25rrf.sql : PL/pgSQL function for hybrid search ( plpgsql_bm25 + pgvector ) with Reciprocal Rank Fusion; and Jupyter notebook examples.


Wow very impressive library great work!


I agree. Someone here posted a drop-in for grep that added the ability to do hybrid text/vector search but the constant need to re-index files was annoying and a drag. Moreover, vector search can add a ton of noise if the model isn't meant for code search and if you're not using a re-ranker.

For all intents and purposes, running gpt-oss 20B in a while loop with access to ripgrep works pretty dang well. gpt-oss is a tool calling god compared to everything else i've tried, and fast.


Say more!


Anybody know of a good service / docker that will do BM25 + vector lookup without spinning up half a dozen microservices?


Here's a Dockerfile that will spin up postgres with pgvector and paradedb https://gist.github.com/cipherself/5260fea1e2631e9630081fb7d...

You can use pgvector for the vector lookup and paradedb for bm25.


For BM25 + trigram, SQLite FTS5 works well.


Elasticsearch / Opensearch is the industry standard for this


Used to be, but they're very complicated to operate compared to more modern alternatives and have just gotten more and more bloated over the years. Also require a bunch of different applications for different parts of the stack in order to do the same basic stuff as e.g. Meilisearch, Manticore or Typesense.


>very complicated to operate compared to more modern alternatives

Can you elaborate? What makes the modern alternatives easier to operate? What makes Elasticsearch complicated?

Asking because in my experience, Elasticsearch is pretty simple to operate unless you have a huge cluster with nodes operating in different modes.


Sure, I've managed both clusters and single node deployments in production until 2025 when I changed jobs. Elastic definitely does have its strengths, but they're increasingly enterprise-oriented and appear not to care a lot about open source deployments. At one point Elastic itself had a severe regression in an irreverible patch update (!?) which took weeks to fix, forcing us to recover from backup and recreate the index. The documentation is or has been ambigious and self-contradicting on a lot of points. The Debian Elastic Enterprise Search package upgrade script was incomplete, so there's a significant manual process for updating the index even for patch updates. The interfaces between the different components of the ELK stack are incoherent and there's literally a thousand ways to configure them. Default setups have changed a lot over the years, leading to incoherent documentation. You really need to be an expert at Elastic in order to run it well – or pay handsomely for the service. It's simply too complicated and costly for what it is, compared to more recent alternatives.


Meilisearch


This is true in general with LLMs, not just for code. LLMs can be told that their RAG tool is using BM25+N-grams, and will search accordingly. keyword search is superior to embeddings based search. The moment google switched to bert based embeddings for search everyone agreed it was going down hill. Most forms of early enshittification were simply switching off BM25 to embeddings based search.

BM25/tf-idf and N grams have always been extremely difficult to beat baselines in information retrieval. This is why embeddings still have not led to a "ChatGPT" moment in information retrieval.


static embedding models im finding quite fast lee101/gobed https://github.com/lee101/gobed is 1ms on gpu :) would need to be trained for code though the bigger code llm embeddings can be high quality too so its just yea about where is ideal on the pareto fronteir really , often yea though your right it tends to be bm25 or rg even for code but yea more complex solutions are kind of possible too if its really important the search is high quality


I've gotten great results applying it to file paths + signatures. Even better if you also fuse those results with BM25.


I like embeddings for natural language documents where your query terms are unlikely to be unique, and overall document direction is a good disambiguator.


With AI needing more access to documentation, WDYT about using RAG for documentation retrieval?


IME most documentation is coming from the web via web search. I like agentic RAG for this case, which you can achieve easily with a Claude Code subagent.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: