Ask HN: How to build a cost effective MVP for niche web search?

wbarber · on Feb 15, 2023

I'm a minimally technical entrepreneur that could use some guidance on piecing together an MVP for a search product.

Some quick background on the idea (also in the linked document):

What Users Want: Ability to quickly generate a list of company websites that match their custom industry definition/query.

MVP Goal: Complex enough to properly validate/reject the idea, simple enough to not be prohibitively expensive or too many man hours to reach validation/rejection step. Ideally, I can add features/complexity to the implementation post validation without completely changing the architecture (for example, adding vector search post initial validation with something like BM25)

Example user input queries: software for catering businesses crane inspection service laboratory reagent suppliers

Output for each case would be a list of relevant businesses they can further filter by relevant criteria like employee count, location, etc.

Some key questions I could use some help answering are: a) At present, how much value will vector search add beyond BM25/BM25F or similar? b) Given the recent rate of progress in LLMs, I'm expecting embeddings for search to improve at a similar rate, and therefore assuming I should expect to be implementing vector based search in the near future even if it's not part of the MVP.

I've share some of my research so far in the linked document. Would really appreciate some feedback on it. How would you build this MVP if you were trying to do it bootstrapped/solo?

PaulHoule · on Feb 15, 2023

The basic trouble with LLMs is that they have a fixed attention window which is often 512 (BERT) - 4096 (ChatGPT) tokens. If you are handling documents that fit in that window they are magical but once you go outside of what they were trained for they don't really beat BM25 and other classical methods anymore.

Certainly larger models will come and people might find ways to make more scalable LLMs but for now you are going to be crunching your documents down to size.

It is a path less taken in the industry but there is a methodology for evaluating search engines, see

https://github.com/usnistgov/trec_eval

You can certainly try using BM25 and decide off the cuff if you like it or not but if you want to try a lot of different things you're going to need a set of documents, queries and evaluated responses ("is this relevant?")

I'd imagine you could train a retrieval model based on that kind of data much like they train ChatGPT, it's probably not as hard but would be a substantial project that would need a lot of training data but I bet you could beat cosine similarity on the vectors.

jeadie · on Feb 17, 2023

It also depends how you process your documents to create embedding(s). The open-source tensor search project [marqo](https://github.com/marqo-ai/marqo) does a great job of dealing with these type of fixed attention window problems.

wbarber · on Feb 15, 2023

There seem to be a few "long rang transformer options" (https://huggingface.co/blog/long-range-transformers). They're slower (not an issue for my use case though). I've heard one talk on this issue mention that there are some "tricks" you can use with web docs (html) like hrefs, to get creative. But more commonly from what I've read, people try different approaches for segmenting the documents, then embed those segments. Heard good things about using a Hidden Markov Model for the segmenting approach.

I hadn't seen that approach to evaluating search engines. But looking at the github repo I'm also not quite following how I would use this/what the standard approach is for scoring relevance ranking approaches, and how this approach differs from that standard approach. If it's not too much trouble a tldr on that would be a really useful intro.

I do intend to have the UX be setup in a way for the user base to sort of re-rank or at least provide some feedback on which results were irrelevant to help with re-training the model over time. For certain applications where very limited domain knowledge is required (for example, is this a hardware product or a sofware product? - or is this a product or a service?), I can also use mechanical turk or similar to label data and I fully intend to do that.

PaulHoule · on Feb 15, 2023

I've found longformers on huggingface that are mated to a tokenizer and trained that look usable but for most of the long range transformers that seems aspirational now.

If you want to understand how to evaluate search engines the best resource is this book

https://www.amazon.com/TREC-Experiment-Evaluation-Informatio...

Basically you make a set of queries, then you make a list of judgements to the effect that "document A (is|is not) relevant to query B"; the code in that github repository is used by conference-goers to merge those judgements with search results to make a precision-recall curve. You can download document collections and judgements from the TREC website to get started but what works for their collection may or may not work well for yours.

The story of that TREC conference was that "99% of what you think will improve your search results won't", and the BM25 algorithm was the first great discovery that came out of it after 5 years of disappointment. I learned to be skeptical about any kind of "break a document up into subdocuments and rank the subdocuments" because that was one of the many ideas that people struggled to make work back in the day.

There definitely are ways to look at documents a piece at a time systematically for particular tasks and sometimes simple answers will work ("Who won the sports game?" is almost always answered at the beginning of a newspaper article.) Most of the simple ways of doing it people try (like averaging vectors) are like having Superman huff some Kryptonite before you test him though.

Look up my profile and send me an email.

wbarber · on Feb 15, 2023

I'll follow up via email. Thanks Paul.