There seem to be a few "long rang transformer options" (https://huggingface.co/blog/long-range-transformers). They're slower (not an issue for my use case though). I've heard one talk on this issue mention that there are some "tricks" you can use with web docs (html) like hrefs, to get creative. But more commonly from what I've read, people try different approaches for segmenting the documents, then embed those segments. Heard good things about using a Hidden Markov Model for the segmenting approach.
I hadn't seen that approach to evaluating search engines. But looking at the github repo I'm also not quite following how I would use this/what the standard approach is for scoring relevance ranking approaches, and how this approach differs from that standard approach. If it's not too much trouble a tldr on that would be a really useful intro.
I do intend to have the UX be setup in a way for the user base to sort of re-rank or at least provide some feedback on which results were irrelevant to help with re-training the model over time. For certain applications where very limited domain knowledge is required (for example, is this a hardware product or a sofware product? - or is this a product or a service?), I can also use mechanical turk or similar to label data and I fully intend to do that.
I've found longformers on huggingface that are mated to a tokenizer and trained that look usable but for most of the long range transformers that seems aspirational now.
If you want to understand how to evaluate search engines the best resource is this book
Basically you make a set of queries, then you make a list of judgements to the effect that "document A (is|is not) relevant to query B"; the code in that github repository is used by conference-goers to merge those judgements with search results to make a precision-recall curve. You can download document collections and judgements from the TREC website to get started but what works for their collection may or may not work well for yours.
The story of that TREC conference was that "99% of what you think will improve your search results won't", and the BM25 algorithm was the first great discovery that came out of it after 5 years of disappointment. I learned to be skeptical about any kind of "break a document up into subdocuments and rank the subdocuments" because that was one of the many ideas that people struggled to make work back in the day.
There definitely are ways to look at documents a piece at a time systematically for particular tasks and sometimes simple answers will work ("Who won the sports game?" is almost always answered at the beginning of a newspaper article.) Most of the simple ways of doing it people try (like averaging vectors) are like having Superman huff some Kryptonite before you test him though.
I hadn't seen that approach to evaluating search engines. But looking at the github repo I'm also not quite following how I would use this/what the standard approach is for scoring relevance ranking approaches, and how this approach differs from that standard approach. If it's not too much trouble a tldr on that would be a really useful intro.
I do intend to have the UX be setup in a way for the user base to sort of re-rank or at least provide some feedback on which results were irrelevant to help with re-training the model over time. For certain applications where very limited domain knowledge is required (for example, is this a hardware product or a sofware product? - or is this a product or a service?), I can also use mechanical turk or similar to label data and I fully intend to do that.