Yes! We've been running Milvus in production for about three years now, powering some customers that do have queries at that scale. It has its foibles like all of these systems (the lack of non-int id fields in the 1.x line is maddening and has required a bunch of additional engineering by us to work with our other systems), but it has held up pretty well in our experience.
(I can't speak to Milvus 2.x as we are probably not going to upgrade to that for a number of non-performance reasons)
So just use their base model and fine-tune with a non-restrictive dataset (e.g. Databricks' Dolly 2.0 instructions)? You can get a decent LoRA fine-tune done in a day or so on consumer GPU hardware, I would imagine.
The point here is that you can use their bases in place of LLaMA and not have to jump through the hoops, so the fine-tuned models are really just there for a bit of flash…
I once travelled with a 5kg vat of fondant icing on a transatlantic flight. "Yes, it looks very much like Semtex, but it's fine!" Still not exactly sure how I got away with it…
It really does give you the best of both worlds - resistant to typos, handling synonyms without all the usual hand-written rules, but still able to handle direct searches like ISBNs.
(disclaimer: I work on Semantic Search at Lucidworks)
If you control the HNSW implementation, it can definitely do pre-filtering. Vespa does it, and you can modify open source HNSW libs easily. I added pre-filtering support to an internal fork of HNSWLIB last week, for example…
(I can't speak to Milvus 2.x as we are probably not going to upgrade to that for a number of non-performance reasons)