We do a projection of the original vectors so that it matches one of our optimized kernel. This generally gives us better recall vs. simple padding since all bits are utilized.
We actually tried to extend DataFusion at first but ultimately decided against it since we can get most of the value by using Arrow and its compute kernels directly. DataFusion also executes filters in a way that rebuilds the underlying arrays (data copy) and requires strict schema, which is not a good fit for our schemaless document-oriented model. In the end, switching from DataFusion to Reactor gave us 3x better latencies.
Founder of TopK here. There are legit use cases for vector-based retrieval (e.g. semantic search, recommendations, multi-modal search, etc.) but that only requires supporting vectors as a data type, not building the whole database around vectors as a first-class citizen (which is what vector DBs do). In practice, you also want to combine multiple vectors, text filters, and metadata alongside custom scoring functions to optimize relevance in your domain, which is not possible with a database built around a vector index.