There is a PR from me (Daniël, committer) with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).
Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.
Yes, that's pretty exciting!
There is even support in Polars to execute the dataframe compute in DataFusion (as it can handle larger than memory datasets).
There is experimental support for distributed query execution with spill-to-disk between stages to support larger than memory datasets. This is implemented in the Ballista crate, which extends DataFusion.
- Is it possible to handle data larger than fits into RAM?
- Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )