Question: - Is it possible to handle data larger than fits into RAM? - Any bench...

hashjoiner · on Aug 24, 2021

There is a PR from me (Daniël, committer) with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).

https://github.com/h2oai/db-benchmark/pull/182

Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.

tomnipotent · on Aug 25, 2021

> it is currently somewhat slower than the fastest (Polars).

That's really promising, considering how fast Polars is. Both are written in Rust and use Apache Arrow, so they can even co-exist in the same context.

hashjoiner · on Aug 25, 2021

Yes, that's pretty exciting! There is even support in Polars to execute the dataframe compute in DataFusion (as it can handle larger than memory datasets).

houqp · on Aug 24, 2021

> - Is it possible to handle data larger than fits into RAM?

Not at the moment, but the community has plans to add support for disk spill.

> - Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )

One of the committer Daniel is working on a h2oai db benchmark PR for Datafusion :)

andygrove · on Aug 24, 2021

There is experimental support for distributed query execution with spill-to-disk between stages to support larger than memory datasets. This is implemented in the Ballista crate, which extends DataFusion.

https://github.com/apache/arrow-datafusion/tree/master/balli...