Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Question:

- Is it possible to handle data larger than fits into RAM?

- Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )



There is a PR from me (Daniël, committer) with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).

https://github.com/h2oai/db-benchmark/pull/182

Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.


> it is currently somewhat slower than the fastest (Polars).

That's really promising, considering how fast Polars is. Both are written in Rust and use Apache Arrow, so they can even co-exist in the same context.


Yes, that's pretty exciting! There is even support in Polars to execute the dataframe compute in DataFusion (as it can handle larger than memory datasets).


> - Is it possible to handle data larger than fits into RAM?

Not at the moment, but the community has plans to add support for disk spill.

> - Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )

One of the committer Daniel is working on a h2oai db benchmark PR for Datafusion :)


There is experimental support for distributed query execution with spill-to-disk between stages to support larger than memory datasets. This is implemented in the Ballista crate, which extends DataFusion.

https://github.com/apache/arrow-datafusion/tree/master/balli...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: