Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One of the Arrow Datafusion committers here. Happy to help answer any question.


I've been following Arrow and Datafusion dev for a little bit, mostly because the architecture and goals look interesting.

What I'd be curious about is one of the possible use cases mentioned in the Readme: ETL processes. I have yet to come across any projects that are building ETL/ELT/pipeline tools that leverage Datafusion. Might not be looking in the right places.

Would anyone have insight into whether this is simply unexplored territory, or just not as good of a fit as other use cases?


Disclosure: I am a contributor to Datafusion.

I have done a lot of work in the ETL space in Apache Spark to build Arc (https://arc.tripl.ai/) and have ported a lot of the basic functionality of Arc to Datafusion as a proof-of-concept. The appeal to me of the Apache Spark and Datafusion engines is the ability to a) seperate compute and storage b) express transformation logic in SQL.

Performance: From those early experiments Datafusion would frequently finish processing an entire job _before_ the SparkContext could be started - even on a local Spark instance. Obviously this is at smaller data sizes but in my experience a lot of ETL is about repeatable processes not necessarily huge datasets.

Compatibility: Those experiments were done a few months ago and the SQL compatibility of the Datafusion engine has improved extremely rapidly (WINDOW functions were recently added). There is still some missing SQL functionality (for example to run all the TPC-H queries https://github.com/apache/arrow-datafusion/tree/master/bench...) but it is moving quickly.


Oh hey, thanks for the info!

I spent some time evaluating Arc for my team's ETL purposes and I was really impressed. I hesitated somewhat to move forward with it because it seemed really tied into the Spark ecosystem (for great reasons). We just weren't at all familiar with deploying and operating Spark, so ended up rolling our own scripts on top of (an existing) Airflow cluster for now.

Besides performance reasons, are there any other advantages to porting Arc to run on top of datafusion? If the porting effort was shared somewhere I'd love to dig in and see what the proof-of-concept looks like.


Hi eduren. Give me a few days and Ill see what i can publish as a WIP repo. The aim of Arc was to always allow swapping the execution engine whilst retaining the logic - hence SQL -so this should hopefully be easy.


Rust stuff tends to be a bit more resource efficient than Java.

Currently using DataFusion from Rust, and being more resource efficient means we can use smaller machines, which means our costs go down. Deploying services is also faster (smaller docker images, faster startup times) and puts less extraneous load on our machines.

I imagine Arc, and thus downstream users, would see similar benefits.


ETL pipeline is a perfect fit for Datafusion and its distributed version Ballista. Personally, this is the main reason I am investing my time into Datafusion.


What is the "DataFusion"?

- not in the FAQ ( https://arrow.apache.org/faq/ )

- not in the Release page.


OK: I have found: https://github.com/apache/arrow-datafusion

"DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.DataFusion supports both an SQL and a DataFrame API for building logical query plans as well as a query optimizer and execution engine capable of parallel execution against partitioned data sources (CSV and Parquet) using threads. DataFusion also supports distributed query execution via the Ballista crate."

"Use Cases: DataFusion is used to create modern, fast and efficient data pipelines, ETL processes, and database systems, which need the performance of Rust and Apache Arrow and want to provide their users the convenience of an SQL interface or a DataFrame API."


You beat me to it, was about to post the github link :) Readme is a good starting place to learn more about the project.


Question:

- Is it possible to handle data larger than fits into RAM?

- Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )


There is a PR from me (Daniël, committer) with for db-benchmark. For the group by benchmarks, on my machine, it is currently somewhat slower than the fastest (Polars).

https://github.com/h2oai/db-benchmark/pull/182

Also we do support running TPC-H benchmarks. For the queries we can run, those are already finishing faster than Spark. We are planning to do more benchmarking and optimizations in the future.


> it is currently somewhat slower than the fastest (Polars).

That's really promising, considering how fast Polars is. Both are written in Rust and use Apache Arrow, so they can even co-exist in the same context.


Yes, that's pretty exciting! There is even support in Polars to execute the dataframe compute in DataFusion (as it can handle larger than memory datasets).


> - Is it possible to handle data larger than fits into RAM?

Not at the moment, but the community has plans to add support for disk spill.

> - Any benchmark? like: https://h2oai.github.io/db-benchmark/ ( see 50GB + Join -> "timeout" | "out of memory" )

One of the committer Daniel is working on a h2oai db benchmark PR for Datafusion :)


There is experimental support for distributed query execution with spill-to-disk between stages to support larger than memory datasets. This is implemented in the Ballista crate, which extends DataFusion.

https://github.com/apache/arrow-datafusion/tree/master/balli...


Hi, how does this compare to Vaex? https://vaex.io/docs/tutorial.html

Vaex can join larger than memory datasets, but it doesn't support SQL per-se, it uses it's own lazy dataframe DSL.


I didn't dive into Vaex's implementation, but based on the example code, I would say they are similar in the sense that they all provide a Dataframe interface for end users to perform compute on relational data.

It looks like Vaex focuses more on end users like data scientists while Datafusion focuses more on being a composable embedded library for building analytical engines. For example, InfluxDB IOx, Ballista and ROAPI all uses Datafusion as the compute engine.

On top of that, Datafusion also comes with a builtin SQL planner so users can choose between Dataframe and SQL interfacts.


As far as I understand, if this continues in the direction it's going, the intention is to retire Apache Spark, correct?

If correct, how far off are we (ball park), in terms of supporting the functionality, and ability to execute in a distributed fashion?


The Apache Spark project is many many years ahead of DataFusion & Ballista with more than a decade of work from more than 1,700 contributors and is going strong.

I don't see DataFusion as a competitor to Spark since it is specifically designed as an embedded library and is optimized for in-memory processing with low overhead.

Ballista is highly influenced by Spark and is capable of running some of the same queries that Spark can support. There is enough functionality to be able to run a subset of the TPC-H benchmarks for example, with reasonable performance at scale. So for users wanting to run those kind of SQL queries, maybe Ballista isn't so far off, but Spark has much more functionality than this and it could potentially take years of effort from a community to try and catch up with Spark. It will be interesting to see what happens for sure.


I might be wrong on this, but I don't believe this is a replacement for Spark. Rather this is similar to the Spark SQL execution engine.

I don't believe there is any focus on providing a distributed execution environment, rather platforms like Spark and Flink could integrate DataFusion as an implementation and expose the API for Apache Arrow operations.


Datafusion, and Ballista by definition, also provides a Dataframe API that let's you construct queries programmatically. It also has preliminary support for UDFs.

We also have community members implementing Spark native executors using Datafusion, which showed significant speed improvements in the initial PoC.


Hi. Is it possible to use Datafusion remotely, likea query service? Perhaps using Arrow Flight? I would like to query data with different clients. Python in Jupyter, straight to browser and perhaps even something like Nu shell. This way each tool won’t need to open its own copy of Arrow/Parquet data.


Yes. The Ballista crate (part of the arrow-datafusion repo) provides distributed query execution and the scheduler has a gRPC service. Flight is used internally as well but not directly exposed to users. There is also work in progress to add Python bindings for Ballista (they already exist for DataFusion).


Thank you. I went through its GitHub repo for docs. It seems I need to dig a bit deeper perhaps. How to get started with my Parquet files isn’t immediately obvious.

I assume Python bindings would talk through gRPC. I could use gRPC directly perhaps?


The best "Getting Started" documentation right now is that on docs.rs - https://docs.rs/ballista/0.5.0/ballista/

This demonstrates using the Rust client (BallistaContext + DataFrame). There are already Python bindings for DataFrame but not BallistaContext yet.

Documentation for Ballista is severely lacking right now and this will be an area of focus for the next release.


Thanks. I’m experimenting with Rust currently. This might fit the bill. I am curious though why does the client need to use async Rust. I hadn’t gotten that far in my learnings. I would have guessed that synchronous way should work as well.


Are there any good resources on using DataFusion in Python beyond the README [1]?

[1] https://github.com/apache/arrow-datafusion/tree/master/pytho...


Is this the best current view of a roadmap for Datafusion? https://www.mail-archive.com/dev@arrow.apache.org/msg23576.h...


<DataFusion committer here>

I do think that is the best current view of a RoadMap and Vision -- it would be great to flesh it out a bit more.

In fact, I'll make a note to try and add some more higher level context into the project on our goals.


If the vision is to pseudo-copy the best bits of postgres, I'd be very interested in seeing features that tackle PostGIS type spatial problems. Native spatial work that actually scales to handle global level data in a single node still feels like a pipe dream a lot of the time. Adding things like Dask or xarray feel like hacks on imperfect base layers just to get a base system to be barely operational.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: