As far as I understand, if this continues in the direction it's going, the inten...

andygrove · on Aug 25, 2021

The Apache Spark project is many many years ahead of DataFusion & Ballista with more than a decade of work from more than 1,700 contributors and is going strong.

I don't see DataFusion as a competitor to Spark since it is specifically designed as an embedded library and is optimized for in-memory processing with low overhead.

Ballista is highly influenced by Spark and is capable of running some of the same queries that Spark can support. There is enough functionality to be able to run a subset of the TPC-H benchmarks for example, with reasonable performance at scale. So for users wanting to run those kind of SQL queries, maybe Ballista isn't so far off, but Spark has much more functionality than this and it could potentially take years of effort from a community to try and catch up with Spark. It will be interesting to see what happens for sure.

peytoncasper · on Aug 24, 2021

I might be wrong on this, but I don't believe this is a replacement for Spark. Rather this is similar to the Spark SQL execution engine.

I don't believe there is any focus on providing a distributed execution environment, rather platforms like Spark and Flink could integrate DataFusion as an implementation and expose the API for Apache Arrow operations.

houqp · on Aug 25, 2021

Datafusion, and Ballista by definition, also provides a Dataframe API that let's you construct queries programmatically. It also has preliminary support for UDFs.

We also have community members implementing Spark native executors using Datafusion, which showed significant speed improvements in the initial PoC.