Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As far as I understand, if this continues in the direction it's going, the intention is to retire Apache Spark, correct?

If correct, how far off are we (ball park), in terms of supporting the functionality, and ability to execute in a distributed fashion?



The Apache Spark project is many many years ahead of DataFusion & Ballista with more than a decade of work from more than 1,700 contributors and is going strong.

I don't see DataFusion as a competitor to Spark since it is specifically designed as an embedded library and is optimized for in-memory processing with low overhead.

Ballista is highly influenced by Spark and is capable of running some of the same queries that Spark can support. There is enough functionality to be able to run a subset of the TPC-H benchmarks for example, with reasonable performance at scale. So for users wanting to run those kind of SQL queries, maybe Ballista isn't so far off, but Spark has much more functionality than this and it could potentially take years of effort from a community to try and catch up with Spark. It will be interesting to see what happens for sure.


I might be wrong on this, but I don't believe this is a replacement for Spark. Rather this is similar to the Spark SQL execution engine.

I don't believe there is any focus on providing a distributed execution environment, rather platforms like Spark and Flink could integrate DataFusion as an implementation and expose the API for Apache Arrow operations.


Datafusion, and Ballista by definition, also provides a Dataframe API that let's you construct queries programmatically. It also has preliminary support for UDFs.

We also have community members implementing Spark native executors using Datafusion, which showed significant speed improvements in the initial PoC.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: