I've just started exploring adding OpenTelemetry support to the Comet subproject of DataFusion. I'm excited to see the integration with Apache Arrow (Rust) and potentially DataFusion in the future.
I contributed to the NVIDIA Spark RAPIDS project for ~4 years and for the past year have been contributing to DataFusion Comet, so I have some experience in Spark acceleration and I have some questions!
1. Given the momentum behind the existing OSS Spark accelerators (Spark RAPIDS, Gluten + Velox, DataFusion Comet), have you considered collaborating with and/or extending these projects? All of them are multi-year efforts with dedicated teams. Both Spark RAPIDS and Gluten + Velox are leveraging GPUs already.
2. You mentioned that "We're fully compatible with Spark SQL (and Spark)." and that is very impressive if true. None of the existing accelerators claim this. Spark compatibility is notoriously difficult with Spark accelerators built with non-JVM languages and alternate hardware architectures. You have to deal with different floating-point implementations and regex engines, for example.
Also, Spark has some pretty quirky behavior. Do you match Spark when casting the string "T2" to a timestamp, for example? Spark compatibility has been pretty much the bulk of the work in my experience so far.
Providing acceleration at the same time as guaranteeing the same behavior as Spark is difficult and the existing accelerators provide many configuration options to allow users to choose between performance and compatibility. I'm curious to hear your take on this topic and where your focus is on performance vs compatibility.
1. Yes! Would love to contribute back to these projects, since I am already using RAPIDS under the hood. My general goal is to bring GPU acceleration to more workloads. Though, as solo founder, I am finding it difficult to have any time for this at the moment, haha.
2. Hmm, maybe I should mention that we're not "accelerating all operations" -- merely compatible. Spark-RAPIDS has the goal of being byte-for-byte compatible unless incompatible ops are specifically allowed. But... you might be right about that kind of quirk. Would not be surprising, and reminds me of checking behavior between compilers.
I'd say the default should be a focus on compatibility, and work through any extra perf stuff with our customers. Maybe a good quick way to contribute back to open source is to first upstream some tests?
Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.
The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.
One of the challenges is that most Spark users don't care if you 2x performance.
We are in the enterprise with large cloud budgets and can simply change instance types. If you're 20x then that is a different story but then (a) you need to have feature parity and (b) need support from cloud vendors which Spark has.
For the longest time, searching for Ballista linked to its old archived repo that didn't even have a link to the new repo. There was no search result for the new repo. This misled people into thinking that Ballista is a dead project but it wasn't. It wasted so much opportunity.
I don't think it's a fair criticism of Ballista to say that it failed in any way. It just looks to need substantial effort to bring it on par with Spark. The performance benefits are meaningful. Ballista can then not only take the crown from Spark, but also revalidate Rust as a language.
I do see a new opportunity for Ballista. By leveraging all of the Spark-compatible operators and expressions being built in Comet, it would be able to support a wider range of queries much more quickly.
Ballista already uses protobuf for sending plans to executors and Comet accepts protobuf plans (in a similar, but different format).
There seems to be a history of data technologies requiring a serious corporate sponsor. Arrow gets so much dev and marketing effort from Voltron, Spark from Databricks, etc. Did Ballista have anything’s similar? I loved the project but it never seemed to move very fast on integrating with other tools and platforms.
Original author of DataFusion/Ballista here. Having alamb and others from InfluxData involved has been a huge help in driving the project forward and helping build an active community behind the project. It is genuinely hard to keep up with the momentum these days!
Hi, I just had a glance over the DataFusion project. Very interesting work out there which I will be definitely keeping the track of but I've got a genuine question. Do you sometimes find development in Rust a little bit challenging for large-scale and performance sensitive type of work?
I say this because I've noticed more than several PRs fixing (large) performance regressions which to my understanding were mostly introduced due to unforeseen or unexpected Rust compiler subtleties which would then lead to less than optimal code generation. One example of such event was a naive and simply looking abstraction that was introduced and which brought down the performance by something like 50% in TPC-H benchmarks. This really struck me a little bit, especially because it seems quite hard to identify the root cause, and I would like to hear the experiences from the first hand. Thanks a bunch!
I think it is worth pointing out that this tool does support querying Delta Lake (the author of ROAPI is also a major contributor the native Rust implementation of Delta Lake). Delta Lake certainly supports transactions, so ROAPI can query transactional data, although the writes would not go through ROAPI.
FWIW, my dealer is not adding any markup. Here is the email I received from them explaining the process:
"We finally received an update from Ford regarding the Lightning! This will be a completely unique process from Ford and they will send a detailed guideline in January, however I wanted to let you know in advance that things are in motion.
This process is going to be based on invitation to convert as demand has outweighed production capability by far. Due to demand, not all reservation holders will recieve an invitation to place an order for a 2022 model year. Ford will begin inviting reservation holders to place orders in waves starting in January. Subsequent waves will receive an invitation in two-week intervals until 2022 model year capacity is reached. Invitations are based on reservation timing, and our estimated allocation. You will be directed to an online configurator (Build and Price tool) where you will then be able to spec out your order and then submit your finalized order. At that time, we will provide you with complete pricing including MSRP, taxes, and dealer handling fee. We are not charging any dealer markup above MSRP. Once the 22MY production capacity is met, all remaining reservation holders will be notified that their next ordering opportunity will be for the next model year."
Sounds like you have a local dealer that's transparent and (relatively!) honest. From other posts it seems that Ford can't necessarily set the prices that the dealers charge...but what they can do is let dealers know that their allocations are at risk. In other words -- Ford can separately survey invitation receivers and find out what prices they pay. Excessive markups can be used to reduce allocations to dealers.
The Apache Spark project is many many years ahead of DataFusion & Ballista with more than a decade of work from more than 1,700 contributors and is going strong.
I don't see DataFusion as a competitor to Spark since it is specifically designed as an embedded library and is optimized for in-memory processing with low overhead.
Ballista is highly influenced by Spark and is capable of running some of the same queries that Spark can support. There is enough functionality to be able to run a subset of the TPC-H benchmarks for example, with reasonable performance at scale. So for users wanting to run those kind of SQL queries, maybe Ballista isn't so far off, but Spark has much more functionality than this and it could potentially take years of effort from a community to try and catch up with Spark. It will be interesting to see what happens for sure.
Ballista started out as a separate project and was donated in April 2021. They currently share a release schedule (but have different versioning) and this was the first release of DataFusion to include the Ballista crate.
My hope is that Ballista and DataFusion become more integrated over time but remain separate, with DataFusion being an embedded / single-process query engine and Ballista providing distributed execution.