More

MrPowers · 2025-09-24T21:29:16 1758749356

Rust is a good language for performant computing in general, but especially for data projects because there are so many great OSS data libraries like DataFusion and Arrow.

SedonaDB currently supports SQL, Python, R, and Rust APIs. We can support APIs for other languages in the future. That's another nice part about Rust. There are lots of libraries to expose other language bindings to Rust projects.

MrPowers · 2025-09-24T21:23:39 1758749019

You can generate the dataset with the instructions in this readme: https://github.com/apache/sedona-spatialbench/tree/main

Here are the queries: https://github.com/apache/sedona-spatialbench/blob/main/prin...

They should be fairly easy to replicate!

MrPowers · 2025-09-24T17:22:12 1758734532

The "DuckDB is probably the most important geospatial software of the last decade" post has a nice related discussion: https://news.ycombinator.com/item?id=43881468

MrPowers · 2025-09-24T16:55:39 1758732939

There is a project called GeoPolars: https://github.com/geopolars/geopolars

From the README:

> Update (August 2024): GeoPolars is blocked on Polars supporting Arrow extension types, which would allow GeoPolars to persist geometry type information and coordinate reference system (CRS) metadata. It's not feasible to create a geopolars. GeoDataFrame as a subclass of a polars. DataFrame (similar to how the geopandas. GeoDataFrame is a subclass of pandas.DataFrame) because polars explicitly does not support subclassing of core data types.

orlp · 2025-09-24T20:22:03 1758745323

I'm working on implementing extension types in Polars. Stay tuned.

MrPowers · 2025-09-24T16:50:01 1758732601

SedonaDB builds on libraries in the Rust ecosystem, like Apache DataFusion, to provide users with a nice geospatial DataFrame experience. It has functions like ST_Intersects that are common in spatial libraries, but not standard in most DataFrame implementations.

There are other good alternatives, such as GeoPandas and DuckDB Spatial. SedonaDB has Python/SQL APIs and is very fast. New features like full raster support and compatibility with lakehouse formats are coming soon!

MrPowers · on June 19, 2024

IMO, it would have been better to donate the repos to a shared org and motivate the community to continue maintaining them.

But pretty awesome this individual is retiring from programming / taking a sabbatical. There is nothing wrong with taking some time off and pursuing other interests when you lose your passion.

MrPowers · on June 18, 2024

> A Data Lakehouse is fine but what benefit does it give you over a much more simple solution of ETL/ELTing the data in batches (weekly, daily, hourly, etc) and letting it sit in some kind of DB.

Lots of engines like Polars, PyTorch, Spark, and Ray can read structured data from databases, but Lakehouses are more efficient.

Databases aren't as good for storing unstructured data.

Databases can also be much more expensive than a Data Lakehouse.

Databases are awesome and have lots of amazing use cases of course. Like you mentioned, data lakehouses are great for high data volume and throughput, but there are other use cases as well IMO.

MrPowers · on May 31, 2024

Lots of Spark workloads are executed with the C++ Photon engine on the Databricks platform, so we ironically have partially moved back to C++. Disclosure: I work for Databricks.

OutOfHere · on May 31, 2024

The continued use of C++ is not exactly something to be proud of, although in this case at least it presumably is for short-running jobs, not for long-running services that accumulate leaks.

_bohm · on May 31, 2024

There is a ton of reliable load-bearing software out there written in C++. I don't think the fact that a piece of software is written in C++ is enough to presume that it has memory leaks.

threeseed · on May 31, 2024

Python would be just another PHP level language if it wasn't for C++.

It's what powers all of the DE/ML/AI libraries.

MrPowers · on May 31, 2024

The OP is the original creator of Ballista, so he's well aware of the project.

Ballista is much less mature than Spark and needs a lot of work. It's awesome they're making Spark faster with Comet.

andygrove · on May 31, 2024

Yes, Ballista failed to gain traction. I think that one of the challenges was that it only supported a small subset of Spark, and there was too much work involved to try and get to parity with Spark.

The Comet approach is much more pragmatic because we just add support for more operators and expressions over time and fall back to Spark for anything that is not supported yet.

threeseed · on May 31, 2024

One of the challenges is that most Spark users don't care if you 2x performance.

We are in the enterprise with large cloud budgets and can simply change instance types. If you're 20x then that is a different story but then (a) you need to have feature parity and (b) need support from cloud vendors which Spark has.

OutOfHere · on May 31, 2024

For the longest time, searching for Ballista linked to its old archived repo that didn't even have a link to the new repo. There was no search result for the new repo. This misled people into thinking that Ballista is a dead project but it wasn't. It wasted so much opportunity.

I don't think it's a fair criticism of Ballista to say that it failed in any way. It just looks to need substantial effort to bring it on par with Spark. The performance benefits are meaningful. Ballista can then not only take the crown from Spark, but also revalidate Rust as a language.

andygrove · on May 31, 2024

I wish I'd known about the search issue.

I do see a new opportunity for Ballista. By leveraging all of the Spark-compatible operators and expressions being built in Comet, it would be able to support a wider range of queries much more quickly.

Ballista already uses protobuf for sending plans to executors and Comet accepts protobuf plans (in a similar, but different format).

OutOfHere · on May 31, 2024

Did Databricks sponsor Comet?

andygrove · on June 1, 2024

spenczar5 · on May 31, 2024

There seems to be a history of data technologies requiring a serious corporate sponsor. Arrow gets so much dev and marketing effort from Voltron, Spark from Databricks, etc. Did Ballista have anything’s similar? I loved the project but it never seemed to move very fast on integrating with other tools and platforms.

MrPowers · on March 6, 2024

I love Medellin and lived there for many years, but the air quality is terrible and getting worse. You can talk with any locals and they say that the climate is noticeably different than it was in the past.

Medellin is surrounded by mountains and the contaminated air cannot escape. There didn't used to be a lot of cars, but now there is financing so the number of cars is growing significantly.

The hills are steep and old busses spew black smoke.

Here is some more info on pollution in Medellin: https://medellinguru.com/medellin-pollution/

Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.

jp191919 · on March 6, 2024

Air quality in Bogota is terrible as well.

I think a good first step would be ditching all the diesel vehicles that have minimal/non-existant exhaust emissions systems.

gfarah · on March 6, 2024

Many factories are relocating outside the valley, and the use of electric vehicles (including cars and motorcycles) is increasing.

danlugo92 · on March 9, 2024

> I love Medellin and lived there for many years, but the air quality is terrible and getting worse

The good thing about hill cities such as Medellin (sadly not a format available for big cities say in Europe or the US) is that you can choose your altitude, and at around 2000 thousand meters (the city starts at ~1500m) the air quality is not so bad, used to be worse years ago (maybe you lived there 2 or 3 years ago), but now it's much better.

> You can talk with any locals and they say that the climate is noticeably different than it was in the past.

Yeah, the city is much warmer compared to say 10 years ago, whether this is due to the city growing into previously-forest areas or /global/ warming I don't know... but yeah, locals agree it was MUCH colder 10 years ago...

> Medellin is surrounded by mountains and the contaminated air cannot escape.

See comments above about living at 2000m altitude (up in the mountains a bit away from the high-rise buildings and such, think of Beverly Hills or something like that.).

> The hills are steep and old busses spew black smoke.

As of now, there's almost no remaining old busses spewing black smoke anymore, but there's some cargo trucks still doing it.

> Saying Medellin's temp decreased by 2 degrees Celsius based on "Mejorar el microclima hasta 2°C" is a misinterpretation. I think this article is quite misleading.

I wouldn't know, but locals do say that it was a much more colder city in the past...