Back in December 2024, Dewey Dunnington (spatially) joined 130,000,000 building centroid points from the Microsoft U.S. buildings dataset to 33,000 postal code areas. The idea was to benchmark various tools: QGIS, PostGIS, GeoPandas1 (using two different approaches), DuckDB Geography2, and Apache Sedona. That 2024 article is interesting in itself.
Fast forward to the present day and a new article: Dewey now works at Wherobots and has been building SedonaDB3 - which is “a tool purpose-built to make exactly that sort of wrangling and joining effortless” (Dewey’s characterisation) (and which is different from Apache Sedona!). In the current article, Dewey re-ran the 2024 analysis but using new approaches: SedonaDB and DuckDB Spatial (which is different from the afore-mentioned DuckDB Geography!):
I checked the summary on my initial post, and it looks like the fastest version I was able to come up with was somewhere around 3-5 minutes with quite a bit of explicit wrangling/tuning and 50 GB of memory involved.
If you have the files locally (as I did in the original post), SedonaDB 0.3.0 executes the join in 6 seconds on my laptop (Apple M4, 24G RAM) without any tuning or wrangling.
For DuckDB Spatial, “the ST_Contains() version of the query takes ~1m 30s; however, the ST_DWithin(..., 0) approximation is much faster (13 seconds).”
So a 30x to 50x speed-up within a bit more than a year. If you (have to) care about fast and efficient data pipelines, I’ve got to agree with Dewey’s conclusion: “It’s an exciting time to be working in spatial!”
Footnotes
GeoPandasis an extension to the popular data science libraryPandasthat enables support for geospatial data.↩︎DuckDBis an open-source column-oriented Relational Database Management System (RDBMS) designed for analytics use-cases.↩︎SedonaDBis an open-source single-node analytical database engine with geospatial as a first-class citizen.↩︎