Joining 130 million points – Spatialists

Back in December 2024, Dewey Dunnington (spatially) joined 130,000,000 building centroid points from the Microsoft U.S. buildings dataset to 33,000 postal code areas. The idea was to benchmark various tools: QGIS, PostGIS, GeoPandas¹ (using two different approaches), DuckDB Geography ², and Apache Sedona. That 2024 article is interesting in itself.

Fast forward to the present day and a new article: Dewey now works at Wherobots and has been building SedonaDB ³ - which is “a tool purpose-built to make exactly that sort of wrangling and joining effortless” (Dewey’s characterisation) (and which is different from Apache Sedona!). In the current article, Dewey re-ran the 2024 analysis but using new approaches: SedonaDB and DuckDB Spatial (which is different from the afore-mentioned DuckDB Geography!):

I checked the summary on my initial post, and it looks like the fastest version I was able to come up with was somewhere around 3-5 minutes with quite a bit of explicit wrangling/tuning and 50 GB of memory involved.

If you have the files locally (as I did in the original post), SedonaDB 0.3.0 executes the join in 6 seconds on my laptop (Apple M4, 24G RAM) without any tuning or wrangling.

For DuckDB Spatial, “the ST_Contains() version of the query takes ~1m 30s; however, the ST_DWithin(..., 0) approximation is much faster (13 seconds).”

So a 30x to 50x speed-up within a bit more than a year. If you (have to) care about fast and efficient data pipelines, I’ve got to agree with Dewey’s conclusion: “It’s an exciting time to be working in spatial!”

Footnotes

GeoPandas is an extension to the popular data science library Pandas that enables support for geospatial data.↩︎
DuckDB is an open-source column-oriented Relational Database Management System (RDBMS) designed for analytics use-cases.↩︎
SedonaDB is an open-source single-node analytical database engine with geospatial as a first-class citizen.↩︎