At The Stack we regularly look at emerging open source projects, including by dipping into the Apache Software Foundation incubator -- the primary entry path into the ASF for new open source projects and codebases. The latest on our list (tackled in, we admit, no particular order) is Apache Sedona. So, what is Apache Sedona?
We asked creator and main architect Dr Mohamed Sarwat, an expert on database systems and spatial analytics who is assistant professor of computer science at Arizona State University, where he is also director of the DataSys lab, and an affiliate member of the Center for Assured and Scalable Data Engineering.
What is Apache Sedona?
"Apache Sedona is a cluster computing system for processing large-scale spatial data. Geospatial data comes from from various sources, which include GPS traces, IoT sensors, and socioeconomic data.
Sedona provides a set of out-of-the-box Spatial Resilient Distributed Datasets and Dataframes that can efficiently load, process, and analyze large-scale spatial data across a cluster of machines. Sedona provide a Scala, SQL, and Python APIs to program analytics pipelines on-top of geospatial data.
What was the genesis of this project? What are the commercial alternatives?
Before Sedona, a data processing system such as Spark could not properly handle geospatial data. The user would have to implement the entire geospatial data processing stack from scratch and even then the pipeline implementation may not be quite efficient.
On the other hand, existing GIS tools such as ESRI ArcGIS were not originally designed with scalability and cloud integrability in mind. Instead, such tools were primarily designed for data that can fit in a single machine.
See also: What is Apache AGE?
Sedona bridges the gap between big data systems and geospatial tools, e.g., ESRI ArcGIS. In other words, Sedona provides geospatial support for modern scalable data pipelines and scalability to classic GIS tools. To achieve that, Sedona extends the core engine of Spark and SparkSQL to support spatial data types, indexes, and geometrical operations in an easy-to-scale distributed system environment.
What is your personal involvement in the project?
I am the creator and main architect of Apache Sedona. The system was formerly named GeoSpark, but we recently renamed it Sedona when we joined the Apache Software Foundation in Summer 2020.
Can you give us an example of real-world use-cases?
Since Sedona is a general-purpose geospatial data processing engine, it can have many use cases in various industries. For instance, in the real estate industry, users may leverage Sedona to: (1) load and integrate home listings with census data, schools information, traffic data based on their geospatial location, (2) then calculate the average school ratings per various neighborhoods, (3) or, calculate the geospatial autocorrelation between listed home prices and nearby schools, (4) and, prepare the listings nearby each customer location, (5) and, perhaps fit a geospatial regression model to predict home prices based upon various geospatial factors.
In addition, Sedona can be used to process big geospatial data to benefit a plethora of other applications, which include: epidemic analysis, disaster management, fintech, smart transportation, insurance, and healthcare.
How active is the community?
We have an active community of about 40 contributors - some of them contribute code regularly to the project.
From a user count perspective, the system has received a lot of traction. For instance, last month Sedona pypi (python api) rand maven central (Scala and SQL) repos were downloaded 46K and 120k times, respectively. Several companies are using Apache Sedona to power their geospatial data analysis pipeline.