Working with OpenStreetMap using Apache Spark and Geotrellis

What we’ll cover
● OpenStreetMap (OSM) and it’s data model
● A Missing Maps use case that needed big data tooling to
process OSM History
● OSMesa, what it is, and what it can do
● The future of distributed OSM processing, and what it will
enable

OSM Data Model
The OSM data model consists mainly of 3 elements:
● Nodes - Points
● Ways - LineStrings, Polygons
● Relations - GeometryCollections, Polygon with holes,
MultiPolygons
As well as the tag-based metadata that applies to each
elements, and changesets grouping edits

OSM Data Model: Changesets
● Edits are grouped into changesets, which have their own
metadata such as use comments (for developers, think
commit messages)
● Adding hashtags to user comments allows downstream
processing to group changes - for example, #HOTLunch

Backfilling missing maps
● Missing maps leaderboard processes OSM change files to
increment user and campaign statistics
● The statistics were correct for when the streaming
calculation started, but there was the problem of accounting
for edits previous to that streaming calculation not counting
towards user’s totals.
● So, there was a need to “backfill” the statistics based on
OSM history.

● Through the Red Cross and a grant
from Microsoft Philanthropies, Seth
Fitzsimmons of Pacific Atlas was
hired to solve the backfilling problem.
● Seth was previously involved with
releasing OSM data as a public
dataset on AWS and early work on
distributed processing of OSM data

Reducing the “time to first question”

Source: Seth’s blog post about processing OSM with Athena

Backfill: Athena approach
● Seth first tried to use Athena to calculate the backfill
statistics. This approach didn’t work
● The complexity of the queries made the jobs blow up or
never finish
● Also, Athena's geospatial support hadn't been announced
yet, and once it was, it still didn’t work with the complicated
set of queries

● Seth started showing interest in a set of tools that Azavea
was building at the time that used Apache Spark and
GeoTrellis for calculations calculating similar statistics
● He ported his complicated SQL queries for Athena to
SparkSQL and started contributing to that effort
Backfill: New approach

What is OSMesa?
● It's a loose term for a workflow for OSM data processing
● Still being defined - useful, but amorphous
● More a group of tools and techniques then, say, a library
● Uses Spark, GeoTrellis and AWS to process OSM data into
geometries, vector tiles, and statistics

● a distributed computation engine.
● An API that lets you work with distributed data as a
collection, including a DataFrames API
● Written in Scala, with language bindings for use with Java,
Python, and R.

● Spark DataFrames provide an API that is similar to R or
Pandas DataFrames; allows working with data in a SQL-like
manner
● Very powerful, and can express complicated queries
● (partially) Abstracts away the complexities of distributed
computing

● Core geospatial library in Scala
● Enables Spark with geospatial types and operations
● Generally focused on Raster data, wraps JTS for vector
support
● Vector Tile module for reading and writing vector tiles

OSMesa workflow
AWS EMR Cluster
AWS S3
ORC
Statistics
Vector Tiles
ORC files

● With OSMesa, we can create full historical geometries.
● To do this, we need needed to create a concept of “minor
versions” of geometries
Creating features from History

way v1
highway=unclassified
node v1
node v1
node v1
node v1
node v1
node v1
node v2
node v2
way v2
highway=primary
node v1
node v1
node v2
node v2
way v1

way v1
node v1
node v1
node v1
node v1
way v1.1
node v1
node v1
node v2
node v2
way v2
highway=primary
node v1
node v1
node v2
node v2
minor
version
change

● With minor versions, we can bake new ORC files that
contain geometries of every element in OSM history, with
ways/relations representing every edit to the element as well
as elements that they contain
● Then, we compute statistics per changeset based on
geometries, and roll up the statistics per user and hashtag
Full historical geometries

● Processing of full history into features in under 40 minutes
(cluster of 255 m3.2xlarge nodes)
● This is not a small cluster ( ≈$65/hour). YMMV with smaller
clusters.
● We are building update mechanisms to avoid refreshing the
entire dataset
Processing OSM data at scale

Some data created by OSMesa...

Viewing time slices of Rhode Island OSM

Historical edits for several hashtag campaigns

Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies

● Building matching between OSM and other vector datasets
● Generating vector tiles for URCHN containing a subset of
historical data to front-end analytics
OSMesa: Other current uses

The Future: Validation workflows, Reputation
scores
● Better validation workflows is a big question in the OSM
community right now (according to SOTM US 2017)
● HOT Tasking manager does some; we can do better
● One way to improve validation workflows is to suggest
validation be done by veteran mappers, validation be
suggested for more junior mappers (“reputations core”)
● Development Seed, who contribute & uses OSMesa work,
have great ideas in this space.

The Future: Data Science notebooks,
production workflows
● We are aiming to create a Python notebook environment for
doing data science on OSM, in combination with raster data
● By building on Spark and projects like GeoMesa’s
“JTSFrames”, RasterFrames, and GeoTrellis, we’re creating
a platform that works both for data scientist poking around
in a Jupyter notebook and production systems.

The Future: Machine Learning pre- and post-
processing
● Pre-processing geospatial imagery and OSM into training
chips - a distributed label-maker
● Managing data into and out of Raster Vision
● Post-processing by cleaning the model output, matching to
OSM or other vector data to remove duplicates, conflation
workflows
● Matching OSM to imagery dates: e.g. pre- and post-
disaster.

Join in the fun
● There is a lot of interesting development challenges that
need to be met in the OSM world
● OSM has many different voices in the room, but they all
have one goal: building a better map
● Join the effort to build a better map

If you could ask the OpenStreetMap any
question, at any scale, what would you ask it?

THANKS!
Rob Emanuele, Azavea
@lossyrob (Twitter, GitHub)
www.azavea.com
Seth Fitzsimmons, Pacific Atlas
@mojodna (Twitter, GitHub)
www.pacatlas.com
github.com/azavea/osmesa

OSM Data Model: Nodes
● Single location; only OSM element with geospatial data
● Can represent points of interest, or be solely for inclusion in
ways
● Represents a Point geometry

OSM Data Model: Ways
● References a sequence of ordered nodes
● Represents a LineString geometry
● Closed ways can represent Polygon geometries

OSM Data Model: Relations
● Group of nodes, ways, and other relations
● Used for representing a Polygon with holes,
MultiPolygons, and more generally GeometryCollections

OSM Data Model: Tags
● Each Node, Way and Relation can have a sequence of
tags, which are string-based keys and values. This
describes the role of each element on the map, e.g.
○ highway=residential
○ landuse=grass
○ amenity=fast_food

Source: Dongpo Deng, https://www.slideshare.net/dongpo/the-one-and-many-maps-participatory-and-temporal-diversities-in-openstreetmap

https://planet.openstreetmap.org/

Ways to work with OSM snapshots
● Import OSM data into PostGIS
○ osm2pgsql
○ imposm3
● Render into raster tiles or vector tiles
○ Mapnik
○ Tegola
● Utilize for routing software
○ pgRouting

Ways to work with OSM history
● Clip it using osmium, and import a subset into PostGIS
● After that … not a lot of mature tooling available

Why is OSM history useful
● Calculating user history statistics
● Calculating campaign history statistics
● Calculating complete answers to the question, “what has
changed?”
● Taking a snapshot of OSM at any point in history
● Analytics for research

Why ORC?
● On-demand querying + predicate push-down is possible if
OSM data is in a format that was well-understood by the
Hadoop ecosystem
● bespoke formats have their place, especially when size or
other considerations are all-consuming, but it's really
frustrating to see people continually implementing OSM PBF
parsers to be slightly faster when those parsers are typically
single-use (for a specific application). i wanted to sidestep
the whole process and use a well-known, well-supported

The Approach: Features from OSM data
● Join element data to the other elements that contain them;
for example, join each node to the way(s) it belongs to.
● Assign a minor version to ways and relations modified
because the underlying elements change; e.g. a minor
version increments for a way if someone moves the nodes
belonging to it.
● Create Points, Line, Polygons, and Multipolygons for each
major and minor version of the element.
ProcessOSM.scala on GitHub

The Approach: Vector Tile Generation

Analytic Vector Tiles
● The name we’ve been using for Vector Tiles that contain
information for analysis not (necessarily) for display
● OSMesa/VectorPipe can create sets of Analytic Vector Tiles
from arbitrary subsets of OSM History and publish them to
AWS S3
● Think custom Mapbox QA Tiles, containing relations and
historical elements
● We are creating streaming update workflows to keep
Analytic Vector Tile sets up-to-the-minute (almost).

Other work in this space
● Mapbox’s Jennings Anderson gave a talk at SOTM and
wrote a blog post around quarterly QA tiles
● Uses a work-in-progress project called osm-wayback to
create the historical QA tiles. Goal of project is “...to create
historic geometries for each intermediate version of an OSM
feature.”
● RocksDB on the backend, which creates a ≈ 600GB index
● We have collaborating and looking to further collaborate,
the work is awesome

Animation of Rhode Island OSM edits over time

How to get started with OSMesa
● GitHub
● Gitter
● Docs are a TODO

An Aside - “Push vs Pull” models for AI
tooling for OSM (and in general)

Working with OpenStreetMap using Apache Spark and Geotrellis

More Related Content

What's hot (20)

Similar to Working with OpenStreetMap using Apache Spark and Geotrellis (20)

More from Rob Emanuele (9)

Recently uploaded (20)

Working with OpenStreetMap using Apache Spark and Geotrellis