SlideShare a Scribd company logo
OpenStreetMap in the age of Spark
@adrianulbona
OpenSteetMap
- The Wikipedia of Maps
- https://www.openstreetmap.org
- nodes: geo-localized points on the map
- ways: roads, building contours, borders (multiple nodes), …
- relations: highways (multiple ways), schools (building contours), ...
OpenSteetMap - the data
The data is available at http://planet.openstreetmap.org and it comes in two
formats:
- XML ~ around 53 GB
- PBF ~ around 34 GB
Take a look at http://osmstats.neis-one.org, if interested on how the map evolves
on a daily basis.
OpenSteetMap - story 1
- download the PBF/XML files
- wait days to import the data in PostgreSQL
- write some SQL
- grab a coffee
- grab a second coffee
- see some query results
- …
- manage scary scripts that keep your OSM DB updated
OpenSteetMap - story 2
- download the PBF/XML files
- extract various pieces of information in obscure CSVs
- write MR jobs full of string parsing bugs
- run the jobs
- grab a coffee
- fix your job
- grab a second coffee
- …
- MR is not cool anymore
OpenSteetMap - story 3
- one day some weird guy comes and asks:
what is the total road network length from OSM?
- you have ways as collections node ids
- you have nodes with ids and coordinates (latitude, longitude)
- all this mixed-up in a huge protobuf file
- options?
what the actual problem is?
we need parallel data access
we need the data structured
Spark will handle the rest
Parquet will give you more …
performance
Apache Parquet
Apache Parquet is a columnar storage format available to any project in the
Hadoop ecosystem, regardless of the choice of data processing framework, data
model or programming language.
google paper: Dremel - Interactive Analysis of Web-Scale Datasets
twitter blog: Dremel Made Simple with Parquet
Apache Parquet
row-based storage
column-based storage
data
Apache Parquet
- protobuf-like
- primitives, arrays, structs
- bonus: nested structs
Parquet + Spark
osm-parquetizer
- github.com/adrianulbona/osm-parquetizer
- input: one OSM PBF file
- output: one parquet file for each entity type (nodes, ways, relations)
- minutes for countries like Romania
- between 2 and 4 hours for the entire planet
- parquet files size ~ 3 x original PBF (~ 100 GB the planet)
http://bit.ly/2n9TRF3

More Related Content

PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
PDF
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
PPTX
XESLite - Handling Event Logs in ProM
ODP
My talk at Topconf.com conference, Tallinn, 1st of November 2012
PDF
Embedded Recipes 2018 - Shared memory / telemetry - Yves-Marie Morgan
PPTX
Hadoop Jute Record Python
PPTX
Nov HUG 2009: Hadoop Record Reader In Python
PPT
Tokyocabinet
Your data isn't that big @ Big Things Meetup 2016-05-16
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
XESLite - Handling Event Logs in ProM
My talk at Topconf.com conference, Tallinn, 1st of November 2012
Embedded Recipes 2018 - Shared memory / telemetry - Yves-Marie Morgan
Hadoop Jute Record Python
Nov HUG 2009: Hadoop Record Reader In Python
Tokyocabinet

What's hot (20)

PDF
Geo Package and OWS Context at FOSS4G PDX
ODP
Tokyo Cabinet
PPT
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
PPT
Getting started with PostGIS geographic database
PPT
Inside database
PPT
Tokyo Cabinet
PPTX
Geo data analytics
PDF
DB reading group may 16, 2018
PDF
Corpus studio Erwin Komen
PDF
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
PDF
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
PPTX
Gdal introduction
PPTX
Mongo db present
PDF
KOS evolution in Linked Data
ODP
ICOS Carbon Data Portal
TXT
No sql
PDF
Versioned Triple Pattern Fragments
PDF
Big data quiz
PDF
Principles of programming languages(Functional programming Languages using LISP)
Geo Package and OWS Context at FOSS4G PDX
Tokyo Cabinet
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting started with PostGIS geographic database
Inside database
Tokyo Cabinet
Geo data analytics
DB reading group may 16, 2018
Corpus studio Erwin Komen
Introduction to Hadoop : A bird eye's view | Abhishek Mukherjee
Распределенные системы хранения данных, особенности реализации DHT в проекте ...
Gdal introduction
Mongo db present
KOS evolution in Linked Data
ICOS Carbon Data Portal
No sql
Versioned Triple Pattern Fragments
Big data quiz
Principles of programming languages(Functional programming Languages using LISP)
Ad

Similar to OpenStreetMap in the age of Spark (20)

PDF
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
PDF
Hadoop scalability
PPTX
ERS downscale2016
PPTX
Apache spark on planet scale
PDF
EEDC - Apache Pig
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Beginner Apache Spark Presentation
PDF
Cluj meetup bigdata-final-version
PPT
PDF
Scaling PyData Up and Out
PPT
EEDC Apache Pig Language
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
PPT
Eedc.apache.pig last
PPTX
OpenMapTiles FOSS4G 2019
PDF
Why Spark Is the Next Top (Compute) Model
PPTX
In Memory Analytics with Apache Spark
PDF
Linked Media Management with Apache Marmotta
PPTX
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
PPT
Moving Towards a Streaming Architecture
PDF
Migrating the elastic stack to the cloud, or application logging @ travix
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Hadoop scalability
ERS downscale2016
Apache spark on planet scale
EEDC - Apache Pig
A look under the hood at Apache Spark's API and engine evolutions
Beginner Apache Spark Presentation
Cluj meetup bigdata-final-version
Scaling PyData Up and Out
EEDC Apache Pig Language
Hopsworks in the cloud Berlin Buzzwords 2019
Eedc.apache.pig last
OpenMapTiles FOSS4G 2019
Why Spark Is the Next Top (Compute) Model
In Memory Analytics with Apache Spark
Linked Media Management with Apache Marmotta
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Moving Towards a Streaming Architecture
Migrating the elastic stack to the cloud, or application logging @ travix
Ad

Recently uploaded (20)

PPT
Project quality management in manufacturing
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Construction Project Organization Group 2.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
573137875-Attendance-Management-System-original
PPTX
Artificial Intelligence
PPTX
Sustainable Sites - Green Building Construction
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Project quality management in manufacturing
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Construction Project Organization Group 2.pptx
bas. eng. economics group 4 presentation 1.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
573137875-Attendance-Management-System-original
Artificial Intelligence
Sustainable Sites - Green Building Construction
Safety Seminar civil to be ensured for safe working.
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
additive manufacturing of ss316l using mig welding
Internet of Things (IOT) - A guide to understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
III.4.1.2_The_Space_Environment.p pdffdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT

OpenStreetMap in the age of Spark

  • 1. OpenStreetMap in the age of Spark @adrianulbona
  • 2. OpenSteetMap - The Wikipedia of Maps - https://www.openstreetmap.org - nodes: geo-localized points on the map - ways: roads, building contours, borders (multiple nodes), … - relations: highways (multiple ways), schools (building contours), ...
  • 3. OpenSteetMap - the data The data is available at http://planet.openstreetmap.org and it comes in two formats: - XML ~ around 53 GB - PBF ~ around 34 GB Take a look at http://osmstats.neis-one.org, if interested on how the map evolves on a daily basis.
  • 4. OpenSteetMap - story 1 - download the PBF/XML files - wait days to import the data in PostgreSQL - write some SQL - grab a coffee - grab a second coffee - see some query results - … - manage scary scripts that keep your OSM DB updated
  • 5. OpenSteetMap - story 2 - download the PBF/XML files - extract various pieces of information in obscure CSVs - write MR jobs full of string parsing bugs - run the jobs - grab a coffee - fix your job - grab a second coffee - … - MR is not cool anymore
  • 6. OpenSteetMap - story 3 - one day some weird guy comes and asks: what is the total road network length from OSM? - you have ways as collections node ids - you have nodes with ids and coordinates (latitude, longitude) - all this mixed-up in a huge protobuf file - options?
  • 7. what the actual problem is?
  • 8. we need parallel data access
  • 9. we need the data structured
  • 10. Spark will handle the rest
  • 11. Parquet will give you more … performance
  • 12. Apache Parquet Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. google paper: Dremel - Interactive Analysis of Web-Scale Datasets twitter blog: Dremel Made Simple with Parquet
  • 14. Apache Parquet - protobuf-like - primitives, arrays, structs - bonus: nested structs
  • 16. osm-parquetizer - github.com/adrianulbona/osm-parquetizer - input: one OSM PBF file - output: one parquet file for each entity type (nodes, ways, relations) - minutes for countries like Romania - between 2 and 4 hours for the entire planet - parquet files size ~ 3 x original PBF (~ 100 GB the planet)