SlideShare a Scribd company logo
Q4 2016 GeoTrellis Presentation
What we’ll be covering
What GeoTrellis is, what it can do
Demo of GeoTrellis in action
Talk about what the next steps for GeoTrellis
are, look into some of the possible use cases for
GeoTrellis that we’re excited about, and talk
about our roadmap
Feel free to ask questions throughout!
Where did
come from?
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
2011 - 2013
2013 - Present
Q4 2016 GeoTrellis Presentation
What is
?
GeoTrellis
a Scala library for geospatial data types and
operations.
enables Spark with geospatial capabilities
storage and query raster data from HDFS, S3,
Accumulo, and Cassandra (HBase soon)
Geo +
Rasters +
Rasters, someVector +
v1.0
Q4 2016
Rasters,Vector,
VectorTiles, Point Cloud +
ROADMAP
v1.1
w/
Vector Data with GeoTrellis (non-Spark)
Wraps JTS
GeoJson,WKT,WKB reading/writing
Reprojection (Proj4j)
Kriging Interpolation
Rasters with GeoTrellis (non-Spark)
Read GeoTiffs
Map Algebra (local, focal, zonal)
Polygonal Summaries
Generally transform and combine raster data
Kernel Density, rasterization, vectorization
Get histograms
Render via color breaks
GeoTrellis & Spark
Ingest data to local file system, HDFS,Accumulo,
S3, or Cassandra
Distributed computations of Spatial and Spatio-
temporal raster data
Map algebra on distributed tile sets
General ways to transform and combine
distributed tile sets
BACKGROUND
PROCESSING GEOSPATIAL DATA
@ SCALE
PROCESSING GEOSPATIAL DATA
@ SCALE
Geospatial Data
Core of GIS (Geographic information system)
Raster (images, weather data)
Vector (points of interest, country boundries)
Geospatial Data
Core of GIS (Geographic information system)
Raster (images, weather data)
Vector (points of interest, country boundries)
VectorTiles, Point Cloud
Raster Data
Raster Data
Raster Data
Raster Data
Vector Data (Points)
Vector Data (Lines)
Vector Data (Polygons)
Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/
Vector Data
PROCESSING GEOSPATIAL DATA
@ SCALE
Contains
Contains
Heatmap (Kernel Density)
Zonal Statistics
Feature Extraction (Image Segmentation)
Source: http://www.professeurs.polymtl.ca/christopher.pal/
Map Algebra
Local Operation
Focal Operation
Map Algebra in GeoTrellis
PROCESSING GEOSPATIAL DATA
WITH
Polygonal Summary Statistics
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
PROCESSING GEOSPATIAL DATA
@ SCALE
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
NED 1/3 arc second
• 170 X 180 km
• 2gb each.
• 11 bands
• 700 scenes per day
• 1.4 TB / day
• 255,500 scenes / year
• 0.25 PB / year
Landsat 8
Landsat 8 on
• All Landsat 8 scenes from 2015 and beyond.
• Selection of cloud-free scenes from 2013 and 2014.
Landsat 8 on
645,763 scenes
Landsat 8 on
≈1 Petabyte
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
64 GB
32 Landsat 8 Scenes
Q4 2016 GeoTrellis Presentation
This many people’s phones could hold all the Landsat 8 AWS is
holding.
PROCESSING GEOSPATIAL DATA
@ SCALE
Q4 2016 GeoTrellis Presentation
Project to build a better search engine, back in
the early 2000’s.
Worked for small datasets, but was not scalable.
The Google papers
After reading the papers, Nutch developers
added a distributed file system and MapReduce
model to Nutch.
In 2006, those portions were spun out of Nutch
to form…
Q4 2016 GeoTrellis Presentation
Apache Hadoop
Heavily supported byYahoo, which moved it’s
large data processing to Hadoop.
by 2007,Twitter, Facebook, LinkedIn and many
others were doing serious work with Hadoop
2008 Hadoop graduated to a top level Apache
project
Hadoop
Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png
Matei Zaharia
Worked with Hadoop at UC Berklee
Noticed Hadoop was not a good fit for
Machine Learning algorithms and other
iterative models.
So in 2009, he created…
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Open sourced in 2010 under BSD license
Maintained by UC Berkeley’s AMPLab
Donated to the Apache Software Foundation in
2013 and relicensed as Apache 2.0
Graduated to a top level Apache project in 2014
Apache Spark
Apache Spark
a distributed computation engine.
An API that lets you work with distributed data
as a collection.
Written in Scala, with language bindings for use
with Java, Python, and R.
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Hey Flyers Fans, what is the total count of
Landsat 8 Scenes on your phones?
Q4 2016 GeoTrellis Presentation
Data
Node
Data
Node
Data
Node
Name
Node
Master
Tablet
Server
Tablet
Server
Tablet
Server
Accumulo
BigTable clone (columnar database)
Records stored on HDFS
Lexicographically sorted table index
Q4 2016 GeoTrellis Presentation
Apache Accumulo
Created by the NSA in 2008
Donated to the Apache Foundation in 2011
Graduated to a top level project in 2012
2006
Q4 2016 GeoTrellis Presentation
(Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such security
features, or (2) Accumulo itself has become a successful open
source database project. Requires DOD and intelligence
community officials to coordinate the use by DOD components
of cloud computing infrastructure and services offered by the
intelligence community for purposes other than intelligence
analysis.
(Sec. 929) Prohibits any DOD component from utilizing the
cloud computing database developed by the National Security
Agency (NSA) and known as "Accumulo" after the end of
FY2013, unless the DOD CIO certifies that: (1) there are no
viable commercial open source databases that have such security
features, or (2) Accumulo itself has become a successful open
source database project. Requires DOD and intelligence
community officials to coordinate the use by DOD components
of cloud computing infrastructure and services offered by the
intelligence community for purposes other than intelligence
analysis.
Hey Flyers Fans, what is the total count of
Landsat 8 Scenes on your phones per month?
PROCESSING GEOSPATIAL
DATA @ SCALE
Hey Flyers Fans, can you take the average pixel value of each scene’s
band and derive a EPSG:3857 tile set of PNGs to be served on web
maps?
Hey Flyers Fans, can you take the average pixel value of each scene’s
band and derive a EPSG:3857 tile set of PNGs to be served on web
maps?
Q4 2016 GeoTrellis Presentation
How does
work?
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Polygonal Summaries
Polygonal Summaries
SPACE FILLING CURVES
Hey Flyers Fans, what is the total count of
Landsat 8 Scenes on your phones per month, per country?
Hey Flyers Fans, what is the total count of
Landsat 8 Scenes on your phones per country?
SPACE FILLING CURVES
Z curve
Hilbert Curve
Space Filling Curves
Q4 2016 GeoTrellis Presentation
Range Decomposition
70 -> 75
92 -> 99
116 -> 121
Q4 2016 GeoTrellis Presentation
on
on
s3 key layerName/zoom/[SFC Index (Hilbert or Z order)]
s3 value
Avro Encoded Seq[(K, V)] where
K = Key Type (e.g. SpatialKey)
V =Value Type (e.g.Tile)
Hey Flyers Fans, what is the total count of
Landsat 8 Scenes on your phones A) per month, B) per country,
C) per both?
Why
?
Sharding raster data across the cluster
Caching operation results across cluster
HDFS support
Advanced fault tolerance
Advanced task scheduling
Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png
Say we have a large set of imagery, and would
like to apply two filters to each band:
First, we want to apply a simple threshold
filter: if a value is above 10,000, we want to
discard it
Second, we would like to apply a 5 x 5 median
filter.
Example Problem: Filtering
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
(0, 0) (1, 0) (2, 0)
(1, 2) (2, 2)(0, 2)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3
(1, 1) (2, 1)(0, 1)
Node 1
Node 2
Node 3
(c, r)
Example Problem: Querying
We want to retrieve all imagery for the city of
Rio de Janeiro taken in March 2016, find the
maximum NDVI values for each pixel and save it
as a GeoTiff.
Q4 2016 GeoTrellis Presentation
What are uses of
?
Q4 2016 GeoTrellis Presentation
100 spot instance m3.xlarge workers @ $0.04 /
hr = $4.00 / hr
400 CPUs / ≈1.5 TB memory
1 master m3.xlarge on-demand instance @
$0.26 / hr
EMR cluster charge, $0.07 / hr
$4.37 / hr
Rendering elevation
with hillshade + NLCD on AWS EMR
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
NED 1/3 arc second + NLCD
NED 1/3 arc second + NLCD
NED 1/3 arc second + NLCD
GLOBAL CIRCULATION MODELS
Models for predicting world temperature and precipitation.
GLOBAL CIRCULATION MODELS
NASA NEX Downscaled Climate Projections (NEX-DCP30)
• Monthly data over conterminous US
• Historical from 1950 - 2006
• 4 RCP scenarios from 2006 - 2099
• 8190 netCDF files on S3 - s3://nasanex/NEX-DCP30
• 15.3 TB in compressed GeoTiff tiles.
• RCP 8.5, max for datatype/model combo: 90.92 GB
Q4 2016 GeoTrellis Presentation
Landsat NDVI/NDWI change detection demo
Static vs Dynamic
serving static data pre-
processed through a batch
transformation pipeline vs
serving data dynamically
processed on-demand from
unprocessed source data
Static vs Dynamic
GeoTrellis systems tend to have two major
components:
A batch pre-processing pipeline, which
processes large amounts of data into some
static data at rest.
A dynamic pipeline which processes data at
the time the user requests it.
“Raw”
Data
Served
Data
Processing Pipeline
“Raw”
Data
Served
Data
Completely dynamic
Application Data
Processing at request time
“Raw”
Data
Served
Data
Completely static
Batch data pre-processing
Application Data
“Raw”
Data
Served
Data
Application Data
Mix of static and dynamic
Batch data pre-processing Processing at request time
“Raw”
Data
Served
Data
Application Data
Mix of static and dynamic
Ingest/ETL Server
“Raw”
Data
Served
Data
Application Data
More static
Faster to serve, less flexibility
“Raw”
Data
Served
Data
Application Data
More dynamic
More flexible, slower to serve
Ingesting Landsat data
Landsat images are pulled off of S3 or Google’s
public Earth Engine storage.
In an Spark job run on EMR, these images are
reprojected, tiled, indexed, and saved off to
Accumulo or HDFS.
The indexed tile set is now ready to be used by
the server application.
Landsat
GeoTiffs
on S3
PNGs,
JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
Landsat
GeoTiffs
on S3
PNGs,
JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
Landsat
GeoTiffs
on S3
PNGs,
JSON
EPSG:3857 tiled imagery in Accumulo
Ingest/ETL Server
DEPLOYMENT
Example Deployment
Servicing User Requests
Q4 2016 GeoTrellis Presentation
ROAD MAP
Release Schedule
v1.0
Q4 2016
v1.1
Q2 2017
Graduation
Rasters,Vector,
VectorTiles, Point Cloud +
ROADMAP
v1.1
w/
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
DOCUMENTATION!
IMPROVED DEPLOYMENT WITH
Q4 2016 GeoTrellis Presentation
Integration work
VECTORTILES
Image: osm2vectortile
POINT CLOUD
MACHINE LEARNING PIPELINES
http://blog.tomnod.com/finding-pools-with-deep-learning
Q4 2016 GeoTrellis Presentation

More Related Content

PDF
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
PDF
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
PDF
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
PPTX
Deep Learning on Aerial Imagery: What does it look like on a map?
PDF
Processing Geospatial Data At Scale @locationtech
PDF
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
PDF
Processing Geospatial at Scale at LocationTech
PPTX
LocationTech Projects
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
FOSDEM 2015: Distributed Tile Processing with GeoTrellis and Spark
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Deep Learning on Aerial Imagery: What does it look like on a map?
Processing Geospatial Data At Scale @locationtech
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Processing Geospatial at Scale at LocationTech
LocationTech Projects

What's hot (19)

PDF
GeoMesa LocationTech DC
PPTX
Project Matsu: Elastic Clouds for Disaster Relief
PDF
SkyhookDM - Towards an Arrow-Native Storage System
PPTX
The next generation of the Montage image mosaic engine
PPTX
Bioclouds CAMDA (Robert Grossman) 09-v9p
PPTX
DATACUBES: Conquering Space & Time
PPTX
OCC Overview OMG Clouds Meeting 07-13-09 v3
PDF
Apache Nemo
PDF
ESCAPE Kick-off meeting - LSST (Feb 2019)
PPTX
High Throughput Processing of Space Debris Data
PPTX
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
PDF
PDF
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
PPTX
Slide 1
PDF
Snow cover assessment tool using Python
PDF
Access to Open Earth Observation Data, an Overview and Outlook Raymond Sluit...
PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PDF
ESIP 2018 - The Case for Archives of Convenience
PDF
OpenTopography - Scalable Services for Geosciences Data
GeoMesa LocationTech DC
Project Matsu: Elastic Clouds for Disaster Relief
SkyhookDM - Towards an Arrow-Native Storage System
The next generation of the Montage image mosaic engine
Bioclouds CAMDA (Robert Grossman) 09-v9p
DATACUBES: Conquering Space & Time
OCC Overview OMG Clouds Meeting 07-13-09 v3
Apache Nemo
ESCAPE Kick-off meeting - LSST (Feb 2019)
High Throughput Processing of Space Debris Data
STAC, ZARR, COG, K8S and Data Cubes: The brave new world of satellite EO anal...
Accelerating Science with Cloud Technologies in the ABoVE Science Cloud
Slide 1
Snow cover assessment tool using Python
Access to Open Earth Observation Data, an Overview and Outlook Raymond Sluit...
Leveraging Map Reduce With Hadoop for Weather Data Analytics
ESIP 2018 - The Case for Archives of Convenience
OpenTopography - Scalable Services for Geosciences Data
Ad

Similar to Q4 2016 GeoTrellis Presentation (20)

PDF
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
PDF
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
PPTX
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
PDF
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
PPTX
Big Data, Big Computing, AI, and Environmental Science
PPTX
My Other Computer is a Data Center (2010 v21)
PDF
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
PPTX
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
PPTX
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
PDF
Building a Big Data platform with the Hadoop ecosystem
PPTX
What it takes to run Hadoop at Scale: Yahoo! Perspectives
PPTX
Apache spark - History and market overview
PDF
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
PPT
BWC Supercomputing 2008 Presentation
PDF
State of GeoServer 2.14
PDF
Big data distributed processing: Spark introduction
PDF
1 mapreduce-fest
PDF
Processing Drone data @Scale
PDF
Using the Open Science Data Cloud for Data Science Research
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Big Data, Big Computing, AI, and Environmental Science
My Other Computer is a Data Center (2010 v21)
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Tim Hall [InfluxData] | InfluxDB Roadmap | InfluxDays Virtual Experience Lond...
Building a Big Data platform with the Hadoop ecosystem
What it takes to run Hadoop at Scale: Yahoo! Perspectives
Apache spark - History and market overview
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
BWC Supercomputing 2008 Presentation
State of GeoServer 2.14
Big data distributed processing: Spark introduction
1 mapreduce-fest
Processing Drone data @Scale
Using the Open Science Data Cloud for Data Science Research
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Ad

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PPTX
Big Data Technologies - Introduction.pptx
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
Machine learning based COVID-19 study performance prediction
Big Data Technologies - Introduction.pptx
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation_ Review paper, used for researhc scholars
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence

Q4 2016 GeoTrellis Presentation