SlideShare a Scribd company logo
GeoMesa: Using
Accumulo for optimized
spatio-temporal
processing
Dr. James Hughes, CCRi
james.hughes@ccri.com
GeoMesa is
● A collection of libraries and modules which can be used to
solve Big Geo Data problems
○ Great for managing billions to trillions of vector data
○ Great for streaming vector data
● Open sourced through Eclipse’s LocationTech working group and has
graduated incubation
● Built on top of great open source libraries
GeoMesa Background
Such architectures allow for live views and near-real time processing (speed layer)
while persisting the data for historic queries and batch analysis (batch layer).
Client access to both layers can be handled by GeoServer.
GeoMesa enables Lambda architectures
Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
Example Use Case: Managing Internet-Aware Devices
Suppose we wish to monitor and understand a group of GPS-enabled and
internet-enabled devices (ex: sensors, vehicles).
● GeoMesa’s ETL / converter library aids in re-usable data modeling.
● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest
into Accumulo and Kafka topics.
● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as
1) geo-fencing, 2) location trackers, and 3) complex alerting rules.
● Effective storage in Accumulo allows for fast query returns.
● End-to-end visualization and analysis supports allows aggregations to pushed
down to the Accumulo tablet servers.
● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc
interactive analysis and data discovery.
All of this adds up to “Speed! Speed! Speed!” whether you are looking at
a live view of the data or pulling back an analysis product.
Example Use Case: Managing Internet-Aware Devices
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
Talk Outline
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Talk Outline
Enabling and making visualization and analysis quick has been a journey and this
talk is about our steps so far
1. Space-filling curves and storing spatio-temporal data
2. Improvements to GeoMesa use and implementation of Accumulo Iterators
3. Spark and MapReduce for distributed computation
Not in this talk
1. Storm / NiFi - Streaming Ingest
2. Live views and online processing with Kafka
3. Command line tools
4. ETL / parser library
5. Machine learning / Deep Analytics
Talk Outline
● Accumulo Key Design
● Space Filling Curves 101
● Indices for Points with Time
● Indices for Lines and Polygons
● Lessons Learned
GeoMesa's
evolution of
Accumulo
schemas
In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
In a traditional stack, the application
issues queries to a database which is
responsible for query planning.
Overview of query planning in Accumulo
With Accumulo, the query planning is
handled by library code in the
application.
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
Space Filling Curves (in one slide!)
● Goal: Index 2+ dimensional data
● Approach: Use Space Filling Curves
● First, ‘grid’ the data space into bins.
● Next, order the grid cells with a space
filling curve.
○ Label the grid cells by the order
that the curve visits the them.
○ Associate the data in that grid cell
with a byte representation of the
label.
● We prefer “good” space filling curves:
○ Want recursive curves and locality.
● Space filling curves have higher
dimensional analogs.
Space Filling Curves (in one slide!)
To query for points in the grey rectangle, the
query planner enumerates a collection of index
ranges which cover the area.
Note: Most queries won’t line up perfectly with the
gridding strategy.
Further filtering can be run on the Accumulo
tablet servers with Iterators (next section)
or we can return ‘loose’ bounding box results
(likely more quickly).
Query planning with Space Filling Curves
GeoMesa has several tables; each optimized for a particular use case.
The Z3 table is used with and optimized for temporal point data. (Think sensor
observations, track reports, or other events which happen at particular location.)
GeoMesa Key Structure for the ‘Z3’ table
Key Value
Row
Column
Record
Family Qualifier
Shard
1-Byte
Epoch
Week
2-Bytes
Z3(x,y,t)
8-Bytes
‘F’
Here and now:
(38.9864985, -76.9561856)
10:15am, Tuesday, Oct. 11th, 2016
Epoch Week: 2440
X value: 1275689
Y value: 151972
T value: 2097151
Z3 (as a long):
6430470637115132837
Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Indexing non-point geometries: New XZ Index
Most approaches to indexing non-point
geometries involve covering the
geometry with a number of grid cells
and storing a copy with each index.
This means that the client has to
deduplicate results which is expensive.
Böhm, Klump, and Kriegel describe an
indexing strategy allows such
geometries to be stored once.
GeoMesa has implemented this
strategy in XZ2 (spatial-only) and XZ3
(spatio-temporal) tables.
The key is to store data by resolution,
separate geometries by size, and then
index them by their lower left corner.
This does require consideration on the
query planning side, but avoiding
deduplication is worth the trade-off.
Indexing non-point geometries: New XZ Index
For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial
extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China.
(http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)
● Accumulo Iterator Overview
● GeoMesa Iterators for Analysis
and Visualization
● Iterator Lessons Learned
GeoMesa's use
of Accumulo
Iterators
“Iterators provide a modular mechanism for adding functionality to be executed by
TabletServers when scanning or compacting data. This allows users to efficiently
summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation
Part of the modularity is that the iterators can be stacked:
t the output of one can be wired into the next.
Example: The first iterator might read from disk, the second could filter with
Authorizations, and a final iterator could filter by column family.
Other notes:
● Iterators provided a sorted view of the key/values.
● Iterator code can be loaded from HDFS and namespaced!
Accumulo Iterators
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa
Visualization Example: Heatmaps
Without powerful visualization options,
big data is big nonsense.
Consider this view of shipping in the
Mediterranean sea
Heatmaps help show patterns and
they can be accelerated with
GeoMesa
Heatmap
Request
HeatMap WPS
Query Hints
A request to GeoMesa consists of two broad pieces:
1. A filter restricting the data to act on, e.g.:
a. Records in Maryland with ‘Accumulo’ in the text field.
b. Records during the first week of 2016.
2. A request for ‘how’ to return the data, e.g.:
a. Return the full records
b. Return a subset of the record (either a projection or ‘bin’ file format)
c. Return a histogram
d. Return a heatmap / kernel density
Generally, a filter can be handled partially by selecting which ranges to scan; the
remainder can be handled by an Iterator.
Modifications to selected data can also be handled by a GeoMesa Iterator.
GeoMesa Data Requests
The first pass of GeoMesa iterators separated concerns into separate iterators.
The GeoMesa query planner assembled a stack of iterators to achieve the desired
result.
Initial GeoMesa Iterator design
Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by
Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
The key benefit to having decomposed iterators is that they are easier to
understand and re-mix.
In terms of performance, each one needs to understand the bytes in the Key and
Value. In many cases, this will lead to additional serialization/deserialization.
Now, we prefer to write Iterators which handle transforming the underlying data
into what the client code is expecting in one go.
Second GeoMesa Iterator design
1. Using fewer iterators in the stack can be beneficial
2. Using lazy evaluation / deserialization for filtering Values can power speed
improvements.
3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and
Values.
4. Accumulo 1.8.0 has an Iterator Test Harness!
https://accumulo.apache.org/release_notes/1.8.0#iterator-test-harness
https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterator_testing
Lessons learned about Iterators
Through our use of a) space filling curves, b) a cost-based query optimizer, and
c) carefully configured iterators, the GeoMesa query planner has a lot going on.
The GeoMesa query explainer logs 1) which index was used, 2) which ranges
where scanned, 3) Iterator configuration, etc.
Putting all together: the GeoMesa Query Explainer
geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who =
'Bierce'"
Planning 'AccumuloQuickStart' Who = 'Bierce'
Original filter: Who = 'Bierce'
Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none]
Sort: none
Transforms: None
Strategy selection:
Query processing took 69ms and produced 1 options
Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]]
Strategy selection took 8ms for 1 options
Strategy 1 of 1: AttributeIdxStrategy
Strategy filter: ATTRIBUTE[Who = 'Bierce'][None]
Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan
Table: geomesa_attr
Deduplicate: false
Column Families: all
Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;)
Iterators (0):
Query planning took 119ms
Verify hints
Inspect strategies considered
See table and ranges to be scanned
Quantify planning time
● GeoMesa + Spark Setup
● GeoMesa + Spark Analytics
● GeoMesa powered notebooks
(Jupyter and Zeppelin)
GeoMesa’s
Spark Support:
Data Analysis
and Discovery
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
GeoMesa MapReduce and Spark Support
Using Accumulo Iterators, we’ve seen how one can easily
perform simple ‘MapReduce’ style jobs without needing more
infrastructure.
NB: Those tasks are limited. One can filter inputs,
transform/map records and aggregate partial results on each
tablet server.
To implement more complex processes, we look to
MapReduce and Spark.
Accumulo Implements the MapReduce InputFormat interface.
Spark provides a way to change InputFormats into RDDs.
So with a little glue code and Spark classpath/environment
management, GeoMesa has Spark support!
GeoMesa MapReduce and Spark Support
GeoMesa Spark Example 1: Time Series
Step 1: Get an RDD[SimpleFeature]
Step 2: Calculate the time series
Step 3: Plot the time series in R.
Using one dataset (country boundaries) to group another (here, GDELT) is
effectively a join.
Our summer intern, Atallah, worked out the details of doing this analysis in Spark
and created a tutorial and blog post.
This picture shows ‘stability’ of a region from GDELT Goldstein values
GeoMesa Spark Example 2: Aggregating by Regions
http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/
http://www.geomesa.org/documentation/tutorials/shallow-join.html
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
Virginia Polygon CQL
GeoMesa RDD
Aggregate by County
Calculate ratio of #traffic
Store back to GeoMesa
GeoMesa Spark Example 3: Aggregating Tweets about #traffic
#traffic by Virginia county
Darker blue has a higher count
Problem: Another developer came by and mentioned that his Spark job using
GeoMesa had quite a few tasks (far more than expected).
Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the
Accumulo user list…
In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split.
With space filling curves, it is easy to enumerate plenty of ranges for a query.
Solution: The short term solution was to create a custom InputFormat which
produce Splits which contain more than one range.
A small bump in the road…
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
Interactive Data Discovery at Scale in GeoMesa Notebooks
Writing (and debugging!) MapReduce /
Spark jobs is slow and requires
expertise.
A long development cycle for an
analytic saps energy and creativity.
The answer to both is interactive
‘notebook’ servers like Apache
Zeppelin and Jupyter (formerly
iPython Notebook).
There are two big things to work out:
1. Getting the right libraries on the
classpath.
2. Wiring up visualizations.
Interactive Data Discovery at Scale in GeoMesa Notebooks
GeoMesa Notebook Roadmap:
● Improved JavaScript integration
● D3.js and other visualization
libraries
● OpenLayers and Leaflet
● Python Bindings
Questions?
Find out more at http://geomesa.org
Connect with us on Gitter:
https://gitter.im/locationtech/geomes
a
See applications at CCRi’s blog:
http://www.ccri.com/blog/
Backup slides
http://www.eichelberger.org/sfseize/index.html
Talk filling curves
GeoMesa Converter Library
The Converter library is used in
1. The GeoMesa command line tools
2. GeoMesa’s NiFi processors
Configurations support XML, CSV, TSV JSON, Avro, and more!
Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.
Live view with the GeoMesa Kafka DataStore
Q: How did you get billions of points?
A: Data is streaming in continually.
Examples come from IoT related
applications:
10 thousand sensors reporting
every 5 seconds generate 1.2 billion
records in a week.
In these cases, we want to see where
things are right now.
GeoMesa Kafka DataStore Architecture
We have two issues to address:
1. In-memory index of
SimpleFeatures
2. Durable message passing system
For indexing, we use a combination of
Guava and CQEngine (efficient Java
collections).
Kafka serves as the message passing
system.
Consumer KDSes can be run in Storm
(for event processing), GeoServer (OGC
access), etc.
Z-Order Hilbert
Around 100 years ago, mathematicians asked the question,
“Is there a continuous function from the unit interval to the unit square
which covers it?”
Space Filling Curves: The Math
Row-Major
Streaming Data Architecture; Part 1
Continuous ingest:
GeoMesa-NiFi
leverages the
GeoMesa converter
library

More Related Content

PDF
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
PPTX
LocationTech Projects
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PDF
Hopsworks - ExtremeEarth Open Workshop
PDF
A time energy performance analysis of map reduce on heterogeneous systems wit...
PPT
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
PPTX
Finalprojectpresentation
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
LocationTech Projects
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Hopsworks - ExtremeEarth Open Workshop
A time energy performance analysis of map reduce on heterogeneous systems wit...
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Finalprojectpresentation

What's hot (20)

PDF
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
PPTX
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
PPTX
PPTX
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
PPTX
GeoMesa: Scalable Geospatial Analytics
PPT
Hadoop - Introduction to HDFS
PPTX
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
PPTX
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
PPTX
Advancing Scientific Data Support in ArcGIS
PPTX
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
PPTX
Working with Scientific Data in MATLAB
PDF
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
PPTX
Stratosphere with big_data_analytics
PPT
Map Reduce introduction
PDF
Topic 6: MapReduce Applications
PDF
Applying stratosphere for big data analytics
PPT
Hadoop introduction 2
PDF
Parallel Sequence Generator
PDF
parallel OLAP
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
GeoMesa: Scalable Geospatial Analytics
Hadoop - Introduction to HDFS
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Advancing Scientific Data Support in ArcGIS
EDF2012 Kostas Tzouma - Linking and analyzing bigdata - Stratosphere
Working with Scientific Data in MATLAB
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
Stratosphere with big_data_analytics
Map Reduce introduction
Topic 6: MapReduce Applications
Applying stratosphere for big data analytics
Hadoop introduction 2
Parallel Sequence Generator
parallel OLAP
Ad

Viewers also liked (20)

PPTX
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
PDF
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
PDF
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
PDF
Accumulo design
PPTX
Accumulo meetup 20130109
PDF
Accumulo Summit 2016: Accumulo in the Enterprise
PDF
Apache Accumulo and the Data Lake
PDF
Large Scale Accumulo Clusters
PDF
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
PPTX
Accumulo: A Quick Introduction
PDF
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
PDF
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
PPTX
GeoMesa – Spatio-Temporal Indexing in Accumulo
PDF
Processing Geospatial Data At Scale @locationtech
PDF
Foundation Comparison
PDF
Sqrrl real time_big_data_20130411
PDF
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
PDF
Processing Geospatial at Scale at LocationTech
PPTX
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Accumulo Summit 2015: GeoWave: Geospatial and Geotemporal Data Storage and Re...
Accumulo Summit 2014: Four Orders of Magnitude: Running Large Scale Accumulo ...
Accumulo Summit 2015: Tracing in Accumulo and HDFS [Internals]
Accumulo design
Accumulo meetup 20130109
Accumulo Summit 2016: Accumulo in the Enterprise
Apache Accumulo and the Data Lake
Large Scale Accumulo Clusters
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Accumulo: A Quick Introduction
Accumulo Summit 2016: Embedding Authenticated Data Structures in Accumulo
Accumulo Summit 2015: Accumulo In-Depth: Building Bulk Ingest [Sponsored]
GeoMesa – Spatio-Temporal Indexing in Accumulo
Processing Geospatial Data At Scale @locationtech
Foundation Comparison
Sqrrl real time_big_data_20130411
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Processing Geospatial at Scale at LocationTech
Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data
Ad

Similar to Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing (20)

PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
PDF
Q4 2016 GeoTrellis Presentation
PPT
Informatica perf points
PPT
Informatica perf points
PPTX
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
PDF
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
PPTX
Research on vector spatial data storage scheme based
PPTX
The design and implementation of modern column oriented databases
PPT
Geoservices Activities at EDINA
PDF
NRE 423 3 Parts of ArcToolBox Presentation.pdf
PDF
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
PDF
Ling liu part 01:big graph processing
PDF
HP - Jerome Rolia - Hadoop World 2010
PDF
SSBSE10.ppt
PDF
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
PPTX
Watershed Delineation in ArcGIS
PDF
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
PDF
Skyline Query Processing using Filtering in Distributed Environment
PPTX
Watershed Delineation Using ArcMap
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Q4 2016 GeoTrellis Presentation
Informatica perf points
Informatica perf points
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
MataNui - Building a Grid Data Infrastructure that "doesn't suck!"
Research on vector spatial data storage scheme based
The design and implementation of modern column oriented databases
Geoservices Activities at EDINA
NRE 423 3 Parts of ArcToolBox Presentation.pdf
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
Ling liu part 01:big graph processing
HP - Jerome Rolia - Hadoop World 2010
SSBSE10.ppt
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
Watershed Delineation in ArcGIS
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Skyline Query Processing using Filtering in Distributed Environment
Watershed Delineation Using ArcMap

Recently uploaded (20)

PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
Introduction to Business Data Analytics.
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Fluorescence-microscope_Botany_detailed content
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
climate analysis of Dhaka ,Banglades.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Launch Your Data Science Career in Kochi – 2025
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
Introduction to Business Data Analytics.
Mega Projects Data Mega Projects Data
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Fluorescence-microscope_Botany_detailed content

Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal Processing

  • 1. GeoMesa: Using Accumulo for optimized spatio-temporal processing Dr. James Hughes, CCRi james.hughes@ccri.com
  • 2. GeoMesa is ● A collection of libraries and modules which can be used to solve Big Geo Data problems ○ Great for managing billions to trillions of vector data ○ Great for streaming vector data ● Open sourced through Eclipse’s LocationTech working group and has graduated incubation ● Built on top of great open source libraries GeoMesa Background
  • 3. Such architectures allow for live views and near-real time processing (speed layer) while persisting the data for historic queries and batch analysis (batch layer). Client access to both layers can be handled by GeoServer. GeoMesa enables Lambda architectures
  • 4. Suppose we wish to monitor and understand a group of GPS-enabled and internet-enabled devices (ex: sensors, vehicles). ● GeoMesa’s ETL / converter library aids in re-usable data modeling. ● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest into Accumulo and Kafka topics. ● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as 1) geo-fencing, 2) location trackers, and 3) complex alerting rules. ● Effective storage in Accumulo allows for fast query returns. ● End-to-end visualization and analysis supports allows aggregations to pushed down to the Accumulo tablet servers. ● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc interactive analysis and data discovery. Example Use Case: Managing Internet-Aware Devices
  • 5. Suppose we wish to monitor and understand a group of GPS-enabled and internet-enabled devices (ex: sensors, vehicles). ● GeoMesa’s ETL / converter library aids in re-usable data modeling. ● GeoMesa’s NiFi support will let us move Flow Files around easily and ingest into Accumulo and Kafka topics. ● Leveraging GeoMesa’s Kafka DataStore, one can implement CEP such as 1) geo-fencing, 2) location trackers, and 3) complex alerting rules. ● Effective storage in Accumulo allows for fast query returns. ● End-to-end visualization and analysis supports allows aggregations to pushed down to the Accumulo tablet servers. ● GeoMesa’s Spark + Jupyter support allows for quick prototyping, ad hoc interactive analysis and data discovery. All of this adds up to “Speed! Speed! Speed!” whether you are looking at a live view of the data or pulling back an analysis product. Example Use Case: Managing Internet-Aware Devices
  • 6. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far Talk Outline
  • 7. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far 1. Space-filling curves and storing spatio-temporal data 2. Improvements to GeoMesa use and implementation of Accumulo Iterators 3. Spark and MapReduce for distributed computation Talk Outline
  • 8. Enabling and making visualization and analysis quick has been a journey and this talk is about our steps so far 1. Space-filling curves and storing spatio-temporal data 2. Improvements to GeoMesa use and implementation of Accumulo Iterators 3. Spark and MapReduce for distributed computation Not in this talk 1. Storm / NiFi - Streaming Ingest 2. Live views and online processing with Kafka 3. Command line tools 4. ETL / parser library 5. Machine learning / Deep Analytics Talk Outline
  • 9. ● Accumulo Key Design ● Space Filling Curves 101 ● Indices for Points with Time ● Indices for Lines and Polygons ● Lessons Learned GeoMesa's evolution of Accumulo schemas
  • 10. In a traditional stack, the application issues queries to a database which is responsible for query planning. Overview of query planning in Accumulo
  • 11. In a traditional stack, the application issues queries to a database which is responsible for query planning. Overview of query planning in Accumulo With Accumulo, the query planning is handled by library code in the application.
  • 12. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves Space Filling Curves (in one slide!)
  • 13. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. Space Filling Curves (in one slide!)
  • 14. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. Space Filling Curves (in one slide!)
  • 15. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. Space Filling Curves (in one slide!)
  • 16. ● Goal: Index 2+ dimensional data ● Approach: Use Space Filling Curves ● First, ‘grid’ the data space into bins. ● Next, order the grid cells with a space filling curve. ○ Label the grid cells by the order that the curve visits the them. ○ Associate the data in that grid cell with a byte representation of the label. ● We prefer “good” space filling curves: ○ Want recursive curves and locality. ● Space filling curves have higher dimensional analogs. Space Filling Curves (in one slide!)
  • 17. To query for points in the grey rectangle, the query planner enumerates a collection of index ranges which cover the area. Note: Most queries won’t line up perfectly with the gridding strategy. Further filtering can be run on the Accumulo tablet servers with Iterators (next section) or we can return ‘loose’ bounding box results (likely more quickly). Query planning with Space Filling Curves
  • 18. GeoMesa has several tables; each optimized for a particular use case. The Z3 table is used with and optimized for temporal point data. (Think sensor observations, track reports, or other events which happen at particular location.) GeoMesa Key Structure for the ‘Z3’ table Key Value Row Column Record Family Qualifier Shard 1-Byte Epoch Week 2-Bytes Z3(x,y,t) 8-Bytes ‘F’ Here and now: (38.9864985, -76.9561856) 10:15am, Tuesday, Oct. 11th, 2016 Epoch Week: 2440 X value: 1275689 Y value: 151972 T value: 2097151 Z3 (as a long): 6430470637115132837
  • 19. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Indexing non-point geometries: New XZ Index
  • 20. Most approaches to indexing non-point geometries involve covering the geometry with a number of grid cells and storing a copy with each index. This means that the client has to deduplicate results which is expensive. Böhm, Klump, and Kriegel describe an indexing strategy allows such geometries to be stored once. GeoMesa has implemented this strategy in XZ2 (spatial-only) and XZ3 (spatio-temporal) tables. The key is to store data by resolution, separate geometries by size, and then index them by their lower left corner. This does require consideration on the query planning side, but avoiding deduplication is worth the trade-off. Indexing non-point geometries: New XZ Index For more details, see Böhm, Klump, and Kriegel. “XZ-ordering: a space-filling curve for objects with spatial extension.” 6th. Int. Symposium on Large Spatial Databases (SSD), 1999, Hong Kong, China. (http://www.dbs.ifi.lmu.de/Publikationen/Boehm/Ordering_99.pdf)
  • 21. ● Accumulo Iterator Overview ● GeoMesa Iterators for Analysis and Visualization ● Iterator Lessons Learned GeoMesa's use of Accumulo Iterators
  • 22. “Iterators provide a modular mechanism for adding functionality to be executed by TabletServers when scanning or compacting data. This allows users to efficiently summarize, filter, and aggregate data.” -- Accumulo 1.7 documentation Part of the modularity is that the iterators can be stacked: t the output of one can be wired into the next. Example: The first iterator might read from disk, the second could filter with Authorizations, and a final iterator could filter by column family. Other notes: ● Iterators provided a sorted view of the key/values. ● Iterator code can be loaded from HDFS and namespaced! Accumulo Iterators
  • 23. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea
  • 24. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea Heatmaps help show patterns and they can be accelerated with GeoMesa
  • 25. Visualization Example: Heatmaps Without powerful visualization options, big data is big nonsense. Consider this view of shipping in the Mediterranean sea Heatmaps help show patterns and they can be accelerated with GeoMesa Heatmap Request HeatMap WPS Query Hints
  • 26. A request to GeoMesa consists of two broad pieces: 1. A filter restricting the data to act on, e.g.: a. Records in Maryland with ‘Accumulo’ in the text field. b. Records during the first week of 2016. 2. A request for ‘how’ to return the data, e.g.: a. Return the full records b. Return a subset of the record (either a projection or ‘bin’ file format) c. Return a histogram d. Return a heatmap / kernel density Generally, a filter can be handled partially by selecting which ranges to scan; the remainder can be handled by an Iterator. Modifications to selected data can also be handled by a GeoMesa Iterator. GeoMesa Data Requests
  • 27. The first pass of GeoMesa iterators separated concerns into separate iterators. The GeoMesa query planner assembled a stack of iterators to achieve the desired result. Initial GeoMesa Iterator design Image from “Spatio-temporal Indexing in Non-relational Distributed Databases” by Anthony Fox, Chris Eichelberger, James Hughes, Skylar Lyon
  • 28. The key benefit to having decomposed iterators is that they are easier to understand and re-mix. In terms of performance, each one needs to understand the bytes in the Key and Value. In many cases, this will lead to additional serialization/deserialization. Now, we prefer to write Iterators which handle transforming the underlying data into what the client code is expecting in one go. Second GeoMesa Iterator design
  • 29. 1. Using fewer iterators in the stack can be beneficial 2. Using lazy evaluation / deserialization for filtering Values can power speed improvements. 3. Iterators take in Sorted Keys + Values and *must* produce Sorted Keys and Values. 4. Accumulo 1.8.0 has an Iterator Test Harness! https://accumulo.apache.org/release_notes/1.8.0#iterator-test-harness https://accumulo.apache.org/1.8/accumulo_user_manual.html#_iterator_testing Lessons learned about Iterators
  • 30. Through our use of a) space filling curves, b) a cost-based query optimizer, and c) carefully configured iterators, the GeoMesa query planner has a lot going on. The GeoMesa query explainer logs 1) which index was used, 2) which ranges where scanned, 3) Iterator configuration, etc. Putting all together: the GeoMesa Query Explainer geomesa> geomesa explain -u USER -p PASS -i INSTANCE -c geomesa -z zoo1,zoo2,zoo3 -f AccumuloQuickStart -q "Who = 'Bierce'" Planning 'AccumuloQuickStart' Who = 'Bierce' Original filter: Who = 'Bierce' Hints: density[false] bin[false] stats[false] map-aggregate[false] sampling[none] Sort: none Transforms: None Strategy selection: Query processing took 69ms and produced 1 options Filter plan: FilterPlan[ATTRIBUTE[Who = 'Bierce'][None]] Strategy selection took 8ms for 1 options Strategy 1 of 1: AttributeIdxStrategy Strategy filter: ATTRIBUTE[Who = 'Bierce'][None] Plan: org.locationtech.geomesa.accumulo.index.BatchScanPlan Table: geomesa_attr Deduplicate: false Column Families: all Ranges (1): [%01;%00;%00;Bierce%00;::%01;%00;%00;Bierce%01;) Iterators (0): Query planning took 119ms Verify hints Inspect strategies considered See table and ranges to be scanned Quantify planning time
  • 31. ● GeoMesa + Spark Setup ● GeoMesa + Spark Analytics ● GeoMesa powered notebooks (Jupyter and Zeppelin) GeoMesa’s Spark Support: Data Analysis and Discovery
  • 32. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. GeoMesa MapReduce and Spark Support
  • 33. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. GeoMesa MapReduce and Spark Support
  • 34. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. Spark provides a way to change InputFormats into RDDs. GeoMesa MapReduce and Spark Support
  • 35. Using Accumulo Iterators, we’ve seen how one can easily perform simple ‘MapReduce’ style jobs without needing more infrastructure. NB: Those tasks are limited. One can filter inputs, transform/map records and aggregate partial results on each tablet server. To implement more complex processes, we look to MapReduce and Spark. Accumulo Implements the MapReduce InputFormat interface. Spark provides a way to change InputFormats into RDDs. So with a little glue code and Spark classpath/environment management, GeoMesa has Spark support! GeoMesa MapReduce and Spark Support
  • 36. GeoMesa Spark Example 1: Time Series Step 1: Get an RDD[SimpleFeature] Step 2: Calculate the time series Step 3: Plot the time series in R.
  • 37. Using one dataset (country boundaries) to group another (here, GDELT) is effectively a join. Our summer intern, Atallah, worked out the details of doing this analysis in Spark and created a tutorial and blog post. This picture shows ‘stability’ of a region from GDELT Goldstein values GeoMesa Spark Example 2: Aggregating by Regions http://www.ccri.com/2016/08/17/new-geomesa-tutorial-aggregating-visualizing-data/ http://www.geomesa.org/documentation/tutorials/shallow-join.html
  • 38. GeoMesa Spark Example 3: Aggregating Tweets about #traffic Virginia Polygon CQL GeoMesa RDD Aggregate by County Calculate ratio of #traffic Store back to GeoMesa
  • 39. GeoMesa Spark Example 3: Aggregating Tweets about #traffic #traffic by Virginia county Darker blue has a higher count
  • 40. Problem: Another developer came by and mentioned that his Spark job using GeoMesa had quite a few tasks (far more than expected). Around the same time, Eugene Cheipesh (Azavea / GeoTrellis) wrote in to the Accumulo user list… In Accumulo 1.6.x, each range in the Accumulo InputFormat becomes a Split. With space filling curves, it is easy to enumerate plenty of ranges for a query. Solution: The short term solution was to create a custom InputFormat which produce Splits which contain more than one range. A small bump in the road…
  • 41. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook).
  • 42. Interactive Data Discovery at Scale in GeoMesa Notebooks Writing (and debugging!) MapReduce / Spark jobs is slow and requires expertise. A long development cycle for an analytic saps energy and creativity. The answer to both is interactive ‘notebook’ servers like Apache Zeppelin and Jupyter (formerly iPython Notebook). There are two big things to work out: 1. Getting the right libraries on the classpath. 2. Wiring up visualizations.
  • 43. Interactive Data Discovery at Scale in GeoMesa Notebooks GeoMesa Notebook Roadmap: ● Improved JavaScript integration ● D3.js and other visualization libraries ● OpenLayers and Leaflet ● Python Bindings
  • 44. Questions? Find out more at http://geomesa.org Connect with us on Gitter: https://gitter.im/locationtech/geomes a See applications at CCRi’s blog: http://www.ccri.com/blog/
  • 47. GeoMesa Converter Library The Converter library is used in 1. The GeoMesa command line tools 2. GeoMesa’s NiFi processors Configurations support XML, CSV, TSV JSON, Avro, and more! Examples are available for GeoNames, GDELT,OSM-GPX, Twitter, and others.
  • 48. Live view with the GeoMesa Kafka DataStore Q: How did you get billions of points? A: Data is streaming in continually. Examples come from IoT related applications: 10 thousand sensors reporting every 5 seconds generate 1.2 billion records in a week. In these cases, we want to see where things are right now.
  • 49. GeoMesa Kafka DataStore Architecture We have two issues to address: 1. In-memory index of SimpleFeatures 2. Durable message passing system For indexing, we use a combination of Guava and CQEngine (efficient Java collections). Kafka serves as the message passing system. Consumer KDSes can be run in Storm (for event processing), GeoServer (OGC access), etc.
  • 50. Z-Order Hilbert Around 100 years ago, mathematicians asked the question, “Is there a continuous function from the unit interval to the unit square which covers it?” Space Filling Curves: The Math Row-Major
  • 51. Streaming Data Architecture; Part 1 Continuous ingest: GeoMesa-NiFi leverages the GeoMesa converter library