SlideShare a Scribd company logo
1
High Performance and Scalable Geospatial
Analytics on Cloud with Open Source
James Hughes – CCRI
Constantin Stanca – Hortonworks
3
Summary
• Loading Geospatial data into the cloud and GeoTools datastores never seems as easy as
it should be. There's sensors network, GPS devices, Twitter streams, FTP servers and all
sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest.
• GeoMesa, NiFi and Spark provides a fully open source solution to ease the pain of
ingesting and analyzing data using ANY GeoTools data store.
• DataPlane Services Cloud Manager (powered by Cloudbreak) helps you to deploy
ephemeral geospatial analytics clusters to support increased computation
requirements, all decoupled from storage.
• We will show how real-time streaming data such as satellite AIS can be ingested and
managed in real-time with NiFi. Also, show how geospatial data stored in S3, HDFS, or
HBase, ORC or Parquet, can be queried at scale using GeoMesa, Spark and Zeppelin.
4
Geospatial Analytics
Challenges
5
Data Movement & System Complexity with Added Pressure of Big Data
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data
6
If That Was Not Enough …
Spatial Data Types
Points
Locations
Events
Instantaneous
Positions
Lines
Road networks
Voyages
Trips
Trajectories
Polygons
Administrative
Regions
Airspaces
7
If That Was Not Enough …
Spatial Data Relationships
equals
disjoint
intersects
touches
crosses
within
contains
overlaps
8
If That Was Not Enough …
Topology Operations
Algorithms
Convex Hull
Buffer
Validation
Dissolve
Polygonization
Simplification
Triangulation
Voronoi
Linear Referencing
and more...
8
9
Requirements for a High
Performance Geospatial
Analytics Platform
10
Traditional Approach
• GIS, data crunching and web serving were three very separate worlds.
• If a web app wanted access to the analysis there was a long process of ETL, DB work,
imports and exports, and bribing various network and storage people for the resources
you needed.
11
Requirements for a High Performance Geospatial Analytics Platform
• IoT sensors present an opportunity to understand the world right now
• A map of the current state of the world enables faster reactions
• The variety of sensors and data source present data management challenges
• Adding new, varied data sources must be easy
• Big data requires distributed storage / computation and scalable infrastructure
• The data layer has to scale
• Analysis has to be easy
12
Scalable Geospatial
Analytics on the Cloud
13
How Cloud Helps to Address Geospatial Big Data Challenges
• Challenges:
• Big data problem (derive insights from all data)
• Compute resources when they are needed (easy scale, easy access to data)
• Solution:
• Cloud provides elastically the needed compute resources, all decoupled from the storage, whether
that is an object store, file system or NoSQL.
14
Importance for Geospatial Analytics
• Spatial streaming visualizations and analytics can present near real-time insights
• Decision makers can respond more rapidly when they see live data feeds on a map
• Spatial batch analytics can fuse multiple data sources together to understand a region
• Patterns of life emerge
• Advertisers can plan their next campaigns
• Business can locate their new store sites
15
Cloudbreak
• Cloudbreak can be utilized to address
Geospatial computational capacity needs
• Easily spin auto-scalable clusters for
different workloads and purposes, whether
is a Geospatial Ingest Cluster with NiFi and
GeoMesa, or Geospatial Analytics cluster
with Spark and GeoMesa.
• Data can reside in your object store or even
in a persistent data store.
• These ephemeral clusters can be scheduled
for a period of time or only until the job is
done so you pay only what you use.
16
LocationTech GeoMesa
17
How GeoMesa Helps with Geospatial Data Type Challenges
• Challenges:
• Vector & raster data
• Geospatial data types
• Solution:
• GeoMesa tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
18
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
19
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
20
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
21
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
22
What Is GeoMesa?
A suite of tools for streaming, persisting, managing, and
analyzing spatio-temporal data at scale
23
Proposed Reference Architecture
24
How Does HDP/HDF + GeoMesa Stream Data?
• The GeoMesa Kafka DataStore allows data produces to write CRUD messages to a Kafka
topic.
• Consumers off that topic build up an in-memory representation of the current state of
the world.
• This allows for
• live maps,
• real time analytics, and
• complex event processing.
25
How Does HDP/HDF + GeoMesa Persist Data?
GeoMesa integrates with HBase and Accumulo:
• Key structures use space filling curves
• Complex geospatial filters and processing can be
‘pushed down’ using Filters, Coprocessors, and Iterators
GeoMesa’s File System Datastore provides the ability to
store spatio-temporally indexed data on S3 cloud object
store or storage formats like ORC or Parquet.
26
Geospatial Data Flow Transformation
with NiFi and GeoMesa
27
Geo Data in
Motion
(Cloud)
Geo Data
in
Motion
(on-premises)
Geo Data
at Rest
(on-premises)
Edge
Geo Data
Geo Data
in Motion
Edge
Analytics
Geo Data
at Rest
(Cloud)
Edge
Geo Data
Geo Data
at Rest
(on-premises)
Closed
Loop
Analytics
Machine
Learning
Deep
Historical
Analysis
Geospatial Data Flow Transformation with NiFi and GeoMesa
On-Prem
Cloud
Satellite AIS
Spatial Data
28
GeoMesa NiFi
• GeoMesa-NiFi allows you to ingest data into GeoMesa straight from NiFi by leveraging
custom processors.
• NiFi allows you to ingest data into GeoMesa from every source GeoMesa supports and
more.
Data
SimpleFeatureType
Schema
GeoMesa NiFi
Processors enabled datastores
29
GeoMesa NiFi Processors
• PutGeoMesaAccumulo: Ingest data into a GeoMesa Accumulo datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaHBase: Ingest data into a GeoMesa HBase datastore with a GeoMesa
converter or from geoavro
• PutGeoMesaFileSystem: Ingest data into a GeoMesa File System datastore with a
GeoMesa converter or from geoavro
• PutGeoMesaKafka: Ingest data into a GeoMesa Kafka datastore with a GeoMesa
converter or from geoavro
• PutGeoTools: Ingest data into an arbitrary GeoTools datastore using a GeoMesa
converter or avro
• ConvertToGeoAvro: Use a GeoMesa converter to create geoavro
30
Analyze Geospatial Data with
GeoMesa and Spark
31
How does HDP + GeoMesa analyze geospatial data?
• GeoMesa integrates deeply with Spark to:
• create spatial User Defined Types and User Defined Functions
• (based on LocationTech JTS, a geometry library)
• optimize spatial queries against GeoMesa DataSources
• persist output data back to GeoMesa
• leverage Zeppelin notebooks to allow for rapid innovation and creativity
• Zeppelin allows analysts to visualize results easily
32
DEMO
Data Ingest and Interactive Insights with
GeoMesa, NiFi, Spark and Zeppelin
33
Demo
• Introduce EE dataset
• Data management / NiFi overview
• Real-time view + historical recall
• Spark Analysis
34
NiFi-GeoMesa Data Flow
35
Satellite AIS
36
Setup
● Import GeoMesa
dependency
● Create dataframe
backed by GeoMesa
relation
● Create SQL temporary
view so we can query
it
37
Sub-select Data
● Create rough sub-
selection of data
■ Bound by time
■ Bound by bounding
box roughly around
the Gulf of Mexico
● Create a new temporary
view from this sub-
selection
● Cache the data (pull into
memory)
38
Data Exploration
● Query for Tankers in the
Gulf
● Get counts for each type
of Tanker
● Group the counts by day
● Graph counts to see
trends
39
Data Exploration
● Restrict our search to
just Trinity Bay
40
Data Exploration
● Create a new
temporary view of
the number of ships
in Trinity Bay
41
Extra Data
● Pull in Gas price data
○ Acquired from
EIA.gov
○ Two Gas Price
Indexes
■ NYH: New York
Harbor
■ GC: Gulf Coast
● Create temporary view
so we can analyze with
SQL
42
Data Exploration
● Graph data over
time period of
Harvey
● Notice we don’t
have daily values
43
Data Exploration
● Create temporary
view of gas price
data around our
time of interest
44
Data Exploration
● Backfill the price data with
the last value to give us day-
continuous data
● Min/Max Normalize gas and
ship counts
● Graph gas prices and ship
counts together
45
Resources
46
Resources
• GeoMesa Project: http://www.geomesa.org/
• GeoMesa-NiFi: http://www.geomesa.org/documentation/user/nifi.html
• GeoMesa-Spark: http://www.geomesa.org/documentation/user/spark/index.html
• Articles:
• http://www.ccri.com/2017/03/20/new-geomesa-spark-sql-zeppelin-notebooks-support/
• http://www.ccri.com/2018/02/26/interactive-insights-hurricane-harveys-impact-energy-
production-geomesa-jupyter-notebooks/
47
Thank you

More Related Content

PPTX
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
PPTX
ExxonMobil’s journey to unleash time-series data with open source technology
PPTX
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
PPTX
Big data at United Airlines
PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
PPTX
Lessons learned processing 70 billion data points a day using the hybrid cloud
PPTX
Big Data Platform Industrialization
PPTX
Keys for Success from Streams to Queries
Acquisition of Seismic, Hydroacoustic, and Infrasonic Data with Apache NiFi a...
ExxonMobil’s journey to unleash time-series data with open source technology
Bridging the gap: achieving fast data synchronization from SAP HANA by levera...
Big data at United Airlines
Data Gloveboxes: A Philosophy of Data Science Data Security
Lessons learned processing 70 billion data points a day using the hybrid cloud
Big Data Platform Industrialization
Keys for Success from Streams to Queries

What's hot (20)

PPTX
MapR on Azure: Getting Value from Big Data in the Cloud -
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
PPTX
Big Data Analytics from Edge to Core
PPTX
GDPR compliance application architecture and implementation using Hadoop and ...
PDF
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PPTX
Evolving from RDBMS to NoSQL + SQL
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
PPTX
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
PPTX
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PDF
Big Telco - Yousun Jeong
PDF
Kudu as Storage Layer to Digitize Credit Processes
PPTX
Log I am your father
PPTX
Evolving Beyond the Data Lake: A Story of Wind and Rain
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
PPTX
Containerized Hadoop beyond Kubernetes
PPTX
Large-scaled telematics analytics
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PDF
About CDAP
MapR on Azure: Getting Value from Big Data in the Cloud -
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Big Data Analytics from Edge to Core
GDPR compliance application architecture and implementation using Hadoop and ...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The key to unlocking the Value in the IoT? Managing the Data!
Evolving from RDBMS to NoSQL + SQL
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
A Data Lake and a Data Lab to Optimize Operations and Safety within a nuclear...
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Big Telco - Yousun Jeong
Kudu as Storage Layer to Digitize Credit Processes
Log I am your father
Evolving Beyond the Data Lake: A Story of Wind and Rain
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Containerized Hadoop beyond Kubernetes
Large-scaled telematics analytics
How Hadoop Makes the Natixis Pack More Efficient
About CDAP
Ad

Similar to High Performance and Scalable Geospatial Analytics on Cloud with Open Source (20)

PDF
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Geospatial Options in Apache Spark
PPT
OS MasterMap it's not a map - but data
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
PDF
Geospatial Analytics at Scale with Deep Learning and Apache Spark
PPTX
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
PPTX
Rapid analytic development on near real time data
PDF
GeoMesa on Apache Spark SQL with Anthony Fox
PDF
Software Freedom Day Google Developer Groups On Campus PEC, Thiruvallur.
PPT
Uniting traditional GIS and mainstream IT
PPTX
Spark summit europe 2015 magellan
PDF
Dista Insight Feature Overview.pdf
PDF
Intro To Geospatial
PPTX
Software Freedom Day Google Developer Groups on Campus
PPTX
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
PDF
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
PDF
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
PPTX
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
High Performance and Scalable Geospatial Analytics on Cloud with Open Source
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
DataStax and Esri: Geotemporal IoT Search and Analytics
Geospatial Options in Apache Spark
OS MasterMap it's not a map - but data
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Big Spatial(!) Data Processing mit GeoMesa. AGIT 2019, Salzburg, Austria.
Rapid analytic development on near real time data
GeoMesa on Apache Spark SQL with Anthony Fox
Software Freedom Day Google Developer Groups On Campus PEC, Thiruvallur.
Uniting traditional GIS and mainstream IT
Spark summit europe 2015 magellan
Dista Insight Feature Overview.pdf
Intro To Geospatial
Software Freedom Day Google Developer Groups on Campus
Geoposicionamiento Big Data o It's bigger on the inside Commit conf 2018
Geoposicionamiento Big Data o It's bigger on the inside Codemetion Madrid 2018
Magellen: Geospatial Analytics on Spark by Ram Sriharsha
RasterFrames: Enabling Global-Scale Geospatial Machine Learning
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Digital-Transformation-Roadmap-for-Companies.pptx

High Performance and Scalable Geospatial Analytics on Cloud with Open Source

  • 1. 1 High Performance and Scalable Geospatial Analytics on Cloud with Open Source James Hughes – CCRI Constantin Stanca – Hortonworks
  • 2. 3 Summary • Loading Geospatial data into the cloud and GeoTools datastores never seems as easy as it should be. There's sensors network, GPS devices, Twitter streams, FTP servers and all sorts of other data that you need to parse, convert to SimpleFeatures, and then ingest. • GeoMesa, NiFi and Spark provides a fully open source solution to ease the pain of ingesting and analyzing data using ANY GeoTools data store. • DataPlane Services Cloud Manager (powered by Cloudbreak) helps you to deploy ephemeral geospatial analytics clusters to support increased computation requirements, all decoupled from storage. • We will show how real-time streaming data such as satellite AIS can be ingested and managed in real-time with NiFi. Also, show how geospatial data stored in S3, HDFS, or HBase, ORC or Parquet, can be queried at scale using GeoMesa, Spark and Zeppelin.
  • 4. 5 Data Movement & System Complexity with Added Pressure of Big Data Acquire Data Store Data Acquire Data Store Data Store Data Store Data Store Data Process and Analyze Data Data Flow Acquire Data Acquire Data
  • 5. 6 If That Was Not Enough … Spatial Data Types Points Locations Events Instantaneous Positions Lines Road networks Voyages Trips Trajectories Polygons Administrative Regions Airspaces
  • 6. 7 If That Was Not Enough … Spatial Data Relationships equals disjoint intersects touches crosses within contains overlaps
  • 7. 8 If That Was Not Enough … Topology Operations Algorithms Convex Hull Buffer Validation Dissolve Polygonization Simplification Triangulation Voronoi Linear Referencing and more... 8
  • 8. 9 Requirements for a High Performance Geospatial Analytics Platform
  • 9. 10 Traditional Approach • GIS, data crunching and web serving were three very separate worlds. • If a web app wanted access to the analysis there was a long process of ETL, DB work, imports and exports, and bribing various network and storage people for the resources you needed.
  • 10. 11 Requirements for a High Performance Geospatial Analytics Platform • IoT sensors present an opportunity to understand the world right now • A map of the current state of the world enables faster reactions • The variety of sensors and data source present data management challenges • Adding new, varied data sources must be easy • Big data requires distributed storage / computation and scalable infrastructure • The data layer has to scale • Analysis has to be easy
  • 12. 13 How Cloud Helps to Address Geospatial Big Data Challenges • Challenges: • Big data problem (derive insights from all data) • Compute resources when they are needed (easy scale, easy access to data) • Solution: • Cloud provides elastically the needed compute resources, all decoupled from the storage, whether that is an object store, file system or NoSQL.
  • 13. 14 Importance for Geospatial Analytics • Spatial streaming visualizations and analytics can present near real-time insights • Decision makers can respond more rapidly when they see live data feeds on a map • Spatial batch analytics can fuse multiple data sources together to understand a region • Patterns of life emerge • Advertisers can plan their next campaigns • Business can locate their new store sites
  • 14. 15 Cloudbreak • Cloudbreak can be utilized to address Geospatial computational capacity needs • Easily spin auto-scalable clusters for different workloads and purposes, whether is a Geospatial Ingest Cluster with NiFi and GeoMesa, or Geospatial Analytics cluster with Spark and GeoMesa. • Data can reside in your object store or even in a persistent data store. • These ephemeral clusters can be scheduled for a period of time or only until the job is done so you pay only what you use.
  • 16. 17 How GeoMesa Helps with Geospatial Data Type Challenges • Challenges: • Vector & raster data • Geospatial data types • Solution: • GeoMesa tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 17. 18 What Is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 18. 19 What Is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 19. 20 What Is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 20. 21 What Is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 21. 22 What Is GeoMesa? A suite of tools for streaming, persisting, managing, and analyzing spatio-temporal data at scale
  • 23. 24 How Does HDP/HDF + GeoMesa Stream Data? • The GeoMesa Kafka DataStore allows data produces to write CRUD messages to a Kafka topic. • Consumers off that topic build up an in-memory representation of the current state of the world. • This allows for • live maps, • real time analytics, and • complex event processing.
  • 24. 25 How Does HDP/HDF + GeoMesa Persist Data? GeoMesa integrates with HBase and Accumulo: • Key structures use space filling curves • Complex geospatial filters and processing can be ‘pushed down’ using Filters, Coprocessors, and Iterators GeoMesa’s File System Datastore provides the ability to store spatio-temporally indexed data on S3 cloud object store or storage formats like ORC or Parquet.
  • 25. 26 Geospatial Data Flow Transformation with NiFi and GeoMesa
  • 26. 27 Geo Data in Motion (Cloud) Geo Data in Motion (on-premises) Geo Data at Rest (on-premises) Edge Geo Data Geo Data in Motion Edge Analytics Geo Data at Rest (Cloud) Edge Geo Data Geo Data at Rest (on-premises) Closed Loop Analytics Machine Learning Deep Historical Analysis Geospatial Data Flow Transformation with NiFi and GeoMesa On-Prem Cloud Satellite AIS Spatial Data
  • 27. 28 GeoMesa NiFi • GeoMesa-NiFi allows you to ingest data into GeoMesa straight from NiFi by leveraging custom processors. • NiFi allows you to ingest data into GeoMesa from every source GeoMesa supports and more. Data SimpleFeatureType Schema GeoMesa NiFi Processors enabled datastores
  • 28. 29 GeoMesa NiFi Processors • PutGeoMesaAccumulo: Ingest data into a GeoMesa Accumulo datastore with a GeoMesa converter or from geoavro • PutGeoMesaHBase: Ingest data into a GeoMesa HBase datastore with a GeoMesa converter or from geoavro • PutGeoMesaFileSystem: Ingest data into a GeoMesa File System datastore with a GeoMesa converter or from geoavro • PutGeoMesaKafka: Ingest data into a GeoMesa Kafka datastore with a GeoMesa converter or from geoavro • PutGeoTools: Ingest data into an arbitrary GeoTools datastore using a GeoMesa converter or avro • ConvertToGeoAvro: Use a GeoMesa converter to create geoavro
  • 29. 30 Analyze Geospatial Data with GeoMesa and Spark
  • 30. 31 How does HDP + GeoMesa analyze geospatial data? • GeoMesa integrates deeply with Spark to: • create spatial User Defined Types and User Defined Functions • (based on LocationTech JTS, a geometry library) • optimize spatial queries against GeoMesa DataSources • persist output data back to GeoMesa • leverage Zeppelin notebooks to allow for rapid innovation and creativity • Zeppelin allows analysts to visualize results easily
  • 31. 32 DEMO Data Ingest and Interactive Insights with GeoMesa, NiFi, Spark and Zeppelin
  • 32. 33 Demo • Introduce EE dataset • Data management / NiFi overview • Real-time view + historical recall • Spark Analysis
  • 35. 36 Setup ● Import GeoMesa dependency ● Create dataframe backed by GeoMesa relation ● Create SQL temporary view so we can query it
  • 36. 37 Sub-select Data ● Create rough sub- selection of data ■ Bound by time ■ Bound by bounding box roughly around the Gulf of Mexico ● Create a new temporary view from this sub- selection ● Cache the data (pull into memory)
  • 37. 38 Data Exploration ● Query for Tankers in the Gulf ● Get counts for each type of Tanker ● Group the counts by day ● Graph counts to see trends
  • 38. 39 Data Exploration ● Restrict our search to just Trinity Bay
  • 39. 40 Data Exploration ● Create a new temporary view of the number of ships in Trinity Bay
  • 40. 41 Extra Data ● Pull in Gas price data ○ Acquired from EIA.gov ○ Two Gas Price Indexes ■ NYH: New York Harbor ■ GC: Gulf Coast ● Create temporary view so we can analyze with SQL
  • 41. 42 Data Exploration ● Graph data over time period of Harvey ● Notice we don’t have daily values
  • 42. 43 Data Exploration ● Create temporary view of gas price data around our time of interest
  • 43. 44 Data Exploration ● Backfill the price data with the last value to give us day- continuous data ● Min/Max Normalize gas and ship counts ● Graph gas prices and ship counts together
  • 45. 46 Resources • GeoMesa Project: http://www.geomesa.org/ • GeoMesa-NiFi: http://www.geomesa.org/documentation/user/nifi.html • GeoMesa-Spark: http://www.geomesa.org/documentation/user/spark/index.html • Articles: • http://www.ccri.com/2017/03/20/new-geomesa-spark-sql-zeppelin-notebooks-support/ • http://www.ccri.com/2018/02/26/interactive-insights-hurricane-harveys-impact-energy- production-geomesa-jupyter-notebooks/