GeoMesa on Apache Spark SQL with Anthony Fox

Anthony Fox
Directorof Data Science,Commonwealth ComputerResearch Inc
GeoMesa Founderand TechnicalLead
anthony.fox@ccri.com
linkedin.com/in/anthony-fox-ccri
twitter.com/algoriffic
www.ccri.com
GeoMesa on Spark SQL
Extracting Location Intelligence from Data

Intro to Location Intelligence and GeoMesa
Spatial Data Types, Spatial SQL
Extending Spark Catalyst for Optimized Spatial SQL
Density of activity in San Francisco
Speed profile of San Francisco

Intro to Location Intelligence and GeoMesa
Spatial Data Types, Spatial SQL
Extending Spark Catalyst for OptimizedSpatial SQL
Density of activity in San Francisco
Speed profile of San Francisco

Location Intelligence
Easy Hard

Show me all the
coffee shops in
this
neighborhood
Easy Hard

Show me all the
coffee shops in
this
neighborhood How many users
commute
through this
intersection
every day?
Easy Hard

Show me all the
coffee shops in
this
commute
through this
intersection
every day?
Which users
commute within
250 meters of a
coffee shop?
Easy Hard

Easy Hard
Show me all the
coffee shops in
this
commute
through this
intersection
every day?
Which users
commute within
250 meters of a
coffee shop?
Should I place
an ad for coffee
on this user’s
mobile device?

Easy Hard
Show me all the
coffee shops in
this
commute
through this
intersection
every day?
Which users
commute within
250 meters of a
coffee shop?
Should I place
an ad for coffee
on this user’s
mobile device?
Where should I
build my next
coffee shop?

What is GeoMesa?
A suite of tools for persisting, querying, analyzing, and
streaming spatio-temporal data at scale

What is GeoMesa?
A suite of tools for persisting, querying, analyzing,
and streaming spatio-temporal data at scale

Spatial Data Types
Points
Locations
Events
Instantaneous Positions

Spatial Data Types
Points
Locations
Events
Lines
Road networks
Voyages
Trips
Trajectories

Spatial Data Types
Points
Locations
Events
Lines
Road networks
Voyages
Trips
Trajectories
Polygons
Administrative Regions
Airspaces

Spatial Data Types
Points
Locations
Events
PointUDT
MultiPointUDT
Lines
Road networks
Voyages
Trips
Trajectories
LineUDT
MultiLineUDT
Polygons
Administrative Regions
Airspaces
PolygonUDT
MultiPolygonUDT

Spatial SQL
SELECT
activity_id,user_id,geom,dtg
FROM
activities
WHERE
st_contains(st_makeBBOX(-78,37,-77,38),geom) AND
dtg > cast(‘2017-06-01’ as timestamp) AND
dtg < cast(‘2017-06-05’ as timestamp)

Spatial SQL
SELECT
FROM
activities
WHERE
Geometry constructor

Spatial SQL
SELECT
FROM
activities
WHERE
Spatial column in schema

Spatial SQL
SELECT
FROM
activities
WHERE
Topological predicate

Sample Spatial UDFs: Geometry Constructors
st_geomFromWKT Create a point, line, or polygon from a
WKT st_geomFromWKT(‘POINT(-122.40,37.78)’)
st_makeLine Create a line from a sequence of points
st_makeLine(collect_list(geom))
st_makeBBOX Create a bounding box from (left, bottom,
right, top) st_makeBBOX(-123,37,-121,39)
...

Sample Spatial UDFs: Topological Predicates
st_contains Returns true if the second argument is
contained within the first argument
st_contains(
st_geomFromWKT(‘POLYGON…’),
geom
)
st_within Returns true if the second argument
geometry is entirely within the first
argument
st_within(
st_geomFromWKT(‘POLYGON…’),
geom
)
st_dwithin Returns true if the geometries are within
a specified distance from each other
st_dwithin(geom1, geom2, 100)

Sample Spatial UDFs: Processing
st_bufferPoint Create a bufferaround a point fordistance
within type queries st_bufferPoint(geom, 10)
st_envelope Extract the envelope ofa geometry
st_envelope(geom)
st_geohash Encode the geometry using a Z-Orderspace
filling curve. Useful for grid analysis.
st_geohash(geom, 35)
st_closestpoint Find the point on the target geometry thatis
closest to the given geometry st_closestpoint(geom1, geom2)
st_distanceSpheroid Find the great circle distance usingthe WGS84
ellipsoid
st_distanceSpheroid(geom1, geom2)

Optimizing Spatial SQL
SELECT
FROM
activities
WHERE

Optimizing Spatial SQL
SELECT
FROM
activities
WHERE
Only load partitions that have
records that intersect the query
geometry.

Extending Spark’s Catalyst Optimizer
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
Catalyst exposes hooks to insert
optimization rules in various points in the
query processing logic.

/**
* :: Experimental ::
* A collection of methods that are consideredexperimental,but canbe used tohook into
* the query plannerforadvanced functionality.
*
* @group basic
* @since 1.3.0
*/
@Experimental
@transient
@InterfaceStability.Unstable
def experimental: ExperimentalMethods = sparkSession.experimental

SQL optimizations for Spatial Predicates
SELECT
FROM
activities
WHERE

SELECT
FROM
activities
WHERE
GeoMesa Relation

SELECT
FROM
activities
WHERE
Relational Projection

SELECT
FROM
activities
WHERE
Topological Predicate

SELECT
FROM
activities
WHERE
Geometry Literal

SELECT
FROM
activities
WHERE
Date range predicate

object STContainsRule extendsRule[LogicalPlan]with PredicateHelper{
override defapply(plan: LogicalPlan): LogicalPlan ={
plan.transform{
case filt @ Filter(f, lr@LogicalRelation(gmRel: GeoMesaRelation,_,_)) =>
…
val relation =gmRel.copy(filt =ff.and(gtFilters:+gmRel.filt))
lr.copy(expectedOutputAttributes=Some(lr.output),
relation =relation)
}
}

plan.transform{
…
relation =relation)
}
}
Intercept a Filter on a
GeoMesa Logical Relation

plan.transform{
…
relation =relation)
}
} Extract the predicates that can be handled by GeoMesa, create a new GeoMesa relation with
the predicates pushed down into the scan,and return a modified tree with the new relation and
the filter removed.
GeoMesa will compute the minimal ranges necessary to cover the query region.

Relational
Projection
Filter
GeoMesa
Relation

Relational
Projection
Filter
GeoMesa
Relation
Relational
Projection
GeoMesa
Relation
<topo predicate>

Relational
Projection
Filter
GeoMesa
Relation
Relational
Projection
GeoMesa
Relation
<topo predicate>
GeoMesa
Relation
<topo predicate>
<relational projection>

SELECT
*
FROM
activities<pushdown filter and projection>

SELECT
*
FROM
activities<pushdown filter and projection>
Reduced I/O, reduced networkoverhead, reduced
compute load - faster Location Intelligence
answers

SELECT
geohash,
count(geohash)as count
FROM (
SELECT st_geohash(geom,35) as geohash
FROM sf
WHERE
st_contains(st_makeBBOX(-122.4194-1,37.77-1,-122.4194+1,37.77+1),
geom)
)
GROUP BY geohash
1. Constrain to San
Francisco
2. Snap location to 35
bit geohash
3. Group by geohash
and count records per
geohash
Density of Activity in San Francisco

1. Constrain to San
Francisco
bit geohash
3. Group by geohash
geohash
SELECT
geohash,
FROM (
FROM sf
WHERE
geom)
)
GROUP BY geohash

SELECT
geohash,
FROM (
FROM sf
WHERE
geom)
)
GROUP BY geohash
1. Constrain to San
Francisco
bit geohash
3. Group by geohash
and count records
per geohash

1. Constrain to San
Francisco
bit geohash
3. Group by geohash
geohash

Visualize using Jupyter
and Bokeh
p = figure(title="STRAVA",
plot_width=900,plot_height=600,
x_range=x_range,y_range=y_range)
p.add_tile(tonerlines)
p.circle(x=projecteddf['px'],
y=projecteddf['py'],
fill_alpha=0.5,
size=6,
fill_color=colors,
line_color=colors)
show(p)

and Bokeh

Speed Profile of a Metro Area
Inputs STRAVA Activities
An activity is sampled once per second
Each observation has a location and time
{
"type": "Feature",
"geometry": { "type": "Point", "coordinates": [-122.40736,37.807147] },
"properties": {
"activity_id": "**********************",
"athlete_id": "**********************",
"device_type": 5,
"activity_type": "Ride",
"frame_type": 2,
"commute": false,
"date": "2016-11-02T23:58:03",
"index": 0
},
"id": "6a9bb90497be6f64eae009e6c760389017bc31db:0"
}

SELECT
activity_id,
index,
geom as s,
lead(geom) OVER (PARTITION BY activity_idORDER by dtgasc) as e,
dtg as start,
lead(dtg) OVER (PARTITIONBY activity_idORDER by dtgasc) as end
FROM activities
WHERE
activity_type = 'Ride' AND
st_contains(
st_makeBBOX(-122.4194-1,37.77-1,-122.4194+1,37.77+1),
geom)
ORDER BY dtg ASC
1. Select all activities
within metro area
2. Sort activity by dtg
ascending
3. Window over each set
of consecutive
samples
4. Create a temporary
table

within metro area
ascending
of consecutive
samples
table
SELECT
activity_id,
index,
geom as s,
dtg as start,
FROM activities
WHERE
st_contains(
st_makeBBOX(-122.4194-1,37.77-1,-122.4194+1,37.77+1),
geom)
ORDER BY dtg ASC

within metro area
ascending
3. Window over each
set of consecutive
samples
table
SELECT
activity_id,
index,
geom as s,
dtg as start,
FROM activities
WHERE
st_contains(
st_makeBBOX(-122.4194-1,37.77-1,-122.4194+1,37.77+1),
geom)
ORDER BY dtg ASC

spark.sql(“””
SELECT
activity_id,
index,
geom as s,
dtg as start,
FROM activities
WHERE
st_contains(
st_makeBBOX(-122.4194-1,37.77-1,-122.4194+1,37.77+1),
geom)
ORDER BY dtg ASC
“””).createOrReplaceTempView(“segments”)
within metro area
ascending
of consecutive
samples
table

within metro area
ascending
of consecutive
samples
table

SELECT
st_geohash(s,35)as gh,
st_distanceSpheroid(s, e)/
cast(cast(end as long)-cast(startas long)as double)
as meters_per_second
FROM segments
5. Compute the
distance between
consecutive points
6. Compute the time
difference between
consecutive points
7. Compute the speed
8. Snap the location to a
grid based on a
GeoHash
table

SELECT
FROM segments
5. Compute the distance
between consecutive
points
6. Compute the time
difference between
consecutive points
grid based on a
GeoHash
table

SELECT
FROM segments
between consecutive
points
6. Compute the time
difference between
consecutive points
8. Snap the location to
a grid based on a
GeoHash
table

spark.sql(“””
SELECT
FROM segments
“””).createOrReplaceTempView(“gridspeeds”)
between consecutive
points
6. Compute the time
difference between
consecutive points
grid based on a
GeoHash
table

between consecutive
points
6. Compute the time
difference between
consecutive points
grid based on a
GeoHash
table

SELECT
st_centroid(st_geomFromGeoHash(gh,35))as p,
percentile_approx(meters_per_second,0.5) as avg_meters_per_second,
stddev(meters_per_second) as std_dev
FROM gridspeeds
GROUP BY gh
10. Group the grid cells
11. For each grid cell,
compute the median
and standard
deviation of the speed
12. Extract the location of
the grid cell

SELECT
percentile_approx(meters_per_second,0.5) as med_meters_per_second,
FROM gridspeeds
GROUP BY gh
compute the median
and standard
deviation of the
speed
the grid cell

SELECT
percentile_approx(meters_per_second,0.5) as avg_meters_per_second,
FROM gridspeeds
GROUP BY gh
compute the median
and standard
12. Extract the location
of the grid cell

compute the median
and standard
the grid cell

p = figure(title="STRAVA",
plot_width=900,plot_height=600,
x_range=x_range,y_range=y_range)
p.add_tile(tonerlines)
p.circle(x=projecteddf['px'],
y=projecteddf['py'],
fill_alpha=0.5,
size=6,
fill_color=colors,
line_color=colors)
show(p)
and Bokeh

Thank You.
geomesa.org
github.com/locationtech/geomesa
twitter.com/algoriffic
linkedin.com/in/anthony-fox-ccri
anthony.fox@ccri.com
www.ccri.com

Indexing Spatio-Temporal Data in Bigtable
Moscone Centercoordinates
37.7839°N,122.4012°W

• Bigtable clones have a single dimension
lexicographic sorted index
37.7839°N,122.4012°W

• What if we concatenated latitude and longitude?
37.7839°N,122.4012°W
Row Key
37.7839,-122.4012

• What if we concatenated latitude and longitude?
• Fukushima sorts lexicographically near Moscone
Center because they have the same latitude
37.7839°N,122.4012°W
Row Key
37.7839,-122.4012
37.7839,140.4676

Space-filling Curves
2-D Z-orderCurve 2-D Hilbert Curve

Space-filling
curve example
Moscone Center coordinates
37.7839° N, 122.4012° W
Encode coordinates to a 32 bit Z

Space-filling
curve example
37.7839° N, 122.4012° W
1. Scale latitude and longitude to use 16 available bits each
scaled_x = (-122.4012 + 180)/360 * 2^16
= 10485
scaled_y = (37.7839 + 90)/180 * 2^16
= 46524

Space-filling
curve example
37.7839° N, 122.4012° W
scaled_x = (-122.4012 + 180)/360 * 2^16
= 10485
scaled_y = (37.7839 + 90)/180 * 2^16
= 46524
1. Take binary representation of scaled coordinates
bin_x = 0010100011110101
bin_y = 1011010110111100

Space-filling
curve example
37.7839° N, 122.4012° W
scaled_x = (-122.4012 + 180)/360 * 2^16
= 10485
scaled_y = (37.7839 + 90)/180 * 2^16
= 46524
bin_x = 0010100011110101
bin_y = 1011010110111100
1. Interleave bits of x and y and convert back to an integer
bin_z = 01001101100100011110111101110010
z = 1301409650

Space-filling
curve example
37.7839° N, 122.4012° W
scaled_x = (-122.4012 + 180)/360 * 2^16
= 10485
scaled_y = (37.7839 + 90)/180 * 2^16
= 46524
bin_x = 0010100011110101
bin_y = 1011010110111100
1. Interleave bits of x and y and convert back to an integer
bin_z = 01001101100100011110111101110010
z = 1301409650
Distance preserving hash

Space-filling curves linearize a
multi-dimensional space
Bigtable Index
[0,2^32]
1301409650
4294967296
0

Regions translate to range scans
Bigtable Index
[0,2^32]
1301409657
1301409650
0
4294967296
scan ‘geomesa’,{STARTROW=> 1301409650,ENDROW=> 1301409657}

Provisioning Spatial RDDs
params = {
"instanceId": "geomesa",
"zookeepers": "X.X.X.X",
"user": "user",
"password": "******",
"tableName": "geomesa.strava"
}
spark
.read
.format("geomesa")
.options(**params)
.option("geomesa.feature", "activities")
.load()
Accumulo

params = {
"bigtable.table.name": "geomesa.strava"
}
spark
.read
.format("geomesa")
.options(**params)
.load()
HBase and Bigtable

params = {
"geomesa.converter": "strava",
"geomesa.input": "s3://path/to/data/*.json.gz"
}
spark
.read
.format("geomesa")
.options(**params)
.load()
Flat files

Inputs
Approach
● Select all activities within metro area
● Sort each activity by dtg ascending
● Window over each set of consecutive samples
● Compute summary statistics of speed
● Group by grid cell
● Visualize

GeoMesa on Apache Spark SQL with Anthony Fox

More Related Content

What's hot (20)

Similar to GeoMesa on Apache Spark SQL with Anthony Fox (20)

More from Databricks (20)

Recently uploaded (20)

GeoMesa on Apache Spark SQL with Anthony Fox