SlideShare a Scribd company logo
Andrés de la Peña
Stratio's Cassandra Lucene index:
Geospatial use cases
Jonathan Nappée
• Big Data Company
• Certified Spark distribution
• Founded in 2013
• 200+ employees
• Offices in Madrid, San Francisco and Bogotá
2/40
1 Lucene-based secondary indexes
2 Geospatial search features
3 Business use cases
3/40
Lucene-based Cassandra secondary indexes
Apache Lucene
• General purpose search library
• Created by Doug Cutting in 1999
• Core of popular search engines:
‒ Apache Nutch, Compass, Apache Solr, ElasticSearch
• Tons of features:
‒ Full-text search, inequalities, sorting, geospatial, aggregations…
• Rich implementation:
‒ Multiple index structures, smart query planning, cool merge policy…
5/40
A Lucene-based C* 2i implementation
• Each node indexes its own data
• Keep P2P architecture
• Distribution managed by C*
• Replication managed by C*
• Just a single pluggable JAR file
CLIENT
C*
node
C*
node
C*
node
Lucene
index
Lucene
index
Lucene
indexJVM
JVM
JVM
6/40
Creating Lucene indexes
CREATE TABLE tweets (
user text,
date timestamp,
message text,
hashtags set<text>
PRIMARY KEY (user, date));
• Built in the background
• Dynamic updates
• Immutable mapping schema
• Many columns per index
• Many indexes per table
CREATE CUSTOM INDEX tweets_idx ON tweets()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{fields : {
user : {type: "string"},
date : {type: "date", pattern: "yyyy-MM-dd"},
message : {type: "text", analyzer: "english"},
hashtags: {type: "string"}}}'};
7/40
Querying Lucene indexes
SELECT * FROM tweets WHERE expr(tweets_idx, '{
filter: {
must: {type: "phrase", field: "message", value: "cassandra is cool"},
not: {type: "wildcard", field: "hashtags", value: "*cassandra*"}
},
sort: {field: "date", reverse: true}
}') AND user = 'adelapena' AND date >= '2016-01-01';
• Custom JSON syntax
• Multiple query types
• Multivariable conditions
• Multivariable sorting
• Separate filtering and relevance queries
8/40
Java query builder
import static com.datastax.driver.core.querybuilder.QueryBuilder.*;
import static com.stratio.cassandra.lucene.builder.Builder.*;
{…}
String search = search().filter(phrase("message", "cassandra is cool"))
.filter(not(wildcard("hashtags", "*cassandra*")))
.sort(field("date").reverse(true))
.build();
session.execute(select().from("tweets")
.where(eq("lucene", search))
.and(eq("user", "adelapena"))
.and(lte("date", "2016-01-01")));
• Available for JVM languages: Java, Scala, Groovy…
• Compatible with most Cassandra clients
9/40
Apache Spark integration
• Compute large amount of data
• Maximizes parallelism
• Filtering push-down
• Avoid full-scan
C*
node
JVM
Lucene
index
C*
node
JVM
Lucene
index
C*
node
JVM
Lucene
index
spark
master
10/40
Geospatial search features
Geo point mapper
CREATE CUSTOM INDEX restaurants_idx
ON restaurants (lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds' : '1',
'schema' : '{
fields : {
location : {
type : "geo_point",
latitude : "lat",
longitude : "lon"
},
stars: {type : "integer" }
}
}
'};
CREATE TABLE restaurants(
name text PRIMARY KEY,
stars bigint,
lat double,
lon double);
14/40
Bounding box search
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_bbox",
field : "location",
min_latitude : 40.425978,
max_latitude : 40.445886,
min_longitude : -3.808252,
max_longitude : -3.770999
}
}';
15/40
Distance search
SELECT * FROM restaurants
WHERE lucene =
'{
filter :
{
type : "geo_distance",
field : "location",
latitude : 40.443270,
longitude : -3.800498,
min_distance : "100m",
max_distance : "2km"
}
}';
16/40
Distance sorting
SELECT * FROM restaurants
WHERE lucene =
'{
sort:
{
type : "geo_distance",
field : "location",
reverse : false,
latitude : 40.442163,
longitude : -3.784519
}
}' LIMIT 10;
17/40
Indexing complex geospatial shapes
CREATE TABLE places(
id uuid PRIMARY KEY,
shape text -- WKT formatted
);
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: []
}
}
}'
};
• Points, lines, polygons & multiparts
• JTS index-time transformations
18/40
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: [{type: "centroid"}]
}
}
}'
};
Index-time shape transformations
• Example: Index only centroid of shapes
19/40
Index-time shape transformations
• Example: Index 50 km buffer zone around shapes
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 15,
transformations: [{
type: "buffer",
min_distance: "50km"}]
}
}
}'
};
20/40
CREATE CUSTOM INDEX places_idx ON places()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
shape: {
type: "geo_shape",
max_levels: 8,
transformations:
[{type: "convex_hull"}]
}
}
}'
};
Index-time shape transformations
• Example: Index the convex hull of the shape
21/40
Search by geo shape
• Can search points and shapes using shapes
• Operations define how you search: Intersects, Is_within, Contains
• Can use transformations before searching
‒ Bounding box
‒ Buffer
‒ Centroid
‒ Convex Hull
‒ Difference
‒ Intersection
‒ Union
22/40
Geo Search
• Example: search within a polygon
SELECT * FROM cities
WHERE expr(cities_index, '{
filter: {
type: "geo_shape",
field: "place",
operation: "is_within",
shape: {
type: "wkt",
value: "POLYGON((-0.07 51.63,
0.03 51.54,
0.05 51.65,
-0.07 51.63))"
}
}
}';
23/40
Business use cases
• Investment fund with large exposures to natural catastrophe insurance on properties
• Many geographical data sets:
‒ properties details
‒ natural catastrophe event data
o Hurricane tracks and affected zones
o Earthquakes impact zones
• Risks and portfolios
23/40
Use cases data set
• We indexed all the US census blocks shapes from the Hazus Database
‒ https://www.fema.gov/hazus
‒ These blocks contain revenue and building stats that are useful for
pricing insurance premiums and potential losses
o Average revenue
o Number of stories
‒ Some of them are very complex
o First attempt with convex hull
o Composite indexing strategy with ±2km geohash and doc values in
borders
• We also indexed all police and firestations in the US
24/40
Use cases data set
CREATE TABLE blocks (
state text,
bucket int,
id int,
area double,
type text,
income_ratio double,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY ((state, bucket),
id)
);
CREATE CUSTOM INDEX block_idx ON blocks(lucene)
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields : {
state : {type: "string"},
type : {type: "string"},
...
center: {type: "geo_point",
max_levels: 11,
latitude: "latitude",
longitude: "longitude"},
shape : {type: "geo_shape",
max_levels: 5}
}
}'};
25/40
Use cases data set
CREATE TABLE fire_stations(
state text,
id text,
city text,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY (state, id)
);
CREATE TABLE police_stations(
state text,
id text,
city text,
latitude double,
longitude double,
shape text,
...
lucene text,
PRIMARY KEY (state, id)
);
• Analogous indexing for police and fire stations tables
26/40
Composite spatial strategy
• Meant for indexing complex polygons
• Two spatial strategies combined
‒ GeoHash recursive prefix tree for speed
‒ Serialized doc values for accuracy
• Reduced number of geohash terms
• Doc values only for polygon borders
David Smiley blog post:
http://opensourceconnections.com/blog/2014/04/1
1/indexing-polygons-in-lucene-with-accuracy
27/40
Use cases: Search blocks in a shape
• We search which census blocks intersect with a shape
SELECT * FROM blocks
WHERE expr(blocks_index, '{
filter: {
type: "geo_shape",
field: "shape",
operation: "intersects",
shape: {
type: "buffer",
max_distance: "10km",
shape: {
type: "wkt",
value: "LINESTRING -80.90 29.05...)"
}
}
}
}';
28/40
Use cases: Search blocks far from police and fire stations
• Proximity to police and fire stations can have an impact on damage when
natural catastrophe event happens
• We can use this information to search for blocks in our portfolio that are more
than 8 miles from any station to highlight their risk
29/40
Use cases: Search blocks far from fire stations
SELECT * FROM fire_stations WHERE lucene = '{
filter : {
type: "geo_shape",
field: "centroid",
shape: {value: "POLYGON(…)"}}
}';
SELECT * FROM blocks WHERE lucene = '{
filter : {
must: {
type: "geo_shape",
field: "shape ",
shape: {value: "POLYGON(…)"}},
not: {
type: "geo_shape",
field: "shape",
shape: {
type: "buffer",
max_distance: "8mi",
shape: {value: "MULTIPOINT(…)"}}}
}}';
30/40
Use cases:
Find which blocks are affected by a moving hurricane and their
maximum wind speed exposures
• If we are modelling a hurricane we end up with a changing shape every 6
hours, with different location and wind speeds
• We want to find for each state which blocks are hit and at which maximum
wind speed
• We use transformations to represent the moving hurricane and within that the
different wind speeds
31/40
SELECT * FROM blocks WHERE expr(idx, '{
filter : {
type: "geo_shape",
field: "shape",
shape: {
type: "union",
shapes: [{
type: "convex_hull",
shape: {
type: "union",
shapes: [
{type: "buffer",
max_distance: "6mi",
shape: {value: "POINT(…)"}},
{type: "buffer",
max_distance: "3mi",
shape: {value: "POINT(…)"}}
]},
...
]
}
}}';
Use cases: Blocks affected by a moving hurricane
Conclusions
Conclusions
• New pluggable geospatial features in Cassandra
‒ Complex polygon search
‒ Geometrical transformations API
• Can be combined with other search predicates
• Compatible with MapReduce frameworks
• Preserves Cassandra's functionality
34/40
It's open source
github.com/stratio/cassandra-lucene-index
• Published as plugin for Apache Cassandra
• Apache License Version 2.0
35/40
THANK YOU
UNITED STATES
Tel: (+1) 408 5998830
EUROPE
Tel: (+34) 91 828 64 73
contact@stratio.com
www.stratio.com

More Related Content

PPTX
Global Positioning System
PPTX
Introduction of gps global navigation satellite systems
PDF
Differential SAR Interferometry Using ALOS-2 Data for Nepal Earthquake
PPTX
Gps (global positioning system)
PPTX
Introduction to GIS.pptx
PDF
GIS Basic
PPT
GIS & GPS PPt
Global Positioning System
Introduction of gps global navigation satellite systems
Differential SAR Interferometry Using ALOS-2 Data for Nepal Earthquake
Gps (global positioning system)
Introduction to GIS.pptx
GIS Basic
GIS & GPS PPt

What's hot (20)

PDF
Multi level Governance of Regional Policy
PPTX
PDF
Application for Women Safety
PPTX
GIS & Raster
PPTX
Remote Sensing ppt
PPTX
Fundamentals of GIS
PDF
Twitter sentimentanalysis report
PPT
introduction-of-GNSS-1
PPTX
GIS - Topology
PPTX
Projections
PDF
How GPS Works ?
PDF
Change detection techniques
PPTX
Visualizing Data with Geographic Information Systems (GIS)
PPTX
GNSS - Global Navigation Satellite System
PPTX
Splunk Distributed Management Console
PPTX
Global positioning system(GPS)
PPT
Introduction to gps and gnss
PDF
Dgps
PPTX
Global Positioning System
PPTX
Types of satellite orbits 086
Multi level Governance of Regional Policy
Application for Women Safety
GIS & Raster
Remote Sensing ppt
Fundamentals of GIS
Twitter sentimentanalysis report
introduction-of-GNSS-1
GIS - Topology
Projections
How GPS Works ?
Change detection techniques
Visualizing Data with Geographic Information Systems (GIS)
GNSS - Global Navigation Satellite System
Splunk Distributed Management Console
Global positioning system(GPS)
Introduction to gps and gnss
Dgps
Global Positioning System
Types of satellite orbits 086
Ad

Similar to Stratio's Cassandra Lucene index: Geospatial Use Cases (Andrés de la Peña & Jonathan Nappee, Nephila) | C* Summit 2016 (20)

PDF
Stratio's Cassandra Lucene index: Geospatial use cases by Andrés Peña
PDF
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
PDF
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
PDF
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PDF
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
PDF
Geospatial and bitemporal search in cassandra with pluggable lucene index
PDF
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
PPTX
Practical Use of a NoSQL Database
PDF
Practical Use of a NoSQL
PDF
thesis.compressed
PDF
Interview with Developer Jose Luis Arenas regarding Google App Engine & Geosp...
PPTX
Getting Started with Geospatial Data in MongoDB
PDF
The state of geo in ElasticSearch
PDF
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
PDF
Geospatial Advancements in Elasticsearch
PDF
Geo webinarjune2015
Stratio's Cassandra Lucene index: Geospatial use cases by Andrés Peña
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Stratio: Geospatial and bitemporal search in Cassandra with pluggable Lucene ...
Geospatial and bitemporal search in cassandra with pluggable lucene index
Comparing Geospatial Implementation in MongoDB, Postgres, and Elastic
Practical Use of a NoSQL Database
Practical Use of a NoSQL
thesis.compressed
Interview with Developer Jose Luis Arenas regarding Google App Engine & Geosp...
Getting Started with Geospatial Data in MongoDB
The state of geo in ElasticSearch
Geographica: A Benchmark for Geospatial RDF Stores - ISWC 2013
Geospatial Advancements in Elasticsearch
Geo webinarjune2015
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
DOCX
The Five Best AI Cover Tools in 2025.docx
PDF
System and Network Administraation Chapter 3
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Introduction to Artificial Intelligence
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Materi-Enum-and-Record-Data-Type (1).pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
history of c programming in notes for students .pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Design an Analysis of Algorithms II-SECS-1021-03
ManageIQ - Sprint 268 Review - Slide Deck
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How to Choose the Right IT Partner for Your Business in Malaysia
The Five Best AI Cover Tools in 2025.docx
System and Network Administraation Chapter 3
Materi_Pemrograman_Komputer-Looping.pptx
ai tools demonstartion for schools and inter college
Wondershare Filmora 15 Crack With Activation Key [2025
How to Migrate SBCGlobal Email to Yahoo Easily
Introduction to Artificial Intelligence
Design an Analysis of Algorithms I-SECS-1021-03
Materi-Enum-and-Record-Data-Type (1).pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
VVF-Customer-Presentation2025-Ver1.9.pptx
history of c programming in notes for students .pptx

Stratio's Cassandra Lucene index: Geospatial Use Cases (Andrés de la Peña & Jonathan Nappee, Nephila) | C* Summit 2016

  • 1. Andrés de la Peña Stratio's Cassandra Lucene index: Geospatial use cases Jonathan Nappée
  • 2. • Big Data Company • Certified Spark distribution • Founded in 2013 • 200+ employees • Offices in Madrid, San Francisco and Bogotá 2/40
  • 3. 1 Lucene-based secondary indexes 2 Geospatial search features 3 Business use cases 3/40
  • 5. Apache Lucene • General purpose search library • Created by Doug Cutting in 1999 • Core of popular search engines: ‒ Apache Nutch, Compass, Apache Solr, ElasticSearch • Tons of features: ‒ Full-text search, inequalities, sorting, geospatial, aggregations… • Rich implementation: ‒ Multiple index structures, smart query planning, cool merge policy… 5/40
  • 6. A Lucene-based C* 2i implementation • Each node indexes its own data • Keep P2P architecture • Distribution managed by C* • Replication managed by C* • Just a single pluggable JAR file CLIENT C* node C* node C* node Lucene index Lucene index Lucene indexJVM JVM JVM 6/40
  • 7. Creating Lucene indexes CREATE TABLE tweets ( user text, date timestamp, message text, hashtags set<text> PRIMARY KEY (user, date)); • Built in the background • Dynamic updates • Immutable mapping schema • Many columns per index • Many indexes per table CREATE CUSTOM INDEX tweets_idx ON tweets() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{fields : { user : {type: "string"}, date : {type: "date", pattern: "yyyy-MM-dd"}, message : {type: "text", analyzer: "english"}, hashtags: {type: "string"}}}'}; 7/40
  • 8. Querying Lucene indexes SELECT * FROM tweets WHERE expr(tweets_idx, '{ filter: { must: {type: "phrase", field: "message", value: "cassandra is cool"}, not: {type: "wildcard", field: "hashtags", value: "*cassandra*"} }, sort: {field: "date", reverse: true} }') AND user = 'adelapena' AND date >= '2016-01-01'; • Custom JSON syntax • Multiple query types • Multivariable conditions • Multivariable sorting • Separate filtering and relevance queries 8/40
  • 9. Java query builder import static com.datastax.driver.core.querybuilder.QueryBuilder.*; import static com.stratio.cassandra.lucene.builder.Builder.*; {…} String search = search().filter(phrase("message", "cassandra is cool")) .filter(not(wildcard("hashtags", "*cassandra*"))) .sort(field("date").reverse(true)) .build(); session.execute(select().from("tweets") .where(eq("lucene", search)) .and(eq("user", "adelapena")) .and(lte("date", "2016-01-01"))); • Available for JVM languages: Java, Scala, Groovy… • Compatible with most Cassandra clients 9/40
  • 10. Apache Spark integration • Compute large amount of data • Maximizes parallelism • Filtering push-down • Avoid full-scan C* node JVM Lucene index C* node JVM Lucene index C* node JVM Lucene index spark master 10/40
  • 12. Geo point mapper CREATE CUSTOM INDEX restaurants_idx ON restaurants (lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds' : '1', 'schema' : '{ fields : { location : { type : "geo_point", latitude : "lat", longitude : "lon" }, stars: {type : "integer" } } } '}; CREATE TABLE restaurants( name text PRIMARY KEY, stars bigint, lat double, lon double); 14/40
  • 13. Bounding box search SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_bbox", field : "location", min_latitude : 40.425978, max_latitude : 40.445886, min_longitude : -3.808252, max_longitude : -3.770999 } }'; 15/40
  • 14. Distance search SELECT * FROM restaurants WHERE lucene = '{ filter : { type : "geo_distance", field : "location", latitude : 40.443270, longitude : -3.800498, min_distance : "100m", max_distance : "2km" } }'; 16/40
  • 15. Distance sorting SELECT * FROM restaurants WHERE lucene = '{ sort: { type : "geo_distance", field : "location", reverse : false, latitude : 40.442163, longitude : -3.784519 } }' LIMIT 10; 17/40
  • 16. Indexing complex geospatial shapes CREATE TABLE places( id uuid PRIMARY KEY, shape text -- WKT formatted ); CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [] } } }' }; • Points, lines, polygons & multiparts • JTS index-time transformations 18/40
  • 17. CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [{type: "centroid"}] } } }' }; Index-time shape transformations • Example: Index only centroid of shapes 19/40
  • 18. Index-time shape transformations • Example: Index 50 km buffer zone around shapes CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 15, transformations: [{ type: "buffer", min_distance: "50km"}] } } }' }; 20/40
  • 19. CREATE CUSTOM INDEX places_idx ON places() USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields: { shape: { type: "geo_shape", max_levels: 8, transformations: [{type: "convex_hull"}] } } }' }; Index-time shape transformations • Example: Index the convex hull of the shape 21/40
  • 20. Search by geo shape • Can search points and shapes using shapes • Operations define how you search: Intersects, Is_within, Contains • Can use transformations before searching ‒ Bounding box ‒ Buffer ‒ Centroid ‒ Convex Hull ‒ Difference ‒ Intersection ‒ Union 22/40
  • 21. Geo Search • Example: search within a polygon SELECT * FROM cities WHERE expr(cities_index, '{ filter: { type: "geo_shape", field: "place", operation: "is_within", shape: { type: "wkt", value: "POLYGON((-0.07 51.63, 0.03 51.54, 0.05 51.65, -0.07 51.63))" } } }'; 23/40
  • 23. • Investment fund with large exposures to natural catastrophe insurance on properties • Many geographical data sets: ‒ properties details ‒ natural catastrophe event data o Hurricane tracks and affected zones o Earthquakes impact zones • Risks and portfolios 23/40
  • 24. Use cases data set • We indexed all the US census blocks shapes from the Hazus Database ‒ https://www.fema.gov/hazus ‒ These blocks contain revenue and building stats that are useful for pricing insurance premiums and potential losses o Average revenue o Number of stories ‒ Some of them are very complex o First attempt with convex hull o Composite indexing strategy with ±2km geohash and doc values in borders • We also indexed all police and firestations in the US 24/40
  • 25. Use cases data set CREATE TABLE blocks ( state text, bucket int, id int, area double, type text, income_ratio double, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY ((state, bucket), id) ); CREATE CUSTOM INDEX block_idx ON blocks(lucene) USING 'com.stratio.cassandra.lucene.Index' WITH OPTIONS = { 'refresh_seconds': '1', 'schema': '{ fields : { state : {type: "string"}, type : {type: "string"}, ... center: {type: "geo_point", max_levels: 11, latitude: "latitude", longitude: "longitude"}, shape : {type: "geo_shape", max_levels: 5} } }'}; 25/40
  • 26. Use cases data set CREATE TABLE fire_stations( state text, id text, city text, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY (state, id) ); CREATE TABLE police_stations( state text, id text, city text, latitude double, longitude double, shape text, ... lucene text, PRIMARY KEY (state, id) ); • Analogous indexing for police and fire stations tables 26/40
  • 27. Composite spatial strategy • Meant for indexing complex polygons • Two spatial strategies combined ‒ GeoHash recursive prefix tree for speed ‒ Serialized doc values for accuracy • Reduced number of geohash terms • Doc values only for polygon borders David Smiley blog post: http://opensourceconnections.com/blog/2014/04/1 1/indexing-polygons-in-lucene-with-accuracy 27/40
  • 28. Use cases: Search blocks in a shape • We search which census blocks intersect with a shape SELECT * FROM blocks WHERE expr(blocks_index, '{ filter: { type: "geo_shape", field: "shape", operation: "intersects", shape: { type: "buffer", max_distance: "10km", shape: { type: "wkt", value: "LINESTRING -80.90 29.05...)" } } } }'; 28/40
  • 29. Use cases: Search blocks far from police and fire stations • Proximity to police and fire stations can have an impact on damage when natural catastrophe event happens • We can use this information to search for blocks in our portfolio that are more than 8 miles from any station to highlight their risk 29/40
  • 30. Use cases: Search blocks far from fire stations SELECT * FROM fire_stations WHERE lucene = '{ filter : { type: "geo_shape", field: "centroid", shape: {value: "POLYGON(…)"}} }'; SELECT * FROM blocks WHERE lucene = '{ filter : { must: { type: "geo_shape", field: "shape ", shape: {value: "POLYGON(…)"}}, not: { type: "geo_shape", field: "shape", shape: { type: "buffer", max_distance: "8mi", shape: {value: "MULTIPOINT(…)"}}} }}'; 30/40
  • 31. Use cases: Find which blocks are affected by a moving hurricane and their maximum wind speed exposures • If we are modelling a hurricane we end up with a changing shape every 6 hours, with different location and wind speeds • We want to find for each state which blocks are hit and at which maximum wind speed • We use transformations to represent the moving hurricane and within that the different wind speeds 31/40
  • 32. SELECT * FROM blocks WHERE expr(idx, '{ filter : { type: "geo_shape", field: "shape", shape: { type: "union", shapes: [{ type: "convex_hull", shape: { type: "union", shapes: [ {type: "buffer", max_distance: "6mi", shape: {value: "POINT(…)"}}, {type: "buffer", max_distance: "3mi", shape: {value: "POINT(…)"}} ]}, ... ] } }}'; Use cases: Blocks affected by a moving hurricane
  • 34. Conclusions • New pluggable geospatial features in Cassandra ‒ Complex polygon search ‒ Geometrical transformations API • Can be combined with other search predicates • Compatible with MapReduce frameworks • Preserves Cassandra's functionality 34/40
  • 35. It's open source github.com/stratio/cassandra-lucene-index • Published as plugin for Apache Cassandra • Apache License Version 2.0 35/40
  • 36. THANK YOU UNITED STATES Tel: (+1) 408 5998830 EUROPE Tel: (+34) 91 828 64 73 contact@stratio.com www.stratio.com

Editor's Notes

  • #2: Hello everyone, my name is Andrés de la Peña, from Stratio, and this is Jonathan Nappée, from Nephila Capital. Today, we are going to talk about how to indéx Geospatial data in Cassandra using Stratio's pluggable Lucene's secondary índex, and we will show some examples about how to apply these features to several Nephila's use cases.
  • #3: To begin with, I'd like to introduce Stratio. Stratio is a big data company founded in 2013, that currently has more than 200 employees. Our technical team is currently located in Madrid but we also have offices in San Francisco and Bogotá. We focus on óffering a big data platform based on the Spark ecosystem, and we are one of the existing certified spark distributions.
  • #4: The presentation has three main points: At first, a quick overview of Stratio's Lucene secondary indexes. Then, we will review the geospatial search features of the plugin. And, finally, Jonathan will show how these geospatial features are applied to three Nephila's business use cases
  • #5: Stratio's Lucene index is an open source implementation of Cassandra's secondary indexes based on Lucene. It was first created in 2014 as a fork of Cassandra, and it became a plugin for Apache Cassandra during last year. It extends Cassandra index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multidimensional, geospatial and bitemporal search.
  • #6: Rather than building our own index structures, we chose using Apache Lucene as the underlying technology for several reasons: - It is a proven stable and fast índexing solution. - It has a lot of interesting features, such as boolean queries, range queries or relevance search. - Solr and ElasticSearch are successful examples of distributed search éngines built on top of Lucene. - We also like that Lucene is just a small library that can be embedded directly in Cassandra, and not an external service. - Finally, it is an Apache project, like Cassandra, fully open source and with a large user community.
  • #7: Here we can see how the integration between Cassandra and Lucene works. There is a Lucene índex embedded in each Cassandra node, so each node indexes its own data. This way, Lucene doesn't have anything to do with distribution and replication, which are responsibility of Cassandra. The peer-to-peer architecture is preserved, so each node is able to coordinate any query. So, no master nodes or external coordinators are required. A cool feature of the Lucene index is that it allows to paginate over rows sorted with a different order than the defined by the partitioner. Sorted pagination is possible thanks to a custom CQL query handler able to intercept and rewrite the Cassandra internal read commands. If we are performing one of these top-k queries, then all the involved nodes will be queried in a parallel fashion. Otherwise, nodes will be sequentially scanned until we find the requested results.
  • #8: Now we are going to see how Lucene indexes are created using the Cassandra Query Language. Let's say that we have a table containing tweets. We store the tweet ID, its creation date, the message body, the user ID and the user name. Then, we create the índex in CQL specifying the Stratio índex class and the índexing properties. We specify an índex readers refresh of one second, and which columns are going to be índexed and how. The creation date will be índexed as a date, using a pattern composed by year, month, and day, defining a precision of days The message field will be treated as English tokenized text And the user ID and the hashtags will be indexed as untokenized text.
  • #9: Here we have an example showing how to do searches using the Lucene índex. It includes filtering, negative filtering, sorting and routing. We embed the Lucene index JSON syntax inside Cassandra Query Language using the recently created clause for custom expressions. The JSON Lucene expression specifies that we are searching for all the tweets containing the phrase 'cassandra is cool'. We also require that the matched tweets must not be labeled with the hashtag 'cassandra'. Additionally, results should be returned sorted by descending creation date. Also, we add some CQL regular clauses specifying that we are only interested in tweets created by a specific user during this year. 'user' is the partition key of the table, so the search will be routed to a single node, avoiding the performance problems of unrestricted secondary indexes queries. So, with this example we can see how the Lucene índex allows to perform quite complex searches.
  • #10: The index is distributed together with a fluent Java query builder that allows to programmatically build index related JSON queries. The built Lucene index clauses are managed as plain strings, so the query builder should be compatible with most JVM-based Cassandra clients, including the popular DataStax Java driver. This is useful because it can be easily integrated in your existing programs. In this example, we show how to use the query builder to write the query that we saw in the previous slide. The produced JSON string can be easily used as a clause of the query builder provided by the DataStax Java driver.
  • #11: A very important feature of our indexes is that they can be combined with MapReduce frameworks, especially with Spark. The usage of Lucene predicates with Spark allows to filter the rows at the data database level. This way, we can retrieve from Cassandra only the information that we need. It avoids the unnecessary reads produced by the usual systematic full table scan. And, it can reduce the amount of data to be processed, speeding up the jobs. As you know, in this kind of deployments there is a Spark Worker running in each Cassandra node. This is done in order to parallelize jobs preserving data locality. Since each node indexes its own data, the locality is guaranteed when using Lucene indexes.
  • #12: Now we are going to talk about the spatial search capabilities that the Lucene plugin adds to Cassandra. These features are based on the Lucene speishal module and on the Java Topology Suite. A small set of these spatial capabilities was presented during our talk in the last Cassandra Summit. During this year, the number of spatial features has grown significantly, and Nephila has had an important role in this.
  • #13: Here we have an example to show how to indéx latitude-longitude points stored in Cassandra using CQL. Imagine an application where we want to find restaurants around you. In this example, the table (on the left) contains the restaurant name, and its location. There isn`t a native point type in Cassandra, so we will use two numeric columns, latitude and longitude, to represent the location. Also a tuple or an UDT could have been used. Then, we create the índex using the statement on the right. In order to indéx the location, we will add a 'geo_point' mapper named 'location'. With this mapping we must specify which columns, in the indexed table, store the latitude and the longitude. We may combine the geo point mapper with any other non-geospatial mapper. For example, we indéx the 'stars' column as an integer.
  • #14: Now that the user locations have been índexed, we can start searching for geospeishal data. The simplest type of query that we have is Bounding box. In this example, we search for restaurants placed inside the visible screen of an hypothetical mobile application. To define the bounding box, we specify, in the query, the minimum and maximum latitude and longitude values. This way, we will collect all the restaurants within the specified coordinates
  • #15: Another possible query is to search for restaurants placed inside a specific distance range from a fixed point. For this, we must specify the latitude and the longitude of the reference point, and the desired distance range. Max distance is mandatory, whereas min distance is optional. Along with the distance value we can specify a distance unit. In our example, we search for restaurants located at least one hundred meters away but no more than two kilometers away from our position.
  • #16: Additionally, it is also possible to sort the results of any search by their distance to a specific location. In the example we request the restaurants closest to the user's location. The 'reverse' attribute controls whether the order should be ascending or descending, that is, if we are going to retrieve the closest or the farthest locations. Finally, we use the CQL limit clause to select only the ten closest restaurants. Although pure sorting queries are perfectly possible, it is usually a good idea to combine sorting with any other filter. This way, we will reduce the number of matched rows, and so the number of locations to be sorted.
  • #17: One of the most exciting Lucene features, that we have recently added, is the ability to index complex geographical shapes, and not only latitude-longitude points. The shapes should be stored in Cassandra in text columns in the WKT format. WKT is a popular markup language able to represent points, line strings, polygons and their muliparts. The indexing is based in the JTS library and its integration with Spatial4j. Although WKT is the only currently supported format, we plan to add support for other popular formats such as GeoJSON. In the example we can see a table storing places, where the text column 'shape' stores a WKT geographical shape. We create a Lucene index that maps this column as a geographical shape, specifying the maximum number of levels in the geohash search prefix tree. It is also possible to specify a sequence of geometric transformations to be sequentially applied to the shape before indexing it. Now we will see some examples to demonstrate the utility of these transformations.
  • #18: One of the available index-time transformations is calculating the centroid of the indexed shape. In these example the indexed table stores polygons but we are only interested in indexing the center of the shapes.
  • #19: Another very useful transformation is indexing a buffer around the initial shape. In the example we are applying a 'buffer' transformation to index the region which is 50km around the stored shape. This transformation could be used, for example, for storing the coverage area of a set of antennas given their lat-lon locations. It could also be used for storing the area around roads or borders defined as line strings.
  • #20: Index size and performance greatly depends on the complexity of the indexed shapes. Shapes with a lot of points and precision decimals will produce many terms in the search tree. This increases the size of the index and reduces performance. So, if your use case allows it, it could be interesting to use transformations to simplify the indexed shapes. Both centroid and bounding box are typical transformations to do this. Convex hull can also be an interesting, more accurate, precision reducer. In this example we show how convex hull transformation is used to reduce a complex shape with more than two thousand points, to a simple polygon with only eight points, dramatically increasing both indexing and searching performance.
  • #21: The indexed polygons can be retrieved by the previously shown bounding box and distance searches. We have also recently added a geo-shape search type that allows to search for shapes that are related with other shapes (and points). The currently supported spatial relations are intersects, is within, and contains. It is also possible to apply transformations to the shapes used in the search. This allows to build complex shapes in search-time from other WKT shapes.
  • #22: In this example we are searching for all the indexed shapes within a triangle. In this case we search for places within the Bermuda Triangle. Please note how we define both, the spatial operation to be applied, and the format of the search shape, which is WKT. In addition, we can recursively define the search shape as transformations of other shapes, as we will see in the Nephila's business use cases that Jonathan will show us.
  • #23: Thank you Andres, I am Jonathan Nappée, I have been working with Nephila Capital for a year now in the Bermuda office. I first started talking to Andres while trying to solve some of our geospatial challenges with Stratio Lucene index in Cassandra. It already contained basic point indexing and distance search features but I had more complex indexing and search cases to implement. It turned out Andres also wanted to improve this aspect of the index. So when we met in London in February we very quickly came up with this idea of transformations. I want to show you now a couple of simplified examples how features could be used in the context of Nephila.
  • #24: To begin with, let me introduce Nephila Capital and briefly explain it’s business. We are an investment fund that specializes in natural catastrophe property insurance. That means we deal with house or building insurances against hurricane, earthquakes or flood kind of disaster. As you can imagine we manipulate many different kinds of geospatial data sets, from properties we insure to the natural catastrophes impacts.
  • #25: Let me explain now the setup for these examples. We started by indexing all US census blocks, that is to say shapes of blocks of buildings, inside a Cassandra cluster. Some of the block shapes are very complex and can contain hundreds of points and multiple polygons. To improve the efficiency we first tried indexing the convex hull but then switched to a composite indexing strategy. I’ll go into more details on this strategy in a few slides. We also indexed all US fire and police stations locations.
  • #26: This is how the blocks table looks like. Each block contains its shape, centroid location, state income ratio and other information. We then indexed the different fields and in particular the block’s shapes.
  • #27: This is how the stations table looks like. Each station contains its shape, centroid location, state, city and other information. We then indexed the different fields and in particular the locations.
  • #28: Before I start showing the examples, let me explain a bit more the composite indexing strategy. This strategy uses two separate index structures, one to achieve speed and other to achieve acuracy. The first search structure is a geohash recursive prefix tree, usually with low precision. This geohash tree is used to quickly discard most of not matching documents. Then, the second search structure, which is a simple covering index, is used to discard the false positives produced by the geohash tree. This composite approach allows to quickly retrieve results with complete accuracy while keeping the index relatively small, and it is specially usefull with our dataset, which is composed by very complex polygons that would produce too much terms in a regular high-accuracy geohash search tree.
  • #29: For our first use case we want to perform a search of blocks that intersect with a given shape. This is an important feature for us as the most damaged properties when a hurricane landfall happens are usually the ones closest to the shore. We can index the coast line but we usually want to see different buffer zones, the 1km, the 5 km and so on. So we use this buffer transformation to give us this flexibility.
  • #30: In this second use case, we are interested in finding which blocks are far from any police or fire station. A property far from a fire station will probably on average suffer more damage from a fire than one close to a station. Thus we can use that information in evaluating the insurance risk. We consider that the beyond an 8 miles radius the fire station response becomes longer and thus less efficient.
  • #31: So we define a rectangular zone of interest in which we find all the fire stations. We then build the 8 miles radius shapes from the fire stations and merge them. Finally we search for all blocks in the rectangular zone, not within the stations safer zone.
  • #32: For the last use case we consider that a Hurricane just hit the US, in this case we use Hurricane Katrina’s example. We know every 6 hours the location of the hurricane and two zones, the larger zone with medium wind speeds and the smaller zone with highest wind speeds.
  • #33: In this example we search for all the blocks in our portfolio of insurance that are hit by the hurricane. So we merge the two wind speed zones together, further merge with all the other shapes of the hurricane at different times and then look for the blocks inside.
  • #36: And last but not least, our implementation is completely open source. It is published under the Apache License and it can be found at GitHub. We encourage you to take a look at it, and, of course, any contribution is more than welcomed.
  • #37: Thanks for your attention! Any questions?