SlideShare a Scribd company logo
© 2017 IBM Corporation
Using Pluggable Apache Spark SQL Filters
to help GridPocket users keep up with the
Jones’ (and save the planet)
Paula Ta-Shma,
Guy Gerson
IBM Research
Contact: paula@il.ibm.com
Joint work with:
Tal Ariel, IBM
Filip Gluszak, GridPocket
Gal Lushi, IBM
Yosef Moatti, IBM
Papa Niamadio, GridPocket
Nathaël Noguès, GridPocket
© 2017 IBM Corporation2
You have 5 minutes to pack your bags…
© 2017 IBM Corporation3
?
© 2017 IBM Corporation4
?
© 2017 IBM Corporation
Using Pluggable Apache Spark SQL Filters
to help GridPocket users keep up with the
Jones’ (and save the planet)
Paula Ta-Shma,
Guy Gerson
IBM Research
Contact: paula@il.ibm.com
Joint work with:
Tal Ariel, IBM
Filip Gluszak, GridPocket
Gal Lushi, IBM
Yosef Moatti, IBM
Papa Niamadio, GridPocket
Nathaël Noguès, GridPocket
© 2017 IBM Corporation6
What would convince you to reduce your energy usage ?
#EUres2
© 2017 IBM Corporation7
What would convince you to reduce your energy usage ?
§ Save $ ?
#EUres2
© 2017 IBM Corporation8
What would convince you to reduce your energy usage ?
§ Save $ ?
§ Save the planet ?
#EUres2
© 2017 IBM Corporation9
Competing to Save Energy
§ Research shows that utility
customers are most influenced by
peer pressure to save energy
§ => Help utilities compare their
customers’ energy consumption
with that of their neighbours
– Queries are anonymized
#EUres2
© 2017 IBM Corporation10
GridPocket
§ A smart grid company developing
energy management applications
and cloud services for electricity,
water and gas utilities
§ HQ based in France
§ http://www.gridpocket.com/
§ Developed open source data
generator
§ Provided industry specific use
cases and algorithms
The GridPocket Dataset
§ Our target: 1 million meters
reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Allowing for 1 order of magnitude
growth in each dimension gives
1PB in 3 months
§ => Use object storage
§ Typically at least order of
magnitude lower storage cost
than NoSQL databases
§ NoSQL dataset – one large
denormalized table containing
meter reading information
– SQL query for nearest neighbours
#EUres2
© 2017 IBM Corporation11
What is Object Storage ?
§ Objects contain data and metadata
§ Written once and not modified
– Cannot append or update data
- Can overwrite completely
- Can update metadata
– No rename operation
§ Accessed through RESTful HTTP
– PUT/GET/POST/DELETE object/bucket
– Flat namespace
§ High capacity, low cost
§ Storage of choice for Big datasets
– Analytics works best on equally sized
objects
§ Examples
– Amazon S3, IBM COS, Google Cloud
Storage, OpenStack Swift
Swift
© 2017 IBM Corporation12
Object Storage and Spark are separately managed
microservices
© 2017 IBM Corporation13
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
© 2017 IBM Corporation14
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
© 2017 IBM Corporation15
Object Storage and Spark are separately managed
microservices
Want to query data on object storage directly
Our goal:Minimize
1. Number of bytes shipped
2. Number of REST requests
THE key factors affecting cost
(and performance)
© 2017 IBM Corporation16
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
© 2017 IBM Corporation17
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
We want one solution for all data
formats
© 2017 IBM Corporation18
How is this done today ?
1. Use specialized column based
formats such as Parquet, ORC
– Column wise compression
– Column pruning
– Specialized metadata
We want one solution for all data
formats
2. Use Hive Style Partitioning to layout
Object Storage data
– Partition pruning
© 2017 IBM Corporation19
Hive Style Partitioning and Partition Pruning
§ Relies on a dataset naming convention
§ Object storage has flat namespace, but can create a virtual folder hierarchy
– Use ‘/’ in object name
§ Data can be partitioned according to a column e.g. dt
– Information about object contents is encoded in object name e.g. dt=2015-09-14
§ Spark SQL can query fields in object name as well as in data e.g. “dt”
– Filters the objects which need to be read from Object Storage and sent to Spark
© 2017 IBM Corporation20
Limitations of Today’s Hive Style Partitioning in Spark
§ Only one hierarchy is possible
– Like database primary key, no secondary keys
§ Changing partitioning scheme requires rewriting entire dataset !
– Hierarchy cannot be changed without renaming all objects
§ No range partitioning
– Only supports partitioning with discrete types e.g. Gender:M/F, date, age etc.
– doesn't work well for timestamps, arbitrary floating point numbers etc.
§ A deep hierarchy may result in small and non uniform object sizes
– May reduce performance
© 2017 IBM Corporation21
Can we do more ?
NY
© 2017 IBM Corporation22
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
© 2017 IBM Corporation23
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
Min/max values
© 2017 IBM Corporation24
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
One or more bounding boxes
© 2017 IBM Corporation25
Can we do more ?
§ Generate metadata per
object column and index it
§ Various index types
– Min/max, bounding boxes,
value lists
§ Filter objects according to
this metadata
§ Applies to all formats e.g.
json, csv
NY
A set of values
© 2017 IBM Corporation26
Filter According to Metadata
§ Store one metadata record per object
– Unlike database fully inverted index
§ Various index types
– Min and max value for ordered columns
– Bounding boxes for geospatial data
– Bloom filters as space efficient value lists
§ Users can choose which columns to index
and index type per column
– Can index additional columns later on
§ Main requirement: no false negatives
§ Avoids touching irrelevant objects altogether
§ Handles updates (PUT/DELETE object)
– Filtering out irrelevant objects always works
#EUres2
NY
© 2017 IBM Corporation27
Spark SQL query execution flow
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional file
filter
Read data
Metadata
Filter
© 2017 IBM Corporation28
The Interface
Filters should extend this trait and implement application specific filtering logic.
trait ExecutionFileFilter {
/**
* @param dataFilters query predicates for data columns
* @param f represents an object which exists in the file catalog
* @return true if the object needs to be scanned during execution
*/
def isRequired(dataFilters: Seq[Filter], f: FileStatus) : Boolean
}
Turn filter on by specifying Filter class:
sqlContext.setConf("spark.sql.execution.fileFilter", “TestFileFilter”)
< 20 lines of code to integrate into Spark
#EUres2
© 2017 IBM Corporation29
Data Ingest: How to best Partition the Data
§ Organizing data in
objects in a smart way
generates more effective
metadata
§ Partition during data
ingestion
§ Example: Kafka allows
user to define partitioning
function
§ Needs to scale
horizontally – want
stateless partitioner
#EUres2
Client 1
Partition 0
Client 2
Client 3
Partition 1
Partition 2
Partition 3
© 2017 IBM Corporation30
Geospatial Partitioning: Grid Partitioner
§ Divide the world map into a grid
– Precision depends on use case
§ Each data point belongs to a cell
§ Each cell is hashed to a partition
§ A partition can contain multiple
cells
§ Partitions periodically generate
objects
§ Each object is described using a
list of bounding boxes
– 1 per participating cell
#EUres2
1 2 3 4
5 6 7 8
Partition 0
Partition 1
Partition 2
8
2
6
4
5
1
7
3
6 6
5
1
2 7
Objects
© 2017 IBM Corporation31
Experimental Results
The GridPocket Dataset
§ 1 million meters reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2)
§ Compared using 50 and 100 Kakfa partitions
#EUres2
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation32
Experimental Results
The GridPocket Dataset
§ 1 million meters reporting every 15 minutes
§ Records are ~100 bytes
§ Generated ~1TB in 3 months
§ Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2)
§ Compared using 50 and 100 Kakfa partitions
#EUres2
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation33
#GB Transferred for bounding box queries
#EUres2
0
200
400
600
800
1000
1200
20x20 km 10x10 km 5x5 km
no filter filter
#GB transferred
50 partitions 100 partitions
1 TB dataset, grid partitioner, precision 1, average of 10 randomly located queries
© 2017 IBM Corporation34
#GET Requests for bounding box queries
#EUres2
1 TB dataset, grid partitioner, precision 1, took average of 10 randomly located queries
0
2000
4000
6000
8000
10000
12000
20x20 km 10x10 km 5x5 km
no filter filter
#GET requests
50 partitions 100 partitions
© 2017 IBM Corporation35
Time (sec) for bounding box queries
#EUres2
1 TB dataset, precision 1, took average of 10 randomly located queries
0
500
1000
1500
2000
2500
20x20 km 10x10 km 5x5 km
no filter filter
Time (sec)
50 partitions 100 partitions
© 2017 IBM Corporation36
Demo
§ Runs on IOSTACK testbed
– 3 Spark and 3 Object
Storage nodes
§ Demo dataset
– 1 million meters
– Report every 15 mins
– 1 day’s worth of data
– 11 GB (csv)
§ Grid Partitioner Config
– Cells have precision 1
- ~10 km2
– 100 partitions
Get my neighbours’ average usage
SELECT AVG(usage)
FROM (
SELECT vid as meter_id,
(MAX(index)-MIN(index)) as usage
FROM dataset
WHERE (lat BETWEEN 43.300 AND 44.100)
AND (lng BETWEEN 6.800 AND 7.600)
GROUP BY vid
)
© 2017 IBM Corporation37
Conclusions: Don’t let your vacation turn into a relocation
§ Running SQL queries directly
on Big Datasets in Object
Storage is viable using
metadata
§ A small change to Spark
enables dramatic
performance improvements
§ We focused here on
geospatial data and the
GridPocket use case,
although it also applies to
other data types and use
cases
§ IBM is investing in Spark
SQL as a backbone for SQL
processing services in the
cloud
#EUres2
© 2017 IBM Corporation38
Thanks!
Contact:
paula@il.ibm.com
https://www.linkedin.com/in/paulatashma/
guyger@il.ibm.com
https://www.linkedin.com/in/guy-gerson-82619164/
© 2017 IBM Corporation39
Backup
#EUres2
© 2017 IBM Corporation40
GridPocket Use Case and Dataset
§ NoSQL dataset – one large denormalized table
– date, index, sumHC, sumHP, type, vid, size, temp, city, region, lat, lng
– Index = meter reading
– sumHC = total of energy consumed since midnight in off-hours
– sumHP = total of energy consumed since midnight in rush-hours
– Type = elec/gas/water
– Vid = meter id
– Size = apartment size in square meters
© 2017 IBM Corporation41
Example Filter Scenario
§ Want to analyze data from active sensors only
§ External DB contains sensor activity info
Data Layout
Archives/dt=01-01-2014
Archives/dt=01-01-2014/sensor1.json (500MB)
Archives/dt=01-01-2014/sensor2.json (500MB)
Archives/dt=01-01-2014/sensor3.json (500MB)
Archives/dt=02-01-2014
Archives/dt=02-01-2014/sensor1.json (500MB)
Archives/dt=02-01-2014/sensor2.json (500MB)
Archives/dt=02-01-2014/sensor3.json (500MB)
more...
#EUres2
sensor active
sensor1 FALSE
sensor2 TRUE
sensor3 FALSE
© 2017 IBM Corporation42
Example Filter
class LiveSensorFilter extends ExecutionFileFilter {
//get a list of live sensors from an external Service
val activeSensors = SensorService.getLiveSensors
//returns true if object represents a live sensor
@Override
def isRequired(dataFilters:Seq[org.apache.spark.sql.sources.Filter],
fileStatus: FileStatus) : Boolean = {
activeSensors.contains(Utils.getSensorIdFromPath(fileStatus))
}
Turn filter on:
sqlContext.setConf("spark.sql.execution.fileFilter", “LiveSensorFilter ”)
#EUres2

More Related Content

PDF
Rental Cars and Industrialized Learning to Rank with Sean Downes
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
PDF
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
PDF
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Rental Cars and Industrialized Learning to Rank with Sean Downes
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Trends for Big Data and Apache Spark in 2017 by Matei Zaharia
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...

What's hot (20)

PDF
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
PDF
Big Telco - Yousun Jeong
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
PDF
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
PDF
From R Script to Production Using rsparkling with Navdeep Gill
PDF
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Overview of the Hive Stinger Initiative
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Art of Feature Engineering for Data Science with Nabeel Sarwar
PPTX
The Past, Present, and Future of Hadoop at LinkedIn
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Spark Summit EU talk by Michael Nitschinger
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Building a Unified Data Pipeline with Apache Spark and XGBoost with Nan Zhu
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
Big Telco - Yousun Jeong
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
From R Script to Production Using rsparkling with Navdeep Gill
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Overview of the Hive Stinger Initiative
Spark Under the Hood - Meetup @ Data Science London
Art of Feature Engineering for Data Science with Nabeel Sarwar
The Past, Present, and Future of Hadoop at LinkedIn
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Ad

Viewers also liked (10)

PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
PDF
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
PDF
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
PDF
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
PDF
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
PDF
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
PDF
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
An Adaptive Execution Engine for Apache Spark with Carson Wang and Yucai Yu
Building Custom ML PipelineStages for Feature Selection with Marc Kaminski
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Ad

Similar to Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones' (and save the planet) Paula Ta-Shma and Guy Gerson (20)

PDF
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
PDF
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
PDF
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
PDF
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
PDF
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
PDF
A Brave new object store world
PDF
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
PDF
Ingesting Data at Blazing Speed Using Apache Orc
PDF
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
PDF
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PPTX
IBM THINK 2018 - IBM Cloud SQL Query Introduction
PDF
Why you should care about data layout in the file system with Cheng Lian and ...
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
PDF
Owning time series with team apache Strata San Jose 2015
PDF
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
PDF
Serverless SQL
PDF
A Java Implementer's Guide to Better Apache Spark Performance
PDF
Apache Spark's Built-in File Sources in Depth
PDF
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
How The Weather Company Uses Apache Spark to Serve Weather Data Fast at Low Cost
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
A Brave new object store world
Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich
Ingesting Data at Blazing Speed Using Apache Orc
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
IBM THINK 2018 - IBM Cloud SQL Query Introduction
Why you should care about data layout in the file system with Cheng Lian and ...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Owning time series with team apache Strata San Jose 2015
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Serverless SQL
A Java Implementer's Guide to Better Apache Spark Performance
Apache Spark's Built-in File Sources in Depth
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...

More from Spark Summit (20)

PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
PDF
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
PDF
Variant-Apache Spark for Bioinformatics with Piotr Szul
PDF
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Variant-Apache Spark for Bioinformatics with Piotr Szul
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

Recently uploaded (20)

PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Introduction to Business Data Analytics.
PDF
Lecture1 pattern recognition............
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Business Data Analytics.
Lecture1 pattern recognition............
climate analysis of Dhaka ,Banglades.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Reliability_Chapter_ presentation 1221.5784
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Acumen Training GuidePresentation.pptx
Database Infoormation System (DBIS).pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision

Using Pluggable Apache Spark SQL Filters to Help GridPocket Users Keep Up with the Jones' (and save the planet) Paula Ta-Shma and Guy Gerson

  • 1. © 2017 IBM Corporation Using Pluggable Apache Spark SQL Filters to help GridPocket users keep up with the Jones’ (and save the planet) Paula Ta-Shma, Guy Gerson IBM Research Contact: paula@il.ibm.com Joint work with: Tal Ariel, IBM Filip Gluszak, GridPocket Gal Lushi, IBM Yosef Moatti, IBM Papa Niamadio, GridPocket Nathaël Noguès, GridPocket
  • 2. © 2017 IBM Corporation2 You have 5 minutes to pack your bags…
  • 3. © 2017 IBM Corporation3 ?
  • 4. © 2017 IBM Corporation4 ?
  • 5. © 2017 IBM Corporation Using Pluggable Apache Spark SQL Filters to help GridPocket users keep up with the Jones’ (and save the planet) Paula Ta-Shma, Guy Gerson IBM Research Contact: paula@il.ibm.com Joint work with: Tal Ariel, IBM Filip Gluszak, GridPocket Gal Lushi, IBM Yosef Moatti, IBM Papa Niamadio, GridPocket Nathaël Noguès, GridPocket
  • 6. © 2017 IBM Corporation6 What would convince you to reduce your energy usage ? #EUres2
  • 7. © 2017 IBM Corporation7 What would convince you to reduce your energy usage ? § Save $ ? #EUres2
  • 8. © 2017 IBM Corporation8 What would convince you to reduce your energy usage ? § Save $ ? § Save the planet ? #EUres2
  • 9. © 2017 IBM Corporation9 Competing to Save Energy § Research shows that utility customers are most influenced by peer pressure to save energy § => Help utilities compare their customers’ energy consumption with that of their neighbours – Queries are anonymized #EUres2
  • 10. © 2017 IBM Corporation10 GridPocket § A smart grid company developing energy management applications and cloud services for electricity, water and gas utilities § HQ based in France § http://www.gridpocket.com/ § Developed open source data generator § Provided industry specific use cases and algorithms The GridPocket Dataset § Our target: 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Allowing for 1 order of magnitude growth in each dimension gives 1PB in 3 months § => Use object storage § Typically at least order of magnitude lower storage cost than NoSQL databases § NoSQL dataset – one large denormalized table containing meter reading information – SQL query for nearest neighbours #EUres2
  • 11. © 2017 IBM Corporation11 What is Object Storage ? § Objects contain data and metadata § Written once and not modified – Cannot append or update data - Can overwrite completely - Can update metadata – No rename operation § Accessed through RESTful HTTP – PUT/GET/POST/DELETE object/bucket – Flat namespace § High capacity, low cost § Storage of choice for Big datasets – Analytics works best on equally sized objects § Examples – Amazon S3, IBM COS, Google Cloud Storage, OpenStack Swift Swift
  • 12. © 2017 IBM Corporation12 Object Storage and Spark are separately managed microservices
  • 13. © 2017 IBM Corporation13 Object Storage and Spark are separately managed microservices Want to query data on object storage directly
  • 14. © 2017 IBM Corporation14 Object Storage and Spark are separately managed microservices Want to query data on object storage directly
  • 15. © 2017 IBM Corporation15 Object Storage and Spark are separately managed microservices Want to query data on object storage directly Our goal:Minimize 1. Number of bytes shipped 2. Number of REST requests THE key factors affecting cost (and performance)
  • 16. © 2017 IBM Corporation16 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata
  • 17. © 2017 IBM Corporation17 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata We want one solution for all data formats
  • 18. © 2017 IBM Corporation18 How is this done today ? 1. Use specialized column based formats such as Parquet, ORC – Column wise compression – Column pruning – Specialized metadata We want one solution for all data formats 2. Use Hive Style Partitioning to layout Object Storage data – Partition pruning
  • 19. © 2017 IBM Corporation19 Hive Style Partitioning and Partition Pruning § Relies on a dataset naming convention § Object storage has flat namespace, but can create a virtual folder hierarchy – Use ‘/’ in object name § Data can be partitioned according to a column e.g. dt – Information about object contents is encoded in object name e.g. dt=2015-09-14 § Spark SQL can query fields in object name as well as in data e.g. “dt” – Filters the objects which need to be read from Object Storage and sent to Spark
  • 20. © 2017 IBM Corporation20 Limitations of Today’s Hive Style Partitioning in Spark § Only one hierarchy is possible – Like database primary key, no secondary keys § Changing partitioning scheme requires rewriting entire dataset ! – Hierarchy cannot be changed without renaming all objects § No range partitioning – Only supports partitioning with discrete types e.g. Gender:M/F, date, age etc. – doesn't work well for timestamps, arbitrary floating point numbers etc. § A deep hierarchy may result in small and non uniform object sizes – May reduce performance
  • 21. © 2017 IBM Corporation21 Can we do more ? NY
  • 22. © 2017 IBM Corporation22 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY
  • 23. © 2017 IBM Corporation23 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY Min/max values
  • 24. © 2017 IBM Corporation24 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY One or more bounding boxes
  • 25. © 2017 IBM Corporation25 Can we do more ? § Generate metadata per object column and index it § Various index types – Min/max, bounding boxes, value lists § Filter objects according to this metadata § Applies to all formats e.g. json, csv NY A set of values
  • 26. © 2017 IBM Corporation26 Filter According to Metadata § Store one metadata record per object – Unlike database fully inverted index § Various index types – Min and max value for ordered columns – Bounding boxes for geospatial data – Bloom filters as space efficient value lists § Users can choose which columns to index and index type per column – Can index additional columns later on § Main requirement: no false negatives § Avoids touching irrelevant objects altogether § Handles updates (PUT/DELETE object) – Filtering out irrelevant objects always works #EUres2 NY
  • 27. © 2017 IBM Corporation27 Spark SQL query execution flow Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter
  • 28. © 2017 IBM Corporation28 The Interface Filters should extend this trait and implement application specific filtering logic. trait ExecutionFileFilter { /** * @param dataFilters query predicates for data columns * @param f represents an object which exists in the file catalog * @return true if the object needs to be scanned during execution */ def isRequired(dataFilters: Seq[Filter], f: FileStatus) : Boolean } Turn filter on by specifying Filter class: sqlContext.setConf("spark.sql.execution.fileFilter", “TestFileFilter”) < 20 lines of code to integrate into Spark #EUres2
  • 29. © 2017 IBM Corporation29 Data Ingest: How to best Partition the Data § Organizing data in objects in a smart way generates more effective metadata § Partition during data ingestion § Example: Kafka allows user to define partitioning function § Needs to scale horizontally – want stateless partitioner #EUres2 Client 1 Partition 0 Client 2 Client 3 Partition 1 Partition 2 Partition 3
  • 30. © 2017 IBM Corporation30 Geospatial Partitioning: Grid Partitioner § Divide the world map into a grid – Precision depends on use case § Each data point belongs to a cell § Each cell is hashed to a partition § A partition can contain multiple cells § Partitions periodically generate objects § Each object is described using a list of bounding boxes – 1 per participating cell #EUres2 1 2 3 4 5 6 7 8 Partition 0 Partition 1 Partition 2 8 2 6 4 5 1 7 3 6 6 5 1 2 7 Objects
  • 31. © 2017 IBM Corporation31 Experimental Results The GridPocket Dataset § 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2) § Compared using 50 and 100 Kakfa partitions #EUres2 Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 32. © 2017 IBM Corporation32 Experimental Results The GridPocket Dataset § 1 million meters reporting every 15 minutes § Records are ~100 bytes § Generated ~1TB in 3 months § Partitioned using Grid Partitioner using precision 1 (cells are roughly 10 km2) § Compared using 50 and 100 Kakfa partitions #EUres2 Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 33. © 2017 IBM Corporation33 #GB Transferred for bounding box queries #EUres2 0 200 400 600 800 1000 1200 20x20 km 10x10 km 5x5 km no filter filter #GB transferred 50 partitions 100 partitions 1 TB dataset, grid partitioner, precision 1, average of 10 randomly located queries
  • 34. © 2017 IBM Corporation34 #GET Requests for bounding box queries #EUres2 1 TB dataset, grid partitioner, precision 1, took average of 10 randomly located queries 0 2000 4000 6000 8000 10000 12000 20x20 km 10x10 km 5x5 km no filter filter #GET requests 50 partitions 100 partitions
  • 35. © 2017 IBM Corporation35 Time (sec) for bounding box queries #EUres2 1 TB dataset, precision 1, took average of 10 randomly located queries 0 500 1000 1500 2000 2500 20x20 km 10x10 km 5x5 km no filter filter Time (sec) 50 partitions 100 partitions
  • 36. © 2017 IBM Corporation36 Demo § Runs on IOSTACK testbed – 3 Spark and 3 Object Storage nodes § Demo dataset – 1 million meters – Report every 15 mins – 1 day’s worth of data – 11 GB (csv) § Grid Partitioner Config – Cells have precision 1 - ~10 km2 – 100 partitions Get my neighbours’ average usage SELECT AVG(usage) FROM ( SELECT vid as meter_id, (MAX(index)-MIN(index)) as usage FROM dataset WHERE (lat BETWEEN 43.300 AND 44.100) AND (lng BETWEEN 6.800 AND 7.600) GROUP BY vid )
  • 37. © 2017 IBM Corporation37 Conclusions: Don’t let your vacation turn into a relocation § Running SQL queries directly on Big Datasets in Object Storage is viable using metadata § A small change to Spark enables dramatic performance improvements § We focused here on geospatial data and the GridPocket use case, although it also applies to other data types and use cases § IBM is investing in Spark SQL as a backbone for SQL processing services in the cloud #EUres2
  • 38. © 2017 IBM Corporation38 Thanks! Contact: paula@il.ibm.com https://www.linkedin.com/in/paulatashma/ guyger@il.ibm.com https://www.linkedin.com/in/guy-gerson-82619164/
  • 39. © 2017 IBM Corporation39 Backup #EUres2
  • 40. © 2017 IBM Corporation40 GridPocket Use Case and Dataset § NoSQL dataset – one large denormalized table – date, index, sumHC, sumHP, type, vid, size, temp, city, region, lat, lng – Index = meter reading – sumHC = total of energy consumed since midnight in off-hours – sumHP = total of energy consumed since midnight in rush-hours – Type = elec/gas/water – Vid = meter id – Size = apartment size in square meters
  • 41. © 2017 IBM Corporation41 Example Filter Scenario § Want to analyze data from active sensors only § External DB contains sensor activity info Data Layout Archives/dt=01-01-2014 Archives/dt=01-01-2014/sensor1.json (500MB) Archives/dt=01-01-2014/sensor2.json (500MB) Archives/dt=01-01-2014/sensor3.json (500MB) Archives/dt=02-01-2014 Archives/dt=02-01-2014/sensor1.json (500MB) Archives/dt=02-01-2014/sensor2.json (500MB) Archives/dt=02-01-2014/sensor3.json (500MB) more... #EUres2 sensor active sensor1 FALSE sensor2 TRUE sensor3 FALSE
  • 42. © 2017 IBM Corporation42 Example Filter class LiveSensorFilter extends ExecutionFileFilter { //get a list of live sensors from an external Service val activeSensors = SensorService.getLiveSensors //returns true if object represents a live sensor @Override def isRequired(dataFilters:Seq[org.apache.spark.sql.sources.Filter], fileStatus: FileStatus) : Boolean = { activeSensors.contains(Utils.getSensorIdFromPath(fileStatus)) } Turn filter on: sqlContext.setConf("spark.sql.execution.fileFilter", “LiveSensorFilter ”) #EUres2