SlideShare a Scribd company logo
Optimising geospatial queries
with dynamic file pruning
Matthew Slack
Principal Data Architect - Wejo
Agenda
Wejo
Typical use cases
Data lake indexing strategies
Dynamic file pruning
Z-ordering strategies
Measuring effectiveness of file
pruning
Optimizing queries to activate
file pruning
Wejo
Wejo
Processing data from over 15m vehicles
OEM A
Streamed
OEM A
Micro-batch (Ign OFF)
OEM B - n
Streamed
15m+ Vehicles
18 billion data points per day
0.4 trillion per month
Peak ~900,000 per second
5Pb data (and growing)
Wejo and Databricks
Data Lake
Data analysis Data science Data governance
CoreIngress
OEM
Egress
Transform
Filter &
Aggregate
Stream
Raw asset
Bespoke DS output
Adept Stream
BI and Dashboards
Incident
Manage
ment
Change
and
Release
24x7
Monitori
ng
Alerting
SecOps
Devops
Stream Stream
Stream
Aggregates
Batch
Insight Batch
Bespoke
Insights
Sample
Preview
Infosec
Complian
ce
BI Datamart Data Aggregates
Infosec Tooling
Sample
Generation
On-prem
AWS
Google Cloud
Microsoft Azure
Portal
OEM
OEM
OEM
Databricks underpins
our ad-hoc analysis of
data for data science
We use it to populate
both the datamart for BI
and the derived data
used in WIM
Introduction of Delta
Lake to support
geospatial workloads
and CCPA and GDPR
Typical use cases
▪ Traffic intelligence
▪ Map matching
▪ Anomaly detection
▪ Journey intelligence
▪ Origin destination
analysis
All require both geospatial and temporal indexing
Spark geospatial support
Magellan
and many others…
▪ Many existing libraries and frameworks
▪ However…
▪ Most are designed to efficiently join
geospatial datasets that are already
loaded to the cluster
▪ They do not optimize the reading of data
from underlying storage
▪ Geomesa can be configured to utilise the
underlying partitioning on disk but is
complex
geopandas
Data Lake Indexing Strategies (Recap)
Data lake partitioning
▪ Choice of partition columns in
data lakes is always a trade-
off
▪ Partitioning in data lakes is
usually done using a
hierarchical directory
structure
year=2020
• month=11
• day=17
• day=18
ingest_date=2020-11-17
• ingest_hour=10
• ingest_hour=11
country=US
• state=NY
• state=TX
ingest_date=2020-11-17
• state=NY
Complexity
Selectivity
Number of
partitions (< 10k)
Size of each
partition (> 10s MB)
Dynamic partition pruning
▪ Allows partitions to be dynamically skipped at query runtime
▪ Partitions that do not match the query filters are excluded by the
optimizer
▪ Significantly improved in Spark 3.0
Data skipping, via effective partitioning strategies, combined
with columnar file formats such as Parquet, ORC were
historically the main levers for optimizing a data lake
Dynamic File Pruning
Dynamic file pruning overview
▪ Introduced with Databricks Delta Lake in early 2020
▪ Allows files to be dynamically skipped within a partition
▪ Files that do not match the query filters are excluded by the
optimizer
▪ Databricks collects metadata on a subset of columns in all files
added to the dataset, stored in _delta_log folder
▪ Relies on files being pre-sorted beforehand
Anatomy of _delta_log/00000000000000000000.json
delta.dataSkippingNumIndexedCols
Test environment
▪ Test cluster
▪ Spark 2.4.5, Databricks Runtime 6.6
▪ i3.4xlarge x 10 executors
▪ Auto scaling off
▪ CLEAR CACHE before each test
▪ Input dataset
▪ One day of connected car data (6th March)
▪ Nested parquet in AWS S3
▪ Over 16 billion rows, ~40 columns
▪ 2491 files, ~1.4 TB data, ~0.5GB per file
▪ Partition strategy
▪ ingest_date, ingest_hour
Naïve example on un-optimized Parquet
▪ Generate all geohashes that
cover the Austin polygon
▪ Spark scans all files that
match the partition filters
▪ Just over 97M datapoints are
covered by the geohashes
However…
▪ Slow… over 5 minutes
▪ Datapoints spread randomly
across all of the 2491 input
files
Z-Ordering is a technique to colocate related information in the
same set of files. This co-locality is automatically used by Delta
Lake on Databricks data-skipping algorithms to dramatically
reduce the amount of data that needs to be read.
Databricks
Z-Ordering the data
OPTIMIZEConvert to Delta
▪ New dataset
▪ 6233 files, ~0.8 TB data, 128MB per file
▪ Partition strategy
▪ ingest_date, ingest_hour
▪ Z-ordered columns
▪ longitude, latitude
Co-location of
geospatial data
▪ Spread of data across files,
after Z-ORDER by geohash
Measuring file pruning
▪ input_file_name
▪ EXPLAIN
Measuring file pruning – continued
df.queryExecution.optimizedPlan.collect {
// WARNING: this info will not be available >= DBR 7.0
case DeltaTable(prepared: PreparedDeltaFileIndex) =>
val stat = prepared.preparedScan
val skippingPct = (100 -
(stat.scanned.bytesCompressed.get.toDouble /
partitionReadBytes.get) * 100)
Candidate z-order
columns
▪ Longitude/latitude
▪ Geohash
▪ Zipcode
▪ State
▪ H3 (Uber - https://eng.uber.com/h3/)
▪ S2 (Google - https://s2geometry.io/)
Z-ordering by one geospatial column
WILL also enable dynamic file
pruning for queries that filter on a
different geospatial column
For example, z-ordering on geohash
will also improve performance for
queries on zipcode or state
Comparing performance
Filter Type
INNER JOIN
(with
broadcast)
geohash
range
h3 range
longitude/
latitude range
state
zipcode range
(10)
make/model
range
Columns
Z-Ordered
longitude,
latitude
5.55 mins
0% skipping
33s
82% skipping
30s
81% skipping
12s
93% skipping
26s
74% skipping
8s
97% skipping
1.6 mins
0% skipping
h3
27s
80% skipping
14s
96% skipping
18s
92% skipping
38s
68% skipping
11s
95% skipping
1.7 mins
0% skipping
geohash
12s
98% skipping
16s
84% skipping
16s
92% skipping
20s
76% skipping
7s
97% skipping
1.3 mins
0% skipping
make/model
1.4 mins
0% skipping
1.4 mins
0% skipping
1.4 mins
0% skipping
20s
84% skipping
geohash,
longitude,
latitude
27s
90% skipping
23s
82% skipping
23s
91% skipping
geohash,
make/model
17s
83% skipping
28s
55% skipping
35s
58% skipping
43s
38% skipping
15s
84% skipping
36s
61% skipping
s2
17s
83% skipping
36s
82% skipping
22s
91% skipping
30s
73% skipping
11s
96% skipping
1.8 mins
0% skipping
Spot the difference
Only one of these queries activates dynamic file pruning
Check the query plan!
Query caveats
▪ Only certain query filters will activate file pruning
Activates
Simple filters, e.g:
• =, >, <, BETWEEN
IN (with up to 10 values)
INNER JOIN – but..
• must be BROADCAST
• must be included in WHERE
clause
Doesn’t activate
RLIKE
IN (with over 10 values)
Skipping the OPTIMIZE step
▪ OPTIMIZE is effective but expensive
▪ Requires an additional step in your data pipelines
▪ repartitionByRange
▪ works for batch and stream
▪ does not ZORDER, but does shuffle datapoints into files based on the selected
columns
▪ can use a column such as geohash, which is already a z-order over the data
Process overview
▪ Importing data
▪ Querying data
Import data in
stream or batch Process
Add geohash
column
(implicit Z-ORDER)
repartitionByRange
(requires shuffle)
Write to parquet
(non-delta table)
Import to delta
(does not require
shuffle)
Convert all
geospatial
queries to a
set of covering
geohashes
Lookup into
table using a
BROADCAST
join
up to x100 reduction in query read
times from object storage (S3)
With the right query adjustments, dynamic file
pruning is a very effective tool for optimizing
geospatial queries which read from object
storage such as S3
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

More Related Content

PDF
Modernizing to a Cloud Data Architecture
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
From Data Warehouse to Lakehouse
PDF
Apache Sedona: how to process petabytes of agronomic data with Spark
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Batch Processing at Scale with Flink & Iceberg
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
Modernizing to a Cloud Data Architecture
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
From Data Warehouse to Lakehouse
Apache Sedona: how to process petabytes of agronomic data with Spark
Apache Iceberg - A Table Format for Hige Analytic Datasets
Batch Processing at Scale with Flink & Iceberg
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...

What's hot (20)

PPTX
Delta lake and the delta architecture
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Iceberg + Alluxio for Fast Data Analytics
PPTX
Apache Superset - open source data exploration and visualization (Conclusion ...
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
Future of Data Engineering
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PPTX
The columnar roadmap: Apache Parquet and Apache Arrow
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Databricks Platform.pptx
PDF
Enabling a Data Mesh Architecture with Data Virtualization
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
PDF
PostgreSQL Performance Tuning
PDF
Spark SQL Beyond Official Documentation
PDF
Spark shuffle introduction
PPTX
Presto: SQL-on-anything
PDF
Data Lake Architecture – Modern Strategies & Approaches
PPTX
Apache Atlas: Governance for your Data
PDF
Apache Spark Overview
Delta lake and the delta architecture
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Iceberg + Alluxio for Fast Data Analytics
Apache Superset - open source data exploration and visualization (Conclusion ...
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Future of Data Engineering
Solving Enterprise Data Challenges with Apache Arrow
Presto Summit 2018 - 09 - Netflix Iceberg
The columnar roadmap: Apache Parquet and Apache Arrow
Iceberg: A modern table format for big data (Strata NY 2018)
Databricks Platform.pptx
Enabling a Data Mesh Architecture with Data Virtualization
Building Lakehouses on Delta Lake with SQL Analytics Primer
PostgreSQL Performance Tuning
Spark SQL Beyond Official Documentation
Spark shuffle introduction
Presto: SQL-on-anything
Data Lake Architecture – Modern Strategies & Approaches
Apache Atlas: Governance for your Data
Apache Spark Overview
Ad

Similar to Optimising Geospatial Queries with Dynamic File Pruning (20)

PDF
Delta Lake: Optimizing Merge
PDF
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
PDF
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
PDF
Dynamic Partition Pruning in Apache Spark
PDF
Operating and Supporting Delta Lake in Production
PDF
Deep Dive into the New Features of Apache Spark 3.0
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
Making Apache Spark Better with Delta Lake
PDF
What’s New in the Upcoming Apache Spark 3.0
PPTX
Delta Lake Tips, Tricks, and Best Practices WIP.pptx
PPTX
Smadav Pro 2025 v15.3 Activaed Full Free
PDF
DeltaLakeOperations.pdf
PDF
Delta: Building Merge on Read
PDF
Spark + AI Summit recap jul16 2020
PDF
delta_lake_cheat_sheet.pdf
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Hoodie: Incremental processing on hadoop
PPTX
Apache Spark 3 Dynamic Partition Pruning
Delta Lake: Optimizing Merge
SQLDAY 2023 Chodkowski Adrian Databricks Performance Tuning
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Delta/Parquet Data Lakes for Apache Spark
Accumulo Summit 2016: GeoMesa: Using Accumulo for Optimized Spatio-Temporal P...
Dynamic Partition Pruning in Apache Spark
Operating and Supporting Delta Lake in Production
Deep Dive into the New Features of Apache Spark 3.0
Common Strategies for Improving Performance on Your Delta Lakehouse
Making Apache Spark Better with Delta Lake
What’s New in the Upcoming Apache Spark 3.0
Delta Lake Tips, Tricks, and Best Practices WIP.pptx
Smadav Pro 2025 v15.3 Activaed Full Free
DeltaLakeOperations.pdf
Delta: Building Merge on Read
Spark + AI Summit recap jul16 2020
delta_lake_cheat_sheet.pdf
Achieving Lakehouse Models with Spark 3.0
Hoodie: Incremental processing on hadoop
Apache Spark 3 Dynamic Partition Pruning
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
1_Introduction to advance data techniques.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction-to-Cloud-ComputingFinal.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
1_Introduction to advance data techniques.pptx
Business Analytics and business intelligence.pdf
Business Acumen Training GuidePresentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
IB Computer Science - Internal Assessment.pptx
Data_Analytics_and_PowerBI_Presentation.pptx

Optimising Geospatial Queries with Dynamic File Pruning

  • 1. Optimising geospatial queries with dynamic file pruning Matthew Slack Principal Data Architect - Wejo
  • 2. Agenda Wejo Typical use cases Data lake indexing strategies Dynamic file pruning Z-ordering strategies Measuring effectiveness of file pruning Optimizing queries to activate file pruning
  • 5. Processing data from over 15m vehicles OEM A Streamed OEM A Micro-batch (Ign OFF) OEM B - n Streamed 15m+ Vehicles 18 billion data points per day 0.4 trillion per month Peak ~900,000 per second 5Pb data (and growing)
  • 6. Wejo and Databricks Data Lake Data analysis Data science Data governance CoreIngress OEM Egress Transform Filter & Aggregate Stream Raw asset Bespoke DS output Adept Stream BI and Dashboards Incident Manage ment Change and Release 24x7 Monitori ng Alerting SecOps Devops Stream Stream Stream Aggregates Batch Insight Batch Bespoke Insights Sample Preview Infosec Complian ce BI Datamart Data Aggregates Infosec Tooling Sample Generation On-prem AWS Google Cloud Microsoft Azure Portal OEM OEM OEM Databricks underpins our ad-hoc analysis of data for data science We use it to populate both the datamart for BI and the derived data used in WIM Introduction of Delta Lake to support geospatial workloads and CCPA and GDPR
  • 7. Typical use cases ▪ Traffic intelligence ▪ Map matching ▪ Anomaly detection ▪ Journey intelligence ▪ Origin destination analysis All require both geospatial and temporal indexing
  • 8. Spark geospatial support Magellan and many others… ▪ Many existing libraries and frameworks ▪ However… ▪ Most are designed to efficiently join geospatial datasets that are already loaded to the cluster ▪ They do not optimize the reading of data from underlying storage ▪ Geomesa can be configured to utilise the underlying partitioning on disk but is complex geopandas
  • 9. Data Lake Indexing Strategies (Recap)
  • 10. Data lake partitioning ▪ Choice of partition columns in data lakes is always a trade- off ▪ Partitioning in data lakes is usually done using a hierarchical directory structure year=2020 • month=11 • day=17 • day=18 ingest_date=2020-11-17 • ingest_hour=10 • ingest_hour=11 country=US • state=NY • state=TX ingest_date=2020-11-17 • state=NY Complexity Selectivity Number of partitions (< 10k) Size of each partition (> 10s MB)
  • 11. Dynamic partition pruning ▪ Allows partitions to be dynamically skipped at query runtime ▪ Partitions that do not match the query filters are excluded by the optimizer ▪ Significantly improved in Spark 3.0
  • 12. Data skipping, via effective partitioning strategies, combined with columnar file formats such as Parquet, ORC were historically the main levers for optimizing a data lake
  • 14. Dynamic file pruning overview ▪ Introduced with Databricks Delta Lake in early 2020 ▪ Allows files to be dynamically skipped within a partition ▪ Files that do not match the query filters are excluded by the optimizer ▪ Databricks collects metadata on a subset of columns in all files added to the dataset, stored in _delta_log folder ▪ Relies on files being pre-sorted beforehand
  • 16. Test environment ▪ Test cluster ▪ Spark 2.4.5, Databricks Runtime 6.6 ▪ i3.4xlarge x 10 executors ▪ Auto scaling off ▪ CLEAR CACHE before each test ▪ Input dataset ▪ One day of connected car data (6th March) ▪ Nested parquet in AWS S3 ▪ Over 16 billion rows, ~40 columns ▪ 2491 files, ~1.4 TB data, ~0.5GB per file ▪ Partition strategy ▪ ingest_date, ingest_hour
  • 17. Naïve example on un-optimized Parquet ▪ Generate all geohashes that cover the Austin polygon ▪ Spark scans all files that match the partition filters ▪ Just over 97M datapoints are covered by the geohashes
  • 18. However… ▪ Slow… over 5 minutes ▪ Datapoints spread randomly across all of the 2491 input files
  • 19. Z-Ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. Databricks
  • 20. Z-Ordering the data OPTIMIZEConvert to Delta ▪ New dataset ▪ 6233 files, ~0.8 TB data, 128MB per file ▪ Partition strategy ▪ ingest_date, ingest_hour ▪ Z-ordered columns ▪ longitude, latitude
  • 21. Co-location of geospatial data ▪ Spread of data across files, after Z-ORDER by geohash
  • 22. Measuring file pruning ▪ input_file_name ▪ EXPLAIN
  • 23. Measuring file pruning – continued df.queryExecution.optimizedPlan.collect { // WARNING: this info will not be available >= DBR 7.0 case DeltaTable(prepared: PreparedDeltaFileIndex) => val stat = prepared.preparedScan val skippingPct = (100 - (stat.scanned.bytesCompressed.get.toDouble / partitionReadBytes.get) * 100)
  • 24. Candidate z-order columns ▪ Longitude/latitude ▪ Geohash ▪ Zipcode ▪ State ▪ H3 (Uber - https://eng.uber.com/h3/) ▪ S2 (Google - https://s2geometry.io/) Z-ordering by one geospatial column WILL also enable dynamic file pruning for queries that filter on a different geospatial column For example, z-ordering on geohash will also improve performance for queries on zipcode or state
  • 25. Comparing performance Filter Type INNER JOIN (with broadcast) geohash range h3 range longitude/ latitude range state zipcode range (10) make/model range Columns Z-Ordered longitude, latitude 5.55 mins 0% skipping 33s 82% skipping 30s 81% skipping 12s 93% skipping 26s 74% skipping 8s 97% skipping 1.6 mins 0% skipping h3 27s 80% skipping 14s 96% skipping 18s 92% skipping 38s 68% skipping 11s 95% skipping 1.7 mins 0% skipping geohash 12s 98% skipping 16s 84% skipping 16s 92% skipping 20s 76% skipping 7s 97% skipping 1.3 mins 0% skipping make/model 1.4 mins 0% skipping 1.4 mins 0% skipping 1.4 mins 0% skipping 20s 84% skipping geohash, longitude, latitude 27s 90% skipping 23s 82% skipping 23s 91% skipping geohash, make/model 17s 83% skipping 28s 55% skipping 35s 58% skipping 43s 38% skipping 15s 84% skipping 36s 61% skipping s2 17s 83% skipping 36s 82% skipping 22s 91% skipping 30s 73% skipping 11s 96% skipping 1.8 mins 0% skipping
  • 26. Spot the difference Only one of these queries activates dynamic file pruning Check the query plan!
  • 27. Query caveats ▪ Only certain query filters will activate file pruning Activates Simple filters, e.g: • =, >, <, BETWEEN IN (with up to 10 values) INNER JOIN – but.. • must be BROADCAST • must be included in WHERE clause Doesn’t activate RLIKE IN (with over 10 values)
  • 28. Skipping the OPTIMIZE step ▪ OPTIMIZE is effective but expensive ▪ Requires an additional step in your data pipelines ▪ repartitionByRange ▪ works for batch and stream ▪ does not ZORDER, but does shuffle datapoints into files based on the selected columns ▪ can use a column such as geohash, which is already a z-order over the data
  • 29. Process overview ▪ Importing data ▪ Querying data Import data in stream or batch Process Add geohash column (implicit Z-ORDER) repartitionByRange (requires shuffle) Write to parquet (non-delta table) Import to delta (does not require shuffle) Convert all geospatial queries to a set of covering geohashes Lookup into table using a BROADCAST join up to x100 reduction in query read times from object storage (S3)
  • 30. With the right query adjustments, dynamic file pruning is a very effective tool for optimizing geospatial queries which read from object storage such as S3
  • 31. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.