scalable air quality analytics with apache spark and apache sedona

DEBS’2021 Grand Challenge
June 28-July 2, 2021, Virtual Event, Italy
Scalable Analytics of Air Quality Batches
with Apache Spark and Apache Sedona
Dr. Rim Moussa
University of Carthage

2
Context
Sensing the Air Quality
Luftdaten
Decision making
Analyze collected data
and extract insights
Collected data is big data
(volume, and velocity )
Big data frameworks (Spark,
Flink..); Eﬃcient algorithms to
query data...
03
01 02
Air Pollution: Particulate Matter: PM2.5
and PM10
, Noxious gases … affect human and animal
health and earth ecosystem (lakes, streams, and soils) (NIEHS)

3
Outline
DEBS’2021 Contest
Solution Overview
Q1: Top K cities (year to year comparison)
Q2: Longest Streaks Calculus
Conclusion
Future Work
1
2
3
4
5
6

4
Solution Overview
Open-source frameworks
● Apache Spark: it allows in-memory processing across a cluster
of machines.
○ fault-tolerance: if one node fails, the failed tasks are
distributed across the other nodes.
○ Scalability: the cluster scales horizontally with no
downtime
● Apache Sedona
○ reverse-geocoding, and other spatial operations
○ creation spatial indexes over large-scale spatial data.
Solution Design
● SparkSQL, Dataset<Row>: Q1, active cities’ tracking
● Workﬂow and JavaPairRDDs: Q2

5
Data
RDDs and dataset<Row>:
● Batch
● German spatial shapes (8k polygons)
● active-cities-lastTSs
● current-year-AQI-summary
● previous-year-AQI-summary
● cities-streaks-summaries

6
Batch
Measurements
Batch is TSed
ingestion-TS
1
2
Reverse geo-coding
Longitude, altitude --> city
Spatial join using Apache Sedona
3
AQI Calculus
Calculate AVG, SUM, COUNT of each pollutant per
city and date
Calculate max(AQI) for each city, date and the yearly
AQI Improvement
Rank active cities
Calculate Summaries over batches
Current-Year-Summary and Previous-Year-Summary
save SUM, COUNT of pollutants per city and date
(use these summaries to run Q1 on snapshots)
6
Refresh cities lastTSs
In order to track active cities
4
5
Q1: Top k cities wrt to yearly AQI improvement

7
Active Cities Tracking
Batch
Measurements
Batch is TSed
ingestion-TS
1
2
REFRESH cities-last-TSs
Case 1: new city data, no match for the city in
cities-last-TSs
--> Add the entry <cityi
,
batch-ingestion-TS> to cities-last-TSs
Case 2: a match exists
--> Update TS of entry cityi
, set last-TS to
batch-ingestion-TS
Case 3: no match exists for the city in the
received batch
--> The TS of entry cityi
is unchanged
4

8
Implementation
○ Apache SparkSQL for Q1 and Active Cities Tracking
○ Algebraic expression of Active Cities Tracking
city
Gmax(lastTS)
Ⲡ {city, ingestion-TS:lastTS}
CY-batch-measurements ∪ cities-last-TSs
With ∪ denotes “union all” -we don’t check for duplicates and G denotes “Group by”
Example
Batch ingestion-TS: 1621416077911
❋ entries corresponding to cities Mainz, Magdeburg and Langenau
are either new entries or were updated. They’ve got the batch
ingestion-TS as last-TS
Whereas, ☘ entries correspond to the cities Ladwigsfelde and
Berlin Grunewald, which kept unchanged their last-TS, i.e. none
measurement in the received batch correspond to these two cities
❋
☘
❋
☘
❋

9
Q1: screenshots
current-year-AQI-summary
Top 20

Streak is deﬁned as a time duration during which a city has good quality index
10
Start timestamp
Streak
Duration
Δt = end-TS － start-TS
End timestamp
Q2: Longest Streaks’ Calculus
Best Streak: ever received to compare with future streaks
Cityi
Last Streak: might be merged with next streak if they are sequential in time
Cities-Streaks-Summaries stores best streak for each city,

COMPUTE batch-streaks-summary-per-city
Each new Batch triggers the refresh of city-streaks-summaries
○ Measurements are grouped by city
○ For each city, we summarize all measurements as follows,
11
First Streak : to merge with last-streak
Batch Streaks’
Summary per
City
Last Streak : to merge with the upcoming ﬁrst-streak
Middle Best Streak : to compare with best-streak
Pattern Identiﬁer : patterns’ enumeration for simple processing
1
REFRESH Cities-streaks-summaries
2

12
Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor. Ipsum
dolor sit amet elit, sed do eiusmod tempor.
Pattern
identiﬁer
Event TSs and good boolean values
Data extract (N pairs of TS, good) for each city
First
streak
Best Mid
Streak
Last
Streak
0
base2: 000
null null null
1
base2: 001
null null Streak
(ts8
,ts10
,Δt)
2
base2: 010
null Streak
(ts5
,ts9
,Δt)
null
⠇ ⠇ ⠇ ⠇ ⠇
7
base2: 111
Streak
(ts1
,ts2
,Δt)
Streak
(ts5
,ts7
,Δt)
Streak
(ts9
,ts10
,Δt)
ts1
ts2
ts3
ts4
ts5
ts6
ts7
ts8
ts9
ts10
0 0 0 0 0 0 0 0 0 0
ts1
ts2
ts3
ts4
ts5
ts6
ts7
ts8
ts9
ts10
0 0 0 0 0 0 0 1 1 1
ts1
ts2
ts3
ts4
ts5
ts6
ts7
ts8
ts9
ts10
0 1 1 0 1 1 1 1 1 0
ts1
ts2
ts3
ts4
ts5
ts6
ts7
ts8
ts9
ts10
1 1 0 0 1 1 1 0 1 1

13
Case 1: no match for the city in the batch streaks’ summaries
● Right hand part resulting from Outer Full Join is null
● The city keeps its recorded best-streak and last-streak unchanged
Best Streak 🗸
Cityi
Last Streak 🗸
Null match

14
First Streak
Batch Streaks’
Summary per
City
Last Streak
Middle Best Streak
Pattern Identiﬁer
Best Streak
Cityi
Last Streak
New entry
best
Case 2: no match for the city in the streaks’ summaries over batches
● Left hand part resulting from Outer Full Join is null: New city data
● Example pattern-identiﬁer = 6, 7
Null match

Case 3: match on city exists,
● processing pattern-identiﬁer = 6, 7
15
First Streak
Batch Streaks’
Summary per
City
Last Streak
Middle Best Streak
Pattern Identiﬁer
Best Streak
Cityi
Last Streak
Best Streak
Cityi
Last Streak
merge
best best

16
b0
bucket
b1
bj
b13
⃜ ⃜
%cities
BUILD the histogram
Count the number of cities for each bucket
If the streak-duration si
of cityi
is in [j×hw/14 , (j+1)×hw/14] , +1 to bj
count cities
Divide each count by the total number of active cities
COMPUTE longest streak for each active city
select active cities: city-streaks-summaries ⨝ 𝜎lastTS ≥ batch-ingestion-TS - 10min
cities-lastTS
Calculate longest-streak for each active city
hw denotes the histogram-window
hw = current-batch-ingestion-TS － ﬁrst-batch-ingestion-TS
3
4

17
Evaluation run
batch-size= 10,000
Q2: screenshots

18
Conclusion
Solution Design and Implementation
Optimized workﬂow
Parallel processing
Preliminary Evaluation
On the provided VM with 16GB, our system processed 22 batches of 10,000
measurements each, during 15.5min (928.097sec)
Limitations of Spark
RDDs are immutable:
Refresh through recalculation of active-cities-lastTSs, current-year-AQI-summary,
previous-year-AQI-summary, cities-streaks-summaries RDDs

19
Future Work
Performance Analysis on a cluster utility
Apache Sedona Spatial Indexes (R-tree, quad-tree), Spark cluster setup (partitioning...)
Streaks’ calculus
For each city, N measurements are analyzed and summarized into a triplet
<first-streak,best-mid-streak,last-streak>
--> Investigate approximate approaches: timestamps close in time have high chance to
relate to same air quality
Spatial Analysis on historical and RT data
Both queries Q1 and Q2 perform analysis per city. Cities are different from each other
considering population, area, sensors’ coverage, and may include parks and industrial sites ;
consequently we can’t aggregate by city
--> UberH3 or geohash indexes to adjust the zone dimensions;
--> High-Low Clustering (Getis-Ord Idx) and Spatial Autocorrelation (Global Moran’s Idx).

Thank you for your Attention
Q&A
Rim Moussa
DEBS ’21, June 28-July 2, 2021, Virtual Event, Italy

scalable air quality analytics with apache spark and apache sedona

More Related Content

Similar to scalable air quality analytics with apache spark and apache sedona (20)

More from Rim Moussa (19)

Recently uploaded (20)

scalable air quality analytics with apache spark and apache sedona