SlideShare a Scribd company logo
Building a real time
Tweet map with
Flink in six weeks
OSTMap
Fast poc development with
flink
Proof of concept - an important tool in the
industry
• PoC often necessary to show feasibility to customers
• touch several topics:
• Scalability
• Stream processing
• Batch processing
• Storage and querying of data
• OSTMap as example PoC
Goals for OSTMap
• Increase trust into big data
technologies on customer side
• It is easy to build an application
with current technologies
• With almost no experience
• Teach students big data technologies
• Recruiting
• Bring big data to the university
• Build a real time application to view
recent geotagged tweets on a map
• Search for terms and users, show
these tweets on a map
• Analytics:
• First data science jobs
• …
Industry in practice: IT-Ringvorlesung 2016
• A course at the University of Leipzig.
• work on projects of local companies
• six students
• over a period of 6 weeks - no full time
invest
• Weekly meetings
• Github project: github.com/IIDP/OSTMap
Nico Graebling Vincent Märkl
Hans Dieter Pogrzeba
Christopher SchottChristopher Rost
Kevin Shrestha
Michael Schmeißer
Martin Grimmer
Matthias Kricke
OSTMap
mgm technology partners
We bring applications into production!
• Innovative software solution provider with application responsibility
• Specialist for highly scalable, transactional online applications
• Central lines of business: Insurance, E-Commerce, E-Government
• Founded in 1994
• 347 employees, 9 offices (2014)
• Revenue: 43,7 Mio € (2014)
• Part of Allgeier SE
ScaDS
Competence center for scalable data services and solutions Dresden/Leipzig
• bundled Big Data research expertise of the TU
Dresden and Leipzig University
• Drive Big Data innovations
• Bring industry and science together
• Knowledge exchange and transfer
Walking skeleton
“A Walking Skeleton is a tiny implementation of the system that performs a small end-to-
end function. It need not use the final architecture, but it should link together the main
architectural components. The architecture and the functionality can then evolve in
parallel.”
- Alistair Cockburn
gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a-
walking-skeleton
Milestone 1
read stream, store data as json file, show tweets, read data from json files
Milestone 2
write to and read from accumulo, show tweets on map, full table scans, slow visualization
Milestone 3
Term index, geotemporal index, ui improvements, clustering, …
OSTMap – stream, batch, storage and querying
geotagged tweets
webservice
a) stream processing
b) batch processing
c) querying data
Stream processing of incoming data – first
version
GeoTweetSourc
e
KeyGeneration RawTweetSinkDateExtraction
This enabled us to build a slow term search and a slow map search via full table scans.
time index
data for
Stream processing of incoming data – final
version
TermIndexSink
GeoTweetSourc
e
KeyGeneration RawTweetSinkDateExtraction
Now we were able to build a faster term and map search and language frequency visualization.
time index
TermExtraction
(tokenizing)
UserExtraction
LanguageFrequ
encySink
Language
Extraction
term index
language statistics
GeoTemporalInd
exCreation
GeoTemporalInd
exSink
geotemporal index
data for
1 minute
window
sum by
language
Batch processing
• Initial creation of the term index and geotemporal
index for already processed tweets
• Data export
• Other statistics like:
• Area/ tweet distance a user covers with his tweets
Storage
Table Row Column Family Column Qualifier Value
RawTweetData (TimeIndex)
timestamp, hash
8b + 4b
- - raw tweet json
TermIndex term field (user,text)
RawTweetData key
12b
-
LanguageFrequency
time bucket
YYYYMMDDhhmm
language-tag -
tweet count
4b
Accumulo table design
Geotemporal Index for OSTMap
Geo index
geo data
geohashes used
as row keys
in accumulo
…
3z
6b
6c
6f
6q
9p
9r
9x
9z
d0
d1
d2
d3
d4
d5
d6
…
dg
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash (z
curve)
function from 2d coordinate
space to 1d key space
Row CF CQ
geohash RawTweetKey -
Geotemporal Index for OSTMap
Geo index – querying?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash
bounding
box
calculate
coverage of
bounding box
range: [9p]
calculate scan
ranges from
coverage
range: [9r]
range:
[d0,d1,d2,d3]
…
3z
6b
6c
6f
6q
9p
9r
9x
9z
d0
d1
d2
d3
d4
d5
d6
…
dg
accumulo
iteratorsaccumulo
iterators
accumulo
iterators
result
Row CF CQ
geohash RawTweetKey lat/lon
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
Add some time!
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
…
13z
16b
16c
16f
16q
19p
19r
19x
19z
1d0
1d1
1d2
1d3
1d4
1d5
1d6
…
1dg
day
lon
lat
…
23z
26b
26c
26f
26q
29p
29r
29x
29z
2d0
2d1
2d2
2d3
2d4
2d5
2d6
…
2dg
…
Row CF CQ
day, geohash RawTweetKey lat/lon
day 1 day 2 day i …
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
What about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
…
13z
16b
16c
16f
16q
19p
19r
19x
19z
1d0
1d1
1d2
1d3
1d4
1d5
1d6
…
1dg
day
lon
lat
…
23z
26b
26c
26f
26q
29p
29r
29x
29z
2d0
2d1
2d2
2d3
2d4
2d5
2d6
…
2dg
…
Row CF CQ
day, geohash RawTweetKey lat/lon
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
Geotemporal Index for OSTMap
What about Hotspots?
9z db dc df dg
9x d8 d9 dd de
9r d2 d3 d6 d7
9p d0 d1 d4 d5
3z 6b 6c 6f 6g
partitioned by geohash,
with timebuckets
day
lon
lat
…
12d2
12d3
12d4
…
…
Row CF CQ
sb, day, geohash RawTweetKey lat/lon
…
11d2
11d3
11d4
…
…
02d2
02d3
02d4
…
…
…
01d2
01d3
01d4
…
…
22d2
22d3
22d4
…
…
…
21d2
21d3
21d4
…
…
spreading byte
node 0
node 1
node 2
node n
• spreading byte = hash(tweet) % 255
• reproducable
• pre table splits in accumulo
demo
Martin Grimmer grimmer[at]informatik.uni-leipzig.de
Matthias Kricke kricke[at]informatik.uni-leipzig.de
www.mgm-tp.comwww.scads.de
Thank you
Michael Schmeißer michael.schmeisser[at]mgm-tp.com

More Related Content

PDF
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

PPTX
Ted Dunning - Keynote: How Can We Take Flink Forward?
PPTX
Ted Dunning-Faster and Furiouser- Flink Drift
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
PDF
Time Series Analysis Using an Event Streaming Platform
PDF
Bay Area Apache Flink Meetup Community Update August 2015
PPTX
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning-Faster and Furiouser- Flink Drift
Jamie Grier - Robust Stream Processing with Apache Flink
Time Series Analysis Using an Event Streaming Platform
Bay Area Apache Flink Meetup Community Update August 2015
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin

What's hot (20)

PPTX
Flink Case Study: Bouygues Telecom
PPTX
Flink Streaming
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Baymeetup-FlinkResearch
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PPTX
Spline: Data Lineage For Spark Structured Streaming
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PDF
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
PDF
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
PPTX
Apache Flink and what it is used for
PDF
Stream Processing with Apache Flink
PDF
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
PPTX
The Past, Present, and Future of Apache Flink®
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
PPTX
Flink vs. Spark
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PDF
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
PDF
Introduction to Streaming with Apache Flink
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Flink Case Study: Bouygues Telecom
Flink Streaming
Taking a look under the hood of Apache Flink's relational APIs.
Apache Flink 101 - the rise of stream processing and beyond
Baymeetup-FlinkResearch
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Spline: Data Lineage For Spark Structured Streaming
Apache Flink(tm) - A Next-Generation Stream Processor
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Apache Flink and what it is used for
Stream Processing with Apache Flink
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
The Past, Present, and Future of Apache Flink®
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
Flink vs. Spark
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Introduction to Streaming with Apache Flink
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Ad

Viewers also liked (20)

PDF
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
PDF
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
PDF
Automatic Detection of Web Trackers by Vasia Kalavri
PDF
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
PDF
Alexander Kolb - Flinkspector – Taming the squirrel
PDF
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
PDF
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
PPTX
Eron Wright - Introducing Flink on Mesos
PDF
Julian Hyde - Streaming SQL
PDF
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
PDF
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
PPTX
Eron Wright - Flink Security Enhancements
PDF
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
PPTX
Aljoscha Krettek - The Future of Apache Flink
PDF
Ufuc Celebi – Stream & Batch Processing in one System
PDF
Matthias J. Sax – A Tale of Squirrels and Storms
PPTX
Kamal Hakimzadeh – Reproducible Distributed Experiments
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...
Gyula Fóra - RBEA- Scalable Real-Time Analytics at King
Automatic Detection of Web Trackers by Vasia Kalavri
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Alexander Kolb - Flinkspector – Taming the squirrel
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Eron Wright - Introducing Flink on Mesos
Julian Hyde - Streaming SQL
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Eron Wright - Flink Security Enhancements
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Aljoscha Krettek - The Future of Apache Flink
Ufuc Celebi – Stream & Batch Processing in one System
Matthias J. Sax – A Tale of Squirrels and Storms
Kamal Hakimzadeh – Reproducible Distributed Experiments
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Ad

Similar to Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks (20)

PDF
Portfolio
PDF
Copy of Copy of Untitled presentation (1).pdf
PDF
Quarterly Technology Briefing, Manchester, UK September 2013
PDF
Esta ld -exploring-spatio-temporal-linked-statistical-data
PDF
ESTA-LD exploring spatio-temporal linked statistical data
PPTX
CitySDK Workshop Feedback
PDF
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
PPT
Chapter 6 project management
PDF
Engineering + Programming portfolio
PPTX
SCHEDULING IN PROJECT MANAGEMENT PROJECT SCHEDULE MANAGEMENT
PDF
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
PDF
Traveloka's data journey — Traveloka data meetup #2
PDF
Graph operations in Git version control system
PPTX
Scalable data pipeline at Traveloka - Facebook Dev Bandung
PDF
Your Timestamps Deserve Better than a Generic Database
PPTX
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
PDF
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
PDF
Open Historical Map: Vector Tiles & Other Updates
PDF
GIS in Pharo PharoOWS & GeoView (ESUG 2025)
PDF
Deduplicating and analysing time-series data with Apache Beam and QuestDB
Portfolio
Copy of Copy of Untitled presentation (1).pdf
Quarterly Technology Briefing, Manchester, UK September 2013
Esta ld -exploring-spatio-temporal-linked-statistical-data
ESTA-LD exploring spatio-temporal linked statistical data
CitySDK Workshop Feedback
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...
Chapter 6 project management
Engineering + Programming portfolio
SCHEDULING IN PROJECT MANAGEMENT PROJECT SCHEDULE MANAGEMENT
SC20 SYCL and C++ Birds of a Feather 19th Nov 2020
Traveloka's data journey — Traveloka data meetup #2
Graph operations in Git version control system
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Your Timestamps Deserve Better than a Generic Database
Frossie Economou & Angelo Fausti [Vera C. Rubin Observatory] | How InfluxDB H...
QTB Technology Lab - The Travel Domain, Beyond SQL, the Cloud, and more...
Open Historical Map: Vector Tiles & Other Updates
GIS in Pharo PharoOWS & GeoView (ESUG 2025)
Deduplicating and analysing time-series data with Apache Beam and QuestDB

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Quality review (1)_presentation of this 21
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Foundation of Data Science unit number two notes
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Database Infoormation System (DBIS).pptx
Reliability_Chapter_ presentation 1221.5784
Miokarditis (Inflamasi pada Otot Jantung)
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Moving the Public Sector (Government) to a Digital Adoption
Quality review (1)_presentation of this 21
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision
Data_Analytics_and_PowerBI_Presentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
IB Computer Science - Internal Assessment.pptx
Foundation of Data Science unit number two notes
Clinical guidelines as a resource for EBP(1).pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx

Matthias Kricke_Martin Grimmer_Michael Schmeißer - Building a real time Tweet map with Flink in six weeks

  • 1. Building a real time Tweet map with Flink in six weeks OSTMap Fast poc development with flink
  • 2. Proof of concept - an important tool in the industry • PoC often necessary to show feasibility to customers • touch several topics: • Scalability • Stream processing • Batch processing • Storage and querying of data • OSTMap as example PoC
  • 3. Goals for OSTMap • Increase trust into big data technologies on customer side • It is easy to build an application with current technologies • With almost no experience • Teach students big data technologies • Recruiting • Bring big data to the university • Build a real time application to view recent geotagged tweets on a map • Search for terms and users, show these tweets on a map • Analytics: • First data science jobs • …
  • 4. Industry in practice: IT-Ringvorlesung 2016 • A course at the University of Leipzig. • work on projects of local companies • six students • over a period of 6 weeks - no full time invest • Weekly meetings • Github project: github.com/IIDP/OSTMap Nico Graebling Vincent Märkl Hans Dieter Pogrzeba Christopher SchottChristopher Rost Kevin Shrestha Michael Schmeißer Martin Grimmer Matthias Kricke OSTMap
  • 5. mgm technology partners We bring applications into production! • Innovative software solution provider with application responsibility • Specialist for highly scalable, transactional online applications • Central lines of business: Insurance, E-Commerce, E-Government • Founded in 1994 • 347 employees, 9 offices (2014) • Revenue: 43,7 Mio € (2014) • Part of Allgeier SE
  • 6. ScaDS Competence center for scalable data services and solutions Dresden/Leipzig • bundled Big Data research expertise of the TU Dresden and Leipzig University • Drive Big Data innovations • Bring industry and science together • Knowledge exchange and transfer
  • 7. Walking skeleton “A Walking Skeleton is a tiny implementation of the system that performs a small end-to- end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel.” - Alistair Cockburn gif from http://blog.codeclimate.com/blog/2014/03/20/kickstart-your-next-project-with-a- walking-skeleton
  • 8. Milestone 1 read stream, store data as json file, show tweets, read data from json files
  • 9. Milestone 2 write to and read from accumulo, show tweets on map, full table scans, slow visualization
  • 10. Milestone 3 Term index, geotemporal index, ui improvements, clustering, …
  • 11. OSTMap – stream, batch, storage and querying geotagged tweets webservice a) stream processing b) batch processing c) querying data
  • 12. Stream processing of incoming data – first version GeoTweetSourc e KeyGeneration RawTweetSinkDateExtraction This enabled us to build a slow term search and a slow map search via full table scans. time index data for
  • 13. Stream processing of incoming data – final version TermIndexSink GeoTweetSourc e KeyGeneration RawTweetSinkDateExtraction Now we were able to build a faster term and map search and language frequency visualization. time index TermExtraction (tokenizing) UserExtraction LanguageFrequ encySink Language Extraction term index language statistics GeoTemporalInd exCreation GeoTemporalInd exSink geotemporal index data for 1 minute window sum by language
  • 14. Batch processing • Initial creation of the term index and geotemporal index for already processed tweets • Data export • Other statistics like: • Area/ tweet distance a user covers with his tweets
  • 15. Storage Table Row Column Family Column Qualifier Value RawTweetData (TimeIndex) timestamp, hash 8b + 4b - - raw tweet json TermIndex term field (user,text) RawTweetData key 12b - LanguageFrequency time bucket YYYYMMDDhhmm language-tag - tweet count 4b Accumulo table design
  • 16. Geotemporal Index for OSTMap Geo index geo data geohashes used as row keys in accumulo … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash (z curve) function from 2d coordinate space to 1d key space Row CF CQ geohash RawTweetKey -
  • 17. Geotemporal Index for OSTMap Geo index – querying? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash bounding box calculate coverage of bounding box range: [9p] calculate scan ranges from coverage range: [9r] range: [d0,d1,d2,d3] … 3z 6b 6c 6f 6q 9p 9r 9x 9z d0 d1 d2 d3 d4 d5 d6 … dg accumulo iteratorsaccumulo iterators accumulo iterators result Row CF CQ geohash RawTweetKey lat/lon
  • 18. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap Add some time! 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg day lon lat … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg … Row CF CQ day, geohash RawTweetKey lat/lon day 1 day 2 day i …
  • 19. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap What about Hotspots? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets … 13z 16b 16c 16f 16q 19p 19r 19x 19z 1d0 1d1 1d2 1d3 1d4 1d5 1d6 … 1dg day lon lat … 23z 26b 26c 26f 26q 29p 29r 29x 29z 2d0 2d1 2d2 2d3 2d4 2d5 2d6 … 2dg … Row CF CQ day, geohash RawTweetKey lat/lon
  • 20. 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g Geotemporal Index for OSTMap What about Hotspots? 9z db dc df dg 9x d8 d9 dd de 9r d2 d3 d6 d7 9p d0 d1 d4 d5 3z 6b 6c 6f 6g partitioned by geohash, with timebuckets day lon lat … 12d2 12d3 12d4 … … Row CF CQ sb, day, geohash RawTweetKey lat/lon … 11d2 11d3 11d4 … … 02d2 02d3 02d4 … … … 01d2 01d3 01d4 … … 22d2 22d3 22d4 … … … 21d2 21d3 21d4 … … spreading byte node 0 node 1 node 2 node n • spreading byte = hash(tweet) % 255 • reproducable • pre table splits in accumulo
  • 21. demo
  • 22. Martin Grimmer grimmer[at]informatik.uni-leipzig.de Matthias Kricke kricke[at]informatik.uni-leipzig.de www.mgm-tp.comwww.scads.de Thank you Michael Schmeißer michael.schmeisser[at]mgm-tp.com

Editor's Notes