SlideShare a Scribd company logo
Changing the way cities move
Modern in-house analytics pipeline
Sergey Burkov
R&D Engineering Lead
DATA STRATEGY:
2
1. About Mobimeo
2. Importance of analytics
3. Buy or build choice
4. Vision vs. Reality
5. Scoping the requirments
Agenda
6. Choice of tech stack
7. Implementation details
8. Big picture
9. Challenges
Mobimeo – Changing
the way cities move
Easy access to daily mobility
Our technology empowers mobility providers to
orchestrate existing and new modes of public transport.
Together we create an effortless transport experience
to make mobility service attractive to millions of users.
More mobility. Less traffic.
3
4
We help our mobility partners to boost customer
happiness
Develop white label products - together with our mobility
partners
Fast implementation of tailor-made solutions in the
apps of our partners
Continuous improvement of products – our customers
are always the focal point
Selection of current mobility partners
We are rethinking urban mobility
5
Personalized
information
Public transport connected
with multimodal offers
Step by step
navigation
Easy ticket
purchase
Deloitte: “As organization investment
in data modernization increases, value
grows exponentially and changes
from hindsight to insights to
foresight.”
Gartner Research: “Data and
analytics leaders should augment or
upgrade traditional BI platforms to
modern platforms that improve
business value and speed time to
insight.”
Data Strategy: BI and Analytics Is Not Just a Tool
6
Outsource
Data Strategy: Buy or build
In house
+ Speed of launch
+ Advanced UI reporting
+ Easy integration into app
+ Less maintenance overhead
- Costs based on data/MAU volume
- Tailored to APP analytics mainly
- Privacy/GDPR concerns
- Code libs dependency
+ Full privacy transparency
+ Flexibility
+ Speed of transition from insight to
product
+ Complete understanding of user context
- Steep implementation curve
- Maintenance work
- Time to launch
7
Data Strategy: Ideal analytics/data pipeline
8
Low Event Latency
Scalability
Interactive Querying
Versioning
Monitoring
Testing
• No single owner for the system
• Different flavours of events definitions between platforms and apps
• Inconsistent names, non-heterogeneous structure
• No strict schema, data sent as raw JSON (some > 500KB)
• JSON BLOB as a field value
• Data stored in S3 in one bucket as individual json files
• …
• Data discovery and query building is complicated and fragile process
• New business event addition is a cross teams full blown project
Data Strategy: starting point & reality
9
Messy Pipeline
10
11
• Ease of adoption
• Extensibility
• Re-usability
• Streaming & Batch
• Strict schema enforcement
• Schema evolution
• Multiple schema versions
• Boiler Code generation
Data Strategy: where do we want to be
• Transparency of execution
• Handling duplicates
• Data flow monitoring
• Robustness for recovery
• Backfilling
• Self-service
• External data sources
• Smoke tests
12
13
Data Strategy: distributed streaming platform
14
@namespace("analytics.mobile")
protocol ActiveTicketClickProtocol {
import idl "../common/AnalyticsCommon.avdl";
import idl "../common/EventPropertiesCommon.avdl";
record ActiveTicketClickProperties {
analytics.common.GenericClickEventProperties @inlineFields(true) clickEvent;
}
@register(true)
record ActiveTicketClickEvent {
analytics.common.CommonEventProperties @inlineFields(true) commonEventProperties;
ActiveTicketClickProperties eventProperties;
analytics.common.FrontendDeviceProperties deviceProperties;
analytics.common.FrontendAppProperties appProperties;
}
}
Tactics: Schema definitions reside in central repository
15
Tactics: safe schema change process
16
Tactics: documentation is always up-to-date
17
• Helper library for Kafka with generated classes for BE
• DTO classes with JSON serializers for mobile apps
• Custom data connectors (REST API source, S3 Parquet sink)
• JSON REST API service (resolve json against schema registry)
• Replay service from raw dead letter S3
• Monitoring/Alerting: injection rate, dead letter queue
Tactics: data injection
18
• Access permissions
• Catalogue of data assets
• Events are stored in dedicated S3 buckets
• Data in Avro, Parquet and raw JSON
• Partitioned by origin, event type, tenant, date, hour
• Dead-letter bucket contains events rejected by pipeline
Tactics: data lake organization
19
CREATE EXTERNAL TABLE `ticketclickevent_avro`(
analyticsid string, timestamp bigint,
eventproperties struct<loggedin:boolean,screenname:string>,
PARTITIONED BY (tenant string, date date, hour int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES (
'avro.schema.literal'='{"type":"record","name":"TicketClickEvent" ...')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
's3://.../avro/event=analytics.mobile.TicketClickEvent'
MSCK REPAIR TABLE
Tactics: data access
20
High Level Architecture
21
• Open source monitoring and configuration tools
• Overcoming AVRO Schema inheritance / re-usability limitations
• Parsing multiple versions of events from JSON to AVRO
• Schema -> Java, Scala, Kotlin, Swift, documentation generation
• Tools / libs doesn’t support latest Avro specsifications or uses incompatible forks
• Splitting classes according to domains
• AWS Athena / Glue limitations
Challenges
22
Sergey Burkov
R&D Engineering Lead
sergey.burkov@mobimeo.com
linkedin.com/in/burkov
Mobimeo GmbH
Hallesches Ufer 60
10963 Berlin
info@mobimeo.com
Open source reference
github.com/confluentinc
redash.io
github.com/plinioj/json-schema-avro
github.com/mobimeo/avrohugger
github.com/mobimeo/gatling-kafka
github.com/mobimeo/avrodoc-plus
github.com/mobimeo/avro4s
github.com/lemastero/kafka-manager
github.com/linkedin/Burrow
github.com/yahoo/CMAK
github.com/linkedin/cruise-control
github.com/linkedin/cruise-control-ui
github.com/tchiotludo/kafkahq
github.com/lensesio/kafka-connect-ui
github.com/pusher/oauth2_proxy

More Related Content

PPT
Webinar: How MongoDB is making Government Better, Faster, Smarter
PDF
Time Series Analytics for Big Fast Data
PPTX
Monetize your APIs and datasets or make them available as open data
PDF
LogStash: Concept Run-Through
PDF
Introduction to ELK
PPTX
Beyond the Basics 3: Introduction to the MongoDB BI Connector
PDF
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
PDF
Creating Real-time Systems of Engagement with Analytics and Big Data
Webinar: How MongoDB is making Government Better, Faster, Smarter
Time Series Analytics for Big Fast Data
Monetize your APIs and datasets or make them available as open data
LogStash: Concept Run-Through
Introduction to ELK
Beyond the Basics 3: Introduction to the MongoDB BI Connector
Big Data LDN 2018: BIG DATA TOO SLOW? SPRINKLE IN SOME NOSQL
Creating Real-time Systems of Engagement with Analytics and Big Data

What's hot (20)

PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
PPTX
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
PPTX
Building the Global Open Knowledgebase (ER&L 2013)
PPTX
How Insurance Companies Use MongoDB
PDF
MongoDB .local Toronto 2019: MongoDB – Powering the new age data demands
PDF
Myth Busters IV: I Access My Data Through APIs–Data Virtualization Can't Do This
PPTX
Neo4j GraphTalk Frankfurt - Einführung
PPTX
Using NoSQL and Enterprise Shared Services (ESS) to Achieve a More Efficient ...
PDF
CenitHub: Introduction
PPTX
WireCloud, WStore and WMarket
PPTX
Advanced applications with MongoDB
PDF
FIWARE Wednesday Webinars - Introduction to NGSI-LD
PPTX
L’architettura di Classe Enterprise di Nuova Generazione
PDF
APIs and the IoT - Centaur Technologies
PPTX
Event-Based Subscription with MongoDB
PDF
MongoDB - General Purpose Database
PDF
FIWARE: Cross-domain concepts and technologies in domain Reference Architectures
PPT
Real World MongoDB: Use Cases from Financial Services by Daniel Roberts
PDF
Understanding the Operational Database Infrastructure for IoT and Fast Data
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
Big Data LDN 2018: DATABASE FOR THE INSTANT EXPERIENCE
It’s All About The Cards: Sharing on Social Media Encouraged HTML Metadata G...
Building the Global Open Knowledgebase (ER&L 2013)
How Insurance Companies Use MongoDB
MongoDB .local Toronto 2019: MongoDB – Powering the new age data demands
Myth Busters IV: I Access My Data Through APIs–Data Virtualization Can't Do This
Neo4j GraphTalk Frankfurt - Einführung
Using NoSQL and Enterprise Shared Services (ESS) to Achieve a More Efficient ...
CenitHub: Introduction
WireCloud, WStore and WMarket
Advanced applications with MongoDB
FIWARE Wednesday Webinars - Introduction to NGSI-LD
L’architettura di Classe Enterprise di Nuova Generazione
APIs and the IoT - Centaur Technologies
Event-Based Subscription with MongoDB
MongoDB - General Purpose Database
FIWARE: Cross-domain concepts and technologies in domain Reference Architectures
Real World MongoDB: Use Cases from Financial Services by Daniel Roberts
Understanding the Operational Database Infrastructure for IoT and Fast Data
Ad

Similar to Building a modern in-house analytics pipeline (20)

PPTX
Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT
PPTX
Ledingkart Meetup #4: Data pipeline @ lk
PPTX
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
PPTX
Taming data lake - scalable metrics model
PDF
Digital cultural heritage spring 2015 day 2
PDF
Dr. Stefan Schwarz - Data is the New Oil
PDF
Making the Most of Customer Data
PDF
Business intelligence with web data gabc may
PPTX
Real time analytics in Big Data
PPTX
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
PPTX
Assessing New Databases– Translytical Use Cases
PDF
Modern Data Flow
PDF
Slides: Success Stories for Data-to-Cloud
PPTX
Accelerating Data Lakes and Streams with Real-time Analytics
PPTX
Big Data Ecosystem
PPTX
Anatomy of a data driven architecture - Tamir Dresher
PDF
Building data pipelines: from simple to more advanced - hands-on experience /...
PPTX
Big data analytics
PDF
3 джозеп курто превращаем вашу организацию в big data компанию
PDF
Big data and oracle
Architecting Data Lake on AWS by the Data Engineering Team at HiFX IT
Ledingkart Meetup #4: Data pipeline @ lk
ACDKOCHI19 - Next Generation Data Analytics Platform on AWS
Taming data lake - scalable metrics model
Digital cultural heritage spring 2015 day 2
Dr. Stefan Schwarz - Data is the New Oil
Making the Most of Customer Data
Business intelligence with web data gabc may
Real time analytics in Big Data
SEMPL 19: MARIUS IVANOVAS, Head of Performance & Biga Data Division, Httpool ...
Assessing New Databases– Translytical Use Cases
Modern Data Flow
Slides: Success Stories for Data-to-Cloud
Accelerating Data Lakes and Streams with Real-time Analytics
Big Data Ecosystem
Anatomy of a data driven architecture - Tamir Dresher
Building data pipelines: from simple to more advanced - hands-on experience /...
Big data analytics
3 джозеп курто превращаем вашу организацию в big data компанию
Big data and oracle
Ad

Recently uploaded (20)

PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
top salesforce developer skills in 2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
history of c programming in notes for students .pptx
PDF
Nekopoi APK 2025 free lastest update
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
System and Network Administration Chapter 2
PDF
AI in Product Development-omnex systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
ai tools demonstartion for schools and inter college
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Understanding Forklifts - TECH EHS Solution
Odoo Companies in India – Driving Business Transformation.pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Design an Analysis of Algorithms I-SECS-1021-03
top salesforce developer skills in 2025.pdf
Digital Strategies for Manufacturing Companies
CHAPTER 2 - PM Management and IT Context
history of c programming in notes for students .pptx
Nekopoi APK 2025 free lastest update
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Softaken Excel to vCard Converter Software.pdf
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
System and Network Administration Chapter 2
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Operating system designcfffgfgggggggvggggggggg
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf

Building a modern in-house analytics pipeline

  • 1. Changing the way cities move Modern in-house analytics pipeline Sergey Burkov R&D Engineering Lead DATA STRATEGY:
  • 2. 2 1. About Mobimeo 2. Importance of analytics 3. Buy or build choice 4. Vision vs. Reality 5. Scoping the requirments Agenda 6. Choice of tech stack 7. Implementation details 8. Big picture 9. Challenges
  • 3. Mobimeo – Changing the way cities move Easy access to daily mobility Our technology empowers mobility providers to orchestrate existing and new modes of public transport. Together we create an effortless transport experience to make mobility service attractive to millions of users. More mobility. Less traffic. 3
  • 4. 4 We help our mobility partners to boost customer happiness Develop white label products - together with our mobility partners Fast implementation of tailor-made solutions in the apps of our partners Continuous improvement of products – our customers are always the focal point Selection of current mobility partners
  • 5. We are rethinking urban mobility 5 Personalized information Public transport connected with multimodal offers Step by step navigation Easy ticket purchase
  • 6. Deloitte: “As organization investment in data modernization increases, value grows exponentially and changes from hindsight to insights to foresight.” Gartner Research: “Data and analytics leaders should augment or upgrade traditional BI platforms to modern platforms that improve business value and speed time to insight.” Data Strategy: BI and Analytics Is Not Just a Tool 6
  • 7. Outsource Data Strategy: Buy or build In house + Speed of launch + Advanced UI reporting + Easy integration into app + Less maintenance overhead - Costs based on data/MAU volume - Tailored to APP analytics mainly - Privacy/GDPR concerns - Code libs dependency + Full privacy transparency + Flexibility + Speed of transition from insight to product + Complete understanding of user context - Steep implementation curve - Maintenance work - Time to launch 7
  • 8. Data Strategy: Ideal analytics/data pipeline 8 Low Event Latency Scalability Interactive Querying Versioning Monitoring Testing
  • 9. • No single owner for the system • Different flavours of events definitions between platforms and apps • Inconsistent names, non-heterogeneous structure • No strict schema, data sent as raw JSON (some > 500KB) • JSON BLOB as a field value • Data stored in S3 in one bucket as individual json files • … • Data discovery and query building is complicated and fragile process • New business event addition is a cross teams full blown project Data Strategy: starting point & reality 9
  • 11. 11
  • 12. • Ease of adoption • Extensibility • Re-usability • Streaming & Batch • Strict schema enforcement • Schema evolution • Multiple schema versions • Boiler Code generation Data Strategy: where do we want to be • Transparency of execution • Handling duplicates • Data flow monitoring • Robustness for recovery • Backfilling • Self-service • External data sources • Smoke tests 12
  • 13. 13
  • 14. Data Strategy: distributed streaming platform 14
  • 15. @namespace("analytics.mobile") protocol ActiveTicketClickProtocol { import idl "../common/AnalyticsCommon.avdl"; import idl "../common/EventPropertiesCommon.avdl"; record ActiveTicketClickProperties { analytics.common.GenericClickEventProperties @inlineFields(true) clickEvent; } @register(true) record ActiveTicketClickEvent { analytics.common.CommonEventProperties @inlineFields(true) commonEventProperties; ActiveTicketClickProperties eventProperties; analytics.common.FrontendDeviceProperties deviceProperties; analytics.common.FrontendAppProperties appProperties; } } Tactics: Schema definitions reside in central repository 15
  • 16. Tactics: safe schema change process 16
  • 17. Tactics: documentation is always up-to-date 17
  • 18. • Helper library for Kafka with generated classes for BE • DTO classes with JSON serializers for mobile apps • Custom data connectors (REST API source, S3 Parquet sink) • JSON REST API service (resolve json against schema registry) • Replay service from raw dead letter S3 • Monitoring/Alerting: injection rate, dead letter queue Tactics: data injection 18
  • 19. • Access permissions • Catalogue of data assets • Events are stored in dedicated S3 buckets • Data in Avro, Parquet and raw JSON • Partitioned by origin, event type, tenant, date, hour • Dead-letter bucket contains events rejected by pipeline Tactics: data lake organization 19
  • 20. CREATE EXTERNAL TABLE `ticketclickevent_avro`( analyticsid string, timestamp bigint, eventproperties struct<loggedin:boolean,screenname:string>, PARTITIONED BY (tenant string, date date, hour int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ( 'avro.schema.literal'='{"type":"record","name":"TicketClickEvent" ...') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://.../avro/event=analytics.mobile.TicketClickEvent' MSCK REPAIR TABLE Tactics: data access 20
  • 22. • Open source monitoring and configuration tools • Overcoming AVRO Schema inheritance / re-usability limitations • Parsing multiple versions of events from JSON to AVRO • Schema -> Java, Scala, Kotlin, Swift, documentation generation • Tools / libs doesn’t support latest Avro specsifications or uses incompatible forks • Splitting classes according to domains • AWS Athena / Glue limitations Challenges 22
  • 23. Sergey Burkov R&D Engineering Lead sergey.burkov@mobimeo.com linkedin.com/in/burkov Mobimeo GmbH Hallesches Ufer 60 10963 Berlin info@mobimeo.com