Using Spark at Vungle

1 like2,703 views

The document discusses the evolution of an ETL pipeline from an old architecture to a new streaming-based one. The old architecture ran hourly jobs that processed 12+ GB of data and could take over an hour to complete. The new architecture uses streaming to provide horizontal scalability and real-time processing. It decouples ingestion of raw data from processing via Spark streaming. Events are ingested into MongoDB as they arrive and then processed to calculate metrics and output to various destinations.

Mobile

Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
1

2
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
● Introduction
● Old Architecture
● New Architecture
● Decoupling
● Streaming
● Conclusion

3
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
● Legacy Java Process
○ “Crunches” data
○ Sends data downstream to our own datastores and to 3rd party
analytics
○ Runs every hour
● Growth
○ Process can run over an hour
○ 12 GB -> 24GB heap in less than 1 year
○ Cron is a horrible job management system
○ A failure requires rerunning a job from the beginning
● 2.0
○ Horizontably scalable
○ Real Time ETL
○ Reuesable

4
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
ETL @ Vungle
● ~1 Billion Events / Day
● Deduplication
● Calculating $$$
● Outputting data to various destinations

5
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Old Architecture

6
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

7
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

8
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

9
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

10
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

11
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

12
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

13
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

14
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
New Architecture

15
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

16
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

17
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

18
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

19
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

20
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

21
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

22
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Decoupling

23
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

24
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

25
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

26
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

27
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

28
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

29
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

30
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

31
Introduction Problem Decoupling Streaming Conclusion
Setup connection and spark streams
Map each line of log into Mongo Objects
and insert into mongo

32
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Setup connection and spark streams

33
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Mapping to Mongo objects and insertions

34
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Questions

35
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Streaming

36
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

37
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

38
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

39
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Ingestion

40
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Event ID Request View Install ... Request
Added
View
Added
Install
Added
Value
Ingestion Table Schema

41
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
... Date Time Deliveries Views Installs Processed
Deliveries
Processed
Views
Processed
Installs
Fact Table Schema

42
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Ingestion

43
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

44
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

45
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

46
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

47
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

48
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

49
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Process

50
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

51
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

52
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

53
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

54
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

55
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Next Steps
● Switching from JSON to ProtoBuf
● Using YARN to run multiple jobs on one cluster
● Data Science
● Who knows?

56
Introduction Old Architecture New Architecture Decoupling Streaming Conclusion
Questions

More Related Content

PDF

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...

Flink Forward

PDF

Maximilian Michels - Flink and Beam

Flink Forward

PPTX

Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...

Flink Forward

PDF

Running Flink in Production: The good, The bad and The in Between - Lakshmi ...

Flink Forward

PDF

Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...

Bowen Li

PDF

Matching the Scale at Tinder with Kafka

confluent

PDF

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

confluent

PPTX

Robust Stream Processing with Apache Flink

Jamie Grier

Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...

Flink Forward

Maximilian Michels - Flink and Beam

Flink Forward

Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...

Flink Forward

Running Flink in Production: The good, The bad and The in Between - Lakshmi ...

Flink Forward

Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...

Bowen Li

Matching the Scale at Tinder with Kafka

confluent

Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...

confluent

Robust Stream Processing with Apache Flink

Jamie Grier

What's hot (20)

PPTX

Counting Elements in Streams

Jamie Grier

PDF

Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day

Ankur Bansal

PDF

Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...

Flink Forward

PDF

Jamie Grier - Robust Stream Processing with Apache Flink

Flink Forward

PDF

Flink in Zalando's world of Microservices

ZalandoHayley

PDF

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...

Zalando Technology

PDF

Gyula Fóra - RBEA- Scalable Real-Time Analytics at King

Flink Forward

PPTX

Streaming in the Wild with Apache Flink

Kostas Tzoumas

PDF

Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...

Flink Forward

PPTX

Apache Flink Community Updates November 2016 @ Berlin Meetup

Robert Metzger

PPTX

Robust Stream Processing With Apache Flink

Jamie Grier

PDF

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

PPTX

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

PDF

Apache Flink @ Alibaba - Seattle Apache Flink Meetup

Bowen Li

PDF

Bitsy graph database

LambdaZen LLC

PDF

Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...

Flink Forward

PPTX

The Past, Present, and Future of Apache Flink®

Aljoscha Krettek

PDF

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

PDF

Achieving end-to-end visibility into complex event-sourcing transactions usin...

HostedbyConfluent

PDF

Apache Beam @ GCPUG.TW Flink.TW 20161006

Randy Huang

Counting Elements in Streams

Jamie Grier

Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day

Ankur Bansal

Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...

Flink Forward

Jamie Grier - Robust Stream Processing with Apache Flink

Flink Forward

Flink in Zalando's world of Microservices

ZalandoHayley

Stream Processing using Apache Flink in Zalando's World of Microservices - Re...

Zalando Technology

Gyula Fóra - RBEA- Scalable Real-Time Analytics at King

Flink Forward

Streaming in the Wild with Apache Flink

Kostas Tzoumas

Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...

Flink Forward

Apache Flink Community Updates November 2016 @ Berlin Meetup

Robert Metzger

Robust Stream Processing With Apache Flink

Jamie Grier

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...

Flink Forward

Apache Flink @ Alibaba - Seattle Apache Flink Meetup

Bowen Li

Bitsy graph database

LambdaZen LLC

Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...

Flink Forward

The Past, Present, and Future of Apache Flink®

Aljoscha Krettek

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

Achieving end-to-end visibility into complex event-sourcing transactions usin...

HostedbyConfluent

Apache Beam @ GCPUG.TW Flink.TW 20161006

Randy Huang

Viewers also liked (16)

PPTX

China for the Win! What Publishers Need to Know to Succeed in this Emerging M...

Vungle

PDF

Woundary 서비스 활용안 vine 130527_석혜윤

Hyeyoon Seok

PDF

Going the extra mile on social media: moving from 1.0 to 2.0

Scopernia

PPTX

Designed to Win: How to Monetize Users and Enhance Experience in Your Game

Vungle

PDF

Product (Experience) Management

Peter John Marquez

PDF

Genius Strategies for Engaging Followers through Social Media

Jennifer Jones

PDF

Mobile Recruiting Best Practices

Workology

PDF

Digital Transformation of the Channel

Scopernia

PPTX

Social media around the world 2011

Steven Van Belleghem

PDF

Node.js and The Internet of Things

Losant

PDF

Satyapriya rajguru: Every day, in one way or another.

Satyapriya Rajguru

PDF

2015 US Global Mobile Consumer Survey

Deloitte United States

PPT

THIRST

Jeff Brenman

PDF

2016 Digital Yearbook

We Are Social Singapore

PPTX

Connecting With the Disconnected

Chris Wejr

PDF

How to Become a Thought Leader in Your Niche

Leslie Samuel

China for the Win! What Publishers Need to Know to Succeed in this Emerging M...

Vungle

Woundary 서비스 활용안 vine 130527_석혜윤

Hyeyoon Seok

Going the extra mile on social media: moving from 1.0 to 2.0

Scopernia

Designed to Win: How to Monetize Users and Enhance Experience in Your Game

Vungle

Product (Experience) Management

Peter John Marquez

Genius Strategies for Engaging Followers through Social Media

Jennifer Jones

Mobile Recruiting Best Practices

Workology

Digital Transformation of the Channel

Scopernia

Social media around the world 2011

Steven Van Belleghem

Node.js and The Internet of Things

Losant

Satyapriya rajguru: Every day, in one way or another.

Satyapriya Rajguru

2015 US Global Mobile Consumer Survey

Deloitte United States

THIRST

Jeff Brenman

2016 Digital Yearbook

We Are Social Singapore

Connecting With the Disconnected

Chris Wejr

How to Become a Thought Leader in Your Niche

Leslie Samuel

Similar to Using Spark at Vungle (20)

PPTX

AWS as platform for scalable applications

Roman Gomolko

PPTX

Software architecture for data applications

Ding Li

PDF

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Helena Edelson

PDF

TechShift: There’s light beyond LAMP

Stephen Tallamy

PPTX

2014 09-12 lambda-architecture-at-indix

Yu Ishikawa

PPTX

Webinar: An Enterprise Architect’s View of MongoDB

MongoDB

PDF

Building a Business Logic Translation Engine with Spark Streaming for Communi...

Spark Summit

PDF

Building Big Data Streaming Architectures

David Martínez Rego

PPTX

MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem

MongoDB

PPT

UnConference for Georgia Southern Computer Science March 31, 2015

Christopher Curtin

PDF

Building data pipelines at Shopee with DEC

Rim Zaidullin

PDF

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Summit

PDF

Hadoop Ecosystem and Low Latency Streaming Architecture

InSemble

PPTX

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

gmalouf678

PDF

Stream Processing Handson With Apache Flink Giannis Polyzos

tangriotuka

PDF

Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson

iseniamabuh

PPTX

Python Ireland Conference 2016 - Python and MongoDB Workshop

Joe Drumgoole

PPTX

Enabling independent teams by creating decoupled data flows

confluent

PDF

Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...

SindhuVasireddy1

PDF

Data Architecture at Vente-Exclusive.com - TOTM Exellys

Wout Scheepers

AWS as platform for scalable applications

Roman Gomolko

Software architecture for data applications

Ding Li

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Helena Edelson

TechShift: There’s light beyond LAMP

Stephen Tallamy

2014 09-12 lambda-architecture-at-indix

Yu Ishikawa

Webinar: An Enterprise Architect’s View of MongoDB

MongoDB

Building a Business Logic Translation Engine with Spark Streaming for Communi...

Spark Summit

Building Big Data Streaming Architectures

David Martínez Rego

MongoDB.local Austin 2018: PetroCloud: MongoDB for the Industrial IOT Ecosystem

MongoDB

UnConference for Georgia Southern Computer Science March 31, 2015

Christopher Curtin

Building data pipelines at Shopee with DEC

Rim Zaidullin

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Summit

Hadoop Ecosystem and Low Latency Streaming Architecture

InSemble

Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014

gmalouf678

Stream Processing Handson With Apache Flink Giannis Polyzos

tangriotuka

Architecting Modern Data Platforms Jan Kunigk Ian Buss Paul Wilkinson

iseniamabuh

Python Ireland Conference 2016 - Python and MongoDB Workshop

Joe Drumgoole

Enabling independent teams by creating decoupled data flows

confluent

Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...

SindhuVasireddy1

Data Architecture at Vente-Exclusive.com - TOTM Exellys

Wout Scheepers

Recently uploaded (6)

PPTX

ASMS Telecommunication company Profile

Mukesh604998

PDF

6-UseCfgfhgfhgfhgfhgfhfhhaseActivity.pdf

FadyFady9

DOC

证书学历UoA毕业证,澳大利亚中汇学院毕业证国外大学毕业证

byysy

DOC

Camb毕业证学历认证,格罗斯泰斯特主教大学毕业证仿冒文凭毕业证

zavofe

PDF

heheheueueyeyeyegehehehhehshMedia-Literacy.pdf

judekimwellnombre15

PDF

Lesson 13- HEREDITY _ pedSAWEREGFVCXZDSASEWFigree.pdf

glennpauloani

ASMS Telecommunication company Profile

Mukesh604998

6-UseCfgfhgfhgfhgfhgfhfhhaseActivity.pdf

FadyFady9

证书学历UoA毕业证,澳大利亚中汇学院毕业证国外大学毕业证

byysy

Camb毕业证学历认证,格罗斯泰斯特主教大学毕业证仿冒文凭毕业证

zavofe

heheheueueyeyeyegehehehhehshMedia-Literacy.pdf

judekimwellnombre15

Lesson 13- HEREDITY _ pedSAWEREGFVCXZDSASEWFigree.pdf

glennpauloani

Using Spark at Vungle

1. Introduction Old Architecture New Architecture Decoupling Streaming Conclusion 1

2. 2 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion ● Introduction ● Old Architecture ● New Architecture ● Decoupling ● Streaming ● Conclusion

3. 3 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion ● Legacy Java Process ○ “Crunches” data ○ Sends data downstream to our own datastores and to 3rd party analytics ○ Runs every hour ● Growth ○ Process can run over an hour ○ 12 GB -> 24GB heap in less than 1 year ○ Cron is a horrible job management system ○ A failure requires rerunning a job from the beginning ● 2.0 ○ Horizontably scalable ○ Real Time ETL ○ Reuesable

4. 4 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion ETL @ Vungle ● ~1 Billion Events / Day ● Deduplication ● Calculating $$$ ● Outputting data to various destinations

5. 5 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Old Architecture

6. 6 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

7. 7 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

8. 8 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

9. 9 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

10. 10 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

11. 11 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

12. 12 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

13. 13 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

14. 14 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion New Architecture

15. 15 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

16. 16 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

17. 17 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

18. 18 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

19. 19 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

20. 20 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

21. 21 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

22. 22 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Decoupling

23. 23 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

24. 24 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

25. 25 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

26. 26 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

27. 27 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

28. 28 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

29. 29 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

30. 30 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

31. 31 Introduction Problem Decoupling Streaming Conclusion Setup connection and spark streams Map each line of log into Mongo Objects and insert into mongo

32. 32 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Setup connection and spark streams

33. 33 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Mapping to Mongo objects and insertions

34. 34 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Questions

35. 35 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Streaming

36. 36 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

37. 37 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

38. 38 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

39. 39 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Ingestion

40. 40 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Event ID Request View Install ... Request Added View Added Install Added Value Ingestion Table Schema

41. 41 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion ... Date Time Deliveries Views Installs Processed Deliveries Processed Views Processed Installs Fact Table Schema

42. 42 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Ingestion

43. 43 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

44. 44 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

45. 45 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

46. 46 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

47. 47 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

48. 48 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

49. 49 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Process

50. 50 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

51. 51 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

52. 52 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

53. 53 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

54. 54 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion

55. 55 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Next Steps ● Switching from JSON to ProtoBuf ● Using YARN to run multiple jobs on one cluster ● Data Science ● Who knows?

56. 56 Introduction Old Architecture New Architecture Decoupling Streaming Conclusion Questions

57. Thank you! 57