SlideShare a Scribd company logo
Build DMP on top of
GCP
VMFive - Randy Huang
Agenda
• Migrated Pipeline to GCP
• Cost Comparison
• Business Use Case
• Fluentd Demo
ELK + AWS EMR
Kinesis Lambda
Pros & Cons
• Pros :
• Well Support.
• Well docs.
• Easy to find Reference.
• Cons :
• High Cost.
• Not open source.
• Have to set the scale at first.
Pipeline on GCP
Dataflow
BigQuery
Machine Learning
Data Visualization
Compute Engine
Global Load Balancing
Datastudio
The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 7
Batch
BI Analysis
Storage

Cloud Storage
Processing

Cloud DataflowStreaming
Time Series Streaming

Cloud Pub/Sub
Storage

BigQuery
The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 8
Targeting Engines
Data Sources
Machine Learning
Applications
API Backend

Compute Engine
Spark MLlib

Cloud Dataproc
App Engine
Transform Data
Hosted Models

Cloud Machine Learning
Real-Time

Prediction API
Device Related

Cloud Pub/Sub
Behavior Related

Cloud Pub/Sub
3rd Party Data

Cloud Pub/Sub
Redis

Compute Engine
Pros & Cons
• Pros :
• Cost-effective.
• Operation-effective.
• Google got your back.
• Cons :
• API/SDK changes everyday.
• Some still in beta mode.
• Docs everywhere.
Workflow Monitoring
• Digdag <Airflow/Oozie/Luigi>
• Native support Python & Ruby
• Multi-Cloud
• Modular
• Workflow as code
• Docker Support
• Altering to Slack
Digdag Sample
Digdag
The journey of Moving from AWS ELK to GCP Data Pipeline
Cost Comparison
• $2000 on AWS per month
• about $200 on GCP production
• about another $200 for dev
• 50M events per month
Business Use Case
• Digital Ads Targeting
• User Behavior Tagging
• BI
• GEO Reporting
• KPI Reporting
• User Demographic
Some Tips
• BigQuery
• https://status.cloud.google.com/incident/bigquery/
18022
• Solved by Fluentd’s Retry and HA
• Dataflow’s SDK & docs is not sync
• Dataflow Sideinput has a bug with Streaming mode
• Compute Engine SLB - TCP/UDP setup for forwarding
Flunetd Update
• Release note for v0.14
• sub second event flush
• New Plugin APIS
support formatting configurations dynamically
(e.g., path /my/dest/${tag}/mydata.%Y-%m-%d.log)
• Secure Forward
Demo
• Nginx -> Fluentd -> BigQuery -> DataStudio
• MySQL -> Fluentd -> BigQuery

More Related Content

PPTX
Google Cloud and Data Pipeline Patterns
PDF
Kafka and Kafka Streams in the Global Schibsted Data Platform
PDF
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
PPTX
Intro to the Google Cloud for Developers
PDF
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
PPTX
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
PPTX
Serverless Reality
PDF
Building scalable data with kafka and spark
Google Cloud and Data Pipeline Patterns
Kafka and Kafka Streams in the Global Schibsted Data Platform
Serverless Big Data Architecture on Google Cloud Platform at Credit OK
Intro to the Google Cloud for Developers
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft
Streaming data in the cloud with Confluent and MongoDB Atlas | Robert Waters,...
Serverless Reality
Building scalable data with kafka and spark

What's hot (20)

PDF
10 Things Learned Releasing Databricks Enterprise Wide
PDF
Moving 150 TB of data resiliently on Kafka With Quorum Controller on Kubernet...
PPTX
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...
PPTX
Introduction to knime
PPTX
Google Cloud Platform
PDF
Introduction to Modern DevOps Technologies
PDF
Apache Airflow Architecture
PDF
Presto Summit 2018 - 03 - Starburst CBO
PDF
Aengus Rooney [Grafana] | What's New with Grafana and InfluxDB | InfluxDays E...
PPTX
Elastic Stack Basic - All The Capabilities in 6.3!
PDF
How to Discover, Visualize, Catalog, Share and Reuse your Kafka Streams (Jona...
PDF
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
PDF
Logging in The World of DevOps
PDF
Accelerating Innovation with Apache Kafka, Heikki Nousiainen | Heikki Nousiai...
PDF
Real-Time Vote Platform Benchmark
PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
PDF
Presto Summit 2018 - 04 - Netflix Containers
PDF
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
PDF
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
PDF
Elastically Scaling Kafka Using Confluent
10 Things Learned Releasing Databricks Enterprise Wide
Moving 150 TB of data resiliently on Kafka With Quorum Controller on Kubernet...
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...
Introduction to knime
Google Cloud Platform
Introduction to Modern DevOps Technologies
Apache Airflow Architecture
Presto Summit 2018 - 03 - Starburst CBO
Aengus Rooney [Grafana] | What's New with Grafana and InfluxDB | InfluxDays E...
Elastic Stack Basic - All The Capabilities in 6.3!
How to Discover, Visualize, Catalog, Share and Reuse your Kafka Streams (Jona...
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
Logging in The World of DevOps
Accelerating Innovation with Apache Kafka, Heikki Nousiainen | Heikki Nousiai...
Real-Time Vote Platform Benchmark
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Presto Summit 2018 - 04 - Netflix Containers
Distributed Data Storage & Streaming for Real-time Decisioning Using Kafka, S...
Martin Moucka [Red Hat] | How Red Hat Uses gNMI, Telegraf and InfluxDB to Gai...
Elastically Scaling Kafka Using Confluent
Ad

Viewers also liked (8)

PDF
A Microservice Architecture for Big Data Pipelines
PPTX
Building data pipelines
PPTX
Building a Big Data Pipeline
PDF
Data pipelines from zero to solid
PPTX
Apache Beam: A unified model for batch and stream processing data
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
PDF
Deploying deep learning models with Docker and Kubernetes
PDF
Building a Data Pipeline from Scratch - Joe Crobak
A Microservice Architecture for Big Data Pipelines
Building data pipelines
Building a Big Data Pipeline
Data pipelines from zero to solid
Apache Beam: A unified model for batch and stream processing data
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Deploying deep learning models with Docker and Kubernetes
Building a Data Pipeline from Scratch - Joe Crobak
Ad

Similar to The journey of Moving from AWS ELK to GCP Data Pipeline (20)

PPTX
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
PDF
Google Cloud - Stand Out Features
PDF
Introduction to Google Cloud Platform
PDF
Google Cloud Platform for Python Developer - Beginner Guide.pdf
PPTX
GCP Data Engineering Online Training in Hyderabad - GCP.pptx
PPTX
Introduction to GCP DataFlow Presentation
PPTX
Introduction to GCP Data Flow Presentation
PPTX
Eric Andersen Keynote
PDF
Google Cloud Platform Introduction - 2016Q3
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
PDF
Getting more into GCP.pdf
PDF
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
Machine learning at scale with Google Cloud Platform
PDF
Couchbase Server and IBM BigInsights: One + One = Three
PDF
Google I/O 2016 Recap - Google Cloud Platform News Update
PDF
Google Cloud Platform for Data Science teams
PDF
data_engineering_on_GCP_PDE_cheat_sheets
PDF
Google Data Engineering.pdf
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Google Cloud - Stand Out Features
Introduction to Google Cloud Platform
Google Cloud Platform for Python Developer - Beginner Guide.pdf
GCP Data Engineering Online Training in Hyderabad - GCP.pptx
Introduction to GCP DataFlow Presentation
Introduction to GCP Data Flow Presentation
Eric Andersen Keynote
Google Cloud Platform Introduction - 2016Q3
Introduction to Data Engineer and Data Pipeline at Credit OK
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
Getting more into GCP.pdf
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Machine learning at scale with Google Cloud Platform
Couchbase Server and IBM BigInsights: One + One = Three
Google I/O 2016 Recap - Google Cloud Platform News Update
Google Cloud Platform for Data Science teams
data_engineering_on_GCP_PDE_cheat_sheets
Google Data Engineering.pdf

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PPT on Performance Review to get promotions
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Welding lecture in detail for understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
CYBER-CRIMES AND SECURITY A guide to understanding
PPT on Performance Review to get promotions
additive manufacturing of ss316l using mig welding
UNIT 4 Total Quality Management .pptx
bas. eng. economics group 4 presentation 1.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Welding lecture in detail for understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
Model Code of Practice - Construction Work - 21102022 .pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Operating System & Kernel Study Guide-1 - converted.pdf

The journey of Moving from AWS ELK to GCP Data Pipeline

  • 1. Build DMP on top of GCP VMFive - Randy Huang
  • 2. Agenda • Migrated Pipeline to GCP • Cost Comparison • Business Use Case • Fluentd Demo
  • 3. ELK + AWS EMR Kinesis Lambda
  • 4. Pros & Cons • Pros : • Well Support. • Well docs. • Easy to find Reference. • Cons : • High Cost. • Not open source. • Have to set the scale at first.
  • 5. Pipeline on GCP Dataflow BigQuery Machine Learning Data Visualization Compute Engine Global Load Balancing
  • 7. The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 7 Batch BI Analysis Storage
 Cloud Storage Processing
 Cloud DataflowStreaming Time Series Streaming
 Cloud Pub/Sub Storage
 BigQuery
  • 8. The Products and Services logos may be used to accurately reference Google's technology and tools, for instance in architecture diagrams. 8 Targeting Engines Data Sources Machine Learning Applications API Backend
 Compute Engine Spark MLlib
 Cloud Dataproc App Engine Transform Data Hosted Models
 Cloud Machine Learning Real-Time
 Prediction API Device Related
 Cloud Pub/Sub Behavior Related
 Cloud Pub/Sub 3rd Party Data
 Cloud Pub/Sub Redis
 Compute Engine
  • 9. Pros & Cons • Pros : • Cost-effective. • Operation-effective. • Google got your back. • Cons : • API/SDK changes everyday. • Some still in beta mode. • Docs everywhere.
  • 10. Workflow Monitoring • Digdag <Airflow/Oozie/Luigi> • Native support Python & Ruby • Multi-Cloud • Modular • Workflow as code • Docker Support • Altering to Slack
  • 14. Cost Comparison • $2000 on AWS per month • about $200 on GCP production • about another $200 for dev • 50M events per month
  • 15. Business Use Case • Digital Ads Targeting • User Behavior Tagging • BI • GEO Reporting • KPI Reporting • User Demographic
  • 16. Some Tips • BigQuery • https://status.cloud.google.com/incident/bigquery/ 18022 • Solved by Fluentd’s Retry and HA • Dataflow’s SDK & docs is not sync • Dataflow Sideinput has a bug with Streaming mode • Compute Engine SLB - TCP/UDP setup for forwarding
  • 17. Flunetd Update • Release note for v0.14 • sub second event flush • New Plugin APIS support formatting configurations dynamically (e.g., path /my/dest/${tag}/mydata.%Y-%m-%d.log) • Secure Forward
  • 18. Demo • Nginx -> Fluentd -> BigQuery -> DataStudio • MySQL -> Fluentd -> BigQuery