Chicago Cloud Conference 2020
Architecting Analytic
Pipelines on GCP
Who am I?
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He is
an avid open-source contributor.
Data/Platform Architect at Otus
Mariano Gonzalez
Most importantly, I am just a person trying to learn about and share big
data technologies and approaches.
Agenda
● Goal for this session
● Overview of GCP services
● Apache Beam and GCP Dataflow
● Natural Language Processing for sentiment analysis
● Demo ETL/Analytics
● QA
Goal for this Session
Find an elegant way to build and deploy data/analytic
pipelines that:
● Support for multiple workloads
● Scale compute and storage independently
● Backed up by manage services
● Cost effective
Common Architecture Analytics Pipeline
Data Storage
Different
Types and
Formats of
Data
Analytic/Data
Pipelines
User
Overview of GCP services - App Engine
● Good alternative if K8s infrastructure is not in place
● Easy deployment
○ Similar to AWS SAM from a CLI perspective
○ Similar to AWS Beanstalk from a deployment perspective
● Well integrated with other cloud services
○ GCP docker Registry
● Multiple Runtimes
○ Custom (Docker)
○ JVM/Node/Python
Overview of GCP services - Storage
● Hot - durable, available performance object storage for frequently accessed data
○ Amazon S3 Standard
○ Microsoft Azure Hot Blob Storage
○ Google Cloud Storage standard
● Cool - storage class for data that is accessed less frequently, but requires rapid access
when needed
○ Amazon S3 Standard I/A and S3 Standard Z-I/A
○ Microsoft Azure Cool Blob Storage
○ Google Cloud Storage Nearline
● Cold - secure, durable, and low-cost storage service for data archiving
○ Amazon S3 Glacier
○ Microsoft Azure Blob Archive Storage
○ Google Cloud Storage Coldline
Overview of GCP services - Pubsub
Why not just use Kafka?
● Fully managed services
○ Both system can have fully managed version in the cloud
● Cloud vs On-prem
○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka
can be used as a both cloud service and on-prem service
● Message duplication
○ Kafka manage the offsets via zookeeper
○ Pubsub works using acknowledging the message
Overview of GCP services - Pubsub
Why not just use Kafka?
● Retention policy
○ Both Kafka and Pubsub have options to configure the maximum retention
time
● Consumers Group vs Subscriptions
○ Pubsub use subscriptions, you create a subscription and then you start
reading messages from that subscription
○ Kafka use the concept of "consumer group" and "partition"
Overview of GCP services - BigQuery
● Query engines probably one of the most competed service today:
○ Snowflake
○ Presto
○ Redshift
● How are these warehouses different?
● Presto
○ Self hosted open source solution
● Pre-RA3 Redshift
○ Somewhat more fully managed, but still requires the user to configure individual
compute clusters with a fixed amount of memory, compute and storage
● Redshift RA3
○ Closer to the user experience of Snowflake by separating compute from storage
● Snowflake
○ The user only configures the size and number of compute clusters
○ Every compute cluster sees the same data
○ Compute clusters can be created and removed in seconds
Overview of GCP services - BigQuery
BigQuery
● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number
of "compute slots"
● Pure serverless model, where the user submits queries one at a time and pays per query
● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your
workload
A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A
"spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization
will be much cheaper in on-demand mode.
Overview of GCP services - BigQuery
What is Google Cloud Dataflow?
● Data processing service for both:
○ batch
○ real-time data streaming applications
● Benefits
○ Enables developers to set up analytic pipelines immediately
● Nextgen MapReduce
○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce
brought to a single type of computational for batch processing jobs
○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency
processing).
Overview of GCP services - Dataflow
Apache Beam SDK and Dataflow Runner
Google Cloud Dataflow overlaps with services such as:
● Amazon Kinesis
● Apache Storm
● Apache Spark
● Facebook Flux
$ java -jar build/libs/transformation-1.0-all.jar 
--project=ccc-2020-289323 
--runner=DataflowRunner 
--streaming=true 
--region=us-east1 
--tempLocation=gs://chicago-cloud-conference-2020/temp/ 
--stagingLocation=gs://chicago-cloud-conference-2020/jars/ 
--filesToStage=build/libs/transformation-1.0-all.jar 
--maxNumWorkers=2 
--numWorkers=1
Apache Beam SDK and Dataflow Runner
Overview of GCP services - Dataproc
On demand Hadoop Cluster
● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight)
Dataproc is the fastest to provision
● Easy runtime customization via PIP commands
● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR
- Apache Zeppelin)
$ gcloud beta dataproc clusters create cluster-name 
--optional-components=ANACONDA,JUPYTER 
--image-version=1.4 
--enable-component-gateway 
--bucket=chicago-cloud-conference-2020 
--region=us-east1 
--project=ccc-2020-289323 
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
Overview of GCP services - Cloud Natural Language API
● What can we do Cloud Natural Language API?
○ Reveal the structure and meaning of text via machine learning models
○ Extract information about people, places, and events, mentioned in text
documents, news articles or blog posts
○ Understand sentiment about product on social media or parse intent from
customer conversations happening in a call center or a messaging app
● How can we use it?
○ Analyze text uploaded as part of a HTTP request
○ Integrate with Google Cloud Storage
NLP - Sentiment Analysis
Two type of metrics to consider:
1. Score
a. It ranges between -1.0 (negative) and
1.0 (positive) and corresponds to the
general emotional tendency of the text
1. Magnitude
a. Indicates the general intensity of
emotion (both positive and negative) in
a given text, between 0.0 and inf
b. Magnitude is not normalized and each
expression of emotion in the text (both
positive and negative) contributes to the
value
Sentiment Sample Values
Positive score: 0.8, magnitude: 3.0
Negative score: -0.6, magnitude: 4.0
Neutral score: 0.1, magnitude: 0.0
Mixed score: 0.0, magnitude: 4.0
Demo - ETL
• Extract – Diferentes fuentes (Twitter for this case)
• Transform – Cleanup and data presentation
• Load – Columnar format
https://github.com/eschizoid/ccc-2020
Demo - Analytics
Conclusion
•Cost effect solution if you
know your data access
patterns
•Full serverless architecture
•Extensible workloads
QA

More Related Content

PPTX
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
PDF
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
PDF
Microservices Patterns with GoldenGate
PPTX
Capgemini Insights and Data
PDF
Webinar Data Mesh - Part 3
PPTX
How data modelling helps serve billions of queries in millisecond latency wit...
PDF
Privacy-Preserving AI Network - PlatON 2.0
PDF
Making the most of your Snowflake Investment
Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Clo...
Data Mesh in Practice - How Europe's Leading Online Platform for Fashion Goes...
Microservices Patterns with GoldenGate
Capgemini Insights and Data
Webinar Data Mesh - Part 3
How data modelling helps serve billions of queries in millisecond latency wit...
Privacy-Preserving AI Network - PlatON 2.0
Making the most of your Snowflake Investment

What's hot (20)

PDF
Future of Data Platform in Cloud Native world
PDF
On the Radar: SnapLogic
PDF
Cloud Modernization and Data as a Service Option
PDF
Life is a Stream of Events
PPTX
Hadoop for Humans: Introducing SnapReduce 2.0
PPTX
Big Data Management: What's New, What's Different, and What You Need To Know
PDF
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
PDF
Consumption based analytics enabled by Data Virtualization
PDF
On Demand BI
PPTX
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
PPTX
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
PPTX
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
PDF
451 Research Impact Report
PDF
Data Democratization at Nubank
PDF
No sql now2011_review_of_adhoc_architectures
PPTX
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
PDF
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
PDF
From ingest to insights with AWS
PDF
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Future of Data Platform in Cloud Native world
On the Radar: SnapLogic
Cloud Modernization and Data as a Service Option
Life is a Stream of Events
Hadoop for Humans: Introducing SnapReduce 2.0
Big Data Management: What's New, What's Different, and What You Need To Know
Weathering the Data Storm – How SnapLogic and AWS Deliver Analytics in the Cl...
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Consumption based analytics enabled by Data Virtualization
On Demand BI
Big Data Analytics: Reference Architectures and Case Studies by Serhiy Haziye...
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Webinar: SnapLogic Fall 2014 Release Brings iPaaS to the Enterprise
451 Research Impact Report
Data Democratization at Nubank
No sql now2011_review_of_adhoc_architectures
Digital Shift in Insurance: How is the Industry Responding with the Influx of...
Building Intelligent Applications w/ Cassandra, Spark & DataStax by Jeff Carp...
From ingest to insights with AWS
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Ad

Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020 (20)

PDF
Getting more into GCP.pdf
PPTX
Introduction to GCP Data Flow Presentation
PPTX
Introduction to GCP DataFlow Presentation
PDF
Machine learning at scale with Google Cloud Platform
PDF
Apache Beam and Google Cloud Dataflow - IDG - final
PDF
Introduction to GCP
PDF
Cloud comparison - AWS vs Azure vs Google
PDF
eChai Developer Meetup | Cloud Native Learnings with AWS
PDF
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
PDF
Story of migrating event pipeline from batch to streaming
PPTX
Harnessing the Power of Google Cloud Platform: Strategies and Applications
PDF
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
PDF
Building Pinterest Real-Time Ads Platform Using Kafka Streams
PDF
Logging in The World of DevOps
PDF
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
PDF
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
PPTX
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PPTX
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PPTX
Kafka Practices @ Uber - Seattle Apache Kafka meetup
PDF
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
Getting more into GCP.pdf
Introduction to GCP Data Flow Presentation
Introduction to GCP DataFlow Presentation
Machine learning at scale with Google Cloud Platform
Apache Beam and Google Cloud Dataflow - IDG - final
Introduction to GCP
Cloud comparison - AWS vs Azure vs Google
eChai Developer Meetup | Cloud Native Learnings with AWS
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Story of migrating event pipeline from batch to streaming
Harnessing the Power of Google Cloud Platform: Strategies and Applications
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Logging in The World of DevOps
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
PostgreSQL-as-a-Service with Crunchy PostgreSQL for PKS
Kafka Practices @ Uber - Seattle Apache Kafka meetup
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
Ad

Recently uploaded (20)

PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Matchmaking for JVMs: How to Pick the Perfect GC Partner
PDF
Salesforce Agentforce AI Implementation.pdf
PDF
Guide to Food Delivery App Development.pdf
PDF
iTop VPN Crack Latest Version Full Key 2025
PPTX
MLforCyber_MLDataSetsandFeatures_Presentation.pptx
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
Practical Indispensable Project Management Tips for Delivering Successful Exp...
PPTX
Tech Workshop Escape Room Tech Workshop
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PPTX
Cybersecurity: Protecting the Digital World
PDF
Microsoft Office 365 Crack Download Free
PDF
AI Guide for Business Growth - Arna Softech
DOC
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PPTX
CNN LeNet5 Architecture: Neural Networks
PPTX
Airline CRS | Airline CRS Systems | CRS System
Autodesk AutoCAD Crack Free Download 2025
Matchmaking for JVMs: How to Pick the Perfect GC Partner
Salesforce Agentforce AI Implementation.pdf
Guide to Food Delivery App Development.pdf
iTop VPN Crack Latest Version Full Key 2025
MLforCyber_MLDataSetsandFeatures_Presentation.pptx
GSA Content Generator Crack (2025 Latest)
Practical Indispensable Project Management Tips for Delivering Successful Exp...
Tech Workshop Escape Room Tech Workshop
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Cybersecurity: Protecting the Digital World
Microsoft Office 365 Crack Download Free
AI Guide for Business Growth - Arna Softech
UTEP毕业证学历认证,宾夕法尼亚克拉里恩大学毕业证未毕业
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
"Secure File Sharing Solutions on AWS".pptx
MCP Security Tutorial - Beginner to Advanced
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CNN LeNet5 Architecture: Neural Networks
Airline CRS | Airline CRS Systems | CRS System

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

  • 1. Chicago Cloud Conference 2020 Architecting Analytic Pipelines on GCP
  • 2. Who am I? Mariano is an engineer with more than 15 years of experience with the JVM. He enjoys working with and exploring a variety of big data technologies. He is an avid open-source contributor. Data/Platform Architect at Otus Mariano Gonzalez Most importantly, I am just a person trying to learn about and share big data technologies and approaches.
  • 3. Agenda ● Goal for this session ● Overview of GCP services ● Apache Beam and GCP Dataflow ● Natural Language Processing for sentiment analysis ● Demo ETL/Analytics ● QA
  • 4. Goal for this Session Find an elegant way to build and deploy data/analytic pipelines that: ● Support for multiple workloads ● Scale compute and storage independently ● Backed up by manage services ● Cost effective
  • 5. Common Architecture Analytics Pipeline Data Storage Different Types and Formats of Data Analytic/Data Pipelines User
  • 6. Overview of GCP services - App Engine ● Good alternative if K8s infrastructure is not in place ● Easy deployment ○ Similar to AWS SAM from a CLI perspective ○ Similar to AWS Beanstalk from a deployment perspective ● Well integrated with other cloud services ○ GCP docker Registry ● Multiple Runtimes ○ Custom (Docker) ○ JVM/Node/Python
  • 7. Overview of GCP services - Storage ● Hot - durable, available performance object storage for frequently accessed data ○ Amazon S3 Standard ○ Microsoft Azure Hot Blob Storage ○ Google Cloud Storage standard ● Cool - storage class for data that is accessed less frequently, but requires rapid access when needed ○ Amazon S3 Standard I/A and S3 Standard Z-I/A ○ Microsoft Azure Cool Blob Storage ○ Google Cloud Storage Nearline ● Cold - secure, durable, and low-cost storage service for data archiving ○ Amazon S3 Glacier ○ Microsoft Azure Blob Archive Storage ○ Google Cloud Storage Coldline
  • 8. Overview of GCP services - Pubsub Why not just use Kafka? ● Fully managed services ○ Both system can have fully managed version in the cloud ● Cloud vs On-prem ○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka can be used as a both cloud service and on-prem service ● Message duplication ○ Kafka manage the offsets via zookeeper ○ Pubsub works using acknowledging the message
  • 9. Overview of GCP services - Pubsub Why not just use Kafka? ● Retention policy ○ Both Kafka and Pubsub have options to configure the maximum retention time ● Consumers Group vs Subscriptions ○ Pubsub use subscriptions, you create a subscription and then you start reading messages from that subscription ○ Kafka use the concept of "consumer group" and "partition"
  • 10. Overview of GCP services - BigQuery ● Query engines probably one of the most competed service today: ○ Snowflake ○ Presto ○ Redshift ● How are these warehouses different?
  • 11. ● Presto ○ Self hosted open source solution ● Pre-RA3 Redshift ○ Somewhat more fully managed, but still requires the user to configure individual compute clusters with a fixed amount of memory, compute and storage ● Redshift RA3 ○ Closer to the user experience of Snowflake by separating compute from storage ● Snowflake ○ The user only configures the size and number of compute clusters ○ Every compute cluster sees the same data ○ Compute clusters can be created and removed in seconds Overview of GCP services - BigQuery
  • 12. BigQuery ● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number of "compute slots" ● Pure serverless model, where the user submits queries one at a time and pays per query ● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your workload A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A "spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization will be much cheaper in on-demand mode. Overview of GCP services - BigQuery
  • 13. What is Google Cloud Dataflow? ● Data processing service for both: ○ batch ○ real-time data streaming applications ● Benefits ○ Enables developers to set up analytic pipelines immediately ● Nextgen MapReduce ○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce brought to a single type of computational for batch processing jobs ○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency processing). Overview of GCP services - Dataflow
  • 14. Apache Beam SDK and Dataflow Runner Google Cloud Dataflow overlaps with services such as: ● Amazon Kinesis ● Apache Storm ● Apache Spark ● Facebook Flux $ java -jar build/libs/transformation-1.0-all.jar --project=ccc-2020-289323 --runner=DataflowRunner --streaming=true --region=us-east1 --tempLocation=gs://chicago-cloud-conference-2020/temp/ --stagingLocation=gs://chicago-cloud-conference-2020/jars/ --filesToStage=build/libs/transformation-1.0-all.jar --maxNumWorkers=2 --numWorkers=1
  • 15. Apache Beam SDK and Dataflow Runner
  • 16. Overview of GCP services - Dataproc On demand Hadoop Cluster ● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight) Dataproc is the fastest to provision ● Easy runtime customization via PIP commands ● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR - Apache Zeppelin) $ gcloud beta dataproc clusters create cluster-name --optional-components=ANACONDA,JUPYTER --image-version=1.4 --enable-component-gateway --bucket=chicago-cloud-conference-2020 --region=us-east1 --project=ccc-2020-289323 --metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'
  • 17. Overview of GCP services - Cloud Natural Language API ● What can we do Cloud Natural Language API? ○ Reveal the structure and meaning of text via machine learning models ○ Extract information about people, places, and events, mentioned in text documents, news articles or blog posts ○ Understand sentiment about product on social media or parse intent from customer conversations happening in a call center or a messaging app ● How can we use it? ○ Analyze text uploaded as part of a HTTP request ○ Integrate with Google Cloud Storage
  • 18. NLP - Sentiment Analysis Two type of metrics to consider: 1. Score a. It ranges between -1.0 (negative) and 1.0 (positive) and corresponds to the general emotional tendency of the text 1. Magnitude a. Indicates the general intensity of emotion (both positive and negative) in a given text, between 0.0 and inf b. Magnitude is not normalized and each expression of emotion in the text (both positive and negative) contributes to the value Sentiment Sample Values Positive score: 0.8, magnitude: 3.0 Negative score: -0.6, magnitude: 4.0 Neutral score: 0.1, magnitude: 0.0 Mixed score: 0.0, magnitude: 4.0
  • 19. Demo - ETL • Extract – Diferentes fuentes (Twitter for this case) • Transform – Cleanup and data presentation • Load – Columnar format https://github.com/eschizoid/ccc-2020
  • 21. Conclusion •Cost effect solution if you know your data access patterns •Full serverless architecture •Extensible workloads
  • 22. QA