Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Chicago Cloud Conference 2020
Architecting Analytic
Pipelines on GCP

Who am I?
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He is
an avid open-source contributor.
Data/Platform Architect at Otus
Mariano Gonzalez
Most importantly, I am just a person trying to learn about and share big
data technologies and approaches.

Agenda
● Goal for this session
● Overview of GCP services
● Apache Beam and GCP Dataflow
● Natural Language Processing for sentiment analysis
● Demo ETL/Analytics
● QA

Goal for this Session
Find an elegant way to build and deploy data/analytic
pipelines that:
● Support for multiple workloads
● Scale compute and storage independently
● Backed up by manage services
● Cost effective

Common Architecture Analytics Pipeline
Data Storage
Different
Types and
Formats of
Data
Analytic/Data
Pipelines
User

Overview of GCP services - App Engine
● Good alternative if K8s infrastructure is not in place
● Easy deployment
○ Similar to AWS SAM from a CLI perspective
○ Similar to AWS Beanstalk from a deployment perspective
● Well integrated with other cloud services
○ GCP docker Registry
● Multiple Runtimes
○ Custom (Docker)
○ JVM/Node/Python

Overview of GCP services - Storage
● Hot - durable, available performance object storage for frequently accessed data
○ Amazon S3 Standard
○ Microsoft Azure Hot Blob Storage
○ Google Cloud Storage standard
● Cool - storage class for data that is accessed less frequently, but requires rapid access
when needed
○ Amazon S3 Standard I/A and S3 Standard Z-I/A
○ Microsoft Azure Cool Blob Storage
○ Google Cloud Storage Nearline
● Cold - secure, durable, and low-cost storage service for data archiving
○ Amazon S3 Glacier
○ Microsoft Azure Blob Archive Storage
○ Google Cloud Storage Coldline

Overview of GCP services - Pubsub
Why not just use Kafka?
● Fully managed services
○ Both system can have fully managed version in the cloud
● Cloud vs On-prem
○ Pubsub is only offered as part of the GCP ecosystem whereas Apache Kafka
can be used as a both cloud service and on-prem service
● Message duplication
○ Kafka manage the offsets via zookeeper
○ Pubsub works using acknowledging the message

Overview of GCP services - Pubsub
Why not just use Kafka?
● Retention policy
○ Both Kafka and Pubsub have options to configure the maximum retention
time
● Consumers Group vs Subscriptions
○ Pubsub use subscriptions, you create a subscription and then you start
reading messages from that subscription
○ Kafka use the concept of "consumer group" and "partition"

Overview of GCP services - BigQuery
● Query engines probably one of the most competed service today:
○ Snowflake
○ Presto
○ Redshift
● How are these warehouses different?

● Presto
○ Self hosted open source solution
● Pre-RA3 Redshift
○ Somewhat more fully managed, but still requires the user to configure individual
compute clusters with a fixed amount of memory, compute and storage
● Redshift RA3
○ Closer to the user experience of Snowflake by separating compute from storage
● Snowflake
○ The user only configures the size and number of compute clusters
○ Every compute cluster sees the same data
○ Compute clusters can be created and removed in seconds

BigQuery
● Flat-rate is similar to Snowflake except there is no concept of a compute cluster, just a configurable number
of "compute slots"
● Pure serverless model, where the user submits queries one at a time and pays per query
● On-demand mode can be much more expensive, or much cheaper, depending on the nature of your
workload
A "steady" workload that utilizes your compute capacity 24/7 will be much cheaper in flat-rate mode. A
"spiky" workload that contains periodic large queries spaced with long periods of idleness or lower utilization
will be much cheaper in on-demand mode.

What is Google Cloud Dataflow?
● Data processing service for both:
○ batch
○ real-time data streaming applications
● Benefits
○ Enables developers to set up analytic pipelines immediately
● Nextgen MapReduce
○ Designed to bring to an entire analytics pipelines the style of fast parallel execution that MapReduce
brought to a single type of computational for batch processing jobs
○ It's based partly on MillWheel and Flume (two Google-developed data ingestion and low-latency
processing).
Overview of GCP services - Dataflow

Apache Beam SDK and Dataflow Runner
Google Cloud Dataflow overlaps with services such as:
● Amazon Kinesis
● Apache Storm
● Apache Spark
● Facebook Flux
$ java -jar build/libs/transformation-1.0-all.jar
--project=ccc-2020-289323
--runner=DataflowRunner
--streaming=true
--region=us-east1
--tempLocation=gs://chicago-cloud-conference-2020/temp/
--stagingLocation=gs://chicago-cloud-conference-2020/jars/
--filesToStage=build/libs/transformation-1.0-all.jar
--maxNumWorkers=2
--numWorkers=1

Apache Beam SDK and Dataflow Runner

Overview of GCP services - Dataproc
On demand Hadoop Cluster
● From all the 3 managed services for Hadoop Clusters (Amazon EMR, Azure Hdinsight)
Dataproc is the fastest to provision
● Easy runtime customization via PIP commands
● Not as well integrated with third party services (Azure Hdinsight - Databricks, Amazon EMR
- Apache Zeppelin)
$ gcloud beta dataproc clusters create cluster-name
--optional-components=ANACONDA,JUPYTER
--image-version=1.4
--enable-component-gateway
--bucket=chicago-cloud-conference-2020
--region=us-east1
--project=ccc-2020-289323
--metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage numpy pandas matplotlib'

Overview of GCP services - Cloud Natural Language API
● What can we do Cloud Natural Language API?
○ Reveal the structure and meaning of text via machine learning models
○ Extract information about people, places, and events, mentioned in text
documents, news articles or blog posts
○ Understand sentiment about product on social media or parse intent from
customer conversations happening in a call center or a messaging app
● How can we use it?
○ Analyze text uploaded as part of a HTTP request
○ Integrate with Google Cloud Storage

NLP - Sentiment Analysis
Two type of metrics to consider:
1. Score
a. It ranges between -1.0 (negative) and
1.0 (positive) and corresponds to the
general emotional tendency of the text
1. Magnitude
a. Indicates the general intensity of
emotion (both positive and negative) in
a given text, between 0.0 and inf
b. Magnitude is not normalized and each
expression of emotion in the text (both
positive and negative) contributes to the
value
Sentiment Sample Values
Positive score: 0.8, magnitude: 3.0
Negative score: -0.6, magnitude: 4.0
Neutral score: 0.1, magnitude: 0.0
Mixed score: 0.0, magnitude: 4.0

Demo - ETL
• Extract – Diferentes fuentes (Twitter for this case)
• Transform – Cleanup and data presentation
• Load – Columnar format
https://github.com/eschizoid/ccc-2020

Conclusion
•Cost effect solution if you
know your data access
patterns
•Full serverless architecture
•Extensible workloads

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

More Related Content

What's hot (20)

Similar to Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020 (20)

Recently uploaded (20)

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020