Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

Serverless Data Architecture at
scale on Google Cloud Platform
Lorenzo Ridi
MILAN 25-26 NOVEMBER 2016

Black Friday (ˈblæk fraɪdɪ)
noun
The day following Thanksgiving Day in the
United States. Since 1932, it has been
regarded as the beginning of the Christmas
shopping season.

Black Friday in the US
2012 - 2016
source: Google Trends, November 23rd 2016

Black Friday in Italy
2012 - 2016
source: Google Trends, November 23rd 2016

What are we doing
Processing
+ analytics
Tweets about
black friday
insights

Pub/Sub
Container
Engine
(Kubernetes)
How we’re gonna do it

What is Google Cloud Pub/Sub?
● Google Cloud Pub/Sub is a
fully-managed real-time
messaging service.
○ Guaranteed delivery
■ “At least once” semantics
○ Reliable at scale
■ Messages are replicated in
different zones

From Twitter to Pub/Sub
$ gcloud beta pubsub topics create blackfridaytweets
Created topic [blackfridaytweets].
SHELL

?
Pub/Sub Topic
Subscription A
Subscription B
Subscription C
Consumer A
Consumer B
Consumer C

● Simple Python application using the TweePy library
# somewhere in the code, track a given set of keywords
stream = Stream(auth, listener)
stream.filter(track=['blackfriday', [...]])
[...]
# somewhere else, write messages to Pub/Sub
for line in data_lines:
pub = base64.urlsafe_b64encode(line)
messages.append({'data': pub})
body = {'messages': messages}
resp = client.projects().topics().publish(
topic='blackfridaytweets', body=body).execute(num_retries=NUM_RETRIES)
PYTHON

App
+
Libs

VM
App
+
Libs

App
+
Libs
Container

App
+
Libs
Container
FROM google/python
RUN pip install --upgrade pip
RUN pip install pyopenssl ndg-httpsclient pyasn1
RUN pip install tweepy
RUN pip install --upgrade google-api-python-client
RUN pip install python-dateutil
ADD twitter-to-pubsub.py /twitter-to-pubsub.py
ADD utils.py /utils.py
CMD python twitter-to-pubsub.py
DOCKERFILE

App
+
Libs
Container Pod

What is Kubernetes (K8S)?
● An orchestration tool for managing a
cluster of containers across multiple
hosts
○ Scaling, rolling upgrades, A/B testing, etc.
● Declarative – not procedural
○ Auto-scales and self-heals to desired
state
● Supports multiple container runtimes,
currently Docker and CoreOS Rkt
● Open-source: github.com/kubernetes

App
+
Libs
Container Pod
apiVersion: v1
kind: ReplicationController
metadata:
[...]
Spec:
replicas: 1
template:
metadata:
labels:
name: twitter-stream
spec:
containers:
- name: twitter-to-pubsub
image: gcr.io/codemotion-2016-demo/pubsub_pipeline
env:
- name: PUBSUB_TOPIC
value: ...
YAML

App
+
Libs
Container Pod Node

Node
Pod A Pod B

Node 1
Node 2

$ gcloud container clusters create codemotion-2016-demo-cluster
Creating cluster cluster-1...done.
Created [...projects/codemotion-2016-demo/.../clusters/codemotion-2016-demo-cluster].
$ gcloud container clusters get-credentials codemotion-2016-demo-cluster
Fetching cluster endpoint and auth data.
kubeconfig entry generated for cluster-1.
$ kubectl create -f ~/git/kube-pubsub-bq/pubsub/twitter-stream.yaml
replicationcontroller “twitter-stream” created.
SHELL

Pub/Sub
Kubernetes

Pub/Sub
Kubernetes
Dataflow

Pub/Sub
Kubernetes
Dataflow
BigQuery

What is Google Cloud Dataflow?
● Cloud Dataflow is a collection
of open source SDKs to
implement parallel processing
pipelines.
○ same programming model for
streaming and batch pipelines
● Cloud Dataflow is a managed
service to run parallel
processing pipelines on
Google Cloud Platform

What is Google BigQuery?
● Google BigQuery is a fully-
managed Analytic Data
Warehouse solution allowing
real-time analysis of Petabyte-
scale datasets.
● Enterprise-grade features
○ Batch and streaming (100K
rows/sec) data ingestion
○ JDBC/ODBC connectors
○ Rich SQL-2011-compliant query
language
○ Supports updates and deletes
new!
new!

From Pub/Sub to BigQuery
Pub/Sub Topic
Subscription
Read tweets
from
Pub/Sub
Format
tweets for
BigQuery
Write tweets
on BigQuery
BigQuery
Table
Dataflow Pipeline

● A Dataflow pipeline is a Java program.
// TwitterProcessor.java
public static void main(String[] args) {
Pipeline p = Pipeline.create();
PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));
PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));
formattedTweets.apply(BigQueryIO.Write.to(tableReference));
p.run();
}
JAVA

● A Dataflow pipeline is a Java program.
// Do Function (to be used within a ParDo)
private static final class DoFormat extends DoFn<String, TableRow> {
private static final long serialVersionUID = 1L;
@Override
public void processElement(DoFn<String, TableRow>.ProcessContext c) {
c.output(createTableRow(c.element()));
}
}
// Helper method
private static TableRow createTableRow(String tweet) throws IOException {
return JacksonFactory.getDefaultInstance().fromString(tweet, TableRow.class);
}
JAVA

● Use Maven to build, deploy or update the Pipeline.
$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor
-Dexec.args="--streaming"
[...]
INFO: To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-
19_15_49_53-5264074060979116717
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18.131s
[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016
[INFO] Final Memory: 28M/362M
[INFO] ------------------------------------------------------------------------
SHELL

● You can monitor your pipelines from Cloud Console.

● Data start flowing into BigQuery tables. You can run queries
from the CLI or the Web Interface.

Pub/Sub
Kubernetes
Dataflow
BigQuery
Data
Studio

Pub/Sub
Kubernetes
Dataflow
BigQuery
Natural
Language
API
Data
Studio

Sentiment Analysis with Natural Language API
Polarity: [-1,1]
Magnitude: [0,+inf)
Text

Polarity: [-1,1]
Magnitude: [0,+inf)
Text
sentiment = polarity x magnitude

Pub/Sub Topic
Read tweets
from
Pub/Sub
Write tweets
on BigQuery BigQuery
Tables
Dataflow Pipeline
Filter and
Evaluate
sentiment
Format
tweets for
BigQuery
Write tweets
on BigQuery
Format
tweets for
BigQuery

● We just add the additional necessary steps.
public static void main(String[] args) {
Pipeline p = Pipeline.create();
PCollection<String> tweets = p.apply(PubsubIO.Read.topic("...blackfridaytweets"));
PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess()));
PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat()));
formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));
PCollection<TableRow> formattedTweets = tweets.apply(ParDo.of(new DoFormat()));
formattedTweets.apply(BigQueryIO.Write.to(tableReference));
p.run();
}
JAVA
PCollection<String> sentTweets = tweets.apply(ParDo.of(new DoFilterAndProcess()));
PCollection<TableRow> formSentTweets = sentTweets.apply(ParDo.of(new DoFormat()));
formSentTweets.apply(BigQueryIO.Write.to(sentTableReference));

● The update process preserves all in-flight data.
$ mvn compile exec:java -Dexec.mainClass=it.noovle.dataflow.TwitterProcessor
-Dexec.args="--streaming --update --jobName=twitterprocessor-lorenzo-1107222550"
[...]
INFO: To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=codemotion-2016-demo cancel 2016-11-
19_15_49_53-5264074060979116717
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 18.131s
[INFO] Finished at: Sun Nov 20 00:49:54 CET 2016
[INFO] Final Memory: 28M/362M
[INFO] ------------------------------------------------------------------------
SHELL

Pub/Sub
Kubernetes
Dataflow
BigQuery
Data
Studio
We did it!
Natural
Language
API

Polarity: -1.0
Magnitude: 1.5
Polarity: -1.0
Magnitude: 2.1

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016 (20)

More from Codemotion (20)

Recently uploaded (20)

Serverless Data Architecture at scale on Google Cloud Platform - Lorenzo Ridi - Codemotion Milan 2016

Editor's Notes