SlideShare a Scribd company logo
JCConf Dataflow Workshop Labs
{Simon Su / 20161015}
Index
Index 1
Lab 1: 準備Dataflow環境,並建置第一個專案 1
建立GCP專案,並安裝Eclipse開發環境 1
安裝Google Cloud SDK 1
啟用Dataflow API 2
建立第一個Dataflow專案 3
執行您的專案 6
Lab 2: 佈署您的第一個專案到Google Cloud Platform 9
準備工作 9
執行佈署 9
檢測執行結果 10
實作Input/Output/Transform等功能 12
Lab 3: 建立Streaming Dataflow 16
建立PubSub topic / subscription 16
佈署Dataflow streaming sample 16
Streaming範例1 16
Streaming範例2 17
從Dashboard監控Dataflow Streaming Task 19
Lab結束後 20
Lab 1: 準備Dataflow環境,並建置第一個專案
建立GCP專案,並安裝Eclipse開發環境
請參考:​JCConf 2016 - Dataflow Workshop行前說明
安裝Google Cloud SDK
● 請參考此URL安裝Cloud SDK:​https://cloud.google.com/sdk/?hl=en_US#download
● 認證Cloud SDK:
> gcloud auth login
> gcloud auth application-default login
● 設定預設專案
> gcloud config set project <your-project-id>
● 確認安裝
> gcloud config list
啟用Dataflow API
至所屬Project的API Manager項目:
在API Manager Dashboard中點選Enable API:
搜尋Dataflow項目:
將該項目做Enable:
建立第一個Dataflow專案
透過Eclipse Dataflow Wizard可以協助您建立Dataflow的相關專案,步驟如下:
Step1: 選擇New > Other...
Step2: 選擇Google Cloud Platform > Cloud Dataflow Java Project
Step3: 輸入您的專案資訊
Step4: 輸入Google Cloud Platform上的專案ID與Cloud Storage資訊
Step4: 專案建立好後,可以檢視專案狀態
範例程式如下:
執行您的專案
點選右上角 按鈕,建立新的Dataflow Run Configuration...
設定Run Configuration名稱
設定Runner形式:
檢視佈署Log狀態...
Lab 2: 佈署您的第一個專案到Google Cloud Platform
準備工作
在進行Lab2前的前置工作部分,需要先確認您在Lab1的專案可以正常執行,然後您可以依照您的需
求稍加改動您的專案,測試一下變化...
執行佈署
透過”Run As > Run Configurations...”之項目進入到Run Configurations設定視窗
設定視窗如下:
您可以點選視窗中的”New Launch Configuration”按鈕(下圖紅色標記處)來建立新的Configuration…
本Lab中,新的Configuration有兩個地方需要設定:
1. 設定Main method
2. 設定Pipeline Arguments
檢測執行結果
在執行視窗中,Console會顯示執行的過程,大致結果如下:
執行當下,可以到依照IDE Console的指示,連線到Web Console檢視該Dataflow Task狀態:
該執行項目的詳細畫面如下:
可以透過”LOGS”鏈結檢視執行狀況...
實作Input/Output/Transform等功能
修改您的專案,讓他從Google Cloud Storage抓取檔案...
@SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(​TextIO.Read.named("sample-book").from("​gs://jcconf2016-dataflow-workshop/sample/book-sample.txt​")​)
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
}
}
進一步修改程式,讓資料輸出到Google Cloud Storage…
@SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(​TextIO.Read.named("sample-book").from("​gs://jcconf2016-dataflow-workshop/sample/book-sample.txt​"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(​TextIO.Write.named("output-book").to("​gs://jcconf2016-dataflow-workshop/result/book-sample.txt​"));
p.run();
}
}
增加Transform Function,將段落字元切割
@SuppressWarnings("serial")
public class TestMain {
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(​ParDo.of(new DoFn<String, String>() {
private final Aggregator<Long, Long> emptyLines =
createAggregator("emptyLines", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}))
.apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
Word Count Sample - 計算每個文件中單字出現的數量
@SuppressWarnings("serial")
public class TestMain {
static class MyExtractWordsFn extends DoFn<String, String> {
private final Aggregator<Long, Long> emptyLines = createAggregator(
"emptyLines", new Sum.SumLongFn());
@Override
public void processElement(ProcessContext c) {
if (c.element().trim().isEmpty()) {
emptyLines.addValue(1L);
}
// Split the line into words.
String[] words = c.element().split("[^a-zA-Z']+");
// Output each word encountered into the output PCollection.
for (String word : words) {
if (!word.isEmpty()) {
c.output(word);
}
}
}
}
public static class MyCountWords extends
PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
@Override
public PCollection<KV<String, Long>> apply(PCollection<String> lines) {
// Convert lines of text into individual words.
PCollection<String> words = lines.apply(ParDo.of(new MyExtractWordsFn()));
// Count the number of times each word occurs.
PCollection<KV<String, Long>> wordCounts = words.apply(Count.<String> perElement());
return wordCounts;
}
}
public static class MyFormatAsTextFn extends DoFn<KV<String, Long>, String> {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().getKey() + ": " + c.element().getValue());
}
}
private static final Logger LOG = LoggerFactory.getLogger(TestMain.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args)
.withValidation().create());
p.apply(TextIO.Read.named("sample-book").from(
"gs://jcconf2016-dataflow-workshop/sample/book-sample.txt"))
.apply(new MyCountWords())
.apply(ParDo.of(new MyFormatAsTextFn()))
.apply(TextIO.Write.named("output-book")
.to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt"));
p.run();
}
}
Lab 3: 建立Streaming Dataflow
建立PubSub topic / subscription
建立topic
gcloud beta pubsub topics create jcconf2016
建立該topic的subscription
gcloud beta pubsub subscriptions create --topic jcconf2016 jcconf2016-sub001
佈署Dataflow streaming sample
Streaming範例1
聆聽subscription作為資料輸入,並將資料輸出在LOG中...
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
p.apply(PubsubIO.Read.named("my-pubsub-input")
.subscription("projects/sunny-573/subscriptions/jcconf2016-sub001"))
.apply(ParDo.of(new DoFn<String, String>() {
@Override
public void processElement(ProcessContext c) {
c.output(c.element().toUpperCase());
}
}))
.apply(ParDo.of(new DoFn<String, Void>() {
@Override
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
p.run();
}
Streaming範例2
整合Work Count範例,將資料寫入BigQuery的dataset中...
/*
* Copyright (C) 2015 Google Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License"); you may not
* use this file except in compliance with the License. You may obtain a copy of
* the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations under
* the License.
*/
package com.jcconf2016.demo;
import java.util.ArrayList;
import java.util.List;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.api.services.bigquery.model.TableFieldSchema;
import com.google.api.services.bigquery.model.TableReference;
import com.google.api.services.bigquery.model.TableRow;
import com.google.api.services.bigquery.model.TableSchema;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.BigQueryIO;
import com.google.cloud.dataflow.sdk.io.PubsubIO;
import com.google.cloud.dataflow.sdk.options.Default;
import com.google.cloud.dataflow.sdk.options.Description;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.options.StreamingOptions;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import com.google.cloud.dataflow.sdk.transforms.windowing.FixedWindows;
import com.google.cloud.dataflow.sdk.transforms.windowing.Window;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
/**
* A starter example for writing Google Cloud Dataflow programs.
*
* <p>
* The example takes two strings, converts them to their upper-case
* representation and logs them.
*
* <p>
* To run this starter example locally using DirectPipelineRunner, just execute
* it without any additional parameters from your favorite development
* environment. In Eclipse, this corresponds to the existing 'LOCAL' run
* configuration.
*
* <p>
* To run this starter example using managed resource in Google Cloud Platform,
* you should specify the following command-line options:
* --project=<YOUR_PROJECT_ID>
* --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE>
* --runner=BlockingDataflowPipelineRunner In Eclipse, you can just modify the
* existing 'SERVICE' run configuration.
*/
@SuppressWarnings("serial")
public class StreamingPipeline {
static final int WINDOW_SIZE = 1; // Default window duration in minutes
public static interface Options extends StreamingOptions {
@Description("Fixed window duration, in minutes")
@Default.Integer(WINDOW_SIZE)
Integer getWindowSize();
void setWindowSize(Integer value);
@Description("Whether to run the pipeline with unbounded input")
boolean isUnbounded();
void setUnbounded(boolean value);
}
private static TableReference getTableReference(Options options) {
TableReference tableRef = new TableReference();
tableRef.setProjectId("sunny-573");
tableRef.setDatasetId("jcconf2016");
tableRef.setTableId("pubsub");
return tableRef;
}
private static TableSchema getSchema() {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("word").setType("STRING"));
fields.add(new TableFieldSchema().setName("count").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("window_timestamp").setType(
"TIMESTAMP"));
TableSchema schema = new TableSchema().setFields(fields);
return schema;
}
static class FormatAsTableRowFn extends DoFn<KV<String, Long>, TableRow> {
@Override
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("word", c.element().getKey())
.set("count", c.element().getValue())
// include a field for the window timestamp
.set("window_timestamp", c.timestamp().toString());
c.output(row);
}
}
private static final Logger LOG = LoggerFactory
.getLogger(StreamingPipeline.class);
public static void main(String[] args) {
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation().as(Options.class);
options.setStreaming(true);
Pipeline p = Pipeline.create(options);
PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016"));
PCollection<String> windowedWords =
input.apply(Window.<String>
into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new
TestMain.MyCountWords());
wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply(
BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema()));
p.run();
}
}
從Dashboard監控Dataflow Streaming Task
打開GCP Web Console,使用Dataflow Dashboard來檢視每個流程的執行狀況。
並透過Cloud Logging來檢視執行Log…
Lab結束後
在Lab結束後,記得參考IDE輸出的Log,將Dataflow job做cancel動作,避免Streaming Dataflow仍
在運行中,主機無法關閉...
gcloud alpha dataflow jobs --project=sunny-573 cancel
2016-10-14_08_38_48-17987270960467929246

More Related Content

PDF
JCConf 2016 - Google Dataflow 小試
PDF
GCPUG.TW - GCP學習資源分享
PDF
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
PDF
GCPUG meetup 201610 - Dataflow Introduction
PDF
Google Cloud Computing compares GCE, GAE and GKE
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
PDF
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析
PDF
From airflow to google cloud composer
JCConf 2016 - Google Dataflow 小試
GCPUG.TW - GCP學習資源分享
JCConf 2015 - Google Dataflow 在雲端大資料處理的應用
GCPUG meetup 201610 - Dataflow Introduction
Google Cloud Computing compares GCE, GAE and GKE
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
使用 Raspberry pi + fluentd + gcp cloud logging, big query 做iot 資料搜集與分析
From airflow to google cloud composer

What's hot (20)

PDF
Docker in Action
PPTX
Airflow at WePay
PDF
Google Cloud Dataflow meets TensorFlow
PDF
Introduction to Apache Airflow
PDF
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
PDF
Upgrading to Apache Airflow 2 | Airflow Summit 2021
PDF
Supercharge your app with Cloud Functions for Firebase
PDF
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
PPTX
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
PDF
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
PDF
Powering machine learning workflows with Apache Airflow and Python
PDF
Building an analytics workflow using Apache Airflow
PDF
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
PDF
Google compute engine - overview
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
PDF
Google Cloud Platform Special Training
PDF
Upcoming features in Airflow 2
PDF
Go With The Flow
PPTX
Getting to Know Airflow
Docker in Action
Airflow at WePay
Google Cloud Dataflow meets TensorFlow
Introduction to Apache Airflow
What's Coming in Apache Airflow 2.0 - PyDataWarsaw 2019
Upgrading to Apache Airflow 2 | Airflow Summit 2021
Supercharge your app with Cloud Functions for Firebase
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Scheduling Apps in the Cloud - Glenn Renfro & Roy Clarkson
Introducing Kubeflow (w. Special Guests Tensorflow and Apache Spark)
Powering machine learning workflows with Apache Airflow and Python
Building an analytics workflow using Apache Airflow
It's a Breeze to develop Apache Airflow (London Apache Airflow meetup)
Google compute engine - overview
How I learned to time travel, or, data pipelining and scheduling with Airflow
Orchestrating workflows Apache Airflow on GCP & AWS
Google Cloud Platform Special Training
Upcoming features in Airflow 2
Go With The Flow
Getting to Know Airflow
Ad

Viewers also liked (11)

PDF
Google Cloud Monitoring
PDF
Try Cloud Spanner
PDF
GAE - Using CloudStorage through FileReadChannel
PDF
JCConf2016 - Dataflow Workshop Setup
PDF
GAE Java IDE installation
PDF
GCS - Java to store data in Cloud Storage
PDF
GCE Windows Serial Console Usage Guide
PDF
愛愛上雲端
PDF
Google Cloud Platform專案建立說明
PPTX
均一Gae甘苦談
PDF
GCPNext17' Extend 開始GCP了嗎?
Google Cloud Monitoring
Try Cloud Spanner
GAE - Using CloudStorage through FileReadChannel
JCConf2016 - Dataflow Workshop Setup
GAE Java IDE installation
GCS - Java to store data in Cloud Storage
GCE Windows Serial Console Usage Guide
愛愛上雲端
Google Cloud Platform專案建立說明
均一Gae甘苦談
GCPNext17' Extend 開始GCP了嗎?
Ad

More from Simon Su (16)

PDF
Kubernetes Basic Operation
PDF
Google IoT Core 初體驗
PDF
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
PDF
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
PDF
Brocade - Stingray Application Firewall
PDF
Google I/O 2016 Recap - Google Cloud Platform News Update
PDF
IThome DevOps Summit - IoT、docker與DevOps
PDF
Google Cloud Platform Introduction - 2016Q3
PDF
Google I/O Extended 2016 - 台北場活動回顧
PPTX
GCS - Access Control Lists (中文)
PDF
Google Cloud Platform - for Mobile Solutions
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
PDF
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
PDF
GCPUG.TW - 2016活動討論
PDF
GCPUG.TW - 2015活動回顧
PDF
CouchDB Getting Start
Kubernetes Basic Operation
Google IoT Core 初體驗
JSDC 2017 - 使用google cloud 從雲到端,動手刻個IoT
GCPUG.TW meetup #28 - GKE上運作您的k8s服務
Brocade - Stingray Application Firewall
Google I/O 2016 Recap - Google Cloud Platform News Update
IThome DevOps Summit - IoT、docker與DevOps
Google Cloud Platform Introduction - 2016Q3
Google I/O Extended 2016 - 台北場活動回顧
GCS - Access Control Lists (中文)
Google Cloud Platform - for Mobile Solutions
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(下)
JCConf 2015 - 輕鬆學google的雲端開發 - Google App Engine入門(上)
GCPUG.TW - 2016活動討論
GCPUG.TW - 2015活動回顧
CouchDB Getting Start

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
sap open course for s4hana steps from ECC to s4
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Chapter 3 Spatial Domain Image Processing.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

JCConf 2016 - Dataflow Workshop Labs

  • 1. JCConf Dataflow Workshop Labs {Simon Su / 20161015} Index Index 1 Lab 1: 準備Dataflow環境,並建置第一個專案 1 建立GCP專案,並安裝Eclipse開發環境 1 安裝Google Cloud SDK 1 啟用Dataflow API 2 建立第一個Dataflow專案 3 執行您的專案 6 Lab 2: 佈署您的第一個專案到Google Cloud Platform 9 準備工作 9 執行佈署 9 檢測執行結果 10 實作Input/Output/Transform等功能 12 Lab 3: 建立Streaming Dataflow 16 建立PubSub topic / subscription 16 佈署Dataflow streaming sample 16 Streaming範例1 16 Streaming範例2 17 從Dashboard監控Dataflow Streaming Task 19 Lab結束後 20 Lab 1: 準備Dataflow環境,並建置第一個專案 建立GCP專案,並安裝Eclipse開發環境 請參考:​JCConf 2016 - Dataflow Workshop行前說明 安裝Google Cloud SDK ● 請參考此URL安裝Cloud SDK:​https://cloud.google.com/sdk/?hl=en_US#download
  • 2. ● 認證Cloud SDK: > gcloud auth login > gcloud auth application-default login ● 設定預設專案 > gcloud config set project <your-project-id> ● 確認安裝 > gcloud config list 啟用Dataflow API 至所屬Project的API Manager項目: 在API Manager Dashboard中點選Enable API: 搜尋Dataflow項目: 將該項目做Enable:
  • 3. 建立第一個Dataflow專案 透過Eclipse Dataflow Wizard可以協助您建立Dataflow的相關專案,步驟如下: Step1: 選擇New > Other... Step2: 選擇Google Cloud Platform > Cloud Dataflow Java Project
  • 4. Step3: 輸入您的專案資訊 Step4: 輸入Google Cloud Platform上的專案ID與Cloud Storage資訊
  • 9. Lab 2: 佈署您的第一個專案到Google Cloud Platform 準備工作 在進行Lab2前的前置工作部分,需要先確認您在Lab1的專案可以正常執行,然後您可以依照您的需 求稍加改動您的專案,測試一下變化... 執行佈署 透過”Run As > Run Configurations...”之項目進入到Run Configurations設定視窗 設定視窗如下: 您可以點選視窗中的”New Launch Configuration”按鈕(下圖紅色標記處)來建立新的Configuration…
  • 10. 本Lab中,新的Configuration有兩個地方需要設定: 1. 設定Main method 2. 設定Pipeline Arguments 檢測執行結果 在執行視窗中,Console會顯示執行的過程,大致結果如下:
  • 13. public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(​TextIO.Read.named("sample-book").from("​gs://jcconf2016-dataflow-workshop/sample/book-sample.txt​")​) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); } } 進一步修改程式,讓資料輸出到Google Cloud Storage… @SuppressWarnings("serial") public class TestMain { private static final Logger LOG = LoggerFactory.getLogger(TestMain.class); public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(​TextIO.Read.named("sample-book").from("​gs://jcconf2016-dataflow-workshop/sample/book-sample.txt​")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(​TextIO.Write.named("output-book").to("​gs://jcconf2016-dataflow-workshop/result/book-sample.txt​")); p.run(); } } 增加Transform Function,將段落字元切割
  • 14. @SuppressWarnings("serial") public class TestMain { private static final Logger LOG = LoggerFactory.getLogger(TestMain.class); public static void main(String[] args) { Pipeline p = Pipeline.create( PipelineOptionsFactory.fromArgs(args).withValidation().create()); p.apply(TextIO.Read.named("sample-book").from("gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(​ParDo.of(new DoFn<String, String>() { private final Aggregator<Long, Long> emptyLines = createAggregator("emptyLines", new Sum.SumLongFn()); @Override public void processElement(ProcessContext c) { if (c.element().trim().isEmpty()) { emptyLines.addValue(1L); } // Split the line into words. String[] words = c.element().split("[^a-zA-Z']+"); // Output each word encountered into the output PCollection. for (String word : words) { if (!word.isEmpty()) { c.output(word); } } } })) .apply(TextIO.Write.named("output-book").to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt")); p.run(); } } Word Count Sample - 計算每個文件中單字出現的數量 @SuppressWarnings("serial") public class TestMain {
  • 15. static class MyExtractWordsFn extends DoFn<String, String> { private final Aggregator<Long, Long> emptyLines = createAggregator( "emptyLines", new Sum.SumLongFn()); @Override public void processElement(ProcessContext c) { if (c.element().trim().isEmpty()) { emptyLines.addValue(1L); } // Split the line into words. String[] words = c.element().split("[^a-zA-Z']+"); // Output each word encountered into the output PCollection. for (String word : words) { if (!word.isEmpty()) { c.output(word); } } } } public static class MyCountWords extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> { @Override public PCollection<KV<String, Long>> apply(PCollection<String> lines) { // Convert lines of text into individual words. PCollection<String> words = lines.apply(ParDo.of(new MyExtractWordsFn())); // Count the number of times each word occurs. PCollection<KV<String, Long>> wordCounts = words.apply(Count.<String> perElement()); return wordCounts; } } public static class MyFormatAsTextFn extends DoFn<KV<String, Long>, String> { @Override public void processElement(ProcessContext c) { c.output(c.element().getKey() + ": " + c.element().getValue()); } } private static final Logger LOG = LoggerFactory.getLogger(TestMain.class); public static void main(String[] args) { Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args) .withValidation().create()); p.apply(TextIO.Read.named("sample-book").from( "gs://jcconf2016-dataflow-workshop/sample/book-sample.txt")) .apply(new MyCountWords()) .apply(ParDo.of(new MyFormatAsTextFn())) .apply(TextIO.Write.named("output-book") .to("gs://jcconf2016-dataflow-workshop/result/book-sample.txt")); p.run(); } }
  • 16. Lab 3: 建立Streaming Dataflow 建立PubSub topic / subscription 建立topic gcloud beta pubsub topics create jcconf2016 建立該topic的subscription gcloud beta pubsub subscriptions create --topic jcconf2016 jcconf2016-sub001 佈署Dataflow streaming sample Streaming範例1 聆聽subscription作為資料輸入,並將資料輸出在LOG中...
  • 17. public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class); options.setStreaming(true); Pipeline p = Pipeline.create(options); p.apply(PubsubIO.Read.named("my-pubsub-input") .subscription("projects/sunny-573/subscriptions/jcconf2016-sub001")) .apply(ParDo.of(new DoFn<String, String>() { @Override public void processElement(ProcessContext c) { c.output(c.element().toUpperCase()); } })) .apply(ParDo.of(new DoFn<String, Void>() { @Override public void processElement(ProcessContext c) { LOG.info(c.element()); } })); p.run(); } Streaming範例2 整合Work Count範例,將資料寫入BigQuery的dataset中... /* * Copyright (C) 2015 Google Inc. * * Licensed under the Apache License, Version 2.0 (the "License"); you may not * use this file except in compliance with the License. You may obtain a copy of * the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the * License for the specific language governing permissions and limitations under * the License. */ package com.jcconf2016.demo; import java.util.ArrayList; import java.util.List; import org.joda.time.Duration; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import com.google.api.services.bigquery.model.TableFieldSchema; import com.google.api.services.bigquery.model.TableReference; import com.google.api.services.bigquery.model.TableRow; import com.google.api.services.bigquery.model.TableSchema; import com.google.cloud.dataflow.sdk.Pipeline; import com.google.cloud.dataflow.sdk.io.BigQueryIO; import com.google.cloud.dataflow.sdk.io.PubsubIO; import com.google.cloud.dataflow.sdk.options.Default; import com.google.cloud.dataflow.sdk.options.Description; import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory; import com.google.cloud.dataflow.sdk.options.StreamingOptions;
  • 18. import com.google.cloud.dataflow.sdk.transforms.DoFn; import com.google.cloud.dataflow.sdk.transforms.ParDo; import com.google.cloud.dataflow.sdk.transforms.windowing.FixedWindows; import com.google.cloud.dataflow.sdk.transforms.windowing.Window; import com.google.cloud.dataflow.sdk.values.KV; import com.google.cloud.dataflow.sdk.values.PCollection; /** * A starter example for writing Google Cloud Dataflow programs. * * <p> * The example takes two strings, converts them to their upper-case * representation and logs them. * * <p> * To run this starter example locally using DirectPipelineRunner, just execute * it without any additional parameters from your favorite development * environment. In Eclipse, this corresponds to the existing 'LOCAL' run * configuration. * * <p> * To run this starter example using managed resource in Google Cloud Platform, * you should specify the following command-line options: * --project=<YOUR_PROJECT_ID> * --stagingLocation=<STAGING_LOCATION_IN_CLOUD_STORAGE> * --runner=BlockingDataflowPipelineRunner In Eclipse, you can just modify the * existing 'SERVICE' run configuration. */ @SuppressWarnings("serial") public class StreamingPipeline { static final int WINDOW_SIZE = 1; // Default window duration in minutes public static interface Options extends StreamingOptions { @Description("Fixed window duration, in minutes") @Default.Integer(WINDOW_SIZE) Integer getWindowSize(); void setWindowSize(Integer value); @Description("Whether to run the pipeline with unbounded input") boolean isUnbounded(); void setUnbounded(boolean value); } private static TableReference getTableReference(Options options) { TableReference tableRef = new TableReference(); tableRef.setProjectId("sunny-573"); tableRef.setDatasetId("jcconf2016"); tableRef.setTableId("pubsub"); return tableRef; } private static TableSchema getSchema() { List<TableFieldSchema> fields = new ArrayList<>(); fields.add(new TableFieldSchema().setName("word").setType("STRING")); fields.add(new TableFieldSchema().setName("count").setType("INTEGER")); fields.add(new TableFieldSchema().setName("window_timestamp").setType( "TIMESTAMP")); TableSchema schema = new TableSchema().setFields(fields); return schema; } static class FormatAsTableRowFn extends DoFn<KV<String, Long>, TableRow> { @Override public void processElement(ProcessContext c) { TableRow row = new TableRow().set("word", c.element().getKey()) .set("count", c.element().getValue())
  • 19. // include a field for the window timestamp .set("window_timestamp", c.timestamp().toString()); c.output(row); } } private static final Logger LOG = LoggerFactory .getLogger(StreamingPipeline.class); public static void main(String[] args) { Options options = PipelineOptionsFactory.fromArgs(args) .withValidation().as(Options.class); options.setStreaming(true); Pipeline p = Pipeline.create(options); PCollection<String> input = p.apply(PubsubIO.Read.topic("projects/sunny-573/topics/jcconf2016")); PCollection<String> windowedWords = input.apply(Window.<String> into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize())))); PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new TestMain.MyCountWords()); wordCounts.apply(ParDo.of(new FormatAsTableRowFn())).apply( BigQueryIO.Write.to(getTableReference(options)).withSchema(getSchema())); p.run(); } } 從Dashboard監控Dataflow Streaming Task 打開GCP Web Console,使用Dataflow Dashboard來檢視每個流程的執行狀況。 並透過Cloud Logging來檢視執行Log…