SlideShare a Scribd company logo
Qian Yu
yuqian8@staff.weibo.com
Weibo Machine Learning Team
Machine Learning with Flink in Weibo
Agenda
1 About Weibo
Weibo ML Platform (WML) overview
Flink in WML
2
3
Next steps with Flink4
About Weibo
• Most popular and largest social media platform in China
• Based on social relationship, conduct, post, and share
information to the world
• Weibo built a social network that connecting people,
interests and content
222M
DAU
516M
MAU
Machine learning Platform in Weibo (WML) —— Overview
W
M
L
Offline
computing cluster
Cluster
High-performance
computing cluster
Online
computing cluster
Online
Predicting
WeiServing (self-developed) WeiPS (self-developed)
Compute
Platform
WeiLearn
(self-developed)
Flink/StormHadoop/Spark TensorFlow
Model
Training
LR DNN/RNNFM/FFMGBDT CF/MF …
Model Zoo
WeiClient
WAIC UI
Sample Pool
Server Lake
Schedule
WeiBox
(self-developed)
K8sYarn
Weiflow
(self-developed)
Application
Feature
Generating
Online
Model Training
Sample
Service
Online
Inference
Machine learning Platform in Weibo (WML) —— Development
Two layer of DAGs design:
WeiLearn
Input
Process
Output
Source
Process
Sink
WeiFlow
Task
Task Task
Task
1.0 2.0 3.0 4.0 5.0
WeiLearn 6.0
Parameter scale
Model count
Model service QPS
Iteration cycle
monthly
50k
10+
100M 10min
950k
190+
100B
WeiLearn
weekly
100k
20+
1B
weekly
230k
30+
2B
daily
550k
80+
10B
+ LR + GBDT + DNN
Offline machine learning:
LR
Offline machine learning:
LR
GBDT
GBDT+LR
Offline machine learning:
LR
GBDT
GBDT+LR
Deep learning:
Wide&Deep
+ Online FM + Online DNN
Online deep learning:
Enforce the expressive
power of online machine
learning model
+ FM
Online machine learning:
Onilne FM/FFM
Offline machine learning:
LR/GBDT/GBDT+LR
FM/MF
Deep learning:
Wide&Deep
DeepFM
DSSM
Offline machine learning:
LR
GBDT
GBDT+LR
FM
MF
Deep learning:
Wide&Deep
After successive iterations, Weibo machine learning platform (WML), can support over 100B parameters, 1m QPS,
and iteration cycle around 10 minutes now.
Machine learning Platform in Weibo (WML) —— CTR model iteration
Flink in WML —— Overview
Business
Interest
Feed
Relation
Feed
Video Recommendation Content monitoring
Applications
Multimedia feature
generating platform
Content
Deduplication
Data
Synchronization
Real-time
feature engineering
Sample service
Platform
Online
Model Training
Image
Recommendation
Computation
WeiPig + WeiStream WeiLearn + WeiFlink
Infrastructure
Storm Cluster Flink Cluster
(K8s, Yarn)
MCQ
Redis
Kafka
HDFS
Apply unified Flink APIs to both online and offline ML pipelines
Online pipeline
Offline pipeline
Stream
MCQ
Stream processing
Stream
MCQ
Online training
Model Serving
Recommendation
System
Stream
MCQ
HDFS
Batch processing
HDFS
Offline training
Model Serving
Recommendation
System
Unified pipeline
Stream
MCQ
HDFS
Model Serving
Recommendation
System
Batch/Stream
processing
Stream
MCQ
HDFS
Model training
Online/Offline mode
Flink in WML —— Overview
Interaction log
Click log
Read log
Expo log
Sample Service
Data filter & map
Multi-stream joining
Flink/timer/state
Sample
Pool
Feature Processor
Forward Predictor
Gradient Computer
Weips agent
Model Training
WeiLearn
PSserver
PSserver
PSserver
PSschudler ZooKeeper
File/Queue
checkpoint
PSserver
PSserver
PSserver
PSschudler
checkpoint
WeiPS
Model
Zoo
Weips proxy
Feature
Processor
Predictor
Model Loader
Model/Version Manager
fbthrift rpc
Model Predicting
WeiServing
Metric/AUC/ACC/MSE
Model
stablization
Consistency
check
CICD / CTCD
Blue green
Deployment
Model
Collection
Model Deployment
Flow shift
Model Evaluation
Service discovery
Sorting
• Real-time logs:100+
• Click behavior delay:min -> second
• User behaviors count:200+
Flink in WML —— Online model training
• Sample alignment
• Performance improvement
• High availability & Fault Tolerant
Flink in WML —— Sample Service
Sample
Service
Offline Data
Real-time Logs
Offline
Data processing
Online
Data processing
Data Processing
Sample
Pool
Model Training
Feature
Generating
Post, User, Relationship, Content,
Multimedia (Picture, Video, Audio)
General computing, Multi-stream joining, Deep learning …
input filter & map input filter & map … input filter & map
Data Source 1 Data Source 2 … Data Source N
Distributed by key
Joining Window(State -> RocksDB/Gemini -> Checkpoint)
After joining filter & map & append features
WML Sample Pool
Multiple stream joining: UDF:
Flink in WML —— Sample Service
Joining time window & sample alignment
Flink in WML —— Sample Service
Interaction log
Expo log
Click log
Read log
Joining time window (10min)
Sample stream
State
(key, value)
RocksDB/Gemini
timestamp
HDFS
timestamp + 10min
Improvements:
1. Flush out sample immediately when joining finished
2. Sample compensation
3. RocksDB vs Gemini
4. Balance between success rate and time window size
Job status
Management
Metrics monitoring
& Alert
Job History
Twinkle/VVP -> Flink Cluster
Filter & Map UDFsSample ID
Data source conf
…
Input
Filter & Map UDFs
Joining window
…
Process
Sample ID
Data sink conf
…
Output
Interaction Layer
Weiclient WAIC UI
• Further encapsulation -> Weiflink + Weiplugin
• Use Jenkins to do UDF cicd
• Unified Steaming Data using sample IDs
• Develop inner DAG logic inside WeiLearn
• Grafana: Monitoring and alerting
• Twinkle/VVP & WAIC UI to manage jobs
• Review job history in HDFS
• WAIC UI : Select your data and UDFs
• Weiclient : CLT to submit job to different cluster
Flink in WML —— Sample Service
Deep Learning & online inference with Flink: Multimedia online computing and inference platform
Distributed
model training
service
Multimedia
online
inference
Multimedia
features
Model
Zoo
processing
ü One click CICD
ü Distributed training
ü Support different types of model
Offline training
Online inference
ü Success rate: 99.99%+
ü Delay: seconds
ü Develop mode: ConfigurableOnline
GPU(k8s)
Offline GPU(k8s)
Business &
Applications
Text stream
Image stream
Video stream
Reconciliation
system
End-to-end
monitoring
Case tracing
Service assurance
Data
center
Sample
Pool
Data
source
Flink in WML —— Multimedia feature generating
Monitoring & Case tracing
Flink in WML —— Multimedia feature generating
High availability & fault tolerant
1. Automatically restart
2. Restore from checkpoints and snapshots
Next steps with Flink —— Real time data warehousing
Real time data warehousing: unifying batch and stream schema & APIs
Development
Abstract APIs
Offline Engine
Offline Storage
One single code base: SQL + UDF
Everything is table: Unified table register APIs and meta data
Online Engine
Online Storage
Mapred Flink
HDFS
Table schema Schema + format + connector
Deep learning with Flink: dealing with complex computation online
Distributed
model training
service
Multimedia
online
inference
Multimedia
features
Model
Zoo
processing
ü One click CICD
ü Distributed training
ü Support different types of model
training
inference
ü Success rate: 99.99%+
ü Delay: seconds
ü Develop mode: ConfigurableOnline
GPU(k8s)
TF on Flink
Business &
Applications
Text stream
Image stream
Video stream
Reconciliation
system
End-to-end
monitoring
Case tracing
Service assurance
Data
center
Sample
Pool
Data
source
Next steps with Flink —— DL in Flink
Qian Yu
yuqian8@staff.weibo.com
Weibo Machine Learning Team
Thanks!

More Related Content

PPTX
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
PDF
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
PDF
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
PDF
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
PDF
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
PPTX
Virtual Flink Forward 2020: Implement Reliable, Isolated & Unified Job Submis...
PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Virtual Flink Forward 2020: Integrate Flink with Kubernetes natively - Yang Wang
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
Kubernetes + Operator + PaaSTA = Flink @ Yelp - Antonio Verardi, Yelp
Virtual Flink Forward 2020: Implement Reliable, Isolated & Unified Job Submis...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...

What's hot (20)

PPTX
Do Flink on Web with FLOW
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
PDF
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
PDF
Scaling stream data pipelines with Pravega and Apache Flink
PPTX
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
PPTX
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
PDF
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
PDF
Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf
PDF
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...
PDF
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
PDF
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
PPTX
Apache flink 1.7 and Beyond
PDF
Streaming your Lyft Ride Prices - Flink Forward SF 2019
PDF
dA Platform Overview
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
PDF
Flink Connector Development Tips & Tricks
Do Flink on Web with FLOW
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Virtual Flink Forward 2020: Data driven matchmaking streaming at Hyperconnect...
Scaling stream data pipelines with Pravega and Apache Flink
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Virtual Flink Forward 2020: Apache Flink Worst Wractices - Konstantin Knauf
Flink Forward San Francisco 2018: Jörg Schad and Biswajit Das - "Operating Fl...
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Apache flink 1.7 and Beyond
Streaming your Lyft Ride Prices - Flink Forward SF 2019
dA Platform Overview
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Connector Development Tips & Tricks
Ad

Similar to Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian (20)

PDF
FlinkML - Big data application meetup
PDF
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
PDF
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
PDF
FlinkML: Large Scale Machine Learning with Apache Flink
PDF
[FFE19] Build a Flink AI Ecosystem
PDF
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
Apache Flink London Meetup - Let's Talk ML on Flink
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
Machine Learning with Apache Flink at Stockholm Machine Learning Group
PPTX
Apache Flink@ Strata & Hadoop World London
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
PDF
DutchMLSchool 2022 - Automation
PDF
Data Intensive Applications with Apache Flink
PDF
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
PDF
VSSML16 L7. REST API, Bindings, and Basic Workflows
PDF
DutchMLSchool. ML Automation
PPTX
Apache Flink Deep Dive
PDF
Parallel machines flinkforward2017
PDF
Data Workflows for Machine Learning - Seattle DAML
FlinkML - Big data application meetup
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...
FlinkML: Large Scale Machine Learning with Apache Flink
[FFE19] Build a Flink AI Ecosystem
Flink Forward Berlin 2017: Boris Lublinsky, Stavros Kontopoulos - Introducing...
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink London Meetup - Let's Talk ML on Flink
BSSML16 L10. Summary Day 2 Sessions
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Apache Flink@ Strata & Hadoop World London
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
DutchMLSchool 2022 - Automation
Data Intensive Applications with Apache Flink
Data intensive applications with Apache Flink - Simone Robutti, Radicalbit
VSSML16 L7. REST API, Bindings, and Basic Workflows
DutchMLSchool. ML Automation
Apache Flink Deep Dive
Parallel machines flinkforward2017
Data Workflows for Machine Learning - Seattle DAML
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Modernizing your data center with Dell and AMD
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Cloud computing and distributed systems.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Modernizing your data center with Dell and AMD
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Cloud computing and distributed systems.
MYSQL Presentation for SQL database connectivity
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Empathic Computing: Creating Shared Understanding
Big Data Technologies - Introduction.pptx

Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian

  • 1. Qian Yu yuqian8@staff.weibo.com Weibo Machine Learning Team Machine Learning with Flink in Weibo
  • 2. Agenda 1 About Weibo Weibo ML Platform (WML) overview Flink in WML 2 3 Next steps with Flink4
  • 3. About Weibo • Most popular and largest social media platform in China • Based on social relationship, conduct, post, and share information to the world • Weibo built a social network that connecting people, interests and content 222M DAU 516M MAU
  • 4. Machine learning Platform in Weibo (WML) —— Overview W M L Offline computing cluster Cluster High-performance computing cluster Online computing cluster Online Predicting WeiServing (self-developed) WeiPS (self-developed) Compute Platform WeiLearn (self-developed) Flink/StormHadoop/Spark TensorFlow Model Training LR DNN/RNNFM/FFMGBDT CF/MF … Model Zoo WeiClient WAIC UI Sample Pool Server Lake Schedule WeiBox (self-developed) K8sYarn Weiflow (self-developed) Application Feature Generating Online Model Training Sample Service Online Inference
  • 5. Machine learning Platform in Weibo (WML) —— Development Two layer of DAGs design: WeiLearn Input Process Output Source Process Sink WeiFlow Task Task Task Task
  • 6. 1.0 2.0 3.0 4.0 5.0 WeiLearn 6.0 Parameter scale Model count Model service QPS Iteration cycle monthly 50k 10+ 100M 10min 950k 190+ 100B WeiLearn weekly 100k 20+ 1B weekly 230k 30+ 2B daily 550k 80+ 10B + LR + GBDT + DNN Offline machine learning: LR Offline machine learning: LR GBDT GBDT+LR Offline machine learning: LR GBDT GBDT+LR Deep learning: Wide&Deep + Online FM + Online DNN Online deep learning: Enforce the expressive power of online machine learning model + FM Online machine learning: Onilne FM/FFM Offline machine learning: LR/GBDT/GBDT+LR FM/MF Deep learning: Wide&Deep DeepFM DSSM Offline machine learning: LR GBDT GBDT+LR FM MF Deep learning: Wide&Deep After successive iterations, Weibo machine learning platform (WML), can support over 100B parameters, 1m QPS, and iteration cycle around 10 minutes now. Machine learning Platform in Weibo (WML) —— CTR model iteration
  • 7. Flink in WML —— Overview Business Interest Feed Relation Feed Video Recommendation Content monitoring Applications Multimedia feature generating platform Content Deduplication Data Synchronization Real-time feature engineering Sample service Platform Online Model Training Image Recommendation Computation WeiPig + WeiStream WeiLearn + WeiFlink Infrastructure Storm Cluster Flink Cluster (K8s, Yarn) MCQ Redis Kafka HDFS
  • 8. Apply unified Flink APIs to both online and offline ML pipelines Online pipeline Offline pipeline Stream MCQ Stream processing Stream MCQ Online training Model Serving Recommendation System Stream MCQ HDFS Batch processing HDFS Offline training Model Serving Recommendation System Unified pipeline Stream MCQ HDFS Model Serving Recommendation System Batch/Stream processing Stream MCQ HDFS Model training Online/Offline mode Flink in WML —— Overview
  • 9. Interaction log Click log Read log Expo log Sample Service Data filter & map Multi-stream joining Flink/timer/state Sample Pool Feature Processor Forward Predictor Gradient Computer Weips agent Model Training WeiLearn PSserver PSserver PSserver PSschudler ZooKeeper File/Queue checkpoint PSserver PSserver PSserver PSschudler checkpoint WeiPS Model Zoo Weips proxy Feature Processor Predictor Model Loader Model/Version Manager fbthrift rpc Model Predicting WeiServing Metric/AUC/ACC/MSE Model stablization Consistency check CICD / CTCD Blue green Deployment Model Collection Model Deployment Flow shift Model Evaluation Service discovery Sorting • Real-time logs:100+ • Click behavior delay:min -> second • User behaviors count:200+ Flink in WML —— Online model training • Sample alignment • Performance improvement • High availability & Fault Tolerant
  • 10. Flink in WML —— Sample Service Sample Service Offline Data Real-time Logs Offline Data processing Online Data processing Data Processing Sample Pool Model Training Feature Generating Post, User, Relationship, Content, Multimedia (Picture, Video, Audio) General computing, Multi-stream joining, Deep learning …
  • 11. input filter & map input filter & map … input filter & map Data Source 1 Data Source 2 … Data Source N Distributed by key Joining Window(State -> RocksDB/Gemini -> Checkpoint) After joining filter & map & append features WML Sample Pool Multiple stream joining: UDF: Flink in WML —— Sample Service
  • 12. Joining time window & sample alignment Flink in WML —— Sample Service Interaction log Expo log Click log Read log Joining time window (10min) Sample stream State (key, value) RocksDB/Gemini timestamp HDFS timestamp + 10min Improvements: 1. Flush out sample immediately when joining finished 2. Sample compensation 3. RocksDB vs Gemini 4. Balance between success rate and time window size
  • 13. Job status Management Metrics monitoring & Alert Job History Twinkle/VVP -> Flink Cluster Filter & Map UDFsSample ID Data source conf … Input Filter & Map UDFs Joining window … Process Sample ID Data sink conf … Output Interaction Layer Weiclient WAIC UI • Further encapsulation -> Weiflink + Weiplugin • Use Jenkins to do UDF cicd • Unified Steaming Data using sample IDs • Develop inner DAG logic inside WeiLearn • Grafana: Monitoring and alerting • Twinkle/VVP & WAIC UI to manage jobs • Review job history in HDFS • WAIC UI : Select your data and UDFs • Weiclient : CLT to submit job to different cluster Flink in WML —— Sample Service
  • 14. Deep Learning & online inference with Flink: Multimedia online computing and inference platform Distributed model training service Multimedia online inference Multimedia features Model Zoo processing ü One click CICD ü Distributed training ü Support different types of model Offline training Online inference ü Success rate: 99.99%+ ü Delay: seconds ü Develop mode: ConfigurableOnline GPU(k8s) Offline GPU(k8s) Business & Applications Text stream Image stream Video stream Reconciliation system End-to-end monitoring Case tracing Service assurance Data center Sample Pool Data source Flink in WML —— Multimedia feature generating
  • 15. Monitoring & Case tracing Flink in WML —— Multimedia feature generating High availability & fault tolerant 1. Automatically restart 2. Restore from checkpoints and snapshots
  • 16. Next steps with Flink —— Real time data warehousing Real time data warehousing: unifying batch and stream schema & APIs Development Abstract APIs Offline Engine Offline Storage One single code base: SQL + UDF Everything is table: Unified table register APIs and meta data Online Engine Online Storage Mapred Flink HDFS Table schema Schema + format + connector
  • 17. Deep learning with Flink: dealing with complex computation online Distributed model training service Multimedia online inference Multimedia features Model Zoo processing ü One click CICD ü Distributed training ü Support different types of model training inference ü Success rate: 99.99%+ ü Delay: seconds ü Develop mode: ConfigurableOnline GPU(k8s) TF on Flink Business & Applications Text stream Image stream Video stream Reconciliation system End-to-end monitoring Case tracing Service assurance Data center Sample Pool Data source Next steps with Flink —— DL in Flink