SlideShare a Scribd company logo
天下武功唯快不破: 
 利用串流資料實做出即時分類器和即時推薦系統 
Yahoo! Taiwan EC Data Team
Who I am 
▪ Norman Huang (normany@yahoo-inc.com) 
▪ Software & Data Engineer of Yahoo! Taiwan 
▪ Aims to retrieve and deliver data insights via BI 
platform and data mining algorithms. 
2
Who I am 
▪ Jason Lin (jasonysl@yahoo-inc.com) 
▪ Software & Data Engineer of Yahoo! Taiwan 
▪ Responsible for recommendation system 
personalization mechanisms and cloud 
computing developer. 
3
Agenda 
▪ Challenges 
▪ Solution: Pinball 
▪ Q&A 
4
Challenges 
! 
! 
! 
! 
! 
! 
▪ Static content until next batch job. 
! 
! 
! 
5 
Processing
Challenges 
! 
! 
! 
! 
! 
! 
▪ Static content until next batch job. 
▪ Batched product recommendation algorithms have become common 
features among e-commerce platforms. 
! 
6 
Processing
Challenges 
! 
! 
! 
! 
! 
! 
▪ Nearly 72% of visitors made their decision at the same day. 
7 
Absorbed into batch views Not yet absorbed 
Time 
Several hours of data
Challenges 
! 
! 
! 
! 
! 
! 
▪ Nearly 72% of visitors made their decision at the same day. 
▪ Real-time solution to interact with potential buyers. 
8 
Absorbed into batch views Not yet absorbed 
Time 
Several hours of data
Solution: Pinball 
9
Pinball 
10 
Classifier 
Classifier 
User 
Profile A Profile B Profile C
Pinball 
! 
▪ Real-time classifier 
▪ Detect buyers’ preferences by streaming data processing 
▪ Deliver personalized ads and product recommendations on the fly 
11
Pinball 
! 
▪ Real-time classifier 
▪ Detect buyers’ preferences by streaming data processing 
▪ Deliver personalized ads and product recommendations on the fly 
! 
▪ Challenges 
› How do to it in real-time? 
12
Pinball 
! 
▪ Real-time classifier 
▪ Detect buyers’ preferences by streaming data processing 
▪ Deliver personalized ads and product recommendations on the fly 
! 
▪ Challenges 
› How do to it in real-time? 
› Storm 
13
Pinball 
! 
▪ Real-time classifier 
▪ Detect buyers’ preferences by streaming data processing 
▪ Deliver personalized ads and product recommendations on the fly 
! 
▪ Challenges 
› How do to it in real-time? 
› Storm! 
› How to determine customers’ purchasing desire? 
14
Pinball 
! 
▪ Real-time classifier 
▪ Detect buyers’ preferences by streaming data processing 
▪ Deliver personalized ads and product recommendations on the fly 
! 
▪ Challenges 
› How do to it in real-time? 
› Storm! 
› How to determine customers’ purchasing desire? 
› Buying Intention Detection 
15
Solution: Pinball 
▪ Storm Overview 
▪ Buying Intention (BI) 
▪ Architecture and Design 
16
Pinball 
17 
Storm Learning 
Buyer
Pinball 
18 
Storm Learning 
Buyers
Pinball 
19 
Learning 
Storm 
Is Potential 
Buyer? 
Buyers 
Visitor 
Promotions
Pinball 
Pinball 
20 
Learning 
Storm 
Is Potential 
Buyer? 
Buyers 
Visitor 
Promotions
Pinball 
Pinball 
21 
Learning 
Storm 
Is Potential 
Buyer? 
Buyers 
Buyer 
Promotions
Storm Concepts 
▪ Tuple & Streams 
▪ Spouts & Bolts 
▪ Topologies 
Yahoo Confidential & Proprietary 
22
Tuple & Streams 
▪ Tuple 
! 
! 
! 
! 
▪ Stream 
Yahoo Confidential & Proprietary 
23 
Field 1 Field 2 Field 3 Field 4 Field 5 
Tuple 
Tuple 1 Tuple 2 Tuple 3 Tuple n 
Stream
Spouts & Bolts 
Yahoo Confidential & Proprietary 
24 
Spout T T T T T Bolt T T T
Topology 
25 
Spout Bolt Bolt 
Streams 
▪ Hadoop map-reduce job vs. Storm topology
Topology 
26 
Spout Bolt Bolt 
Streams 
▪ Hadoop map-reduce job vs. Storm topology
Storm Concepts 
Yahoo Confidential & Proprietary 
27 
Computational 
Primitives 
Use Case 
High-level! 
Language 
Hadoop Map & Reduce 
Batch 
Processing 
Pig 
Storm Spout & Bolt 
Stream 
Processing 
Trident
Storm 
28 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Master node, similar to the Hadoop JobTracker
Storm 
29 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Coordinates the Storm cluster
Storm 
30 
Nimbus 
Zookeeper 
Zookeeper 
Zookeeper 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Supervisor 
Run worker processes
Buying Intention 
▪ Based on our findings: 
› The more page views, the higher the chance a visitor will buy it. 
› BUT, the buying intension value of each category will vary. 
31 
2 6
How to leverage 
Storm with Buying Intention (BI)?
Data Flow Diagram 
33
Adaptive Learning 
34
Learning & Classifier 
▪ Online Binary Classification 
› Simple and computationally efficient 
▪ e.g. 
› assumptions: γ=0.1, BI = 3 
› scenario: a user makes 6 page views before purchasing 
• BI’ = 3 + (6-3) x 0.1 
• BI’ = 3.3 
35 
BI ' = BI +(PV − BI )×γ
Buying Intention Qualification 
36
37 
Topology Design
Lambda Architecture 
▪ Term created by Nathan Marz (Creator of Apache Storm) 
! 
▪ Batch Real-time processing 
Yahoo Confidential & Proprietary 
38
Lambda Architecture 
▪ Term created by Nathan Marz (Creator of Apache Storm) 
! 
▪ Batch Real-time processing 
Yahoo Confidential & Proprietary 
39
Lambda Architecture 
▪ Term created by Nathan Marz (Creator of Apache Storm) 
! 
▪ Batch + Real-time processing 
› Hybrid batch and real-time processing 
Yahoo Confidential & Proprietary 
40
Lambda Architecture 
▪ Term created by Nathan Marz (Creator of Apache Storm) 
! 
▪ Batch + Real-time processing 
› Hybrid batch and real-time processing 
› Batch processing is treated as source of truth, and real-time updates 
models/insights between batches. 
Yahoo Confidential & Proprietary 
41
Lambda Architecture 
Yahoo Confidential & Proprietary 
42 
[REF] http://lambda-architecture.net/
Lambda Architecture 
Yahoo Confidential & Proprietary 
43 
[REF] http://lambda-architecture.net/
Lambda Architecture 
Yahoo Confidential & Proprietary 
44 
Storm Streaming 
[REF] http://lambda-architecture.net/
Lambda Architecture 
Summingbird 
Yahoo Confidential & Proprietary 
45 
[REF] http://lambda-architecture.net/
Pinball Demonstration
47
How to keep it generic and flexible? 
▪ to add more signals 
▪ to add more online learning algorithms 
▪ to add more channels
How to keep it generic and flexible? 
Signals 
Algorithms 
Channels 
49 
Click Login 
Buy 
View 
Bounce 
Time 
Spent 
Buying Intention 
Email Y! Webpages Mobile 
Apps 
Messenger 
Fraud Detection 
Webpage 
Sequence
Summary 
▪ Scalable to process real-time data 
▪ Supports online learning algorithms 
▪ Flexible interactions with visitors 
▪ Increase user's engagement 
▪ Increase the conversion rate 
▪ To create synergy by combining batched recommender and Pinball 
Yahoo Confidential & Proprietary 
50
Simple Hands-on 
-> Find out the heavy users!
Find out the heavy users! 
▪ Memorize the numbers of page views for each user 
▪ If the numbers are great than 3, it’s a heavy user 
Yahoo Confidential & Proprietary 
52
Find out the heavy users! 
Yahoo Confidential & Proprietary 
53 
User Log 
Spout 
Learning 
Bolt 
userid, type, catlv1, catlv2, timestamp
Find out the heavy users! 
Yahoo Confidential & Proprietary 
54 
User Log 
Spout 
Learning 
Bolt 
userid, type, catlv1, catlv2, timestamp 
Learning 
Bolt 
shuffleGroup 
userA, xxxxx 
userB, xxxxx 
userD, xxxxx 
userB, xxxxx 
userE, xxxxx 
userC, xxxxx 
userB, xxxxx 
userC, xxxxx
Find out the heavy users! 
Yahoo Confidential & Proprietary 
55 
User Log 
Spout 
Learning 
Bolt 
userid, type, catlv1, catlv2, timestamp 
Learning 
Bolt 
fieldGroup 
userA, xxxxx 
userD, xxxxx 
userF, xxxxx 
userF, xxxxx 
userE, xxxxx 
userC, xxxxx 
userB, xxxxx 
userB, xxxxx 
userB, xxxxx 
userC, xxxxx
Find out the heavy users! 
Yahoo Confidential & Proprietary 
56 
User Log 
Spout 
Learning 
Bolt 
Learning 
Bolt 
fieldGroup 
userA, xxxxx 
userD, xxxxx 
userF, xxxxx 
userF, xxxxx 
userE, xxxxx 
userC, xxxxx 
userB, xxxxx 
userB, xxxxx 
userB, xxxxx 
userC, xxxxx 
Qualification 
Bolt 
userA, totalPV 
userB, totalPV 
userC, totalPV 
userF, totalPV
Questions? 
Norman! 
@normanyhuang! 
www.linkedin.com/in/normany 
Jason! 
@kalijason! 
www.linkedin.com/pub/jason-lin/67/93/743

More Related Content

PDF
A quick review of Python and Graph Databases
PDF
Managing your black friday logs Voxxed Luxembourg
PDF
Managing your Black Friday Logs NDC Oslo
PDF
The Graph Database Universe: Neo4j Overview
PDF
Why and How to integrate Hadoop and NoSQL?
PDF
Observe Changes of Taiwan Big Data Communities with Small Data
PPTX
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
PPTX
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
A quick review of Python and Graph Databases
Managing your black friday logs Voxxed Luxembourg
Managing your Black Friday Logs NDC Oslo
The Graph Database Universe: Neo4j Overview
Why and How to integrate Hadoop and NoSQL?
Observe Changes of Taiwan Big Data Communities with Small Data
BDW Chicago 2016 - Jim Scott, Director, Enterprise Strategy & Architecture - ...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

What's hot (20)

PDF
Hadoop and Neo4j: A Winning Combination for Bioinformatics
PDF
Apache Druid Vision and Roadmap
PDF
Graph Databases in Python (PyCon Canada 2012)
PPTX
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
PDF
Big Data made easy with a Spark
PDF
Lord of the Bing - Black Hat USA 2010
PDF
Getting started with Graph Databases & Neo4j
PDF
Neo4j in Depth
PDF
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
PPT
Neo4J : Introduction to Graph Database
PPT
Caching Search Engine Results over Incremental Indices
PDF
Tenacious Diggity - Skinny Dippin in a Sea of Bing
PDF
Intro to Graphs and Neo4j
PDF
How Graph Databases efficiently store, manage and query connected data at s...
PDF
Neo4j Fundamentals
PDF
Building Data Applications with Apache Druid
PDF
Using MongoDB + Hadoop Together
PDF
Black Hat 2011 - Pulp Google Hacking: The Next Generation Search Engine Hacki...
PPTX
MongoDB and Hadoop: Driving Business Insights
PDF
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Hadoop and Neo4j: A Winning Combination for Bioinformatics
Apache Druid Vision and Roadmap
Graph Databases in Python (PyCon Canada 2012)
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data made easy with a Spark
Lord of the Bing - Black Hat USA 2010
Getting started with Graph Databases & Neo4j
Neo4j in Depth
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Neo4J : Introduction to Graph Database
Caching Search Engine Results over Incremental Indices
Tenacious Diggity - Skinny Dippin in a Sea of Bing
Intro to Graphs and Neo4j
How Graph Databases efficiently store, manage and query connected data at s...
Neo4j Fundamentals
Building Data Applications with Apache Druid
Using MongoDB + Hadoop Together
Black Hat 2011 - Pulp Google Hacking: The Next Generation Search Engine Hacki...
MongoDB and Hadoop: Driving Business Insights
Introduction to Graph databases and Neo4j (by Stefan Armbruster)
Ad

Viewers also liked (7)

PDF
李慕約&王向榮/如何備料:資料的抓取、清理以及串接
PDF
Z > B 的資料科學
PDF
一個賭徒的告白:從預測市場看金融交易
PDF
林佳賢/資料視覺化的 20 個小訣竅
PDF
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
PDF
[系列活動] 給工程師的統計學及資料分析 123
PDF
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
李慕約&王向榮/如何備料:資料的抓取、清理以及串接
Z > B 的資料科學
一個賭徒的告白:從預測市場看金融交易
林佳賢/資料視覺化的 20 個小訣竅
Collaboration with Statistician? 矩陣視覺化於探索式資料分析
[系列活動] 給工程師的統計學及資料分析 123
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
Ad

Similar to 天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統 (20)

PDF
Avatara: OLAP for Web-scale Analytics Products
KEY
Trending with Purpose
PDF
Schema.org Structured data the What, Why, & How
PDF
Big Data Berlin - Criteo
PDF
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
PDF
NoSQL e Python RuPy 2012
PDF
Wireframes: Choose the Right Tool for the Job
PDF
Graphs in Action: In-depth look at Neo4j in Production
PDF
Architecting a next generation data platform
PPTX
Learn Like a Human: Taking Machine Learning from Batch to Real-Time
PPTX
Semantic search: from document retrieval to virtual assistants
PDF
Moving Targets: Harnessing Real-time Value from Data in Motion
PPT
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
PPTX
Natural born conversion killers - Conversion Jam
PDF
Complex things explained easily
PDF
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
PPTX
IronEdge PowerBI World Tour Presentation
PDF
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
PDF
2012-09-24 Workshop: Wireframe mockups
KEY
iPhone game development - Joash Chee
Avatara: OLAP for Web-scale Analytics Products
Trending with Purpose
Schema.org Structured data the What, Why, & How
Big Data Berlin - Criteo
From 6 hours to 1 minute... in 2 days! How we managed to stream our (long) Ha...
NoSQL e Python RuPy 2012
Wireframes: Choose the Right Tool for the Job
Graphs in Action: In-depth look at Neo4j in Production
Architecting a next generation data platform
Learn Like a Human: Taking Machine Learning from Batch to Real-Time
Semantic search: from document retrieval to virtual assistants
Moving Targets: Harnessing Real-time Value from Data in Motion
Real-time Streaming Analytics: Business Value, Use Cases and Architectural Co...
Natural born conversion killers - Conversion Jam
Complex things explained easily
H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj
IronEdge PowerBI World Tour Presentation
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
2012-09-24 Workshop: Wireframe mockups
iPhone game development - Joash Chee

More from 台灣資料科學年會 (20)

PDF
[台灣人工智慧學校] 人工智慧技術發展與應用
PDF
[台灣人工智慧學校] 執行長報告
PDF
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
PDF
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
PDF
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
PDF
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
PDF
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
PDF
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
PDF
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
PDF
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
PDF
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
PDF
台灣人工智慧學校成果發表會
PDF
[台中分校] 第一期結業典禮 - 執行長談話
PDF
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
PDF
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
PDF
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
PDF
[TOxAIA新竹分校] 深度學習與Kaggle實戰
PDF
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
PDF
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
PDF
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳
[台灣人工智慧學校] 人工智慧技術發展與應用
[台灣人工智慧學校] 執行長報告
[台灣人工智慧學校] 工業 4.0 與智慧製造的發展趨勢與挑戰
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 開創台灣產業智慧轉型的新契機
[台灣人工智慧學校] 台北總校第三期結業典禮 - 執行長談話
[TOxAIA台中分校] AI 引爆新工業革命,智慧機械首都台中轉型論壇
[TOxAIA台中分校] 2019 台灣數位轉型 與產業升級趨勢觀察
[TOxAIA台中分校] 智慧製造成真! 產線導入AI的致勝關鍵
[台灣人工智慧學校] 從經濟學看人工智慧產業應用
[台灣人工智慧學校] 台中分校第二期開學典禮 - 執行長報告
台灣人工智慧學校成果發表會
[台中分校] 第一期結業典禮 - 執行長談話
[TOxAIA新竹分校] 工業4.0潛力新應用! 多模式對話機器人
[TOxAIA新竹分校] AI整合是重點! 竹科的關鍵轉型思維
[TOxAIA新竹分校] 2019 台灣數位轉型與產業升級趨勢觀察
[TOxAIA新竹分校] 深度學習與Kaggle實戰
[台灣人工智慧學校] Bridging AI to Precision Agriculture through IoT
[2018 台灣人工智慧學校校友年會] 產業經驗分享: 如何用最少的訓練樣本,得到最好的深度學習影像分析結果,減少一半人力,提升一倍品質 / 李明達
[2018 台灣人工智慧學校校友年會] 啟動物聯網新關鍵 - 未來由你「喚」醒 / 沈品勳

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Approach and Philosophy of On baking technology
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Review of recent advances in non-invasive hemoglobin estimation
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

天下武功唯快不破:利用串流資料實做出即時分類器和即時推薦系統

  • 2. Who I am ▪ Norman Huang (normany@yahoo-inc.com) ▪ Software & Data Engineer of Yahoo! Taiwan ▪ Aims to retrieve and deliver data insights via BI platform and data mining algorithms. 2
  • 3. Who I am ▪ Jason Lin (jasonysl@yahoo-inc.com) ▪ Software & Data Engineer of Yahoo! Taiwan ▪ Responsible for recommendation system personalization mechanisms and cloud computing developer. 3
  • 4. Agenda ▪ Challenges ▪ Solution: Pinball ▪ Q&A 4
  • 5. Challenges ! ! ! ! ! ! ▪ Static content until next batch job. ! ! ! 5 Processing
  • 6. Challenges ! ! ! ! ! ! ▪ Static content until next batch job. ▪ Batched product recommendation algorithms have become common features among e-commerce platforms. ! 6 Processing
  • 7. Challenges ! ! ! ! ! ! ▪ Nearly 72% of visitors made their decision at the same day. 7 Absorbed into batch views Not yet absorbed Time Several hours of data
  • 8. Challenges ! ! ! ! ! ! ▪ Nearly 72% of visitors made their decision at the same day. ▪ Real-time solution to interact with potential buyers. 8 Absorbed into batch views Not yet absorbed Time Several hours of data
  • 10. Pinball 10 Classifier Classifier User Profile A Profile B Profile C
  • 11. Pinball ! ▪ Real-time classifier ▪ Detect buyers’ preferences by streaming data processing ▪ Deliver personalized ads and product recommendations on the fly 11
  • 12. Pinball ! ▪ Real-time classifier ▪ Detect buyers’ preferences by streaming data processing ▪ Deliver personalized ads and product recommendations on the fly ! ▪ Challenges › How do to it in real-time? 12
  • 13. Pinball ! ▪ Real-time classifier ▪ Detect buyers’ preferences by streaming data processing ▪ Deliver personalized ads and product recommendations on the fly ! ▪ Challenges › How do to it in real-time? › Storm 13
  • 14. Pinball ! ▪ Real-time classifier ▪ Detect buyers’ preferences by streaming data processing ▪ Deliver personalized ads and product recommendations on the fly ! ▪ Challenges › How do to it in real-time? › Storm! › How to determine customers’ purchasing desire? 14
  • 15. Pinball ! ▪ Real-time classifier ▪ Detect buyers’ preferences by streaming data processing ▪ Deliver personalized ads and product recommendations on the fly ! ▪ Challenges › How do to it in real-time? › Storm! › How to determine customers’ purchasing desire? › Buying Intention Detection 15
  • 16. Solution: Pinball ▪ Storm Overview ▪ Buying Intention (BI) ▪ Architecture and Design 16
  • 17. Pinball 17 Storm Learning Buyer
  • 18. Pinball 18 Storm Learning Buyers
  • 19. Pinball 19 Learning Storm Is Potential Buyer? Buyers Visitor Promotions
  • 20. Pinball Pinball 20 Learning Storm Is Potential Buyer? Buyers Visitor Promotions
  • 21. Pinball Pinball 21 Learning Storm Is Potential Buyer? Buyers Buyer Promotions
  • 22. Storm Concepts ▪ Tuple & Streams ▪ Spouts & Bolts ▪ Topologies Yahoo Confidential & Proprietary 22
  • 23. Tuple & Streams ▪ Tuple ! ! ! ! ▪ Stream Yahoo Confidential & Proprietary 23 Field 1 Field 2 Field 3 Field 4 Field 5 Tuple Tuple 1 Tuple 2 Tuple 3 Tuple n Stream
  • 24. Spouts & Bolts Yahoo Confidential & Proprietary 24 Spout T T T T T Bolt T T T
  • 25. Topology 25 Spout Bolt Bolt Streams ▪ Hadoop map-reduce job vs. Storm topology
  • 26. Topology 26 Spout Bolt Bolt Streams ▪ Hadoop map-reduce job vs. Storm topology
  • 27. Storm Concepts Yahoo Confidential & Proprietary 27 Computational Primitives Use Case High-level! Language Hadoop Map & Reduce Batch Processing Pig Storm Spout & Bolt Stream Processing Trident
  • 28. Storm 28 Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Master node, similar to the Hadoop JobTracker
  • 29. Storm 29 Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Coordinates the Storm cluster
  • 30. Storm 30 Nimbus Zookeeper Zookeeper Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor Run worker processes
  • 31. Buying Intention ▪ Based on our findings: › The more page views, the higher the chance a visitor will buy it. › BUT, the buying intension value of each category will vary. 31 2 6
  • 32. How to leverage Storm with Buying Intention (BI)?
  • 35. Learning & Classifier ▪ Online Binary Classification › Simple and computationally efficient ▪ e.g. › assumptions: γ=0.1, BI = 3 › scenario: a user makes 6 page views before purchasing • BI’ = 3 + (6-3) x 0.1 • BI’ = 3.3 35 BI ' = BI +(PV − BI )×γ
  • 38. Lambda Architecture ▪ Term created by Nathan Marz (Creator of Apache Storm) ! ▪ Batch Real-time processing Yahoo Confidential & Proprietary 38
  • 39. Lambda Architecture ▪ Term created by Nathan Marz (Creator of Apache Storm) ! ▪ Batch Real-time processing Yahoo Confidential & Proprietary 39
  • 40. Lambda Architecture ▪ Term created by Nathan Marz (Creator of Apache Storm) ! ▪ Batch + Real-time processing › Hybrid batch and real-time processing Yahoo Confidential & Proprietary 40
  • 41. Lambda Architecture ▪ Term created by Nathan Marz (Creator of Apache Storm) ! ▪ Batch + Real-time processing › Hybrid batch and real-time processing › Batch processing is treated as source of truth, and real-time updates models/insights between batches. Yahoo Confidential & Proprietary 41
  • 42. Lambda Architecture Yahoo Confidential & Proprietary 42 [REF] http://lambda-architecture.net/
  • 43. Lambda Architecture Yahoo Confidential & Proprietary 43 [REF] http://lambda-architecture.net/
  • 44. Lambda Architecture Yahoo Confidential & Proprietary 44 Storm Streaming [REF] http://lambda-architecture.net/
  • 45. Lambda Architecture Summingbird Yahoo Confidential & Proprietary 45 [REF] http://lambda-architecture.net/
  • 47. 47
  • 48. How to keep it generic and flexible? ▪ to add more signals ▪ to add more online learning algorithms ▪ to add more channels
  • 49. How to keep it generic and flexible? Signals Algorithms Channels 49 Click Login Buy View Bounce Time Spent Buying Intention Email Y! Webpages Mobile Apps Messenger Fraud Detection Webpage Sequence
  • 50. Summary ▪ Scalable to process real-time data ▪ Supports online learning algorithms ▪ Flexible interactions with visitors ▪ Increase user's engagement ▪ Increase the conversion rate ▪ To create synergy by combining batched recommender and Pinball Yahoo Confidential & Proprietary 50
  • 51. Simple Hands-on -> Find out the heavy users!
  • 52. Find out the heavy users! ▪ Memorize the numbers of page views for each user ▪ If the numbers are great than 3, it’s a heavy user Yahoo Confidential & Proprietary 52
  • 53. Find out the heavy users! Yahoo Confidential & Proprietary 53 User Log Spout Learning Bolt userid, type, catlv1, catlv2, timestamp
  • 54. Find out the heavy users! Yahoo Confidential & Proprietary 54 User Log Spout Learning Bolt userid, type, catlv1, catlv2, timestamp Learning Bolt shuffleGroup userA, xxxxx userB, xxxxx userD, xxxxx userB, xxxxx userE, xxxxx userC, xxxxx userB, xxxxx userC, xxxxx
  • 55. Find out the heavy users! Yahoo Confidential & Proprietary 55 User Log Spout Learning Bolt userid, type, catlv1, catlv2, timestamp Learning Bolt fieldGroup userA, xxxxx userD, xxxxx userF, xxxxx userF, xxxxx userE, xxxxx userC, xxxxx userB, xxxxx userB, xxxxx userB, xxxxx userC, xxxxx
  • 56. Find out the heavy users! Yahoo Confidential & Proprietary 56 User Log Spout Learning Bolt Learning Bolt fieldGroup userA, xxxxx userD, xxxxx userF, xxxxx userF, xxxxx userE, xxxxx userC, xxxxx userB, xxxxx userB, xxxxx userB, xxxxx userC, xxxxx Qualification Bolt userA, totalPV userB, totalPV userC, totalPV userF, totalPV
  • 57. Questions? Norman! @normanyhuang! www.linkedin.com/in/normany Jason! @kalijason! www.linkedin.com/pub/jason-lin/67/93/743