SlideShare a Scribd company logo
Huawei Advanced Data Science
With Spark Streaming
Jianfeng Qian, Cheng He
Huawei Research Institute
Contents
• streamDM: Stream Mining in Spark Streaming
(Jianfeng Qian)
• Business scenarios in Huawei with Spark Steaming
(Cheng He)
Open Source Machine Learning Projects
 Apache Mahout
• May 2010, v0.1: support Hadoop
• Apr 2014, Mahout-Samsara, v0.10: support Spark and H2O
• April 2016,v0.12: R-like DSL, support Flink
 oryx&oryx2
• Dec 2013, v0.3.0: real-time large-scale machine learning support
Hadoop
• Dec 2015, v2.0: support Spark Streaming
 Apache SAMOA: Scalable Advanced Massive Online Analysis
• Jul 2015, v0.3.0: support Storm, Samza and Flink
Stream Data Mining?
Stream Data Mining
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it is
discarded or archived
• Data is evolving
Approximation algorithms
• Small error rate with high probability
streamDM?
streamDM
streamDM is incremental
• streamDM is designed specifically to be used inside Spark
Streaming.
• All algorithms are incremental
streamDM!
streamDM for users
• Download streamDM
git clone https://github.com/huawei-noah/streamDM.git
• Build streamDM
sbt package
• streamDM execute tasks
./spark.sh "EvaluatePrequential 
-l (SGDLearner -l 0.01 –o LogisticLoss -r ZeroRegularizer) 
–s (FileReader –k 100 –d 60 –f ../data/mydata)" 
1> ../sgd.log 2>../sgd.result
streamDM for Programmers
• StreamReader read and parse Example and create a stream
• Learner provides the train method from an input stream
• Model data structure and set of methods used for Learner
• Evaluator evaluation of predictions
• StreamWriter output of streams
streamDM
• Advanced machine learning methods including streaming decision
trees, streaming clustering methods as CluStream and StreamKM++.
• Ease of use. Experiments can be performed from the command-line,
as in WEKA or MOA.
• High extensibility
• No dependence on third-part libraries, specially on the linear algebra
package Breeze.
streamDM
First Release 31/12/15
• SGD Learner and Perceptron
• Naive Bayes
• CluStream
• Hoeffding Decision Trees
• Bagging
• Stream KM++
Next Release 31/12/16(support Spark 2.0)
• Random Forests
• Frequent Itemset Miner: IncMine
Something else…
Google Cloud
Dataflow
Spark Flink Storm
streamDM
Platform
Stream Mining SAMOA
mllibMachine Learning(Batch) Mahout
oryx2
Something else…
Beam
Google Cloud
Dataflow
Spark Flink Storm
streamDM
Interface &SDK
Platform
Stream Mining SAMOA
BeamML
mllibMachine Learning(Batch) Mahout
oryx2
or something else…
Contents
• streamDM: Stream Mining in Spark Streaming
(Jianfeng Qian)
• Business scenarios in Huawei with Spark Steaming
(Cheng He)
Special Challenges in Telecom Big Data Analytics
Case 1: Alarm Analysis
Cloud Service
Alarm
Trouble Tickets
Field Operation /
Maintenance
Root Cause Analysis
`40-100 Millions Alarms / day
`Change rapidly with unnoticed phenomenon
Project A Project B Project DProject C
。。。
Data Base
AABD : Automatic Alarm Behavior Discovery
…
Topol
ogy
Config
uration
Alarm correlation
Automatic correlation rules generation
& optimization
Root Cause Identification
Quick troubleshooting, reduce MTTR
Fault Penetration
Multi-dimensional analysis the history alarms to identify the potential
network risk or fault
Forecasting
Forecasting next alarm/fault
Rule Database
Flapping
behavior
mining
Frequent item
sets mining
Casual
Inference
Algorithms Library
Automatic Alarm Behavior Discovery
Correlation Method
Cross Domain Performance
Service
Impact …
…
Mllib - >
StreamDM
AABD for Automatic Alarm Management
30X
Rule Design
Efficiency
30%
Alarm Correlation
Implementation
Process Time
Rule Database
Flapping
behavior
mining
Frequent item
sets mining
Casual
Inference
Algorithms Library
Automatic Alarm
Behavior Discovery
Correlation Method
Cross Domain Performance
Service
Impact …
…
Fault
Management
Auto
Troubleshooting Rule
Auto Alarm Rule
Auto Diagnostic
Expert
AABD: Automatic Alarm Behavior Discovery
Systems
AABD : Results from practical applications
` Deployed in 13 Operator sites all over the world
` Improve efficiency of deployment from 5 man * month to 3
man * day
` Millions of alarm sequences, < 1 hour;
Case 2 : Fault Localization for Customer Care
Complaint
Query
Log file / Signaling
events
Fault Localization Engine
Network
KB
Knowledge Base
Maintenance
Complaint Process
Application
KB
Network/Charging
/Support Dept.…
Front line
Customer
Service Staff
Expert / Manager Review
Maintenance
Dept.
Traditional CCA
Complaints
Fault Localization
Results
Ensemble
Classification Model
Automatic Feature
Engineering
Intelligent CCA
Expert
Rules
K
B
Answers
Difficult
issues
Challenges on Discriminative sequence pattern mining
1. It’s an NP-Hard problem to mine patterns out from
massive sequence data for accurate classification
2. The upper bound of the pattern search space is
3. Usually, important patterns occur not so frequently,
so we probably will search the total space of
(1 )
( )s p
O S 
Important and
discriminative patterns
( )s
O S
Original issues State of art
1. Based on Divide-and-conquer,we can reduce
the search space, even when we adopt larger p,
we can still get discriminative patterns
2. Scale down ratio is 1/ sp
s
(1 )
(2 )l m p
O m 
Research goals
1. Online stream pattern mining,dynamically
generate trees for pattern mining,, further
reduce the search space to ;
2. Approximation analysis: balance
complexity and accuracy;
1. Open dataset:192 samples,p = 12% ,no. of
patterns 8600;p = 4%,No. of patterns 92,000;
2. In CCA,538 samples,p=5%,No. of patterns
687402;
1. In CCA:,538 samples,p=20%,No. of
patterns: 29696;
2
1 2
1
ln
(F ) G(F )
2
st nd
R
G
n

  
Ref: “Direct mining of discriminative and essential frequent
patterns via model-based search tree”, W. Fan, etc.. KDD 2008
Experiment results
Mining discriminative and essential patterns with extremely low global support directly.
Better efficiency for stream sequential data mining while preserving accuracy.
With the ability to detect concept drift quickly and adapt to new concept fast;
Design of Fault Localization Solution
Data preprocessing
Sequence Encoding
Sequential Pattern Mining
Modeling
Pattern Matching
Online Classification
Label Feedback
Model Update
`4-6 TB data / day;
`8M – 10M sequences;
`E-2-E feedback support
< 3.6 sec;
`Model update 5 hours
Spark-streaming
Spark
Lessons from practical applications
 About spark:
 Advantages:
 Easy to use;
 Perfect community & ecosystem;
 Limitations:
 Delay;
 Throughput;
 About big data analytics in Telecom networks:
 Efficient sequential pattern mining framework
 Deep reinforcement learning;
 Robust ML / AI & Domain Knowledge;
 Close-loop evaluation;
StreamSMART
Scenarios & Apps:
 App recommendation system;
100M+ customers;
30M – 100M features;
 Anti-DDoS Solution;
4M - 10M flow / sec;
Huawei Innovation Research Program
• The Huawei Innovation Research Program (HIRP) provides funding opportunities to
leading universities and research institutes conducting innovative research in
communication technology, computer science, engineering, and related fields. HIRP
seeks to identify and support world-class, full-time faculty members pursuing
innovation of mutual interest. Outstanding HIRP winners may be invited to establish
further long-term research collaboration with Huawei.
• Call for Proposals for Big Data & Artificial Intelligence
• HIRPO20160606: Novel Algorithm Design and Use Cases for Data Stream Mining based on streamDM
https://innovationresearch.huawei.com/IPD/hirp/portal/index.html
Join us to build a better connected world
THANK YOU
jianfeng.qian@outlook.com, hecheng@huawei.com

More Related Content

PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
PDF
Spark Uber Development Kit
PDF
Spark Summit EU talk by Josef Habdank
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
EclairJS = Node.Js + Apache Spark
PDF
Scaling Machine Learning To Billions Of Parameters
PDF
Spark: Interactive To Production
PDF
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Spark Uber Development Kit
Spark Summit EU talk by Josef Habdank
A Journey into Databricks' Pipelines: Journey and Lessons Learned
EclairJS = Node.Js + Apache Spark
Scaling Machine Learning To Billions Of Parameters
Spark: Interactive To Production
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...

What's hot (20)

PDF
Spark Summit EU talk by Heiko Korndorf
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
CaffeOnSpark: Deep Learning On Spark Cluster
PDF
Spark Summit EU talk by Brij Bhushan Ravat
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Spark Summit EU talk by Elena Lazovik
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Spark Summit EU talk by Luca Canali
PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Scalable Deep Learning Platform On Spark In Baidu
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Heiko Korndorf
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Resource-Efficient Deep Learning Model Selection on Apache Spark
Spark Summit EU talk by Kaarthik Sivashanmugam
CaffeOnSpark: Deep Learning On Spark Cluster
Spark Summit EU talk by Brij Bhushan Ravat
Apache Spark MLlib 2.0 Preview: Data Science and Production
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
An Introduction to Sparkling Water by Michal Malohlava
Analytics at Scale with Apache Spark on AWS with Jonathan Fritz
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Scalable Deep Learning Platform On Spark In Baidu
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Ad

Viewers also liked (20)

PDF
Morticia: Visualizing And Debugging Complex Spark Workflows
PDF
Large Scale Multimedia Data Intelligence And Analysis On Spark
PDF
Airstream: Spark Streaming At Airbnb
PDF
Understanding Memory Management In Spark For Fun And Profit
PDF
Building Custom Machine Learning Algorithms With Apache SystemML
PDF
Big Data in Production: Lessons from Running in the Cloud
PDF
Recent Developments In SparkR For Advanced Analytics
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
PDF
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
PDF
Interactive Visualization of Streaming Data Powered by Spark
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
PDF
Credit Fraud Prevention with Spark and Graph Analysis
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
Analyzing Log Data With Apache Spark
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Large Scale Deep Learning with TensorFlow
PPTX
Big Data Scala by the Bay: Interactive Spark in your Browser
PPTX
Huawei - Celcom Site Survey April 2016 proposal
Morticia: Visualizing And Debugging Complex Spark Workflows
Large Scale Multimedia Data Intelligence And Analysis On Spark
Airstream: Spark Streaming At Airbnb
Understanding Memory Management In Spark For Fun And Profit
Building Custom Machine Learning Algorithms With Apache SystemML
Big Data in Production: Lessons from Running in the Cloud
Recent Developments In SparkR For Advanced Analytics
Spark And Cassandra: 2 Fast, 2 Furious
Netflix - Productionizing Spark On Yarn For ETL At Petabyte Scale
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Interactive Visualization of Streaming Data Powered by Spark
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Credit Fraud Prevention with Spark and Graph Analysis
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Analyzing Log Data With Apache Spark
Deep Dive: Memory Management in Apache Spark
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Large Scale Deep Learning with TensorFlow
Big Data Scala by the Bay: Interactive Spark in your Browser
Huawei - Celcom Site Survey April 2016 proposal
Ad

Similar to Huawei Advanced Data Science With Spark Streaming (20)

PPTX
Real time streaming analytics
PPTX
Shikha fdp 62_14july2017
PPTX
Predictive maintenance withsensors_in_utilities_
PDF
Machine Learning Streams with Spark 1.0
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PPTX
Big Data Analytics with Storm, Spark and GraphLab
DOC
Service Level Comparison for Online Shopping using Data Mining
PPTX
Intro to Spark development
PDF
Data Streams Models And Algorithms Charu C Aggarwal Ed
PDF
Introduction to Spark Training
PPTX
Streaming data mining
PDF
Bds session 13 14
PPTX
Design Patterns for Large-Scale Real-Time Learning
PDF
Streamsets and spark in Retail
PDF
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
PPTX
Machine Learning and Hadoop
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PPTX
Trivento summercamp fast data 9/9/2016
PDF
Introduction to Spark Streaming
PDF
IRJET-Scaling Distributed Associative Classifier using Big Data
Real time streaming analytics
Shikha fdp 62_14july2017
Predictive maintenance withsensors_in_utilities_
Machine Learning Streams with Spark 1.0
Scalable Distributed Real-Time Clustering for Big Data Streams
Big Data Analytics with Storm, Spark and GraphLab
Service Level Comparison for Online Shopping using Data Mining
Intro to Spark development
Data Streams Models And Algorithms Charu C Aggarwal Ed
Introduction to Spark Training
Streaming data mining
Bds session 13 14
Design Patterns for Large-Scale Real-Time Learning
Streamsets and spark in Retail
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Machine Learning and Hadoop
Scalable Distributed Real-Time Clustering for Big Data Streams
Trivento summercamp fast data 9/9/2016
Introduction to Spark Streaming
IRJET-Scaling Distributed Associative Classifier using Big Data

More from Jen Aman (20)

PPTX
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
PDF
Snorkel: Dark Data and Machine Learning with Christopher Ré
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
PDF
RISELab:Enabling Intelligent Real-Time Decisions
PDF
Spatial Analysis On Histological Images Using Spark
PDF
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
PDF
A Graph-Based Method For Cross-Entity Threat Detection
PDF
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
Deploying Accelerators At Datacenter Scale Using Spark
PDF
Re-Architecting Spark For Performance Understandability
PDF
Re-Architecting Spark For Performance Understandability
PDF
Low Latency Execution For Apache Spark
PDF
Efficient State Management With Spark 2.0 And Scale-Out Databases
PDF
Livy: A REST Web Service For Apache Spark
PDF
GPU Computing With Apache Spark And Python
PDF
Spark on Mesos
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Spark at Bloomberg: Dynamically Composable Analytics
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Snorkel: Dark Data and Machine Learning with Christopher Ré
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
RISELab:Enabling Intelligent Real-Time Decisions
Spatial Analysis On Histological Images Using Spark
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
A Graph-Based Method For Cross-Entity Threat Detection
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Time-Evolving Graph Processing On Commodity Clusters
Deploying Accelerators At Datacenter Scale Using Spark
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
Low Latency Execution For Apache Spark
Efficient State Management With Spark 2.0 And Scale-Out Databases
Livy: A REST Web Service For Apache Spark
GPU Computing With Apache Spark And Python
Spark on Mesos
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Spark at Bloomberg: Dynamically Composable Analytics

Recently uploaded (20)

PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Mega Projects Data Mega Projects Data
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
1_Introduction to advance data techniques.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
IBA_Chapter_11_Slides_Final_Accessible.pptx
Reliability_Chapter_ presentation 1221.5784
Mega Projects Data Mega Projects Data
Computer network topology notes for revision
Quality review (1)_presentation of this 21
climate analysis of Dhaka ,Banglades.pptx
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Acumen Training GuidePresentation.pptx
.pdf is not working space design for the following data for the following dat...
Moving the Public Sector (Government) to a Digital Adoption
1_Introduction to advance data techniques.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

Huawei Advanced Data Science With Spark Streaming

  • 1. Huawei Advanced Data Science With Spark Streaming Jianfeng Qian, Cheng He Huawei Research Institute
  • 2. Contents • streamDM: Stream Mining in Spark Streaming (Jianfeng Qian) • Business scenarios in Huawei with Spark Steaming (Cheng He)
  • 3. Open Source Machine Learning Projects  Apache Mahout • May 2010, v0.1: support Hadoop • Apr 2014, Mahout-Samsara, v0.10: support Spark and H2O • April 2016,v0.12: R-like DSL, support Flink  oryx&oryx2 • Dec 2013, v0.3.0: real-time large-scale machine learning support Hadoop • Dec 2015, v2.0: support Spark Streaming  Apache SAMOA: Scalable Advanced Massive Online Analysis • Jul 2015, v0.3.0: support Storm, Samza and Flink
  • 5. Stream Data Mining Data Streams • Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived • Data is evolving Approximation algorithms • Small error rate with high probability
  • 8. streamDM is incremental • streamDM is designed specifically to be used inside Spark Streaming. • All algorithms are incremental
  • 10. streamDM for users • Download streamDM git clone https://github.com/huawei-noah/streamDM.git • Build streamDM sbt package • streamDM execute tasks ./spark.sh "EvaluatePrequential -l (SGDLearner -l 0.01 –o LogisticLoss -r ZeroRegularizer) –s (FileReader –k 100 –d 60 –f ../data/mydata)" 1> ../sgd.log 2>../sgd.result
  • 11. streamDM for Programmers • StreamReader read and parse Example and create a stream • Learner provides the train method from an input stream • Model data structure and set of methods used for Learner • Evaluator evaluation of predictions • StreamWriter output of streams
  • 12. streamDM • Advanced machine learning methods including streaming decision trees, streaming clustering methods as CluStream and StreamKM++. • Ease of use. Experiments can be performed from the command-line, as in WEKA or MOA. • High extensibility • No dependence on third-part libraries, specially on the linear algebra package Breeze.
  • 13. streamDM First Release 31/12/15 • SGD Learner and Perceptron • Naive Bayes • CluStream • Hoeffding Decision Trees • Bagging • Stream KM++ Next Release 31/12/16(support Spark 2.0) • Random Forests • Frequent Itemset Miner: IncMine
  • 14. Something else… Google Cloud Dataflow Spark Flink Storm streamDM Platform Stream Mining SAMOA mllibMachine Learning(Batch) Mahout oryx2
  • 15. Something else… Beam Google Cloud Dataflow Spark Flink Storm streamDM Interface &SDK Platform Stream Mining SAMOA BeamML mllibMachine Learning(Batch) Mahout oryx2 or something else…
  • 16. Contents • streamDM: Stream Mining in Spark Streaming (Jianfeng Qian) • Business scenarios in Huawei with Spark Steaming (Cheng He)
  • 17. Special Challenges in Telecom Big Data Analytics
  • 18. Case 1: Alarm Analysis Cloud Service Alarm Trouble Tickets Field Operation / Maintenance Root Cause Analysis `40-100 Millions Alarms / day `Change rapidly with unnoticed phenomenon
  • 19. Project A Project B Project DProject C 。。。 Data Base AABD : Automatic Alarm Behavior Discovery … Topol ogy Config uration Alarm correlation Automatic correlation rules generation & optimization Root Cause Identification Quick troubleshooting, reduce MTTR Fault Penetration Multi-dimensional analysis the history alarms to identify the potential network risk or fault Forecasting Forecasting next alarm/fault Rule Database Flapping behavior mining Frequent item sets mining Casual Inference Algorithms Library Automatic Alarm Behavior Discovery Correlation Method Cross Domain Performance Service Impact … … Mllib - > StreamDM
  • 20. AABD for Automatic Alarm Management 30X Rule Design Efficiency 30% Alarm Correlation Implementation Process Time Rule Database Flapping behavior mining Frequent item sets mining Casual Inference Algorithms Library Automatic Alarm Behavior Discovery Correlation Method Cross Domain Performance Service Impact … … Fault Management Auto Troubleshooting Rule Auto Alarm Rule Auto Diagnostic Expert AABD: Automatic Alarm Behavior Discovery Systems AABD : Results from practical applications ` Deployed in 13 Operator sites all over the world ` Improve efficiency of deployment from 5 man * month to 3 man * day ` Millions of alarm sequences, < 1 hour;
  • 21. Case 2 : Fault Localization for Customer Care Complaint Query Log file / Signaling events Fault Localization Engine Network KB Knowledge Base Maintenance Complaint Process Application KB Network/Charging /Support Dept.… Front line Customer Service Staff Expert / Manager Review Maintenance Dept. Traditional CCA Complaints Fault Localization Results Ensemble Classification Model Automatic Feature Engineering Intelligent CCA Expert Rules K B Answers Difficult issues
  • 22. Challenges on Discriminative sequence pattern mining 1. It’s an NP-Hard problem to mine patterns out from massive sequence data for accurate classification 2. The upper bound of the pattern search space is 3. Usually, important patterns occur not so frequently, so we probably will search the total space of (1 ) ( )s p O S  Important and discriminative patterns ( )s O S Original issues State of art 1. Based on Divide-and-conquer,we can reduce the search space, even when we adopt larger p, we can still get discriminative patterns 2. Scale down ratio is 1/ sp s (1 ) (2 )l m p O m  Research goals 1. Online stream pattern mining,dynamically generate trees for pattern mining,, further reduce the search space to ; 2. Approximation analysis: balance complexity and accuracy; 1. Open dataset:192 samples,p = 12% ,no. of patterns 8600;p = 4%,No. of patterns 92,000; 2. In CCA,538 samples,p=5%,No. of patterns 687402; 1. In CCA:,538 samples,p=20%,No. of patterns: 29696; 2 1 2 1 ln (F ) G(F ) 2 st nd R G n     Ref: “Direct mining of discriminative and essential frequent patterns via model-based search tree”, W. Fan, etc.. KDD 2008
  • 23. Experiment results Mining discriminative and essential patterns with extremely low global support directly. Better efficiency for stream sequential data mining while preserving accuracy. With the ability to detect concept drift quickly and adapt to new concept fast;
  • 24. Design of Fault Localization Solution Data preprocessing Sequence Encoding Sequential Pattern Mining Modeling Pattern Matching Online Classification Label Feedback Model Update `4-6 TB data / day; `8M – 10M sequences; `E-2-E feedback support < 3.6 sec; `Model update 5 hours Spark-streaming Spark
  • 25. Lessons from practical applications  About spark:  Advantages:  Easy to use;  Perfect community & ecosystem;  Limitations:  Delay;  Throughput;  About big data analytics in Telecom networks:  Efficient sequential pattern mining framework  Deep reinforcement learning;  Robust ML / AI & Domain Knowledge;  Close-loop evaluation; StreamSMART Scenarios & Apps:  App recommendation system; 100M+ customers; 30M – 100M features;  Anti-DDoS Solution; 4M - 10M flow / sec;
  • 26. Huawei Innovation Research Program • The Huawei Innovation Research Program (HIRP) provides funding opportunities to leading universities and research institutes conducting innovative research in communication technology, computer science, engineering, and related fields. HIRP seeks to identify and support world-class, full-time faculty members pursuing innovation of mutual interest. Outstanding HIRP winners may be invited to establish further long-term research collaboration with Huawei. • Call for Proposals for Big Data & Artificial Intelligence • HIRPO20160606: Novel Algorithm Design and Use Cases for Data Stream Mining based on streamDM https://innovationresearch.huawei.com/IPD/hirp/portal/index.html
  • 27. Join us to build a better connected world THANK YOU jianfeng.qian@outlook.com, hecheng@huawei.com