SlideShare a Scribd company logo
RPig: A Scalable Framework for Machine
Learning
and Advanced Statistical Functionalities
MingXue Wang
Sidath B. Handurukande
Mohamed Nassar
Network Management Lab, Ericsson Ireland
CloudCom 2012
Ericsson | Page 2
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 3
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 4
Big data analytic in network management
› Capability of Big data analytics
– Service assurance
– Predictive analysis
› Large amount of network data
– Thousands of cells, nodes
– Millions of connected devices, terminals
– Billions of sessions, events
› Machine learning and advanced statistical algorithms
– Network fault, KPI prediction
– CDR, traffic data analysis
Ericsson | Page 5
RPig framework Context
Service Assurance
..
..
RPig
RPig execution platform
VoIP QoE
alarm models
Network KPIs
(packet loss,
Jitter, delay, etc)
VoIP QoE alarms,
Triggers
Network KPIs -> Service KPIs -> Alarm events
SVM based
algorithm
VOIP use case:
Ericsson | Page 6
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 7
Hadoop and MapReduce
Our Framework (ML/DM)
Zookeeper
Coordination
Hadoop DFS
Hadoop Distributed File System
Hadoop MapReduce
Distributed parallel programming framework
Pig
Data flow
Mahout
ML/DM
Hive
SQL
HBase
NoSQL
S4
Streaming
Hama
BSP
…
Giraph
Graphs
…
Ambari
Management
…
› Big data management system
– terabytes/petabytes of data
– hundreds/thousands of nodes
› MapReduce
– map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3)
… …
Ericsson | Page 8
Pig and Pig Latin
› Pig - Big data management system
– Similar to SQL in RDBMS
– Pig Latin - A high level data flow language
› Events = FILTER Events BY (client == ’Skype’ OR ...);
– Define data processing flows on unstructured raw data
– Execution in MapReduce model
› Other similar
– JAQL from IBM, …
› Pro: Scalable; Distributed parallel processing
› Con: Not for ML and advanced statistical functionalities
Ericsson | Page 9
R and R packages
› R - Traditional statistical software
– A software and language for statistical computing and advanced
data analysis
– Thousands of R packages
– EMA calculation using the TTR package
› Library(TTR); results <- EMA(temp, 20)
› Other similar:
– Matlab, Weka, …
› Pro: Sophisticated statistical algorithms for advanced
analysis
–Clustering, Regression, etc.
› Con: Not scalable, data must be loaded in memory and run
in a single computer
Ericsson | Page 10
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 11
Related work- Extending R
› Extending traditional statistical software
› Scaling memory size
– Use hard disk as external memory
– E.g. RevoScaleR, bigmemory
› Scaling storage size
– Directly read/write data in large scale DMS
– E.g. Ricardo, RJDBC, RMySQL
› Scaling CPU power
– MapReduce based (e.g. RHIPE, RHadoop)
› Require manually design complex key-value pairs based map and
reduce functions
– Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R)
› Do not support parallel data read/write as Hadoop
› Require write programs with complex MPI APIs
Ericsson | Page 12
Related work - Other solutions
› Developing new frameworks
› E.g.
– Mahout
› In a preliminary stage
› Lacking many commonly used algorithms, e.g. SVM
› It does not provide a high level language, such as R and Pig
– SystemML
› DML (a new ML Language) is not as flexible as R language
› lacking on commonly used statistical algorithm implementations
› Con: Lacking algorithm implementations; No high level
language support or else need to learn new language.
Ericsson | Page 13
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 14
RPIG framework
› Our approach: “RPig”
– Integrated framework
› R + Pig
– Integrated language
› Fast algorithms
development
– Auto distributed parallel
execution
Development
Execution
Ericsson | Page 15
RPig script
› Pig prepares the data movement; R does the statistical
tasks
› RPigEditor
Pig
operations
R
function
Ericsson | Page 16
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 17
Forecasting with EMA – case 1
› Case scenario
– Forecasting VoIP traffic in next time period
› Design: Reduce the data size then use the EMA calculation
› RPig Implementation summary
– Pig operations are used as pre-processing steps to summarize data
– Use any statistical algorithm implementations of R, directly on the
summarized data similar to the traditional single machine approach
of R
Raw
events
Summarized
events
outputPig
operations R functions
Ericsson | Page 18
Reduced Development Effort
› 15 configured nodes, 128
MB/block
› Two approaches
– Pig - implemented EMA in Java
to extend Pig
– RPig
› Small overhead
Pig approach: > 100 lines of code
Our RPig approach: less than 10 lines of code
Ericsson | Page 19
Prediction with SVM – case 2
› Case scenario
– Training a model for predicting Service KPIs based on Network
KPIs
› Design: Spilt the data to small SVM training tasks then
execute them in parallel
› RPig implementation summary
– Parallel or iterative statistical algorithms are expressed as parallel R
executions in a Pig data flow
Training data
Split
training data
output
Pig
operations
R functions
Split
Training data
Split
Training data
Ericsson | Page 20
ML Scalability
› Machine Learning (SVM training phase)
– CPU intensive rather than I/O intensive
– 6K training samples
Ericsson | Page 21
Agenda
› Context and technology background
– Big data analytic for network management
– Hadoop, Pig, R
– Related work
› RPig
– RPig framework and RPig script
– Case study
› Conclusion
Ericsson | Page 22
Conclusions
› RPig
– Scalable ML and Statistical functionalities while minimizing the development
effort
› Big data analytic in a high level language
– Without needing to learn new languages, APIs or rewrite complex statistical
algorithms.
› Parallelize executions automatically
– Handling low level operations (data transformation, fault handling, etc.)
itself.
› Future work
– Will focus on minimizing the overhead and increasing the usability of our
framework
2012 CloudCom,  RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

More Related Content

PPTX
Big data business case
PDF
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...
PDF
Automated Production Ready ML at Scale
PDF
Detecting Mobile Malware with Apache Spark with David Pryce
PDF
Make your PySpark Data Fly with Arrow!
PDF
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
PPTX
Big data analytics using R
PDF
Data Warehousing with Spark Streaming at Zalando
Big data business case
Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers ...
Automated Production Ready ML at Scale
Detecting Mobile Malware with Apache Spark with David Pryce
Make your PySpark Data Fly with Arrow!
An Update on Scaling Data Science Applications with SparkR in 2018 with Heiko...
Big data analytics using R
Data Warehousing with Spark Streaming at Zalando

What's hot (18)

PDF
Deploying R in BI and Real time Applications
PDF
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
PPTX
The network structure of cran 2015 07-02 final
PDF
Spark Summit EU talk by Miha Pelko and Til Piffl
PDF
Big Data Analytics with R
PDF
Accelerating Production Machine Learning with MLflow with Matei Zaharia
PPTX
Validating credit cards on mobile using deep learning
PDF
Plume - A Code Property Graph Extraction and Analysis Library
PDF
Real time applications using the R Language
PDF
Predictive Models at Scale
PDF
Microsoft cosmos
PDF
Porting R Models into Scala Spark
PPTX
Are You Ready for Big Data Big Analytics?
PDF
Intro to R for SAS and SPSS User Webinar
PDF
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
PPT
Graph Analytics for big data
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
PDF
Basics of Digital Design and Verilog
Deploying R in BI and Real time Applications
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
The network structure of cran 2015 07-02 final
Spark Summit EU talk by Miha Pelko and Til Piffl
Big Data Analytics with R
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Validating credit cards on mobile using deep learning
Plume - A Code Property Graph Extraction and Analysis Library
Real time applications using the R Language
Predictive Models at Scale
Microsoft cosmos
Porting R Models into Scala Spark
Are You Ready for Big Data Big Analytics?
Intro to R for SAS and SPSS User Webinar
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
Graph Analytics for big data
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Basics of Digital Design and Verilog
Ad

Similar to 2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities (20)

PPTX
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
PDF
The Study of the Large Scale Twitter on Machine Learning
PDF
Apache pig
PDF
43_Sameer_Kumar_Das2
PPTX
Introduction to Pig | Pig Architecture | Pig Fundamentals
PPTX
A Glimpse of Bigdata - Introduction
PDF
Large-Scale Machine Learning at Twitter
PPTX
Apache pig as a researcher’s stepping stone
PDF
IRJET- Analysis of Boston’s Crime Data using Apache Pig
PPTX
Apache pig presentation_siddharth_mathur
PDF
Introduction To Apache Pig at WHUG
PDF
Unit V.pdf
PPT
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
PDF
Big Data: hype or necessity?
PPTX
Apache pig presentation_siddharth_mathur
PPTX
Jonathan Coveney: Why Pig?
PDF
Session 1 - The Current Landscape of Big Data Benchmarks
PPTX
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PDF
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
The Study of the Large Scale Twitter on Machine Learning
Apache pig
43_Sameer_Kumar_Das2
Introduction to Pig | Pig Architecture | Pig Fundamentals
A Glimpse of Bigdata - Introduction
Large-Scale Machine Learning at Twitter
Apache pig as a researcher’s stepping stone
IRJET- Analysis of Boston’s Crime Data using Apache Pig
Apache pig presentation_siddharth_mathur
Introduction To Apache Pig at WHUG
Unit V.pdf
lecturte 5. Hgfjhffjyy to the data will be 1.ppt
Big Data: hype or necessity?
Apache pig presentation_siddharth_mathur
Jonathan Coveney: Why Pig?
Session 1 - The Current Landscape of Big Data Benchmarks
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
BDX 2015 - Scaling out big-data computation & machine learning using Pig, Pyt...
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Machine Learning_overview_presentation.pptx
PPTX
Big Data Technologies - Introduction.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Spectral efficient network and resource selection model in 5G networks
Getting Started with Data Integration: FME Form 101
Machine Learning_overview_presentation.pptx
Big Data Technologies - Introduction.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction

2012 CloudCom, RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities

  • 1. RPig: A Scalable Framework for Machine Learning and Advanced Statistical Functionalities MingXue Wang Sidath B. Handurukande Mohamed Nassar Network Management Lab, Ericsson Ireland CloudCom 2012
  • 2. Ericsson | Page 2 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 3. Ericsson | Page 3 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 4. Ericsson | Page 4 Big data analytic in network management › Capability of Big data analytics – Service assurance – Predictive analysis › Large amount of network data – Thousands of cells, nodes – Millions of connected devices, terminals – Billions of sessions, events › Machine learning and advanced statistical algorithms – Network fault, KPI prediction – CDR, traffic data analysis
  • 5. Ericsson | Page 5 RPig framework Context Service Assurance .. .. RPig RPig execution platform VoIP QoE alarm models Network KPIs (packet loss, Jitter, delay, etc) VoIP QoE alarms, Triggers Network KPIs -> Service KPIs -> Alarm events SVM based algorithm VOIP use case:
  • 6. Ericsson | Page 6 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 7. Ericsson | Page 7 Hadoop and MapReduce Our Framework (ML/DM) Zookeeper Coordination Hadoop DFS Hadoop Distributed File System Hadoop MapReduce Distributed parallel programming framework Pig Data flow Mahout ML/DM Hive SQL HBase NoSQL S4 Streaming Hama BSP … Giraph Graphs … Ambari Management … › Big data management system – terabytes/petabytes of data – hundreds/thousands of nodes › MapReduce – map(k1,v1)-> list(k2,v2); reduce(k2,list(v2))->list(v3) … …
  • 8. Ericsson | Page 8 Pig and Pig Latin › Pig - Big data management system – Similar to SQL in RDBMS – Pig Latin - A high level data flow language › Events = FILTER Events BY (client == ’Skype’ OR ...); – Define data processing flows on unstructured raw data – Execution in MapReduce model › Other similar – JAQL from IBM, … › Pro: Scalable; Distributed parallel processing › Con: Not for ML and advanced statistical functionalities
  • 9. Ericsson | Page 9 R and R packages › R - Traditional statistical software – A software and language for statistical computing and advanced data analysis – Thousands of R packages – EMA calculation using the TTR package › Library(TTR); results <- EMA(temp, 20) › Other similar: – Matlab, Weka, … › Pro: Sophisticated statistical algorithms for advanced analysis –Clustering, Regression, etc. › Con: Not scalable, data must be loaded in memory and run in a single computer
  • 10. Ericsson | Page 10 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 11. Ericsson | Page 11 Related work- Extending R › Extending traditional statistical software › Scaling memory size – Use hard disk as external memory – E.g. RevoScaleR, bigmemory › Scaling storage size – Directly read/write data in large scale DMS – E.g. Ricardo, RJDBC, RMySQL › Scaling CPU power – MapReduce based (e.g. RHIPE, RHadoop) › Require manually design complex key-value pairs based map and reduce functions – Non MapReduce based (e.g. Rmpi, snow,cloudRmpi, Elastic-R) › Do not support parallel data read/write as Hadoop › Require write programs with complex MPI APIs
  • 12. Ericsson | Page 12 Related work - Other solutions › Developing new frameworks › E.g. – Mahout › In a preliminary stage › Lacking many commonly used algorithms, e.g. SVM › It does not provide a high level language, such as R and Pig – SystemML › DML (a new ML Language) is not as flexible as R language › lacking on commonly used statistical algorithm implementations › Con: Lacking algorithm implementations; No high level language support or else need to learn new language.
  • 13. Ericsson | Page 13 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 14. Ericsson | Page 14 RPIG framework › Our approach: “RPig” – Integrated framework › R + Pig – Integrated language › Fast algorithms development – Auto distributed parallel execution Development Execution
  • 15. Ericsson | Page 15 RPig script › Pig prepares the data movement; R does the statistical tasks › RPigEditor Pig operations R function
  • 16. Ericsson | Page 16 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 17. Ericsson | Page 17 Forecasting with EMA – case 1 › Case scenario – Forecasting VoIP traffic in next time period › Design: Reduce the data size then use the EMA calculation › RPig Implementation summary – Pig operations are used as pre-processing steps to summarize data – Use any statistical algorithm implementations of R, directly on the summarized data similar to the traditional single machine approach of R Raw events Summarized events outputPig operations R functions
  • 18. Ericsson | Page 18 Reduced Development Effort › 15 configured nodes, 128 MB/block › Two approaches – Pig - implemented EMA in Java to extend Pig – RPig › Small overhead Pig approach: > 100 lines of code Our RPig approach: less than 10 lines of code
  • 19. Ericsson | Page 19 Prediction with SVM – case 2 › Case scenario – Training a model for predicting Service KPIs based on Network KPIs › Design: Spilt the data to small SVM training tasks then execute them in parallel › RPig implementation summary – Parallel or iterative statistical algorithms are expressed as parallel R executions in a Pig data flow Training data Split training data output Pig operations R functions Split Training data Split Training data
  • 20. Ericsson | Page 20 ML Scalability › Machine Learning (SVM training phase) – CPU intensive rather than I/O intensive – 6K training samples
  • 21. Ericsson | Page 21 Agenda › Context and technology background – Big data analytic for network management – Hadoop, Pig, R – Related work › RPig – RPig framework and RPig script – Case study › Conclusion
  • 22. Ericsson | Page 22 Conclusions › RPig – Scalable ML and Statistical functionalities while minimizing the development effort › Big data analytic in a high level language – Without needing to learn new languages, APIs or rewrite complex statistical algorithms. › Parallelize executions automatically – Handling low level operations (data transformation, fault handling, etc.) itself. › Future work – Will focus on minimizing the overhead and increasing the usability of our framework

Editor's Notes

  • #2: Scaling statistical analysis and machine learning on Hadoop for service assurance.
  • #8: For example IBM has its own alternative to Pig. Microsoft has its own alternative to Pig, IBM has its own alternative to S4 (deduct) Hstreaming () Foundation layer
  • #9: Pig allows define data analysis flows similar to SQL on unstructured raw data stored in HDFS. Pig can automatically generate MapReduce functions based on Pig scripts for scalable data processing.
  • #21: Real experiment results. Same training dataset, 10 folder cross-validation, one kernel, …