SlideShare a Scribd company logo
Training on a pluggable machine learning platformMachine Learning on Hadoop at Huffington Post | AOL
A Little Bit about UsCore Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com)Principal Software EngineerWorked on machine learning, data mining, and natural language processingSang Chul Song, Ph.D. (sangchul.song@teamaol.com)Senior Software EngineerWorked on data intensive computing – data archiving / information retrieval
Machine Learning:Supervised Classification1. Learning PhaseModelTrain“Business”2. Classifying Phase“Entertainment”ModelResultClassifycapital gains to be taxed …“Politics”
Two Machine Learning Use Cases at HuffPost | AOLComment ModerationEvaluate All New HuffPost User Comments Every DayIdentify Abusive / Aggressive CommentsAuto Delete / Publish ~25% Comments Every DayArticle ClassificationTag Articles for AdvertisingE.g.: scary, salacious, …
Our Classification Tasksabusivenon-abusivenon-abusivescarysexynon-abusivenon-abusiveabusiveComment ModerationArticle Classification
In Order to Meet Our Needs,We Require…Support for important algorithms, includingSVMPerceptron / WinnowBayesianDecision TreeAdaBoost …Ability to build tons of models on regular basis, and pick the bestBecause, in general, it’s difficult to know in advance what algorithm / parameter set will work best
However,N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially.For example, N=5, K=5, L=10  500K
So, we parallelize on HadoopGood news: Mahout, a parallel machine learning tool, is already available.There are Mallet, libsvm, Weka, … that support necessary algorithms.Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.
Therefore, we do…We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.On top of our platform, we generate / test hundred thousands models, and choose the best.We use Pig for Hadoop implementation.
Our ApproachOUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree,Bayesian and a Lot OthersTrain RequestReturnCONVENTIONAL1000s Models(one for each param set)Best ModelTraining DataSelectTrain (sequential)
What Parallelization?Training TaskTraining TaskTraining TaskTraining TaskTraining Task
General Processing FlowTrainingDocsPreprocessVectorizedDocsTrainModelPreprocess ParametersStopword use, n-gram size, stemming, etc.Train ParametersAlgorithm and algorithm specific parameters(e.g. SVM, C, Ɛ, and other kernel parameters)
Our Parallel Processing FlowModelVectorizedDocsModelModelTrainingDocsVectorized DocsModelModelModelModelVectorized DocsModelModel
Preprocessing on Hadoop(see next slide)Preprocessing on Hadoopbusiness	Investments are taxed as capital gains.....business	It was the overleveraged and underregulatedbanks …none   	I am afraid we may be headed for …none   	In the famous words of Homer Simpson, “it takes 2 to lie …”…Vector 1Training DataVector 2Vector 3Vector 4279	68ngram_stem_stopword	1snowballtrue279	68	ngram_stem_stopword2	snowball	true279	68	ngram_stem_stopword3	snowball	true279	68	ngram_stem_stopword	1	porter	true279	68	ngram_stem_stopword2porter	true279	68	ngram_stem_stopword3none	false…Vector 5Preprocessing Request (a parameter set per line)Vector k
Preprocessing on HadoopBig PictureVector 1Through UDF CallVector 2UDFpar = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE 		RunPreprocess(par1, par2, …);STORE run ..;RunPreprocess()……..Preprocessors (Pluggable Pipes)StemmerTokenizerStopwordFilterVector kVectorizerFeatureSelector
Training on Hadoop010101101020101100010101110100010101011100…010111010100010100100010101011100110110101…011101011010101011101011011010001010010101…010010111010100010101010001010111010101010…111010110001110101011010100101011010001011…Model 1Training on Hadoop(see next slide)VectorsModel 2Model 3Model 473	923	balanced_winnow	5	1	10…73	923	balanced_winnow	5	210…73	923	balanced_winnow	5	310…73	923	balanced_winnow	5	1	20	…73	923	balanced_winnow	5	2	20	…73	923	balanced_winnow	5	320……Model 5Train Request (a parameter set per line)Model kMahout, Weka, Malletor libsvm
Training on HadoopBig PictureModel 1Through UDF CallModel 2UDFRunTrainer()par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATERunTrainer(par1, par2, …);STORE run ..;…….MalletAdaBoost (M2)
Bagging
Balanced Winnow
C45
Decision Tree
…MahoutBayesian

More Related Content

PPTX
Machine Learning and Hadoop
PPTX
Machine Learning with Spark
PPTX
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
PPTX
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
PDF
The MADlib Analytics Library
 
PPTX
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PDF
Distributed deep learning
PDF
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal
Machine Learning and Hadoop
Machine Learning with Spark
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
The MADlib Analytics Library
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Distributed deep learning
Pivotal Data Labs - Technology and Tools in our Data Scientist's Arsenal

What's hot (20)

PDF
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
PDF
Kaz Sato, Evangelist, Google at MLconf ATL 2016
PDF
Pivotal OSS meetup - MADlib and PivotalR
PPTX
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PDF
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
PDF
Distributed machine learning 101 using apache spark from a browser devoxx.b...
PDF
MapR & Skytree:
PPTX
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
PDF
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
PDF
Auto-Pilot for Apache Spark Using Machine Learning
PPTX
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PPTX
Sparking Science up with Research Recommendations by Maya Hristakeva
PPT
Hands on Mahout!
PDF
Multiplatform Spark solution for Graph datasources by Javier Dominguez
PDF
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
PDF
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
DASK and Apache Spark
Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Pivotal OSS meetup - MADlib and PivotalR
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Big Data Analytics with Storm, Spark and GraphLab
Analytics Zoo: Building Analytics and AI Pipeline for Apache Spark and BigDL ...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
MapR & Skytree:
A Pipeline for Distributed Topic and Sentiment Analysis of Tweets on Pivotal ...
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Auto-Pilot for Apache Spark Using Machine Learning
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
Sparking Science up with Research Recommendations by Maya Hristakeva
Hands on Mahout!
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
Distributed Deep Learning + others for Spark Meetup
DASK and Apache Spark
Ad

Viewers also liked (19)

PPTX
Slides pentaho-hadoop-weka
PDF
EURIB Korte opleiding: Online marketing - Maart 2016
PDF
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
PPTX
World com
PPS
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
PDF
GBBrand 2012 - TOP 100 British Brands
PDF
Reactive architecture e microservices microservices, ap is e event driven (1)
PDF
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
DOC
Zaragoza turismo-59
PPT
Value of the mediawiki platform for providing content to the chemistry community
PDF
Venus - #UseYourAnd
DOCX
Final project report`````
PDF
Smart SMBs: fine-tuning the engines of growth
PPS
美雅找醬油篇
PDF
Pengenalan kepada Pentaho
PPTX
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
PPTX
あっぱれじゃ
PDF
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Slides pentaho-hadoop-weka
EURIB Korte opleiding: Online marketing - Maart 2016
WJAX 2013 Slides online: Big Data beyond Apache Hadoop - How to integrate ALL...
World com
La vuelta al Mundo en 8 Minutos (por: carlitosrangel)
GBBrand 2012 - TOP 100 British Brands
Reactive architecture e microservices microservices, ap is e event driven (1)
ممارسات القيادة الاستراتيجية وعلاقتها بخدمة الزبون
Zaragoza turismo-59
Value of the mediawiki platform for providing content to the chemistry community
Venus - #UseYourAnd
Final project report`````
Smart SMBs: fine-tuning the engines of growth
美雅找醬油篇
Pengenalan kepada Pentaho
Ευρωπαϊκή Ένωση, Αντωνία και Ανιέζα
あっぱれじゃ
Hard Times: College Majors, Unemployment and Earnings: Not All College Degree...
Ad

Similar to Machine Learning with Hadoop (20)

PDF
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
PDF
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Scaling Machine Learning with Apache Spark
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PPTX
Machine learning at scale - Webinar By zekeLabs
PDF
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
PPTX
Data science and Hadoop
PDF
Spark Based Distributed Deep Learning Framework For Big Data Applications
PDF
MLlib: Spark's Machine Learning Library
PDF
Deep Learning with Hadoop 1st Edition Dipayan Dev
PPTX
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
PDF
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
PDF
Data Analytics and Machine Learning: From Node to Cluster on ARM64
PDF
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
DOC
Download Materials
PPTX
Combining Machine Learning frameworks with Apache Spark
PPTX
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
ODP
Challenges in Large Scale Machine Learning
PDF
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Scaling Machine Learning with Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Machine learning at scale - Webinar By zekeLabs
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Data science and Hadoop
Spark Based Distributed Deep Learning Framework For Big Data Applications
MLlib: Spark's Machine Learning Library
Deep Learning with Hadoop 1st Edition Dipayan Dev
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster on ARM64
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Download Materials
Combining Machine Learning frameworks with Apache Spark
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Challenges in Large Scale Machine Learning
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
A comparative analysis of optical character recognition models for extracting...
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Dropbox Q2 2025 Financial Results & Investor Presentation
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Profit Center Accounting in SAP S/4HANA, S4F28 Col11

Machine Learning with Hadoop

  • 1. Training on a pluggable machine learning platformMachine Learning on Hadoop at Huffington Post | AOL
  • 2. A Little Bit about UsCore Services Team at HPMG | AOL Thu Kyaw (thu.kyaw@teamaol.com)Principal Software EngineerWorked on machine learning, data mining, and natural language processingSang Chul Song, Ph.D. (sangchul.song@teamaol.com)Senior Software EngineerWorked on data intensive computing – data archiving / information retrieval
  • 3. Machine Learning:Supervised Classification1. Learning PhaseModelTrain“Business”2. Classifying Phase“Entertainment”ModelResultClassifycapital gains to be taxed …“Politics”
  • 4. Two Machine Learning Use Cases at HuffPost | AOLComment ModerationEvaluate All New HuffPost User Comments Every DayIdentify Abusive / Aggressive CommentsAuto Delete / Publish ~25% Comments Every DayArticle ClassificationTag Articles for AdvertisingE.g.: scary, salacious, …
  • 6. In Order to Meet Our Needs,We Require…Support for important algorithms, includingSVMPerceptron / WinnowBayesianDecision TreeAdaBoost …Ability to build tons of models on regular basis, and pick the bestBecause, in general, it’s difficult to know in advance what algorithm / parameter set will work best
  • 7. However,N algorithms, K parameters each, L values in each parameter  There are N x LK combinations!, which is often too many to deal with sequentially.For example, N=5, K=5, L=10  500K
  • 8. So, we parallelize on HadoopGood news: Mahout, a parallel machine learning tool, is already available.There are Mallet, libsvm, Weka, … that support necessary algorithms.Bad news: Mahout doesn’t support necessary algorithms yet. Other algorithms do not run natively on Hadoop.
  • 9. Therefore, we do…We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.On top of our platform, we generate / test hundred thousands models, and choose the best.We use Pig for Hadoop implementation.
  • 10. Our ApproachOUR APPROACH More algorithms (thus better model), and faster parallel processing AdaBoost, SVM, Decision Tree,Bayesian and a Lot OthersTrain RequestReturnCONVENTIONAL1000s Models(one for each param set)Best ModelTraining DataSelectTrain (sequential)
  • 11. What Parallelization?Training TaskTraining TaskTraining TaskTraining TaskTraining Task
  • 12. General Processing FlowTrainingDocsPreprocessVectorizedDocsTrainModelPreprocess ParametersStopword use, n-gram size, stemming, etc.Train ParametersAlgorithm and algorithm specific parameters(e.g. SVM, C, Ɛ, and other kernel parameters)
  • 13. Our Parallel Processing FlowModelVectorizedDocsModelModelTrainingDocsVectorized DocsModelModelModelModelVectorized DocsModelModel
  • 14. Preprocessing on Hadoop(see next slide)Preprocessing on Hadoopbusiness Investments are taxed as capital gains.....business It was the overleveraged and underregulatedbanks …none I am afraid we may be headed for …none In the famous words of Homer Simpson, “it takes 2 to lie …”…Vector 1Training DataVector 2Vector 3Vector 4279 68ngram_stem_stopword 1snowballtrue279 68 ngram_stem_stopword2 snowball true279 68 ngram_stem_stopword3 snowball true279 68 ngram_stem_stopword 1 porter true279 68 ngram_stem_stopword2porter true279 68 ngram_stem_stopword3none false…Vector 5Preprocessing Request (a parameter set per line)Vector k
  • 15. Preprocessing on HadoopBig PictureVector 1Through UDF CallVector 2UDFpar = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE RunPreprocess(par1, par2, …);STORE run ..;RunPreprocess()……..Preprocessors (Pluggable Pipes)StemmerTokenizerStopwordFilterVector kVectorizerFeatureSelector
  • 16. Training on Hadoop010101101020101100010101110100010101011100…010111010100010100100010101011100110110101…011101011010101011101011011010001010010101…010010111010100010101010001010111010101010…111010110001110101011010100101011010001011…Model 1Training on Hadoop(see next slide)VectorsModel 2Model 3Model 473 923 balanced_winnow 5 1 10…73 923 balanced_winnow 5 210…73 923 balanced_winnow 5 310…73 923 balanced_winnow 5 1 20 …73 923 balanced_winnow 5 2 20 …73 923 balanced_winnow 5 320……Model 5Train Request (a parameter set per line)Model kMahout, Weka, Malletor libsvm
  • 17. Training on HadoopBig PictureModel 1Through UDF CallModel 2UDFRunTrainer()par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATERunTrainer(par1, par2, …);STORE run ..;…….MalletAdaBoost (M2)
  • 20. C45
  • 27. …Model klibsvmSVMTraining on Hadoop : Trick #1Each model can be generated independently  an easy parallelization problem (aka ‘embarrassingly parallel’)But, how do we achieve parallelism with Pig?par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE RunTrainer(par1, par2, …);STORE run ...;par = LOAD param_file AS par1, par2, …;grp = GROUP par BY (par1, par2, …) PARALLEL 50fltn = FOREACH grp GENERATE group.par1 AS par1, …;run = FOREACH fltn GENERATE RunTrainer(par1, …);STORE run …;
  • 28. Training on Hadoop: Trick #2We call ML functions from UDF.Some functions can take too long to return, and Hadoop will kill the job if they do.RunTrainer()“Pig Heartbeat” ThreadMain Thread
  • 29. As a result, we now see…We are now able to build tens of thousands of models within an hour and choose the best.Previously, the same task took us days.As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.
  • 30. Useful ResourcesMahout: http://mahout.apache.org/Mallet: http://mallet.cs.umass.edu/Weka: http://www.cs.waikato.ac.nz/ml/weka/libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/OpenNLP: http://incubator.apache.org/opennlp/Pig: http://pig.apache.org/