SlideShare a Scribd company logo
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
RevolutionizeTextMining
withSparkandZeppelin
April2017
YanboLiang
ApacheSparkcommitter
Softwareengineer@Hortonworks
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Agenda
TextminingworkflowonBigData
TextminingwithSparkandMLlib
SparkandZeppelinastheplatform
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMining:PracticalApplications
•Textclassification
–Spamfiltering
–Frauddetection
•Textclustering
•Sentimentanalysis
•Entityextraction
•Recommendations
•Automaticlabeling
•Contextualadvertising
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TraditionalTextMining
•Commercialsoftware
–IBMSPSS,RapidMiner,SAS
•Opensourcesoftware
–Gensim,KNIME,NLTK,
sklearn,R
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningonBigData
DataScientistsSoftwareengineers
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
WhyApacheSparkMLlib
•ScalablemachinelearningalgorithmsontopofSpark
–AlternatingLeastSquaresonSpotifydata
•50+millionusersx30+millionsongs,50billionratings
•Forrank10with10iterations,~1hourrunningtime
•Workflowutilities
–MLpipeline
–Modelimport/export
–crossvalidation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Loaddata
TextLabel
Iboughtthegame…4
DoNOTbothertry…1
Thisshirtisawesome…5
nevergotit.Seller…1
Iorderedthisto…3
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Extractfeatures
TextLabelWordsFeatures
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Fitamodel
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Evaluate
TextLabelWordsFeaturesProbabilityPrediction
Iboughtthegame…4“i”,“bought”,…[1,0,3,9,…]0.84
DoNOTbothertry…1“do”,“not”,…[0,0,11,0,…]0.62
Thisshirtisawesome…5“this”,“shirt”,…[0,2,3,1,…]0.95
nevergotit.Seller…1“never”,“got”,…[1,2,0,0,…]0.71
Iorderedthisto…3“i”,“ordered”,…[1,0,0,3,…]0.74
Dataset
Feature
engineering
Model
training
Model
evaluation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
KeyabstractionofSparkMLpipeline
•Transformer
–Featuretransformers(e.g.,HashingTF)andtrainedMLmodels(e.g.,NaiveBayesModel).
•Estimator
–MLalgorithmsfortrainingmodels(e.g.,NaiveBayes).
•Evaluator
–Theseevaluatepredictionsandcomputemetrics,usefulfortuningalgorithmparameters(e.g.,
BinaryClassificationEvaluator).
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Spark’sTextMiningalgorithms
•LDAfortopicmodel
•Word2Vecanunsupervisedwaytoturnwordsintofeaturesbasedontheirmeaning
•CountVectorizerturnsdocumentsintovectorsbasedonwordcount
•HashingTF-IDFcalculatesimportantwordsofadocumentwithrespecttothecorpus
•Andmuchmore
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-classification
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
textclassification
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline–topicmodel
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDFLDAtopicmodel
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline-recommendation
Dataset
RegexTokenizerWord2Vec
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLlibTextMiningPipeline
Dataset
RegexTokenizer
StopWordsRemover
CountVectorizer
HashingTF
IDF
StringIndexer
NaiveBayes
LogisticRegression
SVM
MLP
LDA
Word2Vec
textclassification
topicmodel
recommendation
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Demo
•loadthefilecontentsandthecategories
•extractfeaturevectorssuitableformachinelearning
•trainalinearmodeltoperformcategorization
•useagridsearchstrategytofindagoodconfigurationofboththefeatureextraction
componentsandtheclassifier
https://github.com/yanboliang/dataworks-munich-2017
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
CustomingMLPipelines
•MLlib2.1includes:
–30+featuretransformers(Tokenizer,Word2Vec,…)
–25+models(forclassification,regression,clustering,…)
–Modeltuning&evaluation
•Butsomeapplicationsrequirecustomized
–Transformers&Models
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Optionsforcustomization
•Existingusecases:
–spark-corenlp
–spark-vlbfgs
•Extendabstractions
–Transformer
–Estimator&Model
–Evaluator
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Sparkvirtualenvironment
DataScientistADataScientistB
Python2.7
Python2.7
Python2.7
Python2.7
Python2.7
Python3.5
Python3.5
Python3.5
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
TextMiningworkflow
•Prototype(Python/R)
•CreatePipeline
–Loaddataset
–Extractrawfeatures
–Transformfeatures
–Selectkeyfeatures
–Fitandchoosebestmodels
•Re-implementPipelinefor
production(Java/Scala)
•DeployPipeline
•Scoring
DataScienceSoftwareengineering
Duplicatedand
error-prone
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
MLpersistence
•Prototype(Python/R)
•CreatePipeline
•LoadPipeline(Java/Scala)
–Model.load(“s3n://…”)
•Deployinproduction
DataScienceSoftwareengineering
PersistmodelorPipeline:
model.save(“s3n://…”)
‹#
›
©HortonworksInc.2011–2016.AllRightsReserved
Datascientistsworkwithsoftwareengineer
DataScientistsSoftwareengineers
Exploredata
Createpipeline
Findbestparams
Savemodel
Loadmodel
Deployinproduction
Scoringon
batch/streamingdata

More Related Content

PDF
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
PDF
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
PDF
IoT Crash Course Hadoop Summit SJ
PDF
Introduction to HDF 3.0
PDF
Introduction to Hadoop
PDF
HDF 3.1 : An Introduction to New Features
PDF
Apache Hadoop Crash Course - HS16SJ
PDF
SparkR Best Practices for R Data Scientists
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
REAL-TIME INGESTING AND TRANSFORMING SENSOR DATA & SOCIAL DATA w/ NIFI + TENS...
IoT Crash Course Hadoop Summit SJ
Introduction to HDF 3.0
Introduction to Hadoop
HDF 3.1 : An Introduction to New Features
Apache Hadoop Crash Course - HS16SJ
SparkR Best Practices for R Data Scientists

What's hot (20)

PPTX
The Elephant in the Clouds
PDF
Apache NiFi: Ingesting Enterprise Data At Scale
PDF
Apache Hadoop Crash Course
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
PPTX
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
PPTX
How Hadoop Makes the Natixis Pack More Efficient
PDF
Data Science with Apache Spark - Crash Course - HS16SJ
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
PPTX
Apache Atlas: Governance for your Data
PDF
#HSTokyo16 Apache Spark Crash Course
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
PDF
Flink and NiFi, Two Stars in the Apache Big Data Constellation
PDF
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
PPTX
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
PDF
Introduction to Streaming Analytics Manager
PDF
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
PDF
Data in the Cloud Crash Course
PDF
Intro to Spark & Zeppelin - Crash Course - HS16SJ
PDF
Apache NiFi Meetup - Princeton NJ 2016
The Elephant in the Clouds
Apache NiFi: Ingesting Enterprise Data At Scale
Apache Hadoop Crash Course
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Apache NiFi + Tensorflow + Hadoop: Big Data AI サンドイッチの作り方
How Hadoop Makes the Natixis Pack More Efficient
Data Science with Apache Spark - Crash Course - HS16SJ
Future of Data New Jersey - HDF 3.0 Deep Dive
Dataflow with Apache NiFi - Crash Course - HS16SJ
Apache Atlas: Governance for your Data
#HSTokyo16 Apache Spark Crash Course
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Flink and NiFi, Two Stars in the Apache Big Data Constellation
Real-time Twitter Sentiment Analysis and Image Recognition with Apache NiFi
Hive 3.0 - HDPの最新バージョンで実現する新機能とパフォーマンス改善
Introduction to Streaming Analytics Manager
Intelligently Collecting Data at the Edge – Intro to Apache MiNiFi
Data in the Cloud Crash Course
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Apache NiFi Meetup - Princeton NJ 2016
Ad

Similar to Revolutionize Text Mining with Spark and Zeppelin (20)

PPTX
Apache Spark Crash Course
PPTX
Machine Learning With Spark
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
PPTX
Apache Spark: Lightning Fast Cluster Computing
PDF
Enterprise Data Science at Scale
PPTX
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
PPTX
Enterprise data science at scale
PPTX
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
PDF
Apache Zeppelin and Spark for Enterprise Data Science
PPTX
Apache Zeppelin and Spark for Enterprise Data Science
PPTX
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PPTX
Spark-Zeppelin-ML on HWX
PDF
Spark mhug2
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
Apache Spark Crash Course
PPTX
Apache deep learning 101
PDF
Apache Spark Crash Course
Apache Spark Crash Course
Machine Learning With Spark
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Apache Spark: Lightning Fast Cluster Computing
Enterprise Data Science at Scale
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Enterprise data science at scale
Enabling Apache Zeppelin and Spark for Data Science in the Enterprise
Apache Zeppelin and Spark for Enterprise Data Science
Apache Zeppelin and Spark for Enterprise Data Science
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Data Science at Scale with Apache Spark and Zeppelin Notebook
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Spark-Zeppelin-ML on HWX
Spark mhug2
Combining Machine Learning frameworks with Apache Spark
Apache Spark Crash Course
Apache deep learning 101
Apache Spark Crash Course
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
PPT
State of Security: Apache Spark & Apache Zeppelin
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
PDF
Enabling Digital Diagnostics with a Data Science Platform
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
PDF
Hadoop Crash Course
PDF
Data Science Crash Course
PDF
Dataflow with Apache NiFi
PPTX
Schema Registry - Set you Data Free
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
PPTX
Mool - Automated Log Analysis using Data Science and ML
PPTX
HBase in Practice
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
PPTX
Backup and Disaster Recovery in Hadoop
PPTX
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
PPTX
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
PPTX
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
Unleashing the Power of Apache Atlas with Apache Ranger
Enabling Digital Diagnostics with a Data Science Platform
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Data Science Crash Course
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
Backup and Disaster Recovery in Hadoop
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
cuic standard and advanced reporting.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The AUB Centre for AI in Media Proposal.docx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
20250228 LYD VKU AI Blended-Learning.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Modernizing your data center with Dell and AMD
cuic standard and advanced reporting.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Revolutionize Text Mining with Spark and Zeppelin