Machine learning by example

Machine learning by example
Michał Matłoka @mmatloka

–Alan Turing (1950)
“Can machines think?”

Turing test
http://www.smbc-comics.com/?id=2999

• Make machines
“think” like humans
• Learn from data and
make predictions
What is ML?

• Supervised learning
◦ Classiﬁcation
◦ Regression
• Unsupervised learning
◦ Clustering
◦ Dimensionality Reduction
• Semi-supervised learning
• Reinforcement learning
◦ E.g. AlphaGo
Learning
types

UNSUPERVISED -
DIMENSIONALITY REDUCTION

REINFORCMENT LEARNING
ALPHAGO
https://cdn.arstechnica.net/wp-content/uploads/sites/3/2017/05/GettyImages-688097364-800x549.jpg

• Voice recognition
• Fraud analysis
• Face detection
• Ads click-through rate
prediction
• Spam detection
• Shop recommendations
• Photos description
• Self-driving cars
• Healthcare
• ...
Use cases

1. Data gathering
2. Data cleaning & feature
engineering
3. Dataset -> training & test set
4. Learning -> Model
5. Evaluation -> Accuracy
6. New observation -> Prediction
Learning
process

Classify conference
talk abstracts into
tracks

• RDD (Resilient Distributed Dataset) - map,
ﬁlter, count etc - DataFrame
• Spark SQL
• MLib
• GraphX
• Spark Streaming
• API: Scala, Java, Python, R*

Code.
https://github.com/mmatloka/machine-learning-by-example

• Bigger data set
• Smarter tokenizer
• Stemming &
lemmatization
• IDF - Inverse Document
Frequency
• K-fold Cross-Validation
• Parameters tuning (grid-
search)
What can be
improved?

“I will present Big Data
topics”
“I”, “will”, “present”, “Big”,
“Data”, “topics”
What can be
improved?
(Tokenizer)

“I will present Big Data
topics”
“I will”, “will present”,
“present Big”, “Big Data”,
“Data topics”
What can be
improved?
(n-grams)

“communities”, community”
“commun”
What can be
improved?
(Porter Stemmer)

“communities”, community”
“community”
What can be
improved?
(Lemmatization)

TF - Term Frequency
(HashingTF!) - number of
times term occurs in given
document
IDF - Inverse Document
Frequency - occurs many
times in documents set
What can be
improved?
(TF & IDF)

What can be
improved?
(K-fold Cross-Validation)

val paramGrid = new
ParamGridBuilder()
.addGrid(hashingTF.numFeatures,
Array(256, 512, 1024, 2048, 4096))
.build()

val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)

val cvModel = cv.fit(trainData)
What can be
improved?
(Grid search)

Articles & references
• https://www.csee.umbc.edu/courses/471/papers/turing.pdf
• http://spark.apache.org/
• https://databricks.com/try-databricks
• https://research.googleblog.com/2016/09/introducing-
open-images-dataset.html
• https://arstechnica.co.uk/information-technology/2017/05/
deepmind-alphago-go-ke-jie-china/
• https://www.kaggle.com
• https://github.com/dylanmei/docker-zeppelin
• https://github.com/databricks/spark-corenlp
• https://www.coursera.org/learn/scala-spark-big-data

Thank you, Q&A?
@mmatloka
mmatloka
softwaremill.com/blog

Machine learning by example

More Related Content

Viewers also liked (18)

Similar to Machine learning by example (20)

More from SoftwareMill (20)

Recently uploaded (20)

Machine learning by example