SlideShare a Scribd company logo
Dat Tran - Head of Data Science
1
Dat Tran (Head of Data)
@datitran
Demystifying the Buzz in Machine
Learning! (This time for real)
23/11/2018 - Data Natives 2018 Berlin
#idealoTech
echo $(whoami)
2
What do we do at idealo? Some examples...
3
Hotel image ranking for both aesthetic
and technical quality
Low-to-high resolution
Recommendation engine
Check us out! #idealoTech
4
https://github.com/idealo https://medium.com/idealo-tech-blog
5
Let me start
with...
Let me start with Gartner Hype Cycle…
...Why because they’re “always” right
6
Machine Learning
7
8
9
1010
Guidelines for successful
and realistic data projects
11
12
1. Think simple
first and then, if
it’s really
needed, get
more complex
Minimum Viable Model
Not like this…
Like this!
13
Sales Prediction
14
Problem Statement
● For over 50% of the lead-outs, we don’t know whether users bought or not
● We know it for Amazon & ebay but with a 2-days lag; other problems are
direct vs. indirect sales
● Predicting sales is valuable, for example for CRM, recommendation engine
and many other use cases
15
Supervised Learning
Samples
price: 80, pis: 5, ... sale
price: 5, pis: 1, ... non-sale
price: 17, pis: 3, ... sale
ML Model training Predictions
price: 99, pis: 8, ... non-sale
price: 65, pis: 2, ... sale (82%)
price: 32, pis: 9, ... sale (30%)
price: 40, pis: 5, ... sale (50%)
price: 20, pis: 2, ... sale (71%)
Deep Learning????
16
17
Interpretation of your models matters!
18
2. Define your
data product
MVP and release
as early as
possible
MVP for Recommendations
Not like this…
Like this!
19
Classifying Hotel Aesthetics Photos
20
Problem Statement
● 2.306.658 accommodations
● 308.519.299 images
● ~ 133 images per accommodation
Humans?
Deep Learning??
21
How to start a Deep Learning project
1. Computer Vision: ImageNet, AlexNet
2. NLP: Language models (still immature)
22
Automate Image Quality Assessment
To automate the image quality assessment we trained:
● Aesthetic model → Predicts aesthetic score of an image
● Technical model → Predicts the technical image quality (distortion, blur, etc.)
We followed the Google paper “NIMA: Neural Image Assessment” published 09/2017
23
Results - First Iteration
Aesthetic model - MobileNet
Linear correlation coefficient (LCC): 0.5987
Spearman's correlation coefficient (SCRR): 0.6072
Earth Mover's Distance: 0.2018
Accuracy (threshold at 5): 0.74
24
Example - First Iteration
Aesthetic model - MobileNet
25
Learnings
● First results are not good but we only learned it because we released it
○ More domain specific data is needed
● We could load test our applications which is very valuable
○ Used MobileNet instead of VGG-16
26
Second Iteration
● We built a simple labeling application
● ~ 12 people from idealo Reise and Data Science labeled
○ 1000 hotel images for aesthetics
○ 3000 hotel images for technical quality
● We fine-tuned the aesthetic model with 800 training
images
● Built aesthetic test dataset with 200 images
27
Example - Second Iteration
Aesthetic model - MobileNet
28
29
3. Creating data
products is a
team sport
UX/UI +
Frontend
engineer
Backend
engineer
Data
Scientist +
Data
Engineer
30
31
Google’s Smart Reply Feature Apple’s Smart Photo Search Feature
32
33
4. Use the right
tool for the right
problem
34
This is our tech stack... only an extract;)
PyData
Deep
Learning
Big Data
Computer
Vision
NLP
Production
Machine
Learning
Visualization
Data Preparation
35
36
5. Use the cloud
Minimum Viable Platform
Not like this…
Like this!
37
Use the cloud!
38
39
6. Measure your
model and
improve it from
time to time
Hotel Image Tagging Pipeline Day 1
Bedroom
Bedroom
Bedroom
40
Hotel Image Tagging Pipeline Day 2
Bedroom
Bedroom
Reception???
41
● Data changes constantly so monitor your model performance on a regular
basis
● Re-training pipeline is also important
● Don’t do it manually, use appropriate tools for this e.g. Apache Airflow
Learnings
42
43
7. Your results
need to be
reproducible
Data Science Product Life Cycle
Feature Engineering
Modeling
Evaluation
Operationalization
Feedback
Data Review
API Design
Problem Definition
44
● Use git
● Dockerized aka containerized everything
● Use conda and/or pip for package management
● Automatic pipeline management (testing, data)
● TDD & API First strategy (everything as a Microservice)
● Don’t use Jupyter notebooks for production system
Learnings
45
46
8. Prioritize the
projects with
the biggest
business impact
2 x 2 Business Impact vs. Technical Feasibility
47
48
Summary
49
1. Think simple first and then, if it’s really needed, get more complex
2. Define your data product MVP and release as early as possible
3. Creating data products is a team sport
4. Use the right tool for the right problem
5. Use the cloud
6. Measure your model and improve it from time to time
7. Your results need to be reproducible
8. Prioritize the projects with the biggest business impact
Summary
49
50
Questions?
Url: www.dat-tran.com
Twitter: @datitran

More Related Content

PDF
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
PDF
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
PDF
DN18 | Applied Machine Learning in Cybersecurity: Detect malicious DGA Domain...
PPTX
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
PDF
Webinar - Patient Readmission Risk
PDF
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
PPTX
Interpretable Machine Learning
PDF
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...
DN18 | Technical Debt in Machine Learning | Jaroslaw Szymczak | OLX
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | Applied Machine Learning in Cybersecurity: Detect malicious DGA Domain...
Scaling Up Machine Learning: How to Benchmark GraphLab Create on Huge Datasets
Webinar - Patient Readmission Risk
Image Caption Generation: Intro to Distributed Tensorflow and Distributed Sco...
Interpretable Machine Learning
"How Pirelli uses Domino and Plotly for Smart Manufacturing" by Alberto Arrig...

What's hot (20)

PDF
Demystifying Data Science
DOCX
Resume_10_2019
PPTX
Kaggle Days Milan - March 2019
PDF
No, you don't need to learn python
PPTX
Webinar - Fraud Detection - Palombo (20160428)
PDF
BigML Release: PCA
PDF
Ideas on Machine Learning Interpretability
PDF
A Kaggle Talk
PPTX
Emerging engineering issues for building large scale AI systems By Srinivas P...
PDF
Debugging AI
PDF
From Lab to Factory: Or how to turn data into value
PPTX
#SPSToronto Make graph data useful for your company
PDF
Data science in practice
PDF
Writing Smarter Applications with Machine Learning
PPTX
Scientists meet Entrepreneurs - AI & Machine Learning, Tambet Matiisen, Unive...
PDF
Industrial Machine Learning (at GE)
PDF
Machine Learning Project Lifecycle
PDF
Pentaho World 2017: Automated Machine Learning (AutoML) and Pentaho (Thursday...
PPTX
Matlab-Assignment-Help-Europe
PDF
Machine Learning for Time Series, Strata London 2018
Demystifying Data Science
Resume_10_2019
Kaggle Days Milan - March 2019
No, you don't need to learn python
Webinar - Fraud Detection - Palombo (20160428)
BigML Release: PCA
Ideas on Machine Learning Interpretability
A Kaggle Talk
Emerging engineering issues for building large scale AI systems By Srinivas P...
Debugging AI
From Lab to Factory: Or how to turn data into value
#SPSToronto Make graph data useful for your company
Data science in practice
Writing Smarter Applications with Machine Learning
Scientists meet Entrepreneurs - AI & Machine Learning, Tambet Matiisen, Unive...
Industrial Machine Learning (at GE)
Machine Learning Project Lifecycle
Pentaho World 2017: Automated Machine Learning (AutoML) and Pentaho (Thursday...
Matlab-Assignment-Help-Europe
Machine Learning for Time Series, Strata London 2018
Ad

Similar to DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo (20)

PDF
ADV Slides: Data Curation for Artificial Intelligence Strategies
PDF
Reproducible data science and business solutions
PPTX
InTTrust -IBM Artificial Intelligence Event
PDF
Being a Data Science Product Manager
PDF
Automated Machine Learning
PDF
Architecting for Data Science
PPTX
Tackling Challenges in Computer Vision
PPTX
Tackling Challenges in Computer Vision
PDF
AI firsts: Leading from research to proof-of-concept
PDF
IBM i & Data Science in the AI era.
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PPTX
Productionalizing ML : Real Experience
PDF
Artificial Intelligence (ML - DL)
PPTX
AI Orange Belt - Session 4
PDF
Starting your AI/ML project right (May 2020)
PDF
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
PDF
Think Big | Enterprise Artificial Intelligence
PDF
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
PPTX
AI presentation for everyone in every fields
PPTX
AI and The future of work
ADV Slides: Data Curation for Artificial Intelligence Strategies
Reproducible data science and business solutions
InTTrust -IBM Artificial Intelligence Event
Being a Data Science Product Manager
Automated Machine Learning
Architecting for Data Science
Tackling Challenges in Computer Vision
Tackling Challenges in Computer Vision
AI firsts: Leading from research to proof-of-concept
IBM i & Data Science in the AI era.
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Productionalizing ML : Real Experience
Artificial Intelligence (ML - DL)
AI Orange Belt - Session 4
Starting your AI/ML project right (May 2020)
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
Think Big | Enterprise Artificial Intelligence
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016
AI presentation for everyone in every fields
AI and The future of work
Ad

More from Dataconomy Media (20)

PDF
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
PDF
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
PDF
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
PDF
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
PPTX
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
PPTX
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
PPTX
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
PDF
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
PPTX
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
PDF
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
PPTX
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
PDF
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
PDF
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
PDF
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
PDF
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
PPTX
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
PDF
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
PPTX
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
PPTX
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
PPTX
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...
Data Natives Paris v 10.0 | "Blockchain in Healthcare" - Lea Dias & David An...
Data Natives Frankfurt v 11.0 | "Competitive advantages with knowledge graphs...
Data Natives Frankfurt v 11.0 | "Can we be responsible for misuse of data & a...
Data Natives Munich v 12.0 | "How to be more productive with Autonomous Data ...
Data Natives meets DataRobot | "Build and deploy an anti-money laundering mo...
Data Natives Munich v 12.0 | "Political Data Science: A tale of Fake News, So...
Data Natives Vienna v 7.0 | "Building Kubernetes Operators with KUDO for Dat...
Data Natives Vienna v 7.0 | "The Ingredients of Data Innovation" - Robbert de...
Data Natives Cologne v 4.0 | "The Data Lorax: Planting the Seeds of Fairness...
Data Natives Cologne v 4.0 | "How People Analytics Can Reveal the Hidden Aspe...
Data Natives Amsterdam v 9.0 | "Ten Little Servers: A Story of no Downtime" -...
Data Natives Amsterdam v 9.0 | "Point in Time Labeling at Scale" - Timothy Th...
Data Natives Hamburg v 6.0 | "Interpersonal behavior: observing Alex to under...
Data Natives Hamburg v 6.0 | "About Surfing, Failing & Scaling" - Florian Sch...
Data NativesBerlin v 20.0 | "Serving A/B experimentation platform end-to-end"...
Data Natives Berlin v 20.0 | "Ten Little Servers: A Story of no Downtime" - A...
Big Data Frankfurt meets Thinkport | "The Cloud as a Driver of Innovation" - ...
Thinkport meets Frankfurt | "Financial Time Series Analysis using Wavelets" -...
Big Data Helsinki v 3 | "Distributed Machine and Deep Learning at Scale with ...
Big Data Helsinki v 3 | "Federated Learning and Privacy-preserving AI" - Oguz...

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Managing Community Partner Relationships
PDF
Business Analytics and business intelligence.pdf
PDF
Transcultural that can help you someday.
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
Predictive modeling basics in data cleaning process
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PDF
annual-report-2024-2025 original latest.
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
Quality review (1)_presentation of this 21
climate analysis of Dhaka ,Banglades.pptx
Database Infoormation System (DBIS).pptx
Managing Community Partner Relationships
Business Analytics and business intelligence.pdf
Transcultural that can help you someday.
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
Introduction-to-Cloud-ComputingFinal.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Predictive modeling basics in data cleaning process
SAP 2 completion done . PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
annual-report-2024-2025 original latest.
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx

DN18 | Demystifying the Buzz in Machine Learning! (This Time for Real) | Dat Tran | Idealo

  • 1. Dat Tran - Head of Data Science 1 Dat Tran (Head of Data) @datitran Demystifying the Buzz in Machine Learning! (This time for real) 23/11/2018 - Data Natives 2018 Berlin #idealoTech
  • 3. What do we do at idealo? Some examples... 3 Hotel image ranking for both aesthetic and technical quality Low-to-high resolution Recommendation engine
  • 4. Check us out! #idealoTech 4 https://github.com/idealo https://medium.com/idealo-tech-blog
  • 6. Let me start with Gartner Hype Cycle… ...Why because they’re “always” right 6
  • 8. 8
  • 9. 9
  • 10. 1010
  • 11. Guidelines for successful and realistic data projects 11
  • 12. 12 1. Think simple first and then, if it’s really needed, get more complex
  • 13. Minimum Viable Model Not like this… Like this! 13
  • 15. Problem Statement ● For over 50% of the lead-outs, we don’t know whether users bought or not ● We know it for Amazon & ebay but with a 2-days lag; other problems are direct vs. indirect sales ● Predicting sales is valuable, for example for CRM, recommendation engine and many other use cases 15
  • 16. Supervised Learning Samples price: 80, pis: 5, ... sale price: 5, pis: 1, ... non-sale price: 17, pis: 3, ... sale ML Model training Predictions price: 99, pis: 8, ... non-sale price: 65, pis: 2, ... sale (82%) price: 32, pis: 9, ... sale (30%) price: 40, pis: 5, ... sale (50%) price: 20, pis: 2, ... sale (71%) Deep Learning???? 16
  • 17. 17 Interpretation of your models matters!
  • 18. 18 2. Define your data product MVP and release as early as possible
  • 19. MVP for Recommendations Not like this… Like this! 19
  • 21. Problem Statement ● 2.306.658 accommodations ● 308.519.299 images ● ~ 133 images per accommodation Humans? Deep Learning?? 21
  • 22. How to start a Deep Learning project 1. Computer Vision: ImageNet, AlexNet 2. NLP: Language models (still immature) 22
  • 23. Automate Image Quality Assessment To automate the image quality assessment we trained: ● Aesthetic model → Predicts aesthetic score of an image ● Technical model → Predicts the technical image quality (distortion, blur, etc.) We followed the Google paper “NIMA: Neural Image Assessment” published 09/2017 23
  • 24. Results - First Iteration Aesthetic model - MobileNet Linear correlation coefficient (LCC): 0.5987 Spearman's correlation coefficient (SCRR): 0.6072 Earth Mover's Distance: 0.2018 Accuracy (threshold at 5): 0.74 24
  • 25. Example - First Iteration Aesthetic model - MobileNet 25
  • 26. Learnings ● First results are not good but we only learned it because we released it ○ More domain specific data is needed ● We could load test our applications which is very valuable ○ Used MobileNet instead of VGG-16 26
  • 27. Second Iteration ● We built a simple labeling application ● ~ 12 people from idealo Reise and Data Science labeled ○ 1000 hotel images for aesthetics ○ 3000 hotel images for technical quality ● We fine-tuned the aesthetic model with 800 training images ● Built aesthetic test dataset with 200 images 27
  • 28. Example - Second Iteration Aesthetic model - MobileNet 28
  • 29. 29 3. Creating data products is a team sport
  • 31. 31
  • 32. Google’s Smart Reply Feature Apple’s Smart Photo Search Feature 32
  • 33. 33 4. Use the right tool for the right problem
  • 34. 34
  • 35. This is our tech stack... only an extract;) PyData Deep Learning Big Data Computer Vision NLP Production Machine Learning Visualization Data Preparation 35
  • 36. 36 5. Use the cloud
  • 37. Minimum Viable Platform Not like this… Like this! 37
  • 39. 39 6. Measure your model and improve it from time to time
  • 40. Hotel Image Tagging Pipeline Day 1 Bedroom Bedroom Bedroom 40
  • 41. Hotel Image Tagging Pipeline Day 2 Bedroom Bedroom Reception??? 41
  • 42. ● Data changes constantly so monitor your model performance on a regular basis ● Re-training pipeline is also important ● Don’t do it manually, use appropriate tools for this e.g. Apache Airflow Learnings 42
  • 43. 43 7. Your results need to be reproducible
  • 44. Data Science Product Life Cycle Feature Engineering Modeling Evaluation Operationalization Feedback Data Review API Design Problem Definition 44
  • 45. ● Use git ● Dockerized aka containerized everything ● Use conda and/or pip for package management ● Automatic pipeline management (testing, data) ● TDD & API First strategy (everything as a Microservice) ● Don’t use Jupyter notebooks for production system Learnings 45
  • 46. 46 8. Prioritize the projects with the biggest business impact
  • 47. 2 x 2 Business Impact vs. Technical Feasibility 47
  • 49. 49 1. Think simple first and then, if it’s really needed, get more complex 2. Define your data product MVP and release as early as possible 3. Creating data products is a team sport 4. Use the right tool for the right problem 5. Use the cloud 6. Measure your model and improve it from time to time 7. Your results need to be reproducible 8. Prioritize the projects with the biggest business impact Summary 49