SlideShare a Scribd company logo
1st edition
November 4-5, 2018
Machine Learning School in Doha
BigML, Inc X
Real-world Use Case
House Recommender
Full Name
Role, Company
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Let’s build a recommender
Typical way to shop for a home…
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Idea
?
?
?
?
Preference
Model
Preference
Data
Sample
… then use the Preference Model to
filter all the homes on the market
All Homes
Forsale
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Problem #1
What if there are really unusual homes in the data?
• A mansion with 20 bathrooms
• A home with no bedrooms
• A lot size that is smaller than the home?
We don’t want to show these as suggestions
because they are unusual…. How do we detect
anomalies?
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Anomaly Detection
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
What just happened?
• We wanted to find and remove unusual houses.
• We created an Anomaly Detector and examined
the top anomalies.
• We found some unusual houses to remove and
discovered bad data (missing values) that we want
to fix.
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
ML to fix missing data…
• Let’s use Machine Learning…
BEDS BATHSWhat happened to 

being easy?
SQFT PRICE BEDS BATHS
3.125 US$530.000 5 3
2.100 US$460.000 2
1.200 US$250.000 3
3.950 US$610.000 6 4
4
1.5
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
WhizzML
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
What just happened?
• We had a Dataset with missing values.
• We wanted to apply an algorithm to fix the missing
values with Machine Learning
• Rather than write the algorithm, we found what we
needed in the WhizzML public gallery.
• Now that we have cloned the Script we can use it
again and again.
• We can write new ones too!
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Problem #2
• How can we avoid showing essentially the
same house over and over?
All Homes
?
?
?
Sample
Modern
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Problem #2
• How can we avoid showing essentially the
same house over and over?
All Homes
Modern
Lots of
Land
• Great! What if we don’t know how to group
them? Or how many groups?
?
sample
?
sample
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Clustering
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
What just happened?
• Since we don’t know how many groups of homes
there should be, we used G-means Clustering to find
the optimum number of groups of homes
• Our recommender will use these groups to create a
better sampling for user preference
• We also tried to understand the home Clusters using
“model clusters” but the models were difficult to
interpret
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Understanding Clusters
If SQFT >= 3,125 THEN “Cluster 1”
What if we could get rules like…
SQFT PRICE BEDS BATHS CLUSTER
3.125 US$530.000 5 3 Cluster 1
2.100 US$460.000 4 2 Cluster 3
1.200 US$250.000 3 1,5 Cluster 5
3.950 US$610.000 6 4 Cluster 1
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Association Discovery
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
What just happened?
• We used a Batch Centroid to add the Cluster
assignment of each home as a feature to the Dataset
• We use Association Discovery to find “interesting”
relationships between the features including the
Cluster assignment
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Problem #3
There is much more interesting information than just the
number of BEDS, BATHS, etc.
• Unfortunately, these "remarks" are not available in the
Redfin download
• Adding them to our dataset requires crawling the
website
• Like most ML projects, preparing the data is 80% of
the difficulty (fortunately I already did it!)
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Topic Modeling
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
What just happened?
• We extending the home dataset with the syndicated
remarks text field
• We built a Model to predict sale price and explored how
key words discovered in the remarks impacted price
• We used Topic Modeling to create a deeper thematic
understanding of the remarks
• Homes that are "in-town" or "out-of-town"
• We extended the Dataset with fields that represent for
each home how related they are to each of these topics
• This will allow our Clustering to group homes by a deeper
meaning than just BEDS, BATHS, etc
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
Recommender Idea
?
?
Modern
Lots of
Land
Small
?
?
?
?
Preference
Model
Preference
Data
BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 ·
House Recommender
MLSD18. Real World Use Case II

More Related Content

PDF
MLSD18. Supervised Summary
PDF
MLSD18. Machine Learning Research at QCRI
PDF
MLSD18. OptiML and Fusions
PDF
MLSD18. Real-World Use Case I
PDF
MLSD18. Ensembles, Logistic Regression, Deepnets
PDF
MLSD18. Unsupervised Workshop
PDF
MLSD18. End-to-End Machine Learning
PDF
MLSD18. Basic Transformations - BigML
MLSD18. Supervised Summary
MLSD18. Machine Learning Research at QCRI
MLSD18. OptiML and Fusions
MLSD18. Real-World Use Case I
MLSD18. Ensembles, Logistic Regression, Deepnets
MLSD18. Unsupervised Workshop
MLSD18. End-to-End Machine Learning
MLSD18. Basic Transformations - BigML

What's hot (20)

PDF
MLSD18. Data Cleaning
PDF
MLSD18 Evaluations
PDF
MLSD18. Supervised Workshop
PDF
MLSD18. Summary of Morning Sessions
PDF
MLSD18. Basic Transformations - QCRI
PDF
MLSD18. Feature Engineering
PDF
MLSD18. Automating Machine Learning Workflows
PDF
BigML Summer 2017 Release
PDF
BSSML17 - API and WhizzML
PDF
VSSML18. Feature Engineering
PDF
VSSML17 Review. Summary Day 2 Sessions
PDF
BigML Release: PCA
PDF
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
PDF
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
PDF
BSSML16 L10. Summary Day 2 Sessions
PDF
Web UI, Algorithms, and Feature Engineering
PDF
TigerGraph.js
PDF
BSSML16 L7. Feature Engineering
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
PDF
MLSEV. Use Case: The All-in-One Data Warehouse and Machine Learning
MLSD18. Data Cleaning
MLSD18 Evaluations
MLSD18. Supervised Workshop
MLSD18. Summary of Morning Sessions
MLSD18. Basic Transformations - QCRI
MLSD18. Feature Engineering
MLSD18. Automating Machine Learning Workflows
BigML Summer 2017 Release
BSSML17 - API and WhizzML
VSSML18. Feature Engineering
VSSML17 Review. Summary Day 2 Sessions
BigML Release: PCA
Graph Gurus Episode 37: Modeling for Kaggle COVID-19 Dataset
FROM DATAFRAMES TO GRAPH Data Science with pyTigerGraph
BSSML16 L10. Summary Day 2 Sessions
Web UI, Algorithms, and Feature Engineering
TigerGraph.js
BSSML16 L7. Feature Engineering
VSSML17 L5. Basic Data Transformations and Feature Engineering
MLSEV. Use Case: The All-in-One Data Warehouse and Machine Learning
Ad

Similar to MLSD18. Real World Use Case II (20)

PDF
VSSML18. Advanced WhizzML Workflows
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
PDF
BigMLSchool: My First End-to-End Machine Learning Project
PDF
DutchMLSchool. Your first BigML Project
PDF
DutchMLSchool. ML Business Perspective
PDF
DutchMLSchool 2022 - End-to-End ML
PDF
Digital transformation
PDF
DutchMLSchool. Supervised vs Unsupervised Learning
PPTX
20171023 5 Lifehacks: How to Analyze a Pack of Websites
KEY
Everwrite pitch deck
PDF
DutchMLSchool. Machine Learning: Why Now?
PDF
Artificial Intelligence in Real Estate - 3 Ways AI can Drive Savings
PDF
2013 - Yhat - YC app.pdf
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
PPTX
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
PPTX
House price prediction
PDF
"What we learned from 5 years of building a data science software that actual...
PPTX
Jared Smith - MeasureCamp Austin 2020 Presentation
PDF
Generative AI: The New Wild West of SEO - Ryan Huser, Ayima
PPTX
How Digital Marketer Uses Google Tag Manager To Improve Sales & Leads
VSSML18. Advanced WhizzML Workflows
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
BigMLSchool: My First End-to-End Machine Learning Project
DutchMLSchool. Your first BigML Project
DutchMLSchool. ML Business Perspective
DutchMLSchool 2022 - End-to-End ML
Digital transformation
DutchMLSchool. Supervised vs Unsupervised Learning
20171023 5 Lifehacks: How to Analyze a Pack of Websites
Everwrite pitch deck
DutchMLSchool. Machine Learning: Why Now?
Artificial Intelligence in Real Estate - 3 Ways AI can Drive Savings
2013 - Yhat - YC app.pdf
DutchMLSchool. Logistic Regression, Deepnets, Time Series
ODSC18, London, How to build high performing weighted XGBoost ML Model for Re...
House price prediction
"What we learned from 5 years of building a data science software that actual...
Jared Smith - MeasureCamp Austin 2020 Presentation
Generative AI: The New Wild West of SEO - Ryan Huser, Ayima
How Digital Marketer Uses Google Tag Manager To Improve Sales & Leads
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
PDF
DutchMLSchool 2022 - Automation
PDF
DutchMLSchool 2022 - ML for AML Compliance
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
PDF
DutchMLSchool 2022 - My First Anomaly Detector
PDF
DutchMLSchool 2022 - Anomaly Detection
PDF
DutchMLSchool 2022 - History and Developments in ML
PDF
DutchMLSchool 2022 - A Data-Driven Company
PDF
DutchMLSchool 2022 - ML in the Legal Sector
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
PDF
DutchMLSchool 2022 - Citizen Development in AI
PDF
Democratizing Object Detection
PDF
BigML Release: Image Processing
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
PDF
Machine Learning in Retail: ML in the Retail Sector
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
PDF
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance
Digital Transformation and Process Optimization in Manufacturing
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Citizen Development in AI
Democratizing Object Detection
BigML Release: Image Processing
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: ML in the Retail Sector
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
ML in GRC: Cybersecurity versus Governance, Risk Management, and Compliance

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PDF
Transcultural that can help you someday.
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPT
Predictive modeling basics in data cleaning process
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Managing Community Partner Relationships
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
Database Infoormation System (DBIS).pptx
Transcultural that can help you someday.
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
Introduction-to-Cloud-ComputingFinal.pptx
Business Analytics and business intelligence.pdf
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
importance of Data-Visualization-in-Data-Science. for mba studnts
Qualitative Qantitative and Mixed Methods.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Predictive modeling basics in data cleaning process
SAP 2 completion done . PRESENTATION.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Microsoft Core Cloud Services powerpoint
Managing Community Partner Relationships
Galatica Smart Energy Infrastructure Startup Pitch Deck

MLSD18. Real World Use Case II

  • 1. 1st edition November 4-5, 2018 Machine Learning School in Doha
  • 2. BigML, Inc X Real-world Use Case House Recommender Full Name Role, Company
  • 3. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Let’s build a recommender Typical way to shop for a home…
  • 4. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Idea ? ? ? ? Preference Model Preference Data Sample … then use the Preference Model to filter all the homes on the market All Homes Forsale
  • 5. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Problem #1 What if there are really unusual homes in the data? • A mansion with 20 bathrooms • A home with no bedrooms • A lot size that is smaller than the home? We don’t want to show these as suggestions because they are unusual…. How do we detect anomalies?
  • 6. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Anomaly Detection
  • 7. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · What just happened? • We wanted to find and remove unusual houses. • We created an Anomaly Detector and examined the top anomalies. • We found some unusual houses to remove and discovered bad data (missing values) that we want to fix.
  • 8. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · ML to fix missing data… • Let’s use Machine Learning… BEDS BATHSWhat happened to being easy? SQFT PRICE BEDS BATHS 3.125 US$530.000 5 3 2.100 US$460.000 2 1.200 US$250.000 3 3.950 US$610.000 6 4 4 1.5
  • 9. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · WhizzML
  • 10. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · What just happened? • We had a Dataset with missing values. • We wanted to apply an algorithm to fix the missing values with Machine Learning • Rather than write the algorithm, we found what we needed in the WhizzML public gallery. • Now that we have cloned the Script we can use it again and again. • We can write new ones too!
  • 11. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Problem #2 • How can we avoid showing essentially the same house over and over? All Homes ? ? ? Sample Modern
  • 12. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Problem #2 • How can we avoid showing essentially the same house over and over? All Homes Modern Lots of Land • Great! What if we don’t know how to group them? Or how many groups? ? sample ? sample
  • 13. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Clustering
  • 14. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · What just happened? • Since we don’t know how many groups of homes there should be, we used G-means Clustering to find the optimum number of groups of homes • Our recommender will use these groups to create a better sampling for user preference • We also tried to understand the home Clusters using “model clusters” but the models were difficult to interpret
  • 15. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Understanding Clusters If SQFT >= 3,125 THEN “Cluster 1” What if we could get rules like… SQFT PRICE BEDS BATHS CLUSTER 3.125 US$530.000 5 3 Cluster 1 2.100 US$460.000 4 2 Cluster 3 1.200 US$250.000 3 1,5 Cluster 5 3.950 US$610.000 6 4 Cluster 1
  • 16. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Association Discovery
  • 17. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · What just happened? • We used a Batch Centroid to add the Cluster assignment of each home as a feature to the Dataset • We use Association Discovery to find “interesting” relationships between the features including the Cluster assignment
  • 18. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Problem #3 There is much more interesting information than just the number of BEDS, BATHS, etc. • Unfortunately, these "remarks" are not available in the Redfin download • Adding them to our dataset requires crawling the website • Like most ML projects, preparing the data is 80% of the difficulty (fortunately I already did it!)
  • 19. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Topic Modeling
  • 20. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · What just happened? • We extending the home dataset with the syndicated remarks text field • We built a Model to predict sale price and explored how key words discovered in the remarks impacted price • We used Topic Modeling to create a deeper thematic understanding of the remarks • Homes that are "in-town" or "out-of-town" • We extended the Dataset with fields that represent for each home how related they are to each of these topics • This will allow our Clustering to group homes by a deeper meaning than just BEDS, BATHS, etc
  • 21. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · Recommender Idea ? ? Modern Lots of Land Small ? ? ? ? Preference Model Preference Data
  • 22. BigML, Inc X· @bigmlcom · @QatarComputing · #MLSD18 · House Recommender