SlideShare a Scribd company logo
4
Most read
6
Most read
SEARCH Final Project (ILS-Z534)
Yelp Data Challenge
UNDER THE SUPERVISION OF PROFESSOR XIAOZHONG LIU
PRESENTED BY,
MILIND GOKHALE
NAMRATA JAGASIA
DEEPAK BHARANIKANA
SAMEEDHA BAIRAGI
SIDDHARTH JAYASANKAR
TASK – 1
PREDICTING CATEGORIES FOR EACH BUSINESS
USING INFORMATION RETRIEVAL APPROACH
INPUT : - BUSINESS ID
OUTPUT :- LIST OF CATEGORIES FOR
EACH BUSINESS ID
Test Set
Training
Set
Dataset Division
 1.6M reviews and 500K tips for 61K businesses
 Data divided into training set and test set.
 66% Training Set: ~ 38K businesses
 Used for category feature extraction.
 33% Test Set: ~ 20K businesses
 Used for prediction
 Evaluation
Toolset
JSON
Handling
Indexing
+ Search
String
Utilities
Database
POS
Tagging
Java
(Eclipse)
Task 1 – ALGORITHM
Start
Indexing
[Business ID, Reviews,
Tips]
Create Category Feature
Map
Perform Business search on
categories
Rank categories found
Comparison with
Ground Truth
Evaluate precision and
recall
End
Task 1 – Method
 Index Creation using Lucene
Business ID Category Reviews and Tips Text
10001 Restaurant ,
Indian, Spicy
The chicken curry is
great.
Loved the food.
……
……
10002 Restaurant,
American,
Donut
The donuts are delicious
The ambiance is good
……
……
Category Search Query
Indian Curry ,mutter, spicy…..
Italian Pizza, Pasta, Alfredo…..
 Category Feature Extraction from training set
 Features are words with highest TFIDF score
among all the words in reviews and tips text for
the category
Task 1 – Method
 Category Scores for Businesses
Business ID Result
10001 1 Indian - 0.734
2 Restaurant – 0.678
3 Asian – 0.567
.
.
783 Mexican – 0.0
10002 1 Donut -0.67
2 Cheese – 0.56
3 Restaurant – 0.43
.
.
783 Bar – 0.0
Business ID Predicted categories
10001 Indian, Restaurant,
Asian, Authentic,
traditional
10002 Donut , Cheese
Restaurant , American ,
Icecream
 Predicted Results
Task 1 – Evaluation
 Comparison of Ground Truth Value (provided by Yelp) with calculated predictions.
46.38364873
74.19699061
0 10 20 30 40 50 60 70 80
3 Categories
5 Categories
10 Categories
20 Categories
Variation Across Number of Prediction Results
Recall Precision
Task 1 – Evaluation
46.38364873
38
39
40
41
42
43
44
45
46
47
Precision
Precision Across Algorithms
VSM BM25 LMD LMJ
54.03189606
46
47
48
49
50
51
52
53
54
55
Recall
Recall Across Algorithms
VSM BM25 LMD LMJ
Task 1 – Evaluation
40.01825046
35.49528383
53.2457325
44.02515811
0 10 20 30 40 50 60
Recall
Precision
Impact of POS Tagging
With POS Tagging Without POS Tagging
Task 2
PREDICT MOST DISCUSSED ATTRIBUTES
IN EACH CITY
INPUT : CITY NAME
OUTPUT : LIST OF ATTRIBUTES THAT ARE MOST
TALKED ABOUT IN THE CITY
Task 2 - Algorithm
Start
Split the data into Test
and Train and Index the
reviews and Tips for each
City separately
Using word net Create a
Attribute Map for each
Attribute with Attribute Name
as key and search text
(related words) as values
For the given input city , perform a search
for each Attribute and retrieve scores and
rank for each Attribute using BM25
ranking function. Perform this step on
both test and train data
Assign top 10 ranked
Attributes to each City
for both test and train
data
Compare the test results
with the train results.
Calculate Precision and
Recall for this model
End
Task 2 - Method
 Splitting and Indexing of data (City-wise)
Business File Review File Tip File
MongDB
Collections
Final Collection
Used to Index
{BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]}
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Las Vegas
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Las Vegas
Pheonix Pheonix
Tempe Tempe
TRAIN INDEXES TEST INDEXES
Task 2 - Method
 We used word net to create a Attribute map.
 For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test
and train data
Attribute
Good for Kids
Music
Liquor
Smoking
Attribute Map
Good for Kids Healthy, colorful, son,
daughter,…….
Music Jazz, Rock, Pop,
melody…..
Liquor Alcohol , sprits
,vodka, Rum….
Smoking Cigar , Cigarette,
lighter, …..
WORD NET
Top 10 Attributes
Liquor - 4.35
Good for Kids – 3.5
Music – 2.1
Smoking – 1.6
Top 10 Attributes
Liquor - 5.35
Music – 4.0
Good for Kids – 2.5
Smoking – 0.5
Train Data
(60%)
Test Data
(40%)
Results from
Train Data
Results from
Test Data
IR Model
Task 2 - Evaluation
 We compared the predicted results of test data with the predicted results of the train data (considered as
ground truth)and calculated the precision and recall
Charlotte
Phoenix
Las Vegas
Challenges
 Task 1
 Data cleaning and pre-processing time.
 Even after stop word removal, many unwanted features with high TFIDF
scores.
 Java heap space out of memory exception while feature extraction from
categories.
 Task 2
 Data cleaning and pre-processing.
 Manual removal of some features from WordNet for improving output.
 Evaluation metric
Questions ?
Thank You…!!

More Related Content

PPTX
Yelp Dataset Challenge
PPTX
Yelp dataset challenge
PPTX
Yelp Data Challenge - Discovering Latent Factors using Ratings and Reviews
PPTX
Data mining on yelp dataset
PPTX
Yelp Academic Dataset
PPTX
Exploratory data analysis and data mining on yelp restaurant review
PPTX
Yelp dataset challenge
PDF
Yelp Project
Yelp Dataset Challenge
Yelp dataset challenge
Yelp Data Challenge - Discovering Latent Factors using Ratings and Reviews
Data mining on yelp dataset
Yelp Academic Dataset
Exploratory data analysis and data mining on yelp restaurant review
Yelp dataset challenge
Yelp Project

Viewers also liked (15)

PPTX
Yelp final
PDF
Yelp challenge reviews_sentiment_classification
DOCX
YELP Data Set Challenge
PDF
yelp data challenge
PPT
Internet marketing
PPT
Yelp - creating a better review
PPTX
Indian it industry
PPTX
Visualforce
PPTX
One sample runs test
PDF
Yelp Consulting Presentation
PPT
What is Yelp?
PPT
Web Development on Web Project Presentation
PPTX
Collaborative Filtering Recommendation System
PPT
Recommendation system
Yelp final
Yelp challenge reviews_sentiment_classification
YELP Data Set Challenge
yelp data challenge
Internet marketing
Yelp - creating a better review
Indian it industry
Visualforce
One sample runs test
Yelp Consulting Presentation
What is Yelp?
Web Development on Web Project Presentation
Collaborative Filtering Recommendation System
Recommendation system
Ad

Similar to Yelp Dataset Challenge 2015 (20)

PDF
Gdg dev fest_Playtestix
PDF
Py conie 2014
PDF
From Black Box to Black Magic, Pycon Ireland 2014
PDF
Empowering Businesses using Yelp Reviews Mining
PDF
An Exploration of Ranking-based Strategy for Contextual Suggestions
PDF
Real-world News Recommender Systems
PDF
Metrics for Effort/Cost Estimation of Mobile apps development
PDF
IT Skills Analysis
PDF
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
PDF
Building a real time, big data analytics platform with solr
PDF
Building a real time big data analytics platform with solr
PPTX
SIG-NOC Tools Survey 2019 Results
PPTX
Discovery and Simulation of Data-Aware Business Processes
PPTX
Rokach-GomaxSlides.pptx
PPTX
Rokach-GomaxSlides (1).pptx
PPTX
Feature Selection for Document Ranking
PDF
Technical Quiz.pdf
PPTX
Impact of Urban Revitalization in Birmingham Alabama
PDF
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
PDF
Search quality in practice
Gdg dev fest_Playtestix
Py conie 2014
From Black Box to Black Magic, Pycon Ireland 2014
Empowering Businesses using Yelp Reviews Mining
An Exploration of Ranking-based Strategy for Contextual Suggestions
Real-world News Recommender Systems
Metrics for Effort/Cost Estimation of Mobile apps development
IT Skills Analysis
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Building a real time, big data analytics platform with solr
Building a real time big data analytics platform with solr
SIG-NOC Tools Survey 2019 Results
Discovery and Simulation of Data-Aware Business Processes
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides (1).pptx
Feature Selection for Document Ranking
Technical Quiz.pdf
Impact of Urban Revitalization in Birmingham Alabama
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Search quality in practice
Ad

More from Milind Gokhale (19)

DOCX
Sprint Plan1
DOCX
Technology Survey and Design
DOCX
Market Survey Report
DOCX
Epics and User Stories
PDF
Aloha Social Networking Portal - SRS
PDF
Aloha Social Networking Portal - Design Document
PDF
Wsd final paper
PPTX
Android games analysis final presentation
PDF
Android gamesanalysis hunger-gamesfinal
PDF
Buffer Trees - Utility and Applications for External Memory Data Processing
PDF
Algorithms for External Memory Sorting
PDF
Building effective teams in Amdocs-TECC - project report
PPTX
Building effective teams in Amdocs TECC - Presentation
PDF
Internet marketing report
PPTX
Change: to be or not to be
PPTX
Decision Tree Learning
PDF
Web Development on Web Project Report
PDF
iGifts project_report
PDF
I gifts manual
Sprint Plan1
Technology Survey and Design
Market Survey Report
Epics and User Stories
Aloha Social Networking Portal - SRS
Aloha Social Networking Portal - Design Document
Wsd final paper
Android games analysis final presentation
Android gamesanalysis hunger-gamesfinal
Buffer Trees - Utility and Applications for External Memory Data Processing
Algorithms for External Memory Sorting
Building effective teams in Amdocs-TECC - project report
Building effective teams in Amdocs TECC - Presentation
Internet marketing report
Change: to be or not to be
Decision Tree Learning
Web Development on Web Project Report
iGifts project_report
I gifts manual

Recently uploaded (20)

PDF
The Edge You’ve Been Missing Get the Sociocosmos Edge
PPTX
Preposition and Asking and Responding Suggestion.pptx
PDF
Presence That Pays Off Activate My Social Growth
PDF
11111111111111111111111111111111111111111111111
PDF
Real Presence. Real Power. Boost with Authenticity
PPTX
Table Top Exercise (TTEx) on Emergency.pptx
PDF
COMMENTIFY - Commentify.co: Your AI LinkedIn Comments Agent
PDF
How can India improve its Public Diplomacy - Social Media.pdf
PDF
Medium @mikehydes The Cryptomaster Story Stats
PDF
Mastering Social Media Marketing in 2025.pdf
PDF
Medium @mikehydes The Cryptomaster Audience Stats
PPTX
Office Administration Courses in Trivandrum That Employers Value.pptx
PDF
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
PDF
Your Best Post Vanished. Blame the Attention Economy
PDF
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
PDF
Create. Post. Dominate. Let's Build Together
PPTX
Types of Social Media Marketing for Business Success
PPTX
Developing lesson plan gejegkavbw gagsgf
PDF
Medium @mikehydes The Cryptomaster About page
PDF
Medium @mikehydes The Cryptomaster Home page
The Edge You’ve Been Missing Get the Sociocosmos Edge
Preposition and Asking and Responding Suggestion.pptx
Presence That Pays Off Activate My Social Growth
11111111111111111111111111111111111111111111111
Real Presence. Real Power. Boost with Authenticity
Table Top Exercise (TTEx) on Emergency.pptx
COMMENTIFY - Commentify.co: Your AI LinkedIn Comments Agent
How can India improve its Public Diplomacy - Social Media.pdf
Medium @mikehydes The Cryptomaster Story Stats
Mastering Social Media Marketing in 2025.pdf
Medium @mikehydes The Cryptomaster Audience Stats
Office Administration Courses in Trivandrum That Employers Value.pptx
FINAL-Content-Marketing-Made-Easy-Workbook-Guied-Editable.pdf
Your Best Post Vanished. Blame the Attention Economy
Live Echo Boost on TikTok_ Double Devices, Higher Ranks
Create. Post. Dominate. Let's Build Together
Types of Social Media Marketing for Business Success
Developing lesson plan gejegkavbw gagsgf
Medium @mikehydes The Cryptomaster About page
Medium @mikehydes The Cryptomaster Home page

Yelp Dataset Challenge 2015

  • 1. SEARCH Final Project (ILS-Z534) Yelp Data Challenge UNDER THE SUPERVISION OF PROFESSOR XIAOZHONG LIU PRESENTED BY, MILIND GOKHALE NAMRATA JAGASIA DEEPAK BHARANIKANA SAMEEDHA BAIRAGI SIDDHARTH JAYASANKAR
  • 2. TASK – 1 PREDICTING CATEGORIES FOR EACH BUSINESS USING INFORMATION RETRIEVAL APPROACH INPUT : - BUSINESS ID OUTPUT :- LIST OF CATEGORIES FOR EACH BUSINESS ID
  • 3. Test Set Training Set Dataset Division  1.6M reviews and 500K tips for 61K businesses  Data divided into training set and test set.  66% Training Set: ~ 38K businesses  Used for category feature extraction.  33% Test Set: ~ 20K businesses  Used for prediction  Evaluation
  • 5. Task 1 – ALGORITHM Start Indexing [Business ID, Reviews, Tips] Create Category Feature Map Perform Business search on categories Rank categories found Comparison with Ground Truth Evaluate precision and recall End
  • 6. Task 1 – Method  Index Creation using Lucene Business ID Category Reviews and Tips Text 10001 Restaurant , Indian, Spicy The chicken curry is great. Loved the food. …… …… 10002 Restaurant, American, Donut The donuts are delicious The ambiance is good …… …… Category Search Query Indian Curry ,mutter, spicy….. Italian Pizza, Pasta, Alfredo…..  Category Feature Extraction from training set  Features are words with highest TFIDF score among all the words in reviews and tips text for the category
  • 7. Task 1 – Method  Category Scores for Businesses Business ID Result 10001 1 Indian - 0.734 2 Restaurant – 0.678 3 Asian – 0.567 . . 783 Mexican – 0.0 10002 1 Donut -0.67 2 Cheese – 0.56 3 Restaurant – 0.43 . . 783 Bar – 0.0 Business ID Predicted categories 10001 Indian, Restaurant, Asian, Authentic, traditional 10002 Donut , Cheese Restaurant , American , Icecream  Predicted Results
  • 8. Task 1 – Evaluation  Comparison of Ground Truth Value (provided by Yelp) with calculated predictions. 46.38364873 74.19699061 0 10 20 30 40 50 60 70 80 3 Categories 5 Categories 10 Categories 20 Categories Variation Across Number of Prediction Results Recall Precision
  • 9. Task 1 – Evaluation 46.38364873 38 39 40 41 42 43 44 45 46 47 Precision Precision Across Algorithms VSM BM25 LMD LMJ 54.03189606 46 47 48 49 50 51 52 53 54 55 Recall Recall Across Algorithms VSM BM25 LMD LMJ
  • 10. Task 1 – Evaluation 40.01825046 35.49528383 53.2457325 44.02515811 0 10 20 30 40 50 60 Recall Precision Impact of POS Tagging With POS Tagging Without POS Tagging
  • 11. Task 2 PREDICT MOST DISCUSSED ATTRIBUTES IN EACH CITY INPUT : CITY NAME OUTPUT : LIST OF ATTRIBUTES THAT ARE MOST TALKED ABOUT IN THE CITY
  • 12. Task 2 - Algorithm Start Split the data into Test and Train and Index the reviews and Tips for each City separately Using word net Create a Attribute Map for each Attribute with Attribute Name as key and search text (related words) as values For the given input city , perform a search for each Attribute and retrieve scores and rank for each Attribute using BM25 ranking function. Perform this step on both test and train data Assign top 10 ranked Attributes to each City for both test and train data Compare the test results with the train results. Calculate Precision and Recall for this model End
  • 13. Task 2 - Method  Splitting and Indexing of data (City-wise) Business File Review File Tip File MongDB Collections Final Collection Used to Index {BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]} Reviews & Tips Review 101 Review 102 . Tip 101 Reviews & Tips Review 1 Review 2 . Tip 1 Reviews & Tips Review 1 Review 2 . Tip 1 Reviews & Tips Review 1 Review 2 . Tip 1 Las Vegas Reviews & Tips Review 101 Review 102 . Tip 101 Reviews & Tips Review 101 Review 102 . Tip 101 Las Vegas Pheonix Pheonix Tempe Tempe TRAIN INDEXES TEST INDEXES
  • 14. Task 2 - Method  We used word net to create a Attribute map.  For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test and train data Attribute Good for Kids Music Liquor Smoking Attribute Map Good for Kids Healthy, colorful, son, daughter,……. Music Jazz, Rock, Pop, melody….. Liquor Alcohol , sprits ,vodka, Rum…. Smoking Cigar , Cigarette, lighter, ….. WORD NET Top 10 Attributes Liquor - 4.35 Good for Kids – 3.5 Music – 2.1 Smoking – 1.6 Top 10 Attributes Liquor - 5.35 Music – 4.0 Good for Kids – 2.5 Smoking – 0.5 Train Data (60%) Test Data (40%) Results from Train Data Results from Test Data IR Model
  • 15. Task 2 - Evaluation  We compared the predicted results of test data with the predicted results of the train data (considered as ground truth)and calculated the precision and recall Charlotte Phoenix Las Vegas
  • 16. Challenges  Task 1  Data cleaning and pre-processing time.  Even after stop word removal, many unwanted features with high TFIDF scores.  Java heap space out of memory exception while feature extraction from categories.  Task 2  Data cleaning and pre-processing.  Manual removal of some features from WordNet for improving output.  Evaluation metric