Yelp Dataset Challenge 2015

SEARCH Final Project (ILS-Z534)
Yelp Data Challenge
UNDER THE SUPERVISION OF PROFESSOR XIAOZHONG LIU
PRESENTED BY,
MILIND GOKHALE
NAMRATA JAGASIA
DEEPAK BHARANIKANA
SAMEEDHA BAIRAGI
SIDDHARTH JAYASANKAR

TASK – 1
PREDICTING CATEGORIES FOR EACH BUSINESS
USING INFORMATION RETRIEVAL APPROACH
INPUT : - BUSINESS ID
OUTPUT :- LIST OF CATEGORIES FOR
EACH BUSINESS ID

Test Set
Training
Set
Dataset Division
 1.6M reviews and 500K tips for 61K businesses
 Data divided into training set and test set.
 66% Training Set: ~ 38K businesses
 Used for category feature extraction.
 33% Test Set: ~ 20K businesses
 Used for prediction
 Evaluation

Toolset
JSON
Handling
Indexing
+ Search
String
Utilities
Database
POS
Tagging
Java
(Eclipse)

Task 1 – ALGORITHM
Start
Indexing
[Business ID, Reviews,
Tips]
Create Category Feature
Map
Perform Business search on
categories
Rank categories found
Comparison with
Ground Truth
Evaluate precision and
recall
End

Task 1 – Method
 Index Creation using Lucene
Business ID Category Reviews and Tips Text
10001 Restaurant ,
Indian, Spicy
The chicken curry is
great.
Loved the food.
……
……
10002 Restaurant,
American,
Donut
The donuts are delicious
The ambiance is good
……
……
Category Search Query
Indian Curry ,mutter, spicy…..
Italian Pizza, Pasta, Alfredo…..
 Category Feature Extraction from training set
 Features are words with highest TFIDF score
among all the words in reviews and tips text for
the category

Task 1 – Method
 Category Scores for Businesses
Business ID Result
10001 1 Indian - 0.734
2 Restaurant – 0.678
3 Asian – 0.567
.
.
783 Mexican – 0.0
10002 1 Donut -0.67
2 Cheese – 0.56
3 Restaurant – 0.43
.
.
783 Bar – 0.0
Business ID Predicted categories
10001 Indian, Restaurant,
Asian, Authentic,
traditional
10002 Donut , Cheese
Restaurant , American ,
Icecream
 Predicted Results

Task 1 – Evaluation
 Comparison of Ground Truth Value (provided by Yelp) with calculated predictions.
46.38364873
74.19699061
0 10 20 30 40 50 60 70 80
3 Categories
5 Categories
10 Categories
20 Categories
Variation Across Number of Prediction Results
Recall Precision

46.38364873
38
39
40
41
42
43
44
45
46
47
Precision
Precision Across Algorithms
VSM BM25 LMD LMJ
54.03189606
46
47
48
49
50
51
52
53
54
55
Recall
Recall Across Algorithms
VSM BM25 LMD LMJ

40.01825046
35.49528383
53.2457325
44.02515811
0 10 20 30 40 50 60
Recall
Precision
Impact of POS Tagging
With POS Tagging Without POS Tagging

Task 2
PREDICT MOST DISCUSSED ATTRIBUTES
IN EACH CITY
INPUT : CITY NAME
OUTPUT : LIST OF ATTRIBUTES THAT ARE MOST
TALKED ABOUT IN THE CITY

Task 2 - Algorithm
Start
Split the data into Test
and Train and Index the
reviews and Tips for each
City separately
Using word net Create a
Attribute Map for each
Attribute with Attribute Name
as key and search text
(related words) as values
For the given input city , perform a search
for each Attribute and retrieve scores and
rank for each Attribute using BM25
ranking function. Perform this step on
both test and train data
Assign top 10 ranked
Attributes to each City
for both test and train
data
Compare the test results
with the train results.
Calculate Precision and
Recall for this model
End

Task 2 - Method
 Splitting and Indexing of data (City-wise)
Business File Review File Tip File
MongDB
Collections
Final Collection
Used to Index
{BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]}
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Reviews &
Tips
Review 1
Review 2
.
Tip 1
Las Vegas
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Reviews &
Tips
Review 101
Review 102
.
Tip 101
Las Vegas
Pheonix Pheonix
Tempe Tempe
TRAIN INDEXES TEST INDEXES

Task 2 - Method
 We used word net to create a Attribute map.
 For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test
and train data
Attribute
Good for Kids
Music
Liquor
Smoking
Attribute Map
Good for Kids Healthy, colorful, son,
daughter,…….
Music Jazz, Rock, Pop,
melody…..
Liquor Alcohol , sprits
,vodka, Rum….
Smoking Cigar , Cigarette,
lighter, …..
WORD NET
Top 10 Attributes
Liquor - 4.35
Good for Kids – 3.5
Music – 2.1
Smoking – 1.6
Top 10 Attributes
Liquor - 5.35
Music – 4.0
Good for Kids – 2.5
Smoking – 0.5
Train Data
(60%)
Test Data
(40%)
Results from
Train Data
Results from
Test Data
IR Model

Task 2 - Evaluation
 We compared the predicted results of test data with the predicted results of the train data (considered as
ground truth)and calculated the precision and recall
Charlotte
Phoenix
Las Vegas

Challenges
 Task 1
 Data cleaning and pre-processing time.
 Even after stop word removal, many unwanted features with high TFIDF
scores.
 Java heap space out of memory exception while feature extraction from
categories.
 Task 2
 Data cleaning and pre-processing.
 Manual removal of some features from WordNet for improving output.
 Evaluation metric

Yelp Dataset Challenge 2015

More Related Content

Viewers also liked (15)

Similar to Yelp Dataset Challenge 2015 (20)

More from Milind Gokhale (19)

Recently uploaded (20)

Yelp Dataset Challenge 2015