SlideShare a Scribd company logo
A Supervised Modeling Approach to Determine the Elite Status
of Yelp Members Using Decision Trees and Linear Regression
Chithroobeni Shankar
Carnegie Mellon University
chithroobeni.shankar@sv.cmu.edu
Darshana Sivakumar
Carnegie Mellon University
darshana.sivakumar@sv.cmu.edu
Jennifer Li
Carnegie Mellon University
jennifer.li@sv.cmu.edu
Julie Tram
Carnegie Mellon University
julie.tram@sv.cmu.edu
Moustafa Aly
Carnegie Mellon University
moustafa.aly@sv.cmu.edu
Neil Everette
Carnegie Mellon University
neil.everette@sv.cmu.edu
Ravindra Udipi
Carnegie Mellon University
ravindra.udipi@sv.cmu.edu
Sahil Kumar
Carnegie Mellon University
sahil.kumar@sv.cmu.edu
Abstract
Yelp, which was founded in 2004 by two PayPal executives,
is a crowd-sourced multinational company headquartered in
San Francisco, CA. Yelp’s goal is to connect people with
great local businesses. Yelp has over 77 million cumulative
reviews from yelpers around the world. Yelpers share their
everyday local business experiences, giving voice to con-
sumers and bringing word of mouth online. Approximately
142 million unique visitors used Yelp’s website, and approx-
imately 79 million unique visitors visited Yelp via their mo-
bile device, on a monthly average [1].
Embed among all these business reviews and yelpers is
a classification between Elite and Non-elite yelpers. Yelp
Elite is a way for Yelp to recognize and reward users who
are active on Yelp. Elite-worthiness is based on a number
of things, including well-written reviews, high quality tips, a
detailed personal profile, an active voting and compliment-
ing record, and a history of playing well with others [2]. Elite
status is earned every year and is determined by a commit-
tee. Elite yelpers have profiles with special badges and the
elite yelpers are invited to private events and parties.
For the data analytics course project, our team will at-
tempt to crack the code using a systematic algorithm to pre-
dict users’ Elite worthiness. We will use the Yelp academic
set and the associated user attributes to determine the most
accurate algorithm to predict elite status. Our goal for the
project is to predict with 95% accuracy if a user obtains elite
status for any particular year within the Yelp Academic set.
We should note that there are some inherent risk using
the Yelp academic data set. Our team has no insight into any
additional or hidden indicators that may be used in determin-
ing Elite status beyond the data field that was provided in the
Yelp Academic set. The academic dataset only has 12% of
the reviews from 370K users. Our algorithm and modeling
is based on the the data provided that exists in the academic
data set.
1. Introduction
The Yelp Academic Dataset has been provided by Yelp to be
used for academic purposes. The dataset is a rich resource
of the interaction information between customers and busi-
nesses on the Yelp platform. Yelp’s academic dataset in-
cludes information about businesses near 30 different pre-
mium schools, including Carnegie Mellon University in
Pittsburgh, Pennsylvania. The academic dataset is in the
form of different JSON files for different objects, with nested
json structures and arrays in it. It consists of five objects re-
lated to Businesses, Customers, Reviews, Customer Check-
Ins and Customer tips. Business objects contain basic infor-
mation about local businesses. Review objects contain the
review text, the star rating, and information on votes Yelp
users have cast on the review. User objects contain aggre-
gate information about a single user across all of Yelp. Table
1 shows the number of records for each of these categories
and describes the Yelp objects [3].
2. Problem Statement
All Yelpers can nominate themselves or their friends to be an
elite member on the Yelp website. According to Yelp, there
isn’t a specific benchmark for a member to be selected to
Object Type Description Num. of Records
Business Business objects contain, location information, Number of reviews, average star
ratings and url of the local businesses.
61,184
Review Review objects contain the review text, the star rating, and information on votes Yelp
users have cast on the review. [4]
1,569,264
User User objects contain aggregate information about a single user across all of Yelp. 366,715
Check-ins The Check-ins set provides the data related to the user check-ins patterns for busi-
nesses.
45,166
Customer tips Similar to the reviews the tips set also has the text column that provides quick tips
related to the businesses.
495,107
Table 1. Various Objects Belonging to Yelp Academic Dataset
be an elite member or not. Also, to be considered elite, a
member needs to reapply every year [5].
Yelp’s Elite Council’s process of selecting elite members
is a blackbox for the rest of the world. What if, using Yelp’s
historic data, we could create an automated process for deci-
phering if a member is fit enough to be given the elite status
or not? This could potentially ease the selection task for the
Elite Council, by automatically filtering out nominations that
are predictably unfit for elite status. This will result in sav-
ings for Yelp, as the overhead costs of preliminary filtering
for the Elite Council will be removed.
2.1 Goal
Our goal is to create an algorithm to predict a user’s elite
status on Yelp. We want to predict a user’s elite status with
an accuracy of 95%.
3. Initial Data Investigation
There are 5 data objects provided in the Yelp academic
dataset that comprise the 1.6 million reviews and 500k tips
by 366k users for 61k business in 10 cities and four countries
[3]. Out of the 366k Yelp members in the dataset, only 25k
(6.8%) were determined to be elite members. For our initial
investigation we analyzed the 20 attributes of the user data
object to find correlations that could identify elite vs non-
elite members.
3.1 Most Significant Attributes for 2015
The data set was initially reduced to user activity in the
year 2015. The red outline in the box plot developed using
Tableau as seen in Figure 1 identifies the non elite members.
Compared with results from elite members, the following at-
tributes had significant differences over non-elite members:
• Number of reviews written
• Number of user Fans
• Votes counted as Useful
• Votes counted as Cool
• Votes counted as Funny
Figure 1. Most Significant Attributes for 2015
These five attributes were initially flagged as attributes
for further analysis.
3.2 Review Count Past 10 Years, Elite vs Non-Elite
The box plot in Figure 2 depicts the most significant at-
tribute, Review Count.
When the user data attributes were expanded over a 10
year span, it confirmed the findings from the 2015 informa-
tion.
With small exceptions in 2005 (the first year of Elite
qualification) and 2015 (an incomplete year), the attribute
findings were consistent across the 10 year span.
According to our initial analysis on the user attribute
data alone, it was concluded that the four initial attributes
as in Table 2 had a high correlation in identifying Elite
vs Non-Elite members. Additional manipulation of the data
(merging of the user data with the review data set) was
required to further test if other conditions such as previous
years as Yelp Elite status had any additional correlating
effects.
Figure 2. Review Count Past 10 Years, Elite vs Non-Elite
Review Count
Elite Non-Elite Difference
75% Quartile 106 Reviews 11 Reviews 9.6x
Median 75 Reviews 5 Reviews 15x
25% Quartile 51 Reviews 2 Reviews 25x
Votes Useful
Elite Non-Elite Difference
75% Quartile 140 Votes 16 Votes 8.8x
Median 70 Votes 4 Votes 17.5x
25% Quartile 40 Votes 1 Votes 40x
Votes Cool
Elite Non-Elite Difference
75% Quartile 63 Votes 4 Votes 15.8x
Median 27 Votes 1 Votes 27x
25% Quartile 14 Votes 0 Votes 14x
Votes Funny
Elite Non-Elite Difference
75% Quartile 50 Votes 4 Votes 12.5x
Median 20 Votes 1 Votes 20x
25% Quartile 10 Votes 0 Votes 10x
Table 2. Initial Data Findings
Dataset Num. of Attributes
Users 23
Businesses 105
Table 3. Number of Attributes in Different Datasets
4. Feature Selection
Feature selection is a popular technique in Data Mining that
helps reduce input data into more manageable sizes for pro-
cessing and analysis. It does not imply only cardinality re-
duction, i.e. reducing the number of features to be selected
based on a cutoff count, but also actively selecting features
or attributes of a dataset based on their usefulness for anal-
ysis4. Some datasets have the issue of containing too many
attributes which are sparse in their information. This may
lead to cumbersome fitting problems with a model and even
degrade the quality of the result by introducing noise in the
analysis. For this reason we paid attention to the feature se-
lection and data massaging’ early in our work.
As was alluded earlier, the raw Yelp datasets had a high
number of attributes to describe Users and Businesses as
seen in Table 3.
In our bid to create predictive models for determining
Yelp Elite User Selection, we found the models built on
the raw dataset to have a high degree of inaccuracy. In
order to determine the usefulness of the attributes available,
we decided to evaluate correlation between a user’s elite
status and the other attributes available to depict his behavior
on the Yelp platform. Furthermore, we rendered correlation
matrices for the dataset available. This helped us narrow
down to the attribute groups of interest. Still cautious about
getting rid of data in the dataset, we decided to try and club
related attributes. We had 10 different type of compliments
and 10 attributes to represent them. Since they all could
essentially be clubbed in an aggregated field to represent the
overall compliments, we decided to experiment with that.
With this experiment, we noticed slightly higher accuracy
in our predictive model. Inspired by the change, we decided
to apply the same approach to some more attributes which
were related. There were 3 attributes to represent 3 different
types of votes a user had received. Consolidating the data
from these 3 columns into one was the next step. Our model
improved with this step too.
Once we knew that we had a better organized dataset
now to work on in order to create the model, we decided
to trim out down some more to really highlight the patterns
we were interested in and use the correlations that were
more prominent. We built new sets of correlation matrices
on the new dataset in order to filter down to the attributes
that had the highest impact in determining a user’s elite
status. Comparing the correlation of the newly generated
aggregated attributes helped us find the areas we needed to
concentrate on in order to build an effective model. We were
able to improve our models by leveraging this information
Figure 3. Correlation Matrix
Figure 4. Correlation Matrix
by creating appropriate rules in order to better utilize the
correlation information we had found.
5. Algorithm Experiments
Our team decided to dive deeper into the Yelp user data
set to gain better insights in it. As we focused more on the
Yelp Elite member status, we began exploring different tech-
niques to determine any correlation that will help establish
our model. The team used a supervised learning technique
to understand the criteria for Yelp Elite member.
The criteria for our classifier algorithms selection were:
• Rule Based Classifier: We are interested in generating a
rules engine for Yelp Elite users evaluation.
• Reasonable computational complexity: The academic
dataset size is over 2 GB, with the review dataset over
1.4 GB. We need to have the best of both worlds, fast al-
gorithm to allow multiple experiments and produce high
quality model.
Based on that, these initial set of algorithms are selected
for experimentation, we evaluated their effectiveness during
the project life cycle.
• Alternating decision tree
• kNN: k nearest neighbor classification
• Bayesian Algorithms
• Random Forest
• CART: Classification and regression tree
• Conjunctive Rule classification
Below, we will briefly discuss our results in each type of
classifiers:
• Bayesian Algorithms: We ran the data set against a num-
ber of Bayesian algorithms and the results were very
weak True positive rate (62%). A quick look on the nature
of our data and some visualizations lead us to understand
why the Bayesian algorithms performed poorly. Bayesian
algorithms assume strong independence on the attributes,
the data set we had, we could see strong correlation be-
tween the attributes, such as the number of star counts
and the number of reviews. The statistical advantage of
Bayesian algorithms was lost in our case.
• Regression models: We had a hunch that regression
models were not the best options we have. The data is
rich in its attributes and the percentile distribution of the
values in each attributes leads to multiple decision points.
For example, if the number of compliments the user has
is less than 2, he/she is definitely not an elite member.
However, we tried the regression models, the results were
better than Bayesian algorithms, but far from our targets:
True positive rate (72%).
• Alternating Decision Trees: Based on our observations,
we noticed that we needed an algorithm that works with
independent attributes and be sensitive to different bands
of data. Decision trees seemed to be a natural and logi-
cal progression. We had better results True positive rate
(79%). We couldn’t improve further than this.
• Random Forest: As its a family of decision trees, the
results were almost identical to the previous algorithms.
• kNN: K nearest neighbor seemed to be a good choice as
it tends to do well with Binary classifier if K is selected
to be an Odd number. The results were very promising.
True positive rate (84%). However, we identified that we
Figure 5. Dataset Relations
could not improve further as the data is quite discrete and
this degrades the performance of the algorithm.
• CART: Classification and regression tree: Classification
and regression trees seemed to combine the best of both
worlds, Rules that can take care of the percentile distri-
bution and regression that can easily identify the correla-
tion between the attributes and weight them. Indeed, we
achieved the best results with J48 tree and linear regres-
sion, our True positive rate was 94.2% .
6. Results and Analysis
Once we finalized our goal to create a model for determining
the elite worthiness of an yelp user, we focused on the User
attributes and the Review attributes. During the experiment
period we used this data set in different combinations to
arrive at our final model. In this section we describe the
reasoning behind using each of these combinations and the
results of our experiments on each of these data sets.
6.1 Pennsylvania Data Set
Goal: We decided to focus on the data from one state, as
it provides balanced distribution of business types, users and
reviews and helps us understand the behavior and correlation
amongst these attributes. With the smaller data set it is also
easier to try different algorithms.
Data Manipulation: We chose Pennsylvania as it had the
business around the CMU campus and it ranked second
for the number of restaurants/ state metric in the academic
dataset. The user data set did not have the state information
related to the user. With an assumption that the review data
is local, we picked business in Pennsylvania, selected the
reviews for these businesses and then got the users and
the corresponding attributes for these reviews. The relation
between the Business, User and Review objects in shown in
Figure 5.
Datasets Used:
Data Size: 17,791
Elite:Non-Elite: 1:12
Attributes Used: review count, fans, votes.cool,
votes.funny, votes.useful, average stars, compliments.hot,
compliments.more, compliments.list
Results: The results obtained using the J48 Pruned Tree
and Regression Classifier are shown in Table 4.
Discussion: The Pennsylvania data was initially selected
because it is smaller and take less time to try different al-
gorithms. At the same time, it has a balanced distribution of
business types, users and reviews.
The data was divided to test data and training data at ratio
of 1:2. After running J48 graft pruned tree classifier, 95.40%
users are correctly classified. ROC area is 95.70% which
means the classification is quite accurate. However the False
positive rate on Non-Elite users is quite high which means
many users that can be qualified as elite users are falsely
classified as Non-Elite users. So the goal for next step is
to expand the algorithm to a larger scale, as well as reduce
False positive rate.
6.2 Review Data Set
Goal: The academic data set has 1.6 Million reviews,
spread across multiple users and business. The intention of
this experiment is to predict the elite Yelpers just based on
the review data.
Data Manipulation: To be able to use the review data
we decided to aggregate them by the userId for each year.
Elite status is granted to users on an yearly basis. Being
elite in one year doesn’t necessarily mean the status can
be kept for next year. User data doesn’t reflect this time
sensitiveness. Most attributes in user dataset is an aggregated
result across years since user joined Yelp. So we decided to
explore Review dataset which has timestamps of when the
user posted the reviews.
For a given userId we aggregated the star ratings (1,2,3,4
and 5) provided by each user and also the compliments
(funny, cool and useful) they got. For each of these user Ids
we inserted an isElite flag based on the year they were elite.
The years the user had the elite status is available in the user
data set. Sample record set aggregation is depicted in Figure
6.
Datasets Used:
Data Size: 500,967
Elite:Non-Elite: 1:13
Attributes Used: NumberOf5StarReviews, Num-
berOf4StarReviews, NumberOf3StarReviews, Num-
berOf2StarReviews, NumberOf1StarReviews, funnyVote-
Count, usefulVoteCount, coolVoteCount
Results: The results obtained using the J48 Pruned Tree
and Regression Classifier are shown in Table 5.
Discussion: The data was divided to test data and training
data at ratio of 1:2. In the result, weighted average True pos-
itive rate is 93.40%. It is not a significant change from last
experiment. However, True positive rate of Non-Elite is 99%
and False positive rate of Elite is 1%, while True positive
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
85.50% 3.50% 74% 85.50% 79.30% 95.70% Elite
96.50% 14.50% 98.30% 96.50% 97.40% 95.70% Non Elite
Weighted Avg. 95.40% 13.40% 95.80% 95.40% 95.50% 95.70%
Table 4. Pennsylvania Data Set Results
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
21.80% 1.00% 62.30% 21.80% 32.30% 69.30% Elite
99.00% 78.20% 94.20% 99.00% 96.50% 69.40% Non Elite
Weighted Avg. 93.40% 72.60% 91.90% 93.40% 91.90% 69.40%
Table 5. Review Data Set Results
Figure 6. Data Transformation
rate of Elite is 21.80% and False positive rate of Non-Elite
is 78.20%. This means the classifier tends to classify any
given user toward Non-Elite instead of Elite. Almost 80% of
users that can be elite are falsely classified to non-elite. And
the ROC Area dropped significantly to 69.40%. It means the
accuracy of this classification is not very good.
There are two main reasons behind this result:
• Review attributes are not as strongly associated with the
elite status as user attributes.
• Data is highly skewed towards non-elite users.
To achieve more accurate results, we need to stay with
user attributes and take some advantages of review attributes.
6.3 All User Data
Goal: The academic data set has 366K user’s data with 23
attributes. The goal of this experiment is to predict the elite
Yelpers just based on the user data.
Data Manipulation: We massaged the User level attributes
a little to obtain parsable data elements. From the feature
selection process and the attribution correlation matrix we
identified that user data set has attributes like review count,
fans, votes.cool and votes.useful that play a significant role
to obtain the elite status. We further aggregated the friends
list and the total number of votes and compliments into a
measurable numeric count.
Datasets Used:
Data Size: 366,715
Elite:Non-Elite: 1:15
Attributes Used: review count, friends, fans, aver-
age stars, yelping.since.months, aggregated compliments,
aggregated votes
Results: The results obtained using the J48 Pruned Tree
and Regression Classifier are shown in Table 6.
Discussion: After expanding the algorithm to all user data,
the result is quite satisfying. The data was still divided to test
data and training data at ratio of 1:2.
Weighted average is 97% with the True positive rate of
Elite users as high as 98.70%. ROC Area is 94.70%, which
means this classification is relatively accurate. The False
positive rate of Elite is as high as 24.90%, meaning that users
who are not supposed to be elite users are classified to elite
group.We would like to reduce the False positive rate while
maintaining the accuracy of the prediction.
This result proved that user attributes are very strongly as-
sociated with the elite status, compared to review attributes.
However, since elite status is granted on a yearly basis, user
attributes still cannot capture the impact of the time factor
on the elite status. So the next goal is to aggregate the two
dataset to leverage the strengths of user attributes and the
time attribute of review data.
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
98.70% 24.90% 98.20% 98.70% 98.40% 94.70% Elite
75.10% 1.30% 80.50% 75.10% 77.70% 94.70% Non Elite
Weighted Avg. 97.00% 23.20% 96.90% 97.00% 97.00% 94.70%
Table 6. All User Data Results
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
79.40% 1.80% 76.30% 79.40% 77.80% 98.60% Elite
98.20% 20.60% 98.50% 98.20% 98.30% 98.60% Non Elite
Weighted Avg. 96.90% 19.30% 96.90% 96.90% 96.90% 98.60%
Table 7. Merged Review and User Data Results
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
94.20% 4.10% 90.20% 94.70% 92.40% 98.80% Elite
95.90% 5.80% 97.50% 95.10% 96.70% 98.80% Non Elite
Weighted Avg. 95.30% 5.30% 95.90% 95.90% 94.90% 98.80%
Table 8. Balanced Merged Review and User Data Results
6.4 Merged Review and User Data
Goal: After independent analysis of user level attributes
and review level attributes, we wanted to measure the impact
of these attributes together on the elite status. So we decided
to aggregate the review data and merge it with the user data
to predict the elite status.
Data Manipulation: We merged the user level attributes
into review attributes, to be able to experiment the combined
data set. We converted the yelping since column to a mea-
surable number of months field. It is a factor that reflects how
long an user has been active on Yelp. Here the review Count
belongs to user attributes. StarCount1 starCount5 belong
to review attributes. In the dataset, review data only captures
about 12% total reviews all users have given. So sum of all
users’ reviewCount is not equal to amount of reviews.
Datasets Used:
Data Size: 366,715
Elite:Non-Elite: 1:15
Attributes Used: yelpMonths, starCount5, starCount4.
starCount3, starCount2, starCount1, averageStars, cool-
ComplimentsCount, funnyComplimentsCount, useFulCom-
plimentsCount, friendsCount, fanCount, reviewCount
Results: The results obtained using the J48 Pruned Tree
and Regression Classifier are shown in Table 7.
Discussion: The merged dataset yields better result than
review-only data and user-only data. While the weighted av-
erage True positive rate stays as high as 96.90%, average
False positive rate dropped to lower than 20%. ROC Area
is as high as 98.6%, up from 94.70% on user data. Compar-
ing the False positive rate of Elite and Non-Elite, the clas-
sifier still tends to classify users towards non-elite. It makes
20.60% users falsely classified to non-elite status. False pos-
itive rate of two types is very unbalanced.
This experiment showed that the combined attributes
work better in the classification. However the skewed data
problem hasn’t been solved yet. Next step is to balance the
dataset so that false positive rate can be further reduced.
6.5 Balanced Merged Review and User Data
Goal: The goal here is to run our experiments on a bal-
anced data set that is not skewed towards Non-elite mem-
bers.
Data Manipulation: In all of the above mentioned datasets,
the proportion of the elite users was very less so the results
were more inclined towards classifying non-elite status.So
to get the right balance, we choose a dataset, that had a bal-
anced mix (1:2) of elite vs non-elite data. from the merged
dataset.
Datasets Used:
Data Size: 82,000
Elite:Non-Elite: 1:2
Attributes Used: yelpMonths, starCount5, starCount4,
starCount3, starCount2, starCount1, averageStars, cool-
ComplimentsCount, funnyComplimentsCount, useFulCom-
plimentsCount, friendsCount, fanCount, reviewCount
Results: The results obtained using the J48 Pruned Tree
and Regression Classifier are shown in Table 8.
Discussion: After balancing the dataset with 33% elite
users and 67% non-elite users, we got the best result among
all experiments. The data was divided to test data and train-
ing data at ratio of 1:2. Weighted average of True positive
rate is 95.3% with both Elite and Non-Elite close to 95%.
Average of False positive rate is 5.3%, balanced between
elite and non-elite. ROC Area is 98.80%, which is higher
than all previous results.
Given this result, we can confidently conclude that our
classifier will classify Yelp users with a weighted average
accuracy of 95%.
7. Conclusion and Future Work
Our final model, developed using the J48 tree and linear
regression determines elite users with over 94% accuracy.
It also gives an ROC area of 98.80%, establishing its cor-
rectness. However, this model has been developed with the
academic data set provided by Yelp, thus missing some at-
tributes. With additional attributes such as the device through
which reviews were written, the time taken to write reviews
after meals, the proximity with which the reviews were writ-
ten, the user attributed divided by year, and so on, we believe
we can develop the model to predict elite users with more ac-
curacy. The model also does not use Natural Language Pro-
cessing to determine the content of reviews. Applying NLP
on this data may yield more conditions for the determination
of elite status. Furthermore, the Yelp elite council does not
disclose the factors it considers for the determination of the
elite status. The developed model is based only on historic
data.
In the future, we would like to try our models on Yelp’s
complete dataset, and check if it yields similar results. We
may have to make some modifications to incorporate the new
attributes, to achieve similar accuracy. We also plan to sub-
mit our results to the Yelp Dataset Challenge’ to evaluate
our findings. Additionally, we will work with other qualita-
tive factors such as content of reviews, in an effort to com-
pletely eliminate the manual process that Yelp uses to deter-
mine elite members.
References
[1] ”Yelp Investor Relations.” Web. 7 May 2015.
http://goo.gl/Iz4ZEo.
[2] ”What Is Yelp’s Elite Squad?” Web. 7 May 2015.
http://goo.gl/DcbkCX.
[3] ”Yelp.” Yelp’s Academic Dataset. Accessed April 5, 2015.
https://goo.gl/dHgVmn.
[4] ”Feature Selection (Data Mining) -
MSDN - Microsoft.” 2015. 7 May. 2015.
https://msdn.microsoft.com/en-us/ms175382.aspx.
[5] Stone, Madeline. ”Elite Yelpers Hold Immense Power, And
They Get Treated Like Kings By Bars And Restaurants Trying
To Curry Favor.” Business Insider. August 22, 2014. Accessed
April 27, 2015. http://goo.gl/cZyOMN.

More Related Content

PDF
Facebook Business Analysis and Prognosis
PPTX
Ejercicio 4 marlon roko ph9
PDF
Software Metrics - Lease Management Case Study
PDF
internet for electrik
PPT
Talent Acquisition Tool (TAT)
PPSX
MCBServices(Final)
PDF
Safety Critical Research
PDF
Survivable Social Network on a Chip - Foundation of software engineering
Facebook Business Analysis and Prognosis
Ejercicio 4 marlon roko ph9
Software Metrics - Lease Management Case Study
internet for electrik
Talent Acquisition Tool (TAT)
MCBServices(Final)
Safety Critical Research
Survivable Social Network on a Chip - Foundation of software engineering

Similar to A Supervised Modeling Approach to Determine Elite Status of Yelp Members (20)

PDF
Yelp's Review Filtering Algorithm Paper
PDF
Yelp Rating Prediction
DOCX
Yelp an Assessment of Customer Satisfaction
PPTX
Text Enhanced Recommendation System Model Based on Yelp Reviews
PDF
PredictingYelpReviews
PDF
Predicting Yelp Review Star Ratings with Language
PDF
Final.Version
PDF
Deriving measurable content drivers for effective Content Marketing
DOCX
Organizational ChangeChapter 15Charles R. Swanson, Leo.docx
PPTX
strategic management unit 2 ggipu strategic
PDF
Module 1.3 data exploratory
PPTX
Exploratory data analysis and data mining on yelp restaurant review
PPTX
Glassdoor Lunch & Learns
PDF
Job Simulation- Whitepaper from Hiring Simulation Assessments
DOCX
Meaning of performance appraisal
PPTX
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
PDF
Automatic Recommendation of Trustworthy Users in Online Product Rating Sites
PDF
CRM Options for Enterprise Nonprofits - Blackbaud CRM Solutions
PDF
How to Achieve a High-Performing Back Office__Part 2
PDF
HBS study sect 3.4 excerpt
Yelp's Review Filtering Algorithm Paper
Yelp Rating Prediction
Yelp an Assessment of Customer Satisfaction
Text Enhanced Recommendation System Model Based on Yelp Reviews
PredictingYelpReviews
Predicting Yelp Review Star Ratings with Language
Final.Version
Deriving measurable content drivers for effective Content Marketing
Organizational ChangeChapter 15Charles R. Swanson, Leo.docx
strategic management unit 2 ggipu strategic
Module 1.3 data exploratory
Exploratory data analysis and data mining on yelp restaurant review
Glassdoor Lunch & Learns
Job Simulation- Whitepaper from Hiring Simulation Assessments
Meaning of performance appraisal
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Automatic Recommendation of Trustworthy Users in Online Product Rating Sites
CRM Options for Enterprise Nonprofits - Blackbaud CRM Solutions
How to Achieve a High-Performing Back Office__Part 2
HBS study sect 3.4 excerpt
Ad

Recently uploaded (20)

PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Leprosy and NLEP programme community medicine
PDF
Introduction to Data Science and Data Analysis
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Introduction to the R Programming Language
PPTX
Computer network topology notes for revision
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Quality review (1)_presentation of this 21
PPTX
modul_python (1).pptx for professional and student
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Leprosy and NLEP programme community medicine
Introduction to Data Science and Data Analysis
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Business Analytics and business intelligence.pdf
Introduction to the R Programming Language
Computer network topology notes for revision
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
annual-report-2024-2025 original latest.
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Quality review (1)_presentation of this 21
modul_python (1).pptx for professional and student
Ad

A Supervised Modeling Approach to Determine Elite Status of Yelp Members

  • 1. A Supervised Modeling Approach to Determine the Elite Status of Yelp Members Using Decision Trees and Linear Regression Chithroobeni Shankar Carnegie Mellon University chithroobeni.shankar@sv.cmu.edu Darshana Sivakumar Carnegie Mellon University darshana.sivakumar@sv.cmu.edu Jennifer Li Carnegie Mellon University jennifer.li@sv.cmu.edu Julie Tram Carnegie Mellon University julie.tram@sv.cmu.edu Moustafa Aly Carnegie Mellon University moustafa.aly@sv.cmu.edu Neil Everette Carnegie Mellon University neil.everette@sv.cmu.edu Ravindra Udipi Carnegie Mellon University ravindra.udipi@sv.cmu.edu Sahil Kumar Carnegie Mellon University sahil.kumar@sv.cmu.edu Abstract Yelp, which was founded in 2004 by two PayPal executives, is a crowd-sourced multinational company headquartered in San Francisco, CA. Yelp’s goal is to connect people with great local businesses. Yelp has over 77 million cumulative reviews from yelpers around the world. Yelpers share their everyday local business experiences, giving voice to con- sumers and bringing word of mouth online. Approximately 142 million unique visitors used Yelp’s website, and approx- imately 79 million unique visitors visited Yelp via their mo- bile device, on a monthly average [1]. Embed among all these business reviews and yelpers is a classification between Elite and Non-elite yelpers. Yelp Elite is a way for Yelp to recognize and reward users who are active on Yelp. Elite-worthiness is based on a number of things, including well-written reviews, high quality tips, a detailed personal profile, an active voting and compliment- ing record, and a history of playing well with others [2]. Elite status is earned every year and is determined by a commit- tee. Elite yelpers have profiles with special badges and the elite yelpers are invited to private events and parties. For the data analytics course project, our team will at- tempt to crack the code using a systematic algorithm to pre- dict users’ Elite worthiness. We will use the Yelp academic set and the associated user attributes to determine the most accurate algorithm to predict elite status. Our goal for the project is to predict with 95% accuracy if a user obtains elite status for any particular year within the Yelp Academic set. We should note that there are some inherent risk using the Yelp academic data set. Our team has no insight into any additional or hidden indicators that may be used in determin- ing Elite status beyond the data field that was provided in the Yelp Academic set. The academic dataset only has 12% of the reviews from 370K users. Our algorithm and modeling is based on the the data provided that exists in the academic data set. 1. Introduction The Yelp Academic Dataset has been provided by Yelp to be used for academic purposes. The dataset is a rich resource of the interaction information between customers and busi- nesses on the Yelp platform. Yelp’s academic dataset in- cludes information about businesses near 30 different pre- mium schools, including Carnegie Mellon University in Pittsburgh, Pennsylvania. The academic dataset is in the form of different JSON files for different objects, with nested json structures and arrays in it. It consists of five objects re- lated to Businesses, Customers, Reviews, Customer Check- Ins and Customer tips. Business objects contain basic infor- mation about local businesses. Review objects contain the review text, the star rating, and information on votes Yelp users have cast on the review. User objects contain aggre- gate information about a single user across all of Yelp. Table 1 shows the number of records for each of these categories and describes the Yelp objects [3]. 2. Problem Statement All Yelpers can nominate themselves or their friends to be an elite member on the Yelp website. According to Yelp, there isn’t a specific benchmark for a member to be selected to
  • 2. Object Type Description Num. of Records Business Business objects contain, location information, Number of reviews, average star ratings and url of the local businesses. 61,184 Review Review objects contain the review text, the star rating, and information on votes Yelp users have cast on the review. [4] 1,569,264 User User objects contain aggregate information about a single user across all of Yelp. 366,715 Check-ins The Check-ins set provides the data related to the user check-ins patterns for busi- nesses. 45,166 Customer tips Similar to the reviews the tips set also has the text column that provides quick tips related to the businesses. 495,107 Table 1. Various Objects Belonging to Yelp Academic Dataset be an elite member or not. Also, to be considered elite, a member needs to reapply every year [5]. Yelp’s Elite Council’s process of selecting elite members is a blackbox for the rest of the world. What if, using Yelp’s historic data, we could create an automated process for deci- phering if a member is fit enough to be given the elite status or not? This could potentially ease the selection task for the Elite Council, by automatically filtering out nominations that are predictably unfit for elite status. This will result in sav- ings for Yelp, as the overhead costs of preliminary filtering for the Elite Council will be removed. 2.1 Goal Our goal is to create an algorithm to predict a user’s elite status on Yelp. We want to predict a user’s elite status with an accuracy of 95%. 3. Initial Data Investigation There are 5 data objects provided in the Yelp academic dataset that comprise the 1.6 million reviews and 500k tips by 366k users for 61k business in 10 cities and four countries [3]. Out of the 366k Yelp members in the dataset, only 25k (6.8%) were determined to be elite members. For our initial investigation we analyzed the 20 attributes of the user data object to find correlations that could identify elite vs non- elite members. 3.1 Most Significant Attributes for 2015 The data set was initially reduced to user activity in the year 2015. The red outline in the box plot developed using Tableau as seen in Figure 1 identifies the non elite members. Compared with results from elite members, the following at- tributes had significant differences over non-elite members: • Number of reviews written • Number of user Fans • Votes counted as Useful • Votes counted as Cool • Votes counted as Funny Figure 1. Most Significant Attributes for 2015 These five attributes were initially flagged as attributes for further analysis. 3.2 Review Count Past 10 Years, Elite vs Non-Elite The box plot in Figure 2 depicts the most significant at- tribute, Review Count. When the user data attributes were expanded over a 10 year span, it confirmed the findings from the 2015 informa- tion. With small exceptions in 2005 (the first year of Elite qualification) and 2015 (an incomplete year), the attribute findings were consistent across the 10 year span. According to our initial analysis on the user attribute data alone, it was concluded that the four initial attributes as in Table 2 had a high correlation in identifying Elite vs Non-Elite members. Additional manipulation of the data (merging of the user data with the review data set) was required to further test if other conditions such as previous years as Yelp Elite status had any additional correlating effects.
  • 3. Figure 2. Review Count Past 10 Years, Elite vs Non-Elite Review Count Elite Non-Elite Difference 75% Quartile 106 Reviews 11 Reviews 9.6x Median 75 Reviews 5 Reviews 15x 25% Quartile 51 Reviews 2 Reviews 25x Votes Useful Elite Non-Elite Difference 75% Quartile 140 Votes 16 Votes 8.8x Median 70 Votes 4 Votes 17.5x 25% Quartile 40 Votes 1 Votes 40x Votes Cool Elite Non-Elite Difference 75% Quartile 63 Votes 4 Votes 15.8x Median 27 Votes 1 Votes 27x 25% Quartile 14 Votes 0 Votes 14x Votes Funny Elite Non-Elite Difference 75% Quartile 50 Votes 4 Votes 12.5x Median 20 Votes 1 Votes 20x 25% Quartile 10 Votes 0 Votes 10x Table 2. Initial Data Findings Dataset Num. of Attributes Users 23 Businesses 105 Table 3. Number of Attributes in Different Datasets 4. Feature Selection Feature selection is a popular technique in Data Mining that helps reduce input data into more manageable sizes for pro- cessing and analysis. It does not imply only cardinality re- duction, i.e. reducing the number of features to be selected based on a cutoff count, but also actively selecting features or attributes of a dataset based on their usefulness for anal- ysis4. Some datasets have the issue of containing too many attributes which are sparse in their information. This may lead to cumbersome fitting problems with a model and even degrade the quality of the result by introducing noise in the analysis. For this reason we paid attention to the feature se- lection and data massaging’ early in our work. As was alluded earlier, the raw Yelp datasets had a high number of attributes to describe Users and Businesses as seen in Table 3. In our bid to create predictive models for determining Yelp Elite User Selection, we found the models built on the raw dataset to have a high degree of inaccuracy. In order to determine the usefulness of the attributes available, we decided to evaluate correlation between a user’s elite status and the other attributes available to depict his behavior on the Yelp platform. Furthermore, we rendered correlation matrices for the dataset available. This helped us narrow down to the attribute groups of interest. Still cautious about getting rid of data in the dataset, we decided to try and club related attributes. We had 10 different type of compliments and 10 attributes to represent them. Since they all could essentially be clubbed in an aggregated field to represent the overall compliments, we decided to experiment with that. With this experiment, we noticed slightly higher accuracy in our predictive model. Inspired by the change, we decided to apply the same approach to some more attributes which were related. There were 3 attributes to represent 3 different types of votes a user had received. Consolidating the data from these 3 columns into one was the next step. Our model improved with this step too. Once we knew that we had a better organized dataset now to work on in order to create the model, we decided to trim out down some more to really highlight the patterns we were interested in and use the correlations that were more prominent. We built new sets of correlation matrices on the new dataset in order to filter down to the attributes that had the highest impact in determining a user’s elite status. Comparing the correlation of the newly generated aggregated attributes helped us find the areas we needed to concentrate on in order to build an effective model. We were able to improve our models by leveraging this information
  • 4. Figure 3. Correlation Matrix Figure 4. Correlation Matrix by creating appropriate rules in order to better utilize the correlation information we had found. 5. Algorithm Experiments Our team decided to dive deeper into the Yelp user data set to gain better insights in it. As we focused more on the Yelp Elite member status, we began exploring different tech- niques to determine any correlation that will help establish our model. The team used a supervised learning technique to understand the criteria for Yelp Elite member. The criteria for our classifier algorithms selection were: • Rule Based Classifier: We are interested in generating a rules engine for Yelp Elite users evaluation. • Reasonable computational complexity: The academic dataset size is over 2 GB, with the review dataset over 1.4 GB. We need to have the best of both worlds, fast al- gorithm to allow multiple experiments and produce high quality model. Based on that, these initial set of algorithms are selected for experimentation, we evaluated their effectiveness during the project life cycle. • Alternating decision tree • kNN: k nearest neighbor classification • Bayesian Algorithms • Random Forest • CART: Classification and regression tree • Conjunctive Rule classification Below, we will briefly discuss our results in each type of classifiers: • Bayesian Algorithms: We ran the data set against a num- ber of Bayesian algorithms and the results were very weak True positive rate (62%). A quick look on the nature of our data and some visualizations lead us to understand why the Bayesian algorithms performed poorly. Bayesian algorithms assume strong independence on the attributes, the data set we had, we could see strong correlation be- tween the attributes, such as the number of star counts and the number of reviews. The statistical advantage of Bayesian algorithms was lost in our case. • Regression models: We had a hunch that regression models were not the best options we have. The data is rich in its attributes and the percentile distribution of the values in each attributes leads to multiple decision points. For example, if the number of compliments the user has is less than 2, he/she is definitely not an elite member. However, we tried the regression models, the results were better than Bayesian algorithms, but far from our targets: True positive rate (72%). • Alternating Decision Trees: Based on our observations, we noticed that we needed an algorithm that works with independent attributes and be sensitive to different bands of data. Decision trees seemed to be a natural and logi- cal progression. We had better results True positive rate (79%). We couldn’t improve further than this. • Random Forest: As its a family of decision trees, the results were almost identical to the previous algorithms. • kNN: K nearest neighbor seemed to be a good choice as it tends to do well with Binary classifier if K is selected to be an Odd number. The results were very promising. True positive rate (84%). However, we identified that we
  • 5. Figure 5. Dataset Relations could not improve further as the data is quite discrete and this degrades the performance of the algorithm. • CART: Classification and regression tree: Classification and regression trees seemed to combine the best of both worlds, Rules that can take care of the percentile distri- bution and regression that can easily identify the correla- tion between the attributes and weight them. Indeed, we achieved the best results with J48 tree and linear regres- sion, our True positive rate was 94.2% . 6. Results and Analysis Once we finalized our goal to create a model for determining the elite worthiness of an yelp user, we focused on the User attributes and the Review attributes. During the experiment period we used this data set in different combinations to arrive at our final model. In this section we describe the reasoning behind using each of these combinations and the results of our experiments on each of these data sets. 6.1 Pennsylvania Data Set Goal: We decided to focus on the data from one state, as it provides balanced distribution of business types, users and reviews and helps us understand the behavior and correlation amongst these attributes. With the smaller data set it is also easier to try different algorithms. Data Manipulation: We chose Pennsylvania as it had the business around the CMU campus and it ranked second for the number of restaurants/ state metric in the academic dataset. The user data set did not have the state information related to the user. With an assumption that the review data is local, we picked business in Pennsylvania, selected the reviews for these businesses and then got the users and the corresponding attributes for these reviews. The relation between the Business, User and Review objects in shown in Figure 5. Datasets Used: Data Size: 17,791 Elite:Non-Elite: 1:12 Attributes Used: review count, fans, votes.cool, votes.funny, votes.useful, average stars, compliments.hot, compliments.more, compliments.list Results: The results obtained using the J48 Pruned Tree and Regression Classifier are shown in Table 4. Discussion: The Pennsylvania data was initially selected because it is smaller and take less time to try different al- gorithms. At the same time, it has a balanced distribution of business types, users and reviews. The data was divided to test data and training data at ratio of 1:2. After running J48 graft pruned tree classifier, 95.40% users are correctly classified. ROC area is 95.70% which means the classification is quite accurate. However the False positive rate on Non-Elite users is quite high which means many users that can be qualified as elite users are falsely classified as Non-Elite users. So the goal for next step is to expand the algorithm to a larger scale, as well as reduce False positive rate. 6.2 Review Data Set Goal: The academic data set has 1.6 Million reviews, spread across multiple users and business. The intention of this experiment is to predict the elite Yelpers just based on the review data. Data Manipulation: To be able to use the review data we decided to aggregate them by the userId for each year. Elite status is granted to users on an yearly basis. Being elite in one year doesn’t necessarily mean the status can be kept for next year. User data doesn’t reflect this time sensitiveness. Most attributes in user dataset is an aggregated result across years since user joined Yelp. So we decided to explore Review dataset which has timestamps of when the user posted the reviews. For a given userId we aggregated the star ratings (1,2,3,4 and 5) provided by each user and also the compliments (funny, cool and useful) they got. For each of these user Ids we inserted an isElite flag based on the year they were elite. The years the user had the elite status is available in the user data set. Sample record set aggregation is depicted in Figure 6. Datasets Used: Data Size: 500,967 Elite:Non-Elite: 1:13 Attributes Used: NumberOf5StarReviews, Num- berOf4StarReviews, NumberOf3StarReviews, Num- berOf2StarReviews, NumberOf1StarReviews, funnyVote- Count, usefulVoteCount, coolVoteCount Results: The results obtained using the J48 Pruned Tree and Regression Classifier are shown in Table 5. Discussion: The data was divided to test data and training data at ratio of 1:2. In the result, weighted average True pos- itive rate is 93.40%. It is not a significant change from last experiment. However, True positive rate of Non-Elite is 99% and False positive rate of Elite is 1%, while True positive
  • 6. TP Rate FP Rate Precision Recall F-Measure ROC Area Class 85.50% 3.50% 74% 85.50% 79.30% 95.70% Elite 96.50% 14.50% 98.30% 96.50% 97.40% 95.70% Non Elite Weighted Avg. 95.40% 13.40% 95.80% 95.40% 95.50% 95.70% Table 4. Pennsylvania Data Set Results TP Rate FP Rate Precision Recall F-Measure ROC Area Class 21.80% 1.00% 62.30% 21.80% 32.30% 69.30% Elite 99.00% 78.20% 94.20% 99.00% 96.50% 69.40% Non Elite Weighted Avg. 93.40% 72.60% 91.90% 93.40% 91.90% 69.40% Table 5. Review Data Set Results Figure 6. Data Transformation rate of Elite is 21.80% and False positive rate of Non-Elite is 78.20%. This means the classifier tends to classify any given user toward Non-Elite instead of Elite. Almost 80% of users that can be elite are falsely classified to non-elite. And the ROC Area dropped significantly to 69.40%. It means the accuracy of this classification is not very good. There are two main reasons behind this result: • Review attributes are not as strongly associated with the elite status as user attributes. • Data is highly skewed towards non-elite users. To achieve more accurate results, we need to stay with user attributes and take some advantages of review attributes. 6.3 All User Data Goal: The academic data set has 366K user’s data with 23 attributes. The goal of this experiment is to predict the elite Yelpers just based on the user data. Data Manipulation: We massaged the User level attributes a little to obtain parsable data elements. From the feature selection process and the attribution correlation matrix we identified that user data set has attributes like review count, fans, votes.cool and votes.useful that play a significant role to obtain the elite status. We further aggregated the friends list and the total number of votes and compliments into a measurable numeric count. Datasets Used: Data Size: 366,715 Elite:Non-Elite: 1:15 Attributes Used: review count, friends, fans, aver- age stars, yelping.since.months, aggregated compliments, aggregated votes Results: The results obtained using the J48 Pruned Tree and Regression Classifier are shown in Table 6. Discussion: After expanding the algorithm to all user data, the result is quite satisfying. The data was still divided to test data and training data at ratio of 1:2. Weighted average is 97% with the True positive rate of Elite users as high as 98.70%. ROC Area is 94.70%, which means this classification is relatively accurate. The False positive rate of Elite is as high as 24.90%, meaning that users who are not supposed to be elite users are classified to elite group.We would like to reduce the False positive rate while maintaining the accuracy of the prediction. This result proved that user attributes are very strongly as- sociated with the elite status, compared to review attributes. However, since elite status is granted on a yearly basis, user attributes still cannot capture the impact of the time factor on the elite status. So the next goal is to aggregate the two dataset to leverage the strengths of user attributes and the time attribute of review data.
  • 7. TP Rate FP Rate Precision Recall F-Measure ROC Area Class 98.70% 24.90% 98.20% 98.70% 98.40% 94.70% Elite 75.10% 1.30% 80.50% 75.10% 77.70% 94.70% Non Elite Weighted Avg. 97.00% 23.20% 96.90% 97.00% 97.00% 94.70% Table 6. All User Data Results TP Rate FP Rate Precision Recall F-Measure ROC Area Class 79.40% 1.80% 76.30% 79.40% 77.80% 98.60% Elite 98.20% 20.60% 98.50% 98.20% 98.30% 98.60% Non Elite Weighted Avg. 96.90% 19.30% 96.90% 96.90% 96.90% 98.60% Table 7. Merged Review and User Data Results TP Rate FP Rate Precision Recall F-Measure ROC Area Class 94.20% 4.10% 90.20% 94.70% 92.40% 98.80% Elite 95.90% 5.80% 97.50% 95.10% 96.70% 98.80% Non Elite Weighted Avg. 95.30% 5.30% 95.90% 95.90% 94.90% 98.80% Table 8. Balanced Merged Review and User Data Results 6.4 Merged Review and User Data Goal: After independent analysis of user level attributes and review level attributes, we wanted to measure the impact of these attributes together on the elite status. So we decided to aggregate the review data and merge it with the user data to predict the elite status. Data Manipulation: We merged the user level attributes into review attributes, to be able to experiment the combined data set. We converted the yelping since column to a mea- surable number of months field. It is a factor that reflects how long an user has been active on Yelp. Here the review Count belongs to user attributes. StarCount1 starCount5 belong to review attributes. In the dataset, review data only captures about 12% total reviews all users have given. So sum of all users’ reviewCount is not equal to amount of reviews. Datasets Used: Data Size: 366,715 Elite:Non-Elite: 1:15 Attributes Used: yelpMonths, starCount5, starCount4. starCount3, starCount2, starCount1, averageStars, cool- ComplimentsCount, funnyComplimentsCount, useFulCom- plimentsCount, friendsCount, fanCount, reviewCount Results: The results obtained using the J48 Pruned Tree and Regression Classifier are shown in Table 7. Discussion: The merged dataset yields better result than review-only data and user-only data. While the weighted av- erage True positive rate stays as high as 96.90%, average False positive rate dropped to lower than 20%. ROC Area is as high as 98.6%, up from 94.70% on user data. Compar- ing the False positive rate of Elite and Non-Elite, the clas- sifier still tends to classify users towards non-elite. It makes 20.60% users falsely classified to non-elite status. False pos- itive rate of two types is very unbalanced. This experiment showed that the combined attributes work better in the classification. However the skewed data problem hasn’t been solved yet. Next step is to balance the dataset so that false positive rate can be further reduced. 6.5 Balanced Merged Review and User Data Goal: The goal here is to run our experiments on a bal- anced data set that is not skewed towards Non-elite mem- bers. Data Manipulation: In all of the above mentioned datasets, the proportion of the elite users was very less so the results were more inclined towards classifying non-elite status.So to get the right balance, we choose a dataset, that had a bal- anced mix (1:2) of elite vs non-elite data. from the merged dataset. Datasets Used: Data Size: 82,000 Elite:Non-Elite: 1:2 Attributes Used: yelpMonths, starCount5, starCount4, starCount3, starCount2, starCount1, averageStars, cool- ComplimentsCount, funnyComplimentsCount, useFulCom- plimentsCount, friendsCount, fanCount, reviewCount Results: The results obtained using the J48 Pruned Tree and Regression Classifier are shown in Table 8.
  • 8. Discussion: After balancing the dataset with 33% elite users and 67% non-elite users, we got the best result among all experiments. The data was divided to test data and train- ing data at ratio of 1:2. Weighted average of True positive rate is 95.3% with both Elite and Non-Elite close to 95%. Average of False positive rate is 5.3%, balanced between elite and non-elite. ROC Area is 98.80%, which is higher than all previous results. Given this result, we can confidently conclude that our classifier will classify Yelp users with a weighted average accuracy of 95%. 7. Conclusion and Future Work Our final model, developed using the J48 tree and linear regression determines elite users with over 94% accuracy. It also gives an ROC area of 98.80%, establishing its cor- rectness. However, this model has been developed with the academic data set provided by Yelp, thus missing some at- tributes. With additional attributes such as the device through which reviews were written, the time taken to write reviews after meals, the proximity with which the reviews were writ- ten, the user attributed divided by year, and so on, we believe we can develop the model to predict elite users with more ac- curacy. The model also does not use Natural Language Pro- cessing to determine the content of reviews. Applying NLP on this data may yield more conditions for the determination of elite status. Furthermore, the Yelp elite council does not disclose the factors it considers for the determination of the elite status. The developed model is based only on historic data. In the future, we would like to try our models on Yelp’s complete dataset, and check if it yields similar results. We may have to make some modifications to incorporate the new attributes, to achieve similar accuracy. We also plan to sub- mit our results to the Yelp Dataset Challenge’ to evaluate our findings. Additionally, we will work with other qualita- tive factors such as content of reviews, in an effort to com- pletely eliminate the manual process that Yelp uses to deter- mine elite members. References [1] ”Yelp Investor Relations.” Web. 7 May 2015. http://goo.gl/Iz4ZEo. [2] ”What Is Yelp’s Elite Squad?” Web. 7 May 2015. http://goo.gl/DcbkCX. [3] ”Yelp.” Yelp’s Academic Dataset. Accessed April 5, 2015. https://goo.gl/dHgVmn. [4] ”Feature Selection (Data Mining) - MSDN - Microsoft.” 2015. 7 May. 2015. https://msdn.microsoft.com/en-us/ms175382.aspx. [5] Stone, Madeline. ”Elite Yelpers Hold Immense Power, And They Get Treated Like Kings By Bars And Restaurants Trying To Curry Favor.” Business Insider. August 22, 2014. Accessed April 27, 2015. http://goo.gl/cZyOMN.