SlideShare a Scribd company logo
Predicting Yelp Review Star Ratings with Language
Feature Analysis
Sebastian Cheah
University of California: San
Diego
9500 Gilman Dr
La Jolla, CA 92093
swcheah@ucsd.com
Tianqi Chen
University of California: San
Diego
9500 Gilman Dr
La Jolla, CA 92093
tic011@ucsd.edu
Linjie Li
University of California: San
Diego
9500 Gilman Dr
La Jolla, CA 92093
lil121@ucsd.com
ABSTRACT
For an assignment, we investigate multiple features regard-
ing Yelp reviews in order to construct a predictor for re-
view star ratings. Our supervised learning model uses lin-
ear/ridge regression to observe the correlation between a set
of features and review star ratings. Basic readily available
features include the business’ star rating, the user’s average
star rating, and the total number of votes associated with
the review. For advanced features, we discovered that some
language processing techniques on review text lead to good
features correlating with review star ratings. We combine
Latent Dirichlet Allocation (LDA) with other optimizations
such as stemming and rounding of edge cases to improve
upon the basic feature model. We compare the model’s re-
sults with a baseline model using the mean squared error
(MSE) as the metric. The baseline resulted in an MSE of
1.67836502285. Our model using the features we described
resulted in an MSE of 0.732726208483, which is an improve-
ment over the baseline results by %56.342857572
Keywords
Feature Analysis; Linear Regression; Natural Language Pro-
cessing; Latent Dirichlet Allocation (LDA); Stemming; Reg-
ularization
1. INTRODUCTION
Supervised learning is a machine learning task that can
help create a predictor given an input training set to infer
features from. As such, the extraction of useful features will
be key during the learning task.
For our assignment, we use the dataset from Yelp to con-
duct our experiments. Inside this dataset is information
regarding reviews and other associated data.
Our goal is to extract features from this dataset to im-
prove upon a baseline predictor (to be described later). In
particular, we conduct an exploratory experiment with natu-
ral language processing techniques to determine its viability
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
c 2015 ACM. ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
Figure 1: Count of reviews with 1...5 star rating
in extracting good features.
2. THE DATASET
The current dataset from Yelp we are using is advertised
to have:
• 1.6M reviews and 500K tips by 366K users for
61K businesses
• 481K business attributes, e.g., hours, parking avail-
ability, ambience.
• Social network of 366K users for a total of 2.9M so-
cial edges.
• Aggregated check-ins over time for each of the 61K
businesses
Using this dataset from Yelp, we view some properties in
order to determine good uses for feature modeling. Yelp’s
dataset includes three data types useful for review star rat-
ings: user, business, and review data schemas. Our hope
is to see if there are features that correlate with a review’s
given star rating. First, we determine and plot the count of
reviews distributed by its respective star rating. This way,
we can get a first-look into review rating frequencies. Look-
ing at figure 1, there is a left-skew distribution, with the
majority of review ratings being at 4-5.
We then view other readily available features, including
the user writing the review and the business being reviewed.
With figure 2, we see that the business rating distribution
Figure 2: Count of business star ratings with 1...5
star rating (rounded to half-stars)
Figure 3: Histogram of user average star ratings
is also left-skew. However, we note that there are multiple
reviews for a single business in the dataset, so a correlation
between the two distributions cannot be observed immedi-
ately. In contrast, user average ratings show a varied distri-
bution that is also left-skewed in figure 3. Another available
feature is the number of votes a review receives. Votes are
distributed between useful, funny, and cool. Any user can
vote at least once, for each vote category. For simplicity, we
sum the categories to create a ”total votes” feature. Using a
scatter plot in figure 4, we can see that the majority of the
distribution of ratings is between 3-5 as the number of votes
increase. There are a small number of outliers associated
with high votes and low star ratings.
We preprocess the data due to the immense size of the
dataset. We first take a look at businesses separated by
category to retain a well-represented subset of the dataset.
Table 1 shows the top 10 categories. There are many other
categories that have a small number of business associated
with them, and could prove useful for community tasks such
as clustering. However, this is outside the main focus of our
predictive tasks.
We will process reviews for restaurants only (990627 re-
views from 1569264 total reviews). Businesses that are restau-
rants (21892 businesses from 61184 total businesses). We
then take 150000 random reviews from the pre-processed
restaurant review dataset. The restaurant review dataset
will be further split up into 3 sets with an 8:1:1 ratio split:
Figure 4: Average number of stars per review with
n votes
Table 1: Top 10 Business Category Counts
Business Category Count
Restaurants 21892
Shopping 8919
Food 7862
Beauty & Spas 4738
Nightlife 4340
Bars 3628
Health & Medical 3213
Automotive 2965
Home Services 2853
Fashion 2566
Training (size 120000), Validation (size 15000), and Test
(size 15000).
3. PREDICTIVE TASKS
3.1 Baseline
For our baseline, we calculate the global average review
star rating. Using this as our baseline feature, we determine
its accuracy through the calculation of the Mean Squared
Error (MSE) through the following formula.
MSE = 1/N
N
i=1
(yi − Xi · θ)2
With a global average review rating of 3.61844166667 based
upon the training set, the baseline case (with θ equal to 1)
resulted in an MSE of 1.67836502285 with the test set. The
baseline is a good measure to compare against when deter-
mining whether or not a set of features correlate effectively.
3.2 Linear Regression
Linear regression is an algorithm that models a target
value Y using feature values Xf and their corresponding
weights θf . Note that there is a base feature value of 1
to set an initial bias. The following formula describes the
predictor using linear regression.
Y =
f∈features
θf × Xf
Table 2: Individual Basic Feature Correlations
Feature Initial Bias Correlation Coef. Train MSE Validation MSE Test MSE
business stars 0.23118219 0.93947228 1.38275398 1.33961795435 1.472758096
user average stars 0.0074526 0.97277473 1.26107949 1.21922160516 1.28202036485
total votes 3.63148846 -0.00687633 1.6517479 1.60593037535 1.67404382551
Above 3 combined -2.26595628 [0.75764692 0.85405507 -0.00926946] 1.09015062 1.04932669113 1.14555894359
Table 3: LDA Topic Feature MSE’s with 10 topics
Text Train MSE Validation MSE Test MSE
Unchanged 1.23284007 1.23991515279 1.24364137738
Stemmed 1.09244281 1.11862814344 1.06806245847
Linear regression applies a linear model that attempts to
minimize the sum of squared residuals between predicted
and truth values. It is a simplistic model that can be used
to quickly check the effectiveness of features in predicting
review ratings.
3.3 Predictions with Basic Features
Previously, we viewed the distribution of a review’s star
rating versus individual features including the business’s star
rating, the user’s average star rating, and the review’s to-
tal number of votes. Applying linear regression with each
feature individually reveals some degree of correlation, as
observed in table 2.
When we combine all these features together, we achieve
a test MSE of 1.14555894359, which is an improvement over
the baseline case’s test MSE of 1.67836502285. This is a
%31.745542356 improvement over the baseline. There are
other possible features to extract from the dataset. We make
note of the review’s text, and determine new features to add
to improve the basic prediction model.
4. FEATURE ANALYSIS AND MODEL
4.1 Latent Dirichlet Allocation (LDA)
In natural language processing, Latent Dirichlet Alloca-
tion (LDA) is a topic model generated from input text doc-
uments. In our case, the text documents are the texts asso-
ciated with each review. With this model, each review text
can be viewed as a mixture of topics. For each generated
topic, there are words with their own corresponding frequen-
cies, which will vary between topics. LDA will generate a
specified number of topics from the training set. Each word
in the review text can be seen as being assigned a topic. The
feature values generated from LDA is a probability distribu-
tion based on the frequency and weight of words associated
with the generated topics. These statistics can be inferred
for the validation and test set when extracting topic model
features from the review text.
4.2 Stemming (Pre-processing)
Initial results reveal a correlation between topic distribu-
tions and review rating. Using linear regression with just
the topic distributions alone, results are shown improving
upon the baseline case. However, we make note of another
language technique: stemming. Stemming merges differ-
ent inflections of words, which can help model a more ac-
curate distribution among topics generated from LDA. On
Table 4: Top 10 stemmed words for 5 out of 10
generated topics
Topic
1 2 3 4 5
good food order s s
sushi great t restaur t
roll place food locat place
dish servic time room like
chicken good us look can
rice alway servic park get
order love back old go
food time wait area just
like friendli ask strip good
place go tabl wall don
another hand, the stemming technique can possibly merge
words with different meanings. To demonstrate the effec-
tiveness of LDA with and without stemming, we perform
linear regression with two models: one with the review text
words stemmed and the other not stemmed. Table 3 show
a substantial improvement with stemming before topic gen-
eration. With this in mind, all our generated LDA feature
distributions will utilize stemming, as it consistently pro-
vides improvements in multiple tests. To demonstrate topic
modeling and topic words, we provide the top 10 words for
3 of the 10 generated topics from LDA (note that they are
stemmed) based off the training set in the restaurant reviews
dataset. This is documented in table 4.
4.3 Combining Basic Features with LDA
With the current data, we must see if currently selected
individual features have a good relationship with the star
ratings. Poorly selected features may add noise during the
training phase of our predictor. To verify, we perform a
simple linear regression task with 10 generated topic features
and basic features which include business stars, user average
stars, and total votes. Results finally show the predictor
with MSE values lower than 1, as seen in table 5.
4.4 Rounding Edge Cases
We observe that some of our predictions are less than 1.0
or greater than 5.0, which lies outside the range boundaries
of possible star ratings. We set predictions less than the
minimum, to the minimum, and predictions greater than
the maximum, to the maximum. This leads to some small
improvements which can be observed in table 5.
4.5 Number of Passes
An LDA model also takes as an input a number of training
passes through the training corpus. Previous results were
using a single pass. We increase the number of passes to
further optimize our features from LDA. Ultimately, an in-
Table 5: Basic Features + LDA, 10 topics, with
stemming
Rounding? Train MSE Validation MSE Test MSE
No 0.856218784176 0.854454074693 0.862671573968
Yes 0.845763856351 0.844435046853 0.852894522549
Table 6: Increasing the number of passes on Ba-
sic Features + LDA, 10 topics, with stemming and
rounding
Passes Train MSE Validation MSE Test MSE
1 0.845763856351 0.844435046853 0.852894522549
20 0.810688013962 0.814545737192 0.82159026054
crease in passes would increase the number of updates of
the features, increasing the likelihood of reaching conver-
gence during training. We tabulate our results in table 6,
finding an optimal number of passes (20) to be used during
topic modeling.
4.6 Number of Generated Topics
LDA takes as input a number of topics to generate based
on the training dataset. We model the relationship between
the number of topics and MSE of the resulting model com-
bined with the basic features and previous optimizations. In
figure 5 we find that as the number of topics increase, the
MSE fluctuates between increasing and decreasing. Previ-
ously, we included a validation set, but did not specifically
use it to tune our model. This is where that set plays a role.
We view validation MSE and choose the number of topics
yielding the lowest MSE. In our case, 50 topics provided the
best results.
4.7 Regularization
We introduce a regularization parameter to see whether
or not model complexity is affecting our results. Regulariza-
tion will penalize model complexity during training. Linear
regression with this model will try to minimize the sum of
squared residuals plus the regularization parameter λ (this
Figure 5: MSE of combined model with increasing
number of topics
Figure 6: MSE with regularization
is also known as Ridge Regression):
min
θ
N
i=1
(yi − Xi · θ)2
+ λ||θ||2
2
Again, the validation set will be used in determining which
model based on the regularization parameter will perform
best. We plot the validation MSE values against multiple
tuning values and present it in figure 6. Based on the results,
we note that regularization did not improve upon our results,
as a regularization parameter of 0 yielded the best predictor.
5. RELATED LITERATURE
Personalizing Yelp’s star ratings relies on the topic mod-
eling processes that allow us to learn the latent subtopics
in review text. Traditional topic modeling lacks methods of
incorporating star ratings or semantic analysis in the gen-
erative process. In Jack Linshi’s paper, an approximation
of a modified latent Dirichlet allocation (LDA) in which
term distributions of topics are conditional on star ratings
was proposed [1]. He assumed that ratings are an approxi-
mate function of positively and negatively connoted adjec-
tives and implemented the function by adding two different
”codewords”, indicating either the presence of a positive or
negative adjective, after each positive and negative adjective
in the corpus. The approximation of this modified LDA,
the codeword LDA, was introduced to show that when ex-
amining documents’ topic mixture, this approach produces
clearer and more semantically-oriented topics than those of
traditional LDA.
6. RESULTS AND CONCLUSION
After establishing a baseline case, we attempted to extract
useful features from the dataset in order to predict Yelp re-
view star ratings. Readily available features such as busi-
ness ratings, user average ratings, and total votes provided
a moderate increase in predictor accuracy over the baseline
case. When we delved into language processing with an LDA
model with review text, our results improved even further,
and also combined well with the previous readily available
features. We do note that while our training MSE decreases,
so does our validation and test MSE’s. There is a possibility
of overfitting on the training data, but the results show that
this is not the case with our model.
Table 7: Final Results with our Model: Basic Fea-
tures + LDA (50 topics, 20 passes) + Stemming +
Rounding Edge Cases
Model Test MSE
Baseline 1.67836502285
Final 0.732726208483
We further try to optimize and improve upon our LDA
model. Stemming, rounding edge case, and an generat-
ing an increasing number of topics improved upon our fea-
ture model. With this increase in feature complexity, we
attempted to apply regularization using a ridge regression
model. However, this did not see any benefits.
Ultimately, we observed the beneficial techniques with the
LDA model and provide final results. We utilize basic fea-
tures and LDA with 50 topics, 20 passes, with stemming and
rounding of edge cases and apply linear regression. With
these techniques and parameters, our final results are listed
in table 7. Our final model resulted in a %56.342857572
improvement over the baseline model.
Feature inspection using a natural language processing
technique such as LDA saw a generous decrease in predic-
tor error. Beyond LDA, there are other language processing
techniques that can extract useful data from the review text.
One such example is sentimental analysis. To put it simply,
sentiment analysis is another natural language processing
technique that determines the polarity of a given text. It can
determine whether or not the review text is positive, neg-
ative, or neutral. It is our hope that such a feature would
correlate with a review’s star rating. Unfortunately, sen-
timent analysis proved to have long processing times with
our current implementation, and we were unable to generate
results given two plus days with our training data.
7. REFERENCES
[1] J. Linshi. Personalizing Yelp Star Ratings: a Semantic
Topic Modeling Approach.

More Related Content

PDF
PredictingYelpReviews
PDF
Computing Ratings and Rankings by Mining Feedback Comments
PDF
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
PDF
Crimson Publishers-Toward Generating Customized Rehabilitation Plan and Deliv...
PDF
IRJET- Classification of Business Reviews using Sentiment Analysis
PDF
Driver Analysis and Product Optimization with Bayesian Networks
PDF
Ordinal logistic regression
PDF
The International Journal of Engineering and Science (The IJES)
PredictingYelpReviews
Computing Ratings and Rankings by Mining Feedback Comments
An Approach for Big Data to Evolve the Auspicious Information from Cross-Domains
Crimson Publishers-Toward Generating Customized Rehabilitation Plan and Deliv...
IRJET- Classification of Business Reviews using Sentiment Analysis
Driver Analysis and Product Optimization with Bayesian Networks
Ordinal logistic regression
The International Journal of Engineering and Science (The IJES)

Similar to Predicting Yelp Review Star Ratings with Language (20)

PDF
Yelp Rating Prediction
PDF
Final.Version
PDF
PDF
Predicting Employee Attrition
DOCX
Developing Movie Recommendation System
PPTX
Yelp dataset challenge
PPTX
Yelp dataset challenge
PDF
Rating Prediction for Restaurant
PDF
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
PPT
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
PDF
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
PDF
A Supervised Modeling Approach to Determine Elite Status of Yelp Members
PDF
Online Testing Learning to Rank with Solr Interleaving
DOCX
Rating System Algorithms Document
PDF
An E-commerce feedback review mining for a trusted seller’s profile and class...
PDF
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
PDF
IRJET- Analysis of Brand Value Prediction based on Social Media Data
DOCX
Neural Network Model
PPTX
Sentiment analysis with variaous Modeling
PDF
session2.pdf
Yelp Rating Prediction
Final.Version
Predicting Employee Attrition
Developing Movie Recommendation System
Yelp dataset challenge
Yelp dataset challenge
Rating Prediction for Restaurant
IRJET- Boosting Response Aware Model-Based Collaborative Filtering
Market Research using SPSS _ Edu4Sure Sept 2023.ppt
IRJET- Survey of Classification of Business Reviews using Sentiment Analysis
A Supervised Modeling Approach to Determine Elite Status of Yelp Members
Online Testing Learning to Rank with Solr Interleaving
Rating System Algorithms Document
An E-commerce feedback review mining for a trusted seller’s profile and class...
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
IRJET- Analysis of Brand Value Prediction based on Social Media Data
Neural Network Model
Sentiment analysis with variaous Modeling
session2.pdf
Ad

Predicting Yelp Review Star Ratings with Language

  • 1. Predicting Yelp Review Star Ratings with Language Feature Analysis Sebastian Cheah University of California: San Diego 9500 Gilman Dr La Jolla, CA 92093 swcheah@ucsd.com Tianqi Chen University of California: San Diego 9500 Gilman Dr La Jolla, CA 92093 tic011@ucsd.edu Linjie Li University of California: San Diego 9500 Gilman Dr La Jolla, CA 92093 lil121@ucsd.com ABSTRACT For an assignment, we investigate multiple features regard- ing Yelp reviews in order to construct a predictor for re- view star ratings. Our supervised learning model uses lin- ear/ridge regression to observe the correlation between a set of features and review star ratings. Basic readily available features include the business’ star rating, the user’s average star rating, and the total number of votes associated with the review. For advanced features, we discovered that some language processing techniques on review text lead to good features correlating with review star ratings. We combine Latent Dirichlet Allocation (LDA) with other optimizations such as stemming and rounding of edge cases to improve upon the basic feature model. We compare the model’s re- sults with a baseline model using the mean squared error (MSE) as the metric. The baseline resulted in an MSE of 1.67836502285. Our model using the features we described resulted in an MSE of 0.732726208483, which is an improve- ment over the baseline results by %56.342857572 Keywords Feature Analysis; Linear Regression; Natural Language Pro- cessing; Latent Dirichlet Allocation (LDA); Stemming; Reg- ularization 1. INTRODUCTION Supervised learning is a machine learning task that can help create a predictor given an input training set to infer features from. As such, the extraction of useful features will be key during the learning task. For our assignment, we use the dataset from Yelp to con- duct our experiments. Inside this dataset is information regarding reviews and other associated data. Our goal is to extract features from this dataset to im- prove upon a baseline predictor (to be described later). In particular, we conduct an exploratory experiment with natu- ral language processing techniques to determine its viability Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. c 2015 ACM. ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 Figure 1: Count of reviews with 1...5 star rating in extracting good features. 2. THE DATASET The current dataset from Yelp we are using is advertised to have: • 1.6M reviews and 500K tips by 366K users for 61K businesses • 481K business attributes, e.g., hours, parking avail- ability, ambience. • Social network of 366K users for a total of 2.9M so- cial edges. • Aggregated check-ins over time for each of the 61K businesses Using this dataset from Yelp, we view some properties in order to determine good uses for feature modeling. Yelp’s dataset includes three data types useful for review star rat- ings: user, business, and review data schemas. Our hope is to see if there are features that correlate with a review’s given star rating. First, we determine and plot the count of reviews distributed by its respective star rating. This way, we can get a first-look into review rating frequencies. Look- ing at figure 1, there is a left-skew distribution, with the majority of review ratings being at 4-5. We then view other readily available features, including the user writing the review and the business being reviewed. With figure 2, we see that the business rating distribution
  • 2. Figure 2: Count of business star ratings with 1...5 star rating (rounded to half-stars) Figure 3: Histogram of user average star ratings is also left-skew. However, we note that there are multiple reviews for a single business in the dataset, so a correlation between the two distributions cannot be observed immedi- ately. In contrast, user average ratings show a varied distri- bution that is also left-skewed in figure 3. Another available feature is the number of votes a review receives. Votes are distributed between useful, funny, and cool. Any user can vote at least once, for each vote category. For simplicity, we sum the categories to create a ”total votes” feature. Using a scatter plot in figure 4, we can see that the majority of the distribution of ratings is between 3-5 as the number of votes increase. There are a small number of outliers associated with high votes and low star ratings. We preprocess the data due to the immense size of the dataset. We first take a look at businesses separated by category to retain a well-represented subset of the dataset. Table 1 shows the top 10 categories. There are many other categories that have a small number of business associated with them, and could prove useful for community tasks such as clustering. However, this is outside the main focus of our predictive tasks. We will process reviews for restaurants only (990627 re- views from 1569264 total reviews). Businesses that are restau- rants (21892 businesses from 61184 total businesses). We then take 150000 random reviews from the pre-processed restaurant review dataset. The restaurant review dataset will be further split up into 3 sets with an 8:1:1 ratio split: Figure 4: Average number of stars per review with n votes Table 1: Top 10 Business Category Counts Business Category Count Restaurants 21892 Shopping 8919 Food 7862 Beauty & Spas 4738 Nightlife 4340 Bars 3628 Health & Medical 3213 Automotive 2965 Home Services 2853 Fashion 2566 Training (size 120000), Validation (size 15000), and Test (size 15000). 3. PREDICTIVE TASKS 3.1 Baseline For our baseline, we calculate the global average review star rating. Using this as our baseline feature, we determine its accuracy through the calculation of the Mean Squared Error (MSE) through the following formula. MSE = 1/N N i=1 (yi − Xi · θ)2 With a global average review rating of 3.61844166667 based upon the training set, the baseline case (with θ equal to 1) resulted in an MSE of 1.67836502285 with the test set. The baseline is a good measure to compare against when deter- mining whether or not a set of features correlate effectively. 3.2 Linear Regression Linear regression is an algorithm that models a target value Y using feature values Xf and their corresponding weights θf . Note that there is a base feature value of 1 to set an initial bias. The following formula describes the predictor using linear regression. Y = f∈features θf × Xf
  • 3. Table 2: Individual Basic Feature Correlations Feature Initial Bias Correlation Coef. Train MSE Validation MSE Test MSE business stars 0.23118219 0.93947228 1.38275398 1.33961795435 1.472758096 user average stars 0.0074526 0.97277473 1.26107949 1.21922160516 1.28202036485 total votes 3.63148846 -0.00687633 1.6517479 1.60593037535 1.67404382551 Above 3 combined -2.26595628 [0.75764692 0.85405507 -0.00926946] 1.09015062 1.04932669113 1.14555894359 Table 3: LDA Topic Feature MSE’s with 10 topics Text Train MSE Validation MSE Test MSE Unchanged 1.23284007 1.23991515279 1.24364137738 Stemmed 1.09244281 1.11862814344 1.06806245847 Linear regression applies a linear model that attempts to minimize the sum of squared residuals between predicted and truth values. It is a simplistic model that can be used to quickly check the effectiveness of features in predicting review ratings. 3.3 Predictions with Basic Features Previously, we viewed the distribution of a review’s star rating versus individual features including the business’s star rating, the user’s average star rating, and the review’s to- tal number of votes. Applying linear regression with each feature individually reveals some degree of correlation, as observed in table 2. When we combine all these features together, we achieve a test MSE of 1.14555894359, which is an improvement over the baseline case’s test MSE of 1.67836502285. This is a %31.745542356 improvement over the baseline. There are other possible features to extract from the dataset. We make note of the review’s text, and determine new features to add to improve the basic prediction model. 4. FEATURE ANALYSIS AND MODEL 4.1 Latent Dirichlet Allocation (LDA) In natural language processing, Latent Dirichlet Alloca- tion (LDA) is a topic model generated from input text doc- uments. In our case, the text documents are the texts asso- ciated with each review. With this model, each review text can be viewed as a mixture of topics. For each generated topic, there are words with their own corresponding frequen- cies, which will vary between topics. LDA will generate a specified number of topics from the training set. Each word in the review text can be seen as being assigned a topic. The feature values generated from LDA is a probability distribu- tion based on the frequency and weight of words associated with the generated topics. These statistics can be inferred for the validation and test set when extracting topic model features from the review text. 4.2 Stemming (Pre-processing) Initial results reveal a correlation between topic distribu- tions and review rating. Using linear regression with just the topic distributions alone, results are shown improving upon the baseline case. However, we make note of another language technique: stemming. Stemming merges differ- ent inflections of words, which can help model a more ac- curate distribution among topics generated from LDA. On Table 4: Top 10 stemmed words for 5 out of 10 generated topics Topic 1 2 3 4 5 good food order s s sushi great t restaur t roll place food locat place dish servic time room like chicken good us look can rice alway servic park get order love back old go food time wait area just like friendli ask strip good place go tabl wall don another hand, the stemming technique can possibly merge words with different meanings. To demonstrate the effec- tiveness of LDA with and without stemming, we perform linear regression with two models: one with the review text words stemmed and the other not stemmed. Table 3 show a substantial improvement with stemming before topic gen- eration. With this in mind, all our generated LDA feature distributions will utilize stemming, as it consistently pro- vides improvements in multiple tests. To demonstrate topic modeling and topic words, we provide the top 10 words for 3 of the 10 generated topics from LDA (note that they are stemmed) based off the training set in the restaurant reviews dataset. This is documented in table 4. 4.3 Combining Basic Features with LDA With the current data, we must see if currently selected individual features have a good relationship with the star ratings. Poorly selected features may add noise during the training phase of our predictor. To verify, we perform a simple linear regression task with 10 generated topic features and basic features which include business stars, user average stars, and total votes. Results finally show the predictor with MSE values lower than 1, as seen in table 5. 4.4 Rounding Edge Cases We observe that some of our predictions are less than 1.0 or greater than 5.0, which lies outside the range boundaries of possible star ratings. We set predictions less than the minimum, to the minimum, and predictions greater than the maximum, to the maximum. This leads to some small improvements which can be observed in table 5. 4.5 Number of Passes An LDA model also takes as an input a number of training passes through the training corpus. Previous results were using a single pass. We increase the number of passes to further optimize our features from LDA. Ultimately, an in-
  • 4. Table 5: Basic Features + LDA, 10 topics, with stemming Rounding? Train MSE Validation MSE Test MSE No 0.856218784176 0.854454074693 0.862671573968 Yes 0.845763856351 0.844435046853 0.852894522549 Table 6: Increasing the number of passes on Ba- sic Features + LDA, 10 topics, with stemming and rounding Passes Train MSE Validation MSE Test MSE 1 0.845763856351 0.844435046853 0.852894522549 20 0.810688013962 0.814545737192 0.82159026054 crease in passes would increase the number of updates of the features, increasing the likelihood of reaching conver- gence during training. We tabulate our results in table 6, finding an optimal number of passes (20) to be used during topic modeling. 4.6 Number of Generated Topics LDA takes as input a number of topics to generate based on the training dataset. We model the relationship between the number of topics and MSE of the resulting model com- bined with the basic features and previous optimizations. In figure 5 we find that as the number of topics increase, the MSE fluctuates between increasing and decreasing. Previ- ously, we included a validation set, but did not specifically use it to tune our model. This is where that set plays a role. We view validation MSE and choose the number of topics yielding the lowest MSE. In our case, 50 topics provided the best results. 4.7 Regularization We introduce a regularization parameter to see whether or not model complexity is affecting our results. Regulariza- tion will penalize model complexity during training. Linear regression with this model will try to minimize the sum of squared residuals plus the regularization parameter λ (this Figure 5: MSE of combined model with increasing number of topics Figure 6: MSE with regularization is also known as Ridge Regression): min θ N i=1 (yi − Xi · θ)2 + λ||θ||2 2 Again, the validation set will be used in determining which model based on the regularization parameter will perform best. We plot the validation MSE values against multiple tuning values and present it in figure 6. Based on the results, we note that regularization did not improve upon our results, as a regularization parameter of 0 yielded the best predictor. 5. RELATED LITERATURE Personalizing Yelp’s star ratings relies on the topic mod- eling processes that allow us to learn the latent subtopics in review text. Traditional topic modeling lacks methods of incorporating star ratings or semantic analysis in the gen- erative process. In Jack Linshi’s paper, an approximation of a modified latent Dirichlet allocation (LDA) in which term distributions of topics are conditional on star ratings was proposed [1]. He assumed that ratings are an approxi- mate function of positively and negatively connoted adjec- tives and implemented the function by adding two different ”codewords”, indicating either the presence of a positive or negative adjective, after each positive and negative adjective in the corpus. The approximation of this modified LDA, the codeword LDA, was introduced to show that when ex- amining documents’ topic mixture, this approach produces clearer and more semantically-oriented topics than those of traditional LDA. 6. RESULTS AND CONCLUSION After establishing a baseline case, we attempted to extract useful features from the dataset in order to predict Yelp re- view star ratings. Readily available features such as busi- ness ratings, user average ratings, and total votes provided a moderate increase in predictor accuracy over the baseline case. When we delved into language processing with an LDA model with review text, our results improved even further, and also combined well with the previous readily available features. We do note that while our training MSE decreases, so does our validation and test MSE’s. There is a possibility of overfitting on the training data, but the results show that this is not the case with our model.
  • 5. Table 7: Final Results with our Model: Basic Fea- tures + LDA (50 topics, 20 passes) + Stemming + Rounding Edge Cases Model Test MSE Baseline 1.67836502285 Final 0.732726208483 We further try to optimize and improve upon our LDA model. Stemming, rounding edge case, and an generat- ing an increasing number of topics improved upon our fea- ture model. With this increase in feature complexity, we attempted to apply regularization using a ridge regression model. However, this did not see any benefits. Ultimately, we observed the beneficial techniques with the LDA model and provide final results. We utilize basic fea- tures and LDA with 50 topics, 20 passes, with stemming and rounding of edge cases and apply linear regression. With these techniques and parameters, our final results are listed in table 7. Our final model resulted in a %56.342857572 improvement over the baseline model. Feature inspection using a natural language processing technique such as LDA saw a generous decrease in predic- tor error. Beyond LDA, there are other language processing techniques that can extract useful data from the review text. One such example is sentimental analysis. To put it simply, sentiment analysis is another natural language processing technique that determines the polarity of a given text. It can determine whether or not the review text is positive, neg- ative, or neutral. It is our hope that such a feature would correlate with a review’s star rating. Unfortunately, sen- timent analysis proved to have long processing times with our current implementation, and we were unable to generate results given two plus days with our training data. 7. REFERENCES [1] J. Linshi. Personalizing Yelp Star Ratings: a Semantic Topic Modeling Approach.