Yelp Dataset Challenge: Sentiment and Business Attributes Analyses

Overview

Two teammates and I worked together on this project.
Our goal is to train the data provided by the Yelp website - specifically the business data on all existing restaurants in the United States and the reviews written by customers - in order to predict the relationships between business attributes and the business’ overall success, defined by their star rating (0-5). In addition to business feature insights, we perform a sentiment analysis using NLP on the text data provided by customer reviews to help businesses better understand what contributes to customer preferences. Ultimately, we hope to provide a tool for restaurant owners to predict their business’ long-run average star rating and to help in their decision-making processes.
Yelp Sentiment and Business Attribute Analysis project was done as the final project of Applied Machine Learning at Cornell Tech.

Duration

Three months

Programming Language

Python

Role

Data Processing, Feature Extraction Models and Machine Learning Prediction Models for Business Factor Attributes Analysis

Background

Yelp serves as the most popular method to share reviews, photos, and ratings of local restaurants and businesses. These ratings are invaluable information about the quality and performance of a business. They also indicate future popularity as people rely on the star ratings and reviews to make informed decisions about the places to go. Small businesses specifically stand to benefit from this type of rating platform because it helps gain word of mouth traction from a new customer base and also gives them free and invaluable feedback about the performance of their restaurant across a multitude of attributes and categories.

Goals

Sentiment analysis through using NLP on customer reviews
Business success prediction model through Feature Extraction Model and Machine Learning Prediction Model

Sentiment Analysis Model

The Dataset

We work with the review json from the Yelp Dataset. This dataset contains the following fields: Business ID, Review ID, Text, User ID, Userful, Cool, Funny, Date, Stars and Type.

We are specifically working with the ‘text’ and ‘stars’ fields, where ‘stars’ serves as our target output. We split the training dataset into 70% training, 15% hold-out as a way to obtain accuracy prior to generalizing our models. Finally, we test on the remaining 15% of data.

Preprocessing

In order to preprocess the text reviews, we decided to perform the following: (1) remove white spaces, (2) remove punctuation, (3) remove stop words, and (4) lemmatize words. We also created a new column called ‘sentiment’ which converts to star ratings into either a positive, neutral or negative review. Ratings of 1 and 2 map to negative, 3 maps to neutral, and 4 and 5 maps to positive.
The review data contains 8,021,122 rows of data with the following distribution. We first filtered out (1) closed businesses, (2) kept only restaurant businesses, and (3) filtered on the businessID’s contained in the business dataset. From there our dataset was reduced to 3,487,813 rows and the rating distribution looked like the following:

We see here that there is a skew in our ratings towards positive ratings of 4 and 5 and acknowledge that this can lead to issues when predicting on new reviews because the prediction may be skewed towards ratings that have a higher frequency. In order to work with a manageable amount of data, we conducted most of our modeling on a sample of this dataset by extracting 10,000 rows through random sampling with replacement.

Feature Extraction Models

For feature extraction of the text, we use two methods. The first model, the Bag of Words Model, is a representation of text that describes the occurrence of words within a document. It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The second model, the TF-IDF model, uses a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf). The terms with higher weight scores are considered to be more important.

Machine Learning Prediction Models

For the sentiment analysis prediction models, we decided to train on three different models, namely (1) multinomial naive bayes, (2) logistic regression with class set to ‘multinomial’ to account for the different classes in the case of both individual star predictions and sentiment predictions, and (3) linear SVM. For the linear SVM model, we trained the feature vectors on a variety of regularization hyperparameters to find the best classification F1-scores. We graph the F1-scores for each hyperparameter in the following section. These algorithms ultimately allowed us to apply a diverse set of prediction methods for our classification problem and are tailored for text classification.

Experimental Analysis

We ran the multinomial naive bayes, logistic regression, and linear SVM on BoW and TF-IDF on the training set and predicted on the hold-out set. After running the models, we use the following calculation, known as the F1-score, a geometric mean of precision and recall which represents our ability to predict correct positives, to evaluate our models:

We ran our three different machine learning algorithms but with different inputs and labels for each model. For the first model, we ran the model without text preprocessing and predicted the individual star ratings. For the second model, we ran the model with text preprocessing and again predicted the individual star ratings. For the final model, we used the ‘sentiment’ column with preprocessing since our overall goal was to predict the overarching sentiment from the given text. For the linear SVM of both models, we conducted hyperparameter tuning with varying C values to arrive at the one outputting the highest F1 score, which in this case was at C = 0.1. The following tables show our F1 scores for the different versions we tested:

Predicting individual stars (1-5) without text preprocessing

Bags of Words Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.575	0.591	0.580
Logistic	0.563	0.574	0.567
Linear SVM	0.525	0.534	0.528

TF-IDF Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.464	0.516	0.309
Logistic	0.569	0.603	0.572
Linear SVM	0.573	0.605	0.564

Predicting individual stars (1-5) with text preprocessing

Bags of Words Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.565	0.585	0.572
Logistic	0.554	0.568	0.559
Linear SVM	0.546	0.567	0.552

TF-IDF Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.497	0.535	0.339
Logistic	0.577	0.603	0.574
Linear SVM	0.541	0.564	0.549

Predicting sentiment (positive, negative) with text preprocessing

Bags of Words Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.806	0.810	0.807
Logistic	0.795	0.805	0.799
Linear SVM	0.790	0.807	0.796

TF-IDF Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.709	0.789	0.735
Logistic	0.795	0.805	0.799
Linear SVM	0.790	0.807	0.796

Final Accuracy Score of Bag of Words Model

Bags of Words Model
Model	Precision	Recall	F1 Score
Multinomial NB	0.796	0.807	0.801
Logistic	0.805	0.821	0.810
Linear SVM	0.801	0.821	0.807

Business Attributes Analysis Model

The Dataset

We work with the business json from the Yelp Dataset. Out of all the columns, the ‘city’, ‘state’, ‘postal code’, ‘stars’, ‘review_count’, and ‘attributes‘ fields are used. All other features serve as x, while ‘stars’ serve as y - the target. We split the training dataset into 70% training, 15% hold-out, and the remaining 15% test data.

Preprocessing

In order to preprocess the business data, we took care of the following: (1) missing data, (2) businesses other than restaurants, (3) closed restaurants, and (4) restaurants outside the United States. We also created a new column called ‘binary_star’ that converts ratings into either positive (1) or negative (0). Ratings of 1 and 2 map to negative and 4 and 5 maps to positive.

Feature Extraction Models

All potential business features that could impact the average star rating of the respective restaurant have been extracted. Afterwards, categorical features were converted to binary conditional data of (yes, no).

Machine Learning Prediction Models

For the business factor analysis prediction models, we decided to train on three different models, namely (1) random forest regression, (2) decision tree regression and (3) linear regression. For regression models, we set the target output to be continuous variables. To measure accuracy, the MSE, MAE, RMSE have been used.

Experimental Analysis

We ran various regression models on the training set and predicted on the hold-out set. After running the models, we used the mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE) to evaluate the accuracy of our models:

Predicting Average Star Rating With Business Factors

Model	Precision	Recall	F1 Score
Linear Regression	0.537	0.464	0.681
Logistic Regression	0.292	0.293	0.541
Decision Tree Regression	0.584	0.578	0.760
Adaboost Regression	0.551	0.475	0.690
Random Forest Regression	0.557	0.511	0.715

Random Forest Regression

For a random forest regressor, we fit to find the best parameters, with which we get the baseline performance: the average error is 0.5526 degree and the accuracy is 81.93%. After fitting on the dev data, its average error has decreased to 0.5216 degrees and its accuracy improved to 82.66%. This is an improvement of 0.90%, which can be significant in respective to a 5-star-rating system.

Decision Tree Regression

For a decision tree regressor, we fit to find the best parameters, with which we got the baseline performance: the average error is 0.5834 degree and the accuracy is 81.16%. After fitting on the dev data, its average error has decreased to 0.5216 degrees and its accuracy improved to 82.66%. This is an improvement of 1.86%, which can be significant in respective to a 5-star-rating system.

Adaboost Regression

The initial prediction of adaboosting on train data is the following: [MAE: 0.5524, MSE: 0.4779, RMSE: 0.6913]. This has been improved to [MAE: 0.5465, MSE: 0.4629, RMSE: 0.6804]

Linear Regression

The initial prediction of adaboosting on train data is the following: [MAE: 0.5320, MSE: 0.4635, RMSE: 0.6808]. While there is no difference in MAE, otherwise this has been improved for MSE to 0.4397 and RMSE to 0.6631.

Discussion

Sentiment Analysis Model

From the experimental results, we can see that: (1) based on the two different feature extraction models, the Bag of Words model generally performed better than the TF-IDF model. (2) There did not seem to be a significant difference between preprocessing the text and no preprocessing of the text and in some cases, preprocessing led to worse performance. After doing more research, this seems plausible since some stop words convey sentiment and thus removing them will decrease the accuracy of the overall models. (4) We also found that within the BoW runs, the logistic regression with multiclass set to multinomial generated the best accuracy performance at an F1 score of 0.810. (5) Some of the top words that contributed to a high rating were ['amaze' 'delicious' 'excellent' 'awesome' 'exactly']. On the other hand, some of the top words that contributed to a poor rating were ['mediocre' 'rude' 'overprice' 'horrible' 'poor']. Thus, we can conclude that there are a number of potential factors at play, such as quality of the food, service, and the overall experience. This information is especially important and salient for restaurant owners because it allows them to gain a better understanding of what factors contribute to a high or low ratings in an easily digestible format. From there, they can adjust their dining options and experience accordingly.

Business Attributes Analysis Model

From the experimental results, we can see that: (1) both Random Forest Regression and Decision Tree models have the potential to much improve with grid search. (2) Random Forest Regression works the best with the number of trees in the forest set to be 600, the minimum number of samples required to split an internal node 10, the minimum number of samples required to be a leaf node 2, the number of features to consider when looking for the best split to be square root, and the maximum depth of the tree 60. (3) The best model performance can be achieved with a criterion of using mean squared error with Friedman’s improvement score for potential splits, the max depth of the tree of 6, and the minimum number of samples required to split an internal node is 2.

Conclusion

In this paper, we explicated multiple machine learning based methods to help business owners expand upon their current suite of resources to better understand their consumer base in terms of both reviews and business attributes. We created two models to achieve this end. The first model, the sentiment analysis model, ran MultinomialNB, logistic regression, and Linear SVM. The second model, the business attributes, ran linear regression, logistic regression, decision tree regression, and random forest regression. Overall, the set of business attributes we tested seem to be reasonable predictors of a restaurant’s average star rating.