SlideShare a Scribd company logo
Binary Search Query Classifier with Ensemble Models
Esteban Ribero – Chicago, March 2020
Abstract
In this paper, I develop a custom binary classifier of search queries for the makeup category
using different Machine Learning techniques and models. An extensive exploration of shallow and
Deep Learning models was performed using a cross-validation framework to identify the top three
models, optimize them tuning their hyperparameters, and finally creating an ensemble of models
with a custom decision threshold that outperforms all other models. The final classifier achieves an
accuracy of 98.83% on a test set, making it ready for production. The conclusions confirm some of
the common wisdom in Machine Learning regarding size and quality of data, shallow vs Deep
Learning and ensemble vs individual models.
Introduction
Search data from search engines such as Google and Bing are a great source of information
about consumers’ needs and opportunities for content development and strategic media placement
for marketers. However, the search data that is available to marketers is sometimes limited to lists
of keywords with estimates of their search volume, cost per click, and competitive level. To expand
and complement these data, the Intent Lab– a research partnership between Performics (a leading
performance marketing agency) and Northwestern University’s Medill School of Journalism –
performs primary research with consumers to better understand their intent when using search
engines.
In a recent study (to be published), The Intent Lab asked more than 1700 consumers of the
makeup category to write a search query that they would use to research the category when
looking to refresh their makeup. Each respondent submitted a search query and was asked to
classify the query as being an Item-query (one used to look for a specific item, such as “makeup kit”)
or a Task-query (used to learn more about how to accomplish a broad goal such as “find new
makeup look”). It turns out that knowing what type of query a searcher is performing provides clear
guidance into the type of content that should be served to those consumers. To expand the
learnings beyond the study and build a practical application to automatically label thousands of
search queries as Goal-oriented (Task-search) vs Item-oriented (Item-search), it is necessary to
develop and train a custom search-query classifier. This classifier can then be used to analyze the
publicly available data provided to marketers by the type of search query on an ongoing basis. The
purpose of this paper is to report on the development of such classifier.
Literature review
Deep Learning has become a popular toolkit for Data Scientists trying to use the latest in
Machine Learning and AI (Sejnowski, 2018). Natural Language Processing (NLP) has been one of the
areas much influenced by these recent developments, although the field is vast and includes several
traditional approaches (Lane, Howard, and Hapke, 2019) that could be reframed nowadays as
shallow learning (Chollet, 2018). Classifying search queries into distinct buckets can be conceived as
a case of Supervised Learning and the available models suited for the task is vast. Classifying queries
into Task-searches or Item-Searches is a binary classification problem and well suited for a variety of
techniques. It is often advised to try simpler, shallow learning models before moving on to more
complex deep learning models (Chollet, 2018). The following section describes some of the most
common models.
Logistic Regression is commonly used to estimate the probability that an instance belongs
to a particular class. The model computes a weighted sum of the input features (plus a bias term),
but instead of outputting the result directly like the Linear Regression model does, it outputs the
logistic of this result (Géron, 2020). This is a typical sigmoid function that squashes the input (the
weighted sum of input features plus bias in this case) into an S-shape form that outputs values
between 0 and 1. Once this probability is estimated, a prediction can be made by identifying a
threshold and assigning the observation to either one class or the other. The threshold is by default
0.5 but this can be adjusted if needed. This sigmoid function is also often used in Deep Learning as
the final layer of a neural network binary classifier. A great feature of Logistic Regression is the
ability to identify the important features leading to the classifier decision. The coefficients in the
regression convey such information (Géron, 2020).
Support Vector Machines are powerful ML models and are particularly well suited for
classification of complex small or medium-sized datasets (Géron, 2020). It relies on two parallel
vectors to a line that separates the classes in two. These “support” vectors make sure that the
decision boundary that separates the classes is the one farther away from each class creating a
“large margin classification” (Géron, 2020). Unfortunately, unlike the other binary classifiers, SVM
classifiers do not output probabilities for each class making them hard to use as part of an ensemble
of models that average the prediction probabilities at inference time to come up with a pooled
classification decision (soft voting). However, they can be effectively used with ensembles if hard
voting is used instead.
Random Forrest is among the most popular ML algorithms. It is an ensemble of Decision
Trees, where each tree is usually trained on a subset of the samples selected at random most often
with bootstrapping (sampling with replacement). It can also randomly choose the features available
to each tree. This introduces extra randomness when growing trees because instead of searching
for the very best feature when splitting a node, it searches for the best features among a random
subset of features (Géron, 2020). Radom Forrest have several hyperparameters that can be used to
fine tune the model. The method provides estimates of probabilities as well as specific class
prediction and also provides feature importance.
Gradient Boosting is another ensemble method that has become extremely popular among
Data Scientists (Chollet, 2019). Gradient Boosting works by sequentially adding predictors, usually
trees, to an ensemble, each one correcting its predecessor. This method tries to fit the new
predictor to the residual errors made by the previous predictor. Unlike Random Forrest where each
tress is gown independent of the others, Gradient Boosting sequentially train trees. This makes the
method slower but tends to perform better. Fortunately, in 2016 Tianqi Chen and Carlos Guestrin
(2016) developed an optimized system to perform gradient boosting that drastically improves
speed and scalability. The system is called Extreme Gradient Boosting (XGBoost) and is available
free via many open source packages.
Deep Learning Methods for text classification can be divided into two simple Neural Nets
(NN) that do not take word order into consideration (BOW models) and those that do. Simple NN
are simply stacks of fully connected layers with a sigmoid classifier on top. Models that do take into
account the context and/or word order are: 1) 1D-CNNs, useful to extract ordered patterns in
sequences of words regardless of where they appear in the text, and 2) Recurrent NN (RNNs) that
treat the word inputs as time-based representations where the input in the present is combined
with all previous inputs from the past. There are several variations of RNNs that have been designed
over the years to overcome some obstacles like the exploding or vanishing gradient problem. The
most popular ones are the Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and
the Gated Recurrent Unit (GRU) (Cho et al. 2014). There are other types of Deep Learning
techniques for language processing and understanding called Transformers that rely on attention
mechanisms to help the models weight the importance of the surrounding sequences. One of the
most popular is BERT and its variations (Devlin, Chang, Lee, Toutanova, 2018). As with many Deep
Learning problems, it is not always clear what is the best approach for a given problem, and so it is
often suggested to start with the simplest solution and gradually increase the complexity by trying
different approaches (Chollet, 2018).
In the following section, I describe the approach taken to explore these different
techniques to classify search queries performed by consumers of the makeup category. And
eventually narrow down on the best solution for this task.
Method
Initial Data Set. The Intent Lab study described in the introduction asked more than 1700
consumers of the makeup category to write a search query that they would use to research the
category when looking to refresh their makeup. Each respondent submitted a search query and was
asked to classify the query as being an Item-query (one used to look for a specific item, such as
“makeup kit” ) or Task-query (used to learn more about how to accomplish a broad goal such as
“find new makeup look”). Respondents were also provided with two more options: the search
query is “Both” a task and item search or “I don’t know”. Many respondents used the same search
query, but they did not always agree as to what type of query it was. This means the initial data is
noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and
assigned the label with the majority of votes. When there was a tie between any choice and “Both”
I selected both since it encompasses the other two options and when the tie was between Task or
Item I chose Task since it is the more general one unless it was obvious from the search query that
the individual was asking for a specific item. After cleaning for duplicates with disagreement,
irrelevant queries (“abc def ceb”), and removing “I don’t know”s, I ended up with n=954 queries. I
performed an initial exploration of methods with these data (reported in a separate report and
provided here in the appendix) and concluded, that more data was needed.
Final Data Set. To simplify the task into a binary classification challenge, I reclassified the
queries with a “Both” label into either a Task or an Item search. Most of those queries ended up
being Task-searches which are the more abstract of the two. Item searches, by definition, mention a
specific item, such a “mascara”, “foundation”, “lipstick”, etc. so it was relatively easy to identify
them. I also looked at the ones labeled “I don’t know” and was able to correctly classify them into
one or the other class. This helped me expand the workable set of search queries to around 1200
but most (71%) ended up being Task-searches probably due to the way the question was asked
which primed consumers to think about a broader goal more so than a specific item. The question
was something equivalent to: “what would you type in the search bar if you were going to use a
search engine to look for information to help you refresh your makeup look?”. To balance the data
set and expand it even further I manually identified 1150 Item-searches and 650 Task-searches from
a list of >20,000 keywords downloaded from Google Keyword Planner. The resulting and final data
set is a perfectly balanced list of 1500 Item-searches and 1500 Task-searches for the makeup
category for a total of n = 3000. I randomly split the data set into a Development set (for Kfold
cross-validation) and a Test set with a 80/20 split.
Models. I trained three sets of models:
A) The four “shallow learning” models described above: 1 Logistic Regression; 2 Random
Forrest; 3 Extreme Gradient Boosting; and 4 Support Vector Classifier (SVC). I used the
default settings for these models provided by the Python Skit-Learn library.
B) Several Deep Learning models using word vectorization and different topologies/types
of layers: A basic NN with a single layer with 100 nodes using one-hot-encoding of the
input sequences (5 Baseline_NN_100_nodes). The same model using a trainable
embedding layer of shape (2000, 100) (6 Embedding_1_NN_100_nodes. There were
only 1556 words in the vocabulary but set the dim to 2000 to avoid collisions when
hashing. A model with a simple RNN layer with 100 nodes and a pre-trained word
embedding using GloVe vectors of size 100 (7 Pre-trained_embedding
_simple_RNN_100). The same model with a learnable embedding layer (8
Embedding_simple_RNN_100). An LSTM layer with 100 nodes and the same learnable
embedding layer (9 Embedding_1_LSTM_layer_100). And a 1D-CNN layer of size 100
with a window size 3 and Global Max Pooling before the classifier (10, Embedding_
1D_CNN_100_GobalMaxPooling). I used sequences of size 10 with padding for all
embedding layers in the models above.
C) Two character-level Deep Learning models. Both with an embedding layer of size 50
and 100, and padded sequence of size 40. One has a single 1D-CNN layer of size 32 with
a window of 10 characters and Max Pooling with 3 filters (11 Char_Embedding_1D_
CNN_32_10_MaxPooling_3). The other has two stacks of 1D-CNNs of size 100 with Max
Polling with three filters in between and a final Global Max Pooling layer before the
classifier. (12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling). This is the
most complex model of all.
Apart from the models above I created a random classifier for reference and shown in Table
1 along with the other models. After training the models above I chose the three winning models
and optimized them tweaking the hyperparameters. I trained 6 versions of the Logistic Regression
and 10 versions of the XGBoost model. I did not tweak the Baseline NN given the extensive
exploration of different topologies and types of layers for the Deep Learning models performed
already.
Metrics. I used Kfold (4 folds) cross-validation with the Development data set and estimated
the average Train and Validation Accuracy across the folds. I also estimated the average ROC-AUC
score on the Validation sets and estimated the total cross-validation training time. I trained again
the 3 winning models using all the available training data and evaluated them on the holdout test
data set using Accuracy, F1 Score, and the ROC-AUC score. I used a virtual machine with GPU via
Google Collaboratory for all the experiments.
Model ensembling. In a final exploration of the best models, I ensembled the predictions of
the top three and top 2 classifiers by averaging the prediction probabilities and identified the
optimal threshold to maximize the true positive rate while keeping the false-negative rate to its
minimum possible.
Results
Table 1. Model comparison
As can be seen in Table 1, all models perform extremely well surpassing 95% accuracy on
the validation set and the ROC_AUC scores are all above 0.97. The top-performing models are
the Logistic Regression with a ROC-AUC score of 0.996 and a validation accuracy of 0.980 and
the XGBoost with a validation accuracy of 0.983 and a ROC-AUC of 0.995. The SVC follows these
models very closely but with a significantly higher computation cost. Of the Deep Learning models,
the simplest two NN were the winners with similar ROC-AUC scores but with lower validation
accuracy. Although the most complex of all the models (model 12) achieve high performance it did
not beat the simplest NN or even the worst performing of the shallow learning models. This is
another case where less is more and the most traditional of all the classifiers (Logistic Regression)
won the top prize with the most cost-effective implementation. It only took 0.57 seconds to train.
Table 2. Hyperparameter tuning of the winning models
Model
Train
accuracy
Validation
accuracy
Validation
ROC-AUC
Training time
(seconds)
Random_Classifier 0.495 0.505 0.523 0.069
1 Logistic_Regression 0.987 0.980 0.996 0.577
2 Random_Forrest 1.000 0.973 0.993 4.894
3 XGBoost 0.984 0.983 0.995 24.493
4 SVC 0.991 0.983 0.993 70.915
5 Baseline_NN_100_nodes 0.997 0.976 0.995 13.157
6 Embedding_1_NN_100_nodes 0.995 0.976 0.994 18.456
7 Pre-trained_embedding_simple_RNN_100 0.972 0.956 0.977 20.193
8 Embedding_simple_RNN_100 0.997 0.958 0.989 23.984
9 Embedding_1_LSTM_layer_100 0.998 0.960 0.987 39.607
10 Embedding_1D_CNN_100_GobalMaxPooling 0.998 0.973 0.994 18.607
11 Char_Embedding_1D_CNN_32_10_MaxPooling_3 0.998 0.964 0.988 48.268
12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling 0.996 0.965 0.990 77.049
Kfold-Cross Validation Results
Model
Train
accuracy
Validation
accuracy
Validation
ROC-AUC
Training time
(seconds)
1 Logistic_Regression 0.9872 0.9804 0.9958 0.577
2 Logistic_Regression_C_0.5 0.9832 0.9788 0.9954 0.345
3 Logistic_Regression_C_1.5 0.9894 0.9825 0.9959 0.486
4 Logistic_Regression_C_2 0.9907 0.9829 0.9960 0.476
5 Logistic_Regression_C_5 0.9953 0.9838 0.9959 0.561
6 Logistic_Regression_elasticnet_l1_0.5 0.9853 0.9838 0.9960 9.496
1 XGBoost 0.9842 0.9833 0.9948 24.493
2 XGBoost_lr_1 0.9874 0.9813 0.9926 20.157
3 XGBoost_lr_0.05 0.9697 0.9679 0.9924 19.854
4 XGBoost_max_depth_2 0.9782 0.9779 0.9938 14.717
5 XGBoost_max_depth_4 0.9851 0.9842 0.9951 24.363
6 XGBoost_max_depth_5 0.9854 0.9829 0.9956 29.120
7 XGBoost_max_depth_5_200_trees 0.9864 0.9825 0.9948 58.037
8 XGBoost_max_depth_4_200_trees 0.9861 0.9829 0.9951 48.427
9 XGBoost_max_depth_5_L2_0.5 0.9857 0.9838 0.9955 29.010
10 XGBoost_max_depth_5_L2_5 0.9833 0.9800 0.9953 28.690
Kfold-Cross Validation Results
Table 2 shows the results of the different variations of the top-performing models to
identify the optimal setting for some of the hyperparameters. Looking at the results an interesting
finding emerges. The models with less regularization up to a point are the ones performing the best.
The Logistic Regression with L2 regularization with a C value of 5 (vs C = 1, the default) is the top
performer followed by the XGBoost with max_depth of 5 (deeper trees) that tends to fit better the
training data but generalize worse. Presumably, this indicates that the data are very homogeneous
and the test set ends up being very similar to the train set.
Table 3 shows the final results of the winning models and the ensemble of the top 2 and top
3 models with the standard 0.5 threshold as well as the optimized threshold. The winning of all
approaches appears to be an ensemble of the optimized Logistic Regression and the optimized
XGBoost with a custom decision threshold of 0.78. Beyond this probability threshold, the search
query would be classified as a Task-search. The accuracy of the final classifier is 98.83% and F1
Score of the same value. Figure 1 and 2 show the confusion matrix and the ROC curve for the
winning model(s).
Table 3. Final results on the test data set
Figure 1 and 2. Confusion Matrix and ROC curve for Ensemble top 2 with a 0.78 threshold
Accuracy F1 Score ROC-AUC
Logistic_Regression_C_5 0.9800 0.9800 0.9966
XGBoost_max_depth_5 0.9800 0.9800 0.9954
Baseline_NN_100_nodes 0.9767 0.9767 0.9954
Ensamble top 3 0.9800 0.9800 0.9968
Ensemble top 2 0.9817 0.9817 0.9962
Ensemble top 2 (threshold = 0.65) 0.9867 0.9867 0.9962
Ensemble top 2 (threshold = 0.78) 0.9883 0.9883 0.9962
Test Set Results
Conclusion
An extensive exploration of Machine Learning models was used to develop a robust and
highly accurate classifier that can be put into production right away. This exploration of methods
and data shows a couple of interesting learnings and reinforces some of the common wisdom in
Machine Learning: 1) Quality and quantity of data are key. The difference between the first
exploration with the original data set (see appendix for reference) where the models seem to
perform just marginally better than a naïve classifier and the performance on the cleaner and
enhanced data set used here is truly remarkable. 2) shallow learning models often outperform Deep
Learning models and one should not assume Deep Learning is always better. 3) More regularization
is not necessarily going to always perform the best. It appears to depend on the situation and the
default settings are often good enough. 4) An ensemble of good and different models outperforms
any single one. And 5) a custom decision threshold may maximize the performance of a binary
classifier and give some flexibility in the tradeoff between true positives and false positives.
References
Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978-
1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python-
notebooks.git
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network
Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV]
Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation
Volume 9 | Issue 8 | November 15, 1997 p.1735-1780
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, Yoshua Bengio, 2014. "Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation". arXiv:1406.1078 [cs.CL].
Torrence J. Sejnkowski, 2018. The Deep Learning Revolution. MIT Press
Aurélian, Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow. 2nd
Edition. O’Reilly Media, Inc.
Tianqi Chen, Carlos Guestrin, 2016. XGBoost: A Scalable Tree Boosting System.
arXiv:1603.02754v3 [cs.LG]
Appendix – Previous report
Abstract
In this paper, I explore different NLP techniques to classify search queries made by
consumers to search engines such as Google and Bing when researching about the makeup
category. I used one-hot-encoding, learnable and pre-trained word embeddings, as well as several
temporal processing neural nets such as LSTM, GRU 1D-CNNs. None of the models appear to
significantly outperform a naïve classifier that always predicts the most frequent class. A discussion
of potential areas for further exploration is provided at the end.
Introduction
Natural Language Processing (NLP) and natural language understanding are a booming
research area and a field with lots of applications (Lane, Howard, Hapke, 2019). The ability to teach
machines to process text and use algorithms to understand and act on that understanding is
magical. In this paper, I will explore several Deep Learning techniques for a text classification task.
Literature review
Before using language, machines need to process it in a way that allows computation and
mathematical transformation. Vectorizing text is the first step. There are several techniques to
represent units of text (documents) for classification purposes. Lane, Howard, and Hapke in their
book Natural Language Processing in Action (2019) present the following common approaches:
One-hot-encoding, Term Frequency, term Frequency (TF), Term Frequency Inverse Document
Frequency, (TF-IDF), Word embeddings, among others. These can be divided into two broad types:
techniques that do not take into account the order (or context) of the words -or characters, known
as Bag-of-Words (BOW) and techniques that do, such as word embeddings.
Bag-of-words representations tend to provide sparse high dimensional vectors while word
embeddings provide dense and low dimensional (Chollet, 2018). BOW vectors are hard-coded while
word embeddings are learned from data. There are several approaches to creating word
embeddings, these can be created while training the models for the classification task or they can
be created separately using different techniques fitted on large data sets, such as the Wikipedia
corpus, Google news, Facebook content, etc. and are often made publicly available. The most
popular ones are Word2Vec, Doc2Vec, GloVe, and fastText (Lane, Howard, Hapke, 2019). More
recently BERT and variations have also become public (Devlin, Chang, Lee, Toutanova, 2018).
Representing the data as numerical vectors is only the beginning. Different Deep Learning
techniques that utilize the data to make predictions or to generate text have also been developed.
Some of the most common are 1D-CNNs and Recurrent Neural Networks (RNNs). Regular Deep
Neural Nets (DNNs) can also be used with text but they don’t fully leverage the order of the words
or sequences. 1D-CNNs are useful to extract ordered patterns in sequences of words regardless of
where they appear in the text. RNNs treat the word inputs as time-based representations where the
input in the present is combined with all previous inputs from the past. There are several variations
of RNNs that have been designed over the years to overcome some obstacles like the exploding or
vanishing gradient problem. The most popular ones are the Long Short Term Memory (LSTM)
(Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al. 2014). As with
many Deep Learning problems, it is not always clear what is the best approach for a given problem,
and so it is often suggested to start with the simplest solution and gradually increase the complexity
by trying different approaches (Chollet, 2018).
In the following section, I describe the approach taken to explore these different
techniques to classify search queries performed by consumers of the makeup category into three
classes (Task-Oriented search queries, Item-Oriented search queries or Both Task-Item).
Method
Data. A study (to be published) from the Intent Lab – a research partnership between
Performics and Northwestern University’s Medill School of Journalism and Microsoft’s Bing – asked
more than 1700 consumers of the makeup category to write a search query they would use to
research the category when looking to refresh their makeup. Each respondent submitted a search
query and was asked to classify the query as being an Item-query (one used to look for a specific
item, such as “makeup kit” ) or Task-query (used to learn more about how to accomplish a broad
goal such as “find new makeup look”). Respondents were also provided with two more options: the
search query is “Both” a task and item search or “I don’t know”. Many respondents used the same
search query, but they did not always agree as to what type of query it was. This means the data is
noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and
assigned the label with the majority of votes. When there was a tie between any choice and “Both”
I selected both since it encompasses the other two options and when the tie was between Task or
Item I chose Task since it is the more general one unless it was obvious from the search query that
the individual was asking for a specific item.
After cleaning for duplicates with disagreement, irrelevant queries (“abc def ceb”), and
removing “I don’t know”s, I ended up with n=954 queries. I split the data into a Train (n=686)
Validation (n=77) and Test (n =191) data sets.
Sequence vectorizing. To test the effect of different type of representation I trained two
models using one-hot-matrices and a simple hidden fully-connected layer -one with dropout (0.5)
and 32 nodes the other one without dropout and 16 nodes- and 10 models (including two simple
similar models to the ones without embeddings) using an embedding layer of dimensions 1000,
100, 8. The size of the vocabulary is 606 but to avoid collisions I set the max to 1000. I used vectors
with padding of size 8. Most queries are of size 3 while the longer is of size 21. 8 was somehow an
arbitrary choice and I did not test different lengths in this set of experiments. The choice of the
length of the embedding vector (100) was chosen to easily use pre-trained GloVe word
embeddings.
Models. Apart from the two, simple, baseline DNN models (trained with one-hot-vectors)
described above, I tested the following models: 1) a naïve_model that always predicts Task-search
(the most common label (46%) and used as a reference; 2) and 3) described above (One-hot-
baseline_NN_16_nodes and One-hot-baseline_NN_32_w_dropout). 4) Embedding_solo a model
that uses only the embedded layer to be trained as part of the task.
5) Embedding_layer_NN_32_w_dropout the same as model 4 with an additional dense layer of size
32 and dropout (0.5) -comparable to model 3. 6) Embedding_1D_CNN_MaxPooling a model with a
trainable embedding layer, a 1D CNN with a window size of 3 and a MaxPooling layer.
7) Pre_trained_embedding_solo same as model 4 but with pre-trained embeddings from GloVe.
8) Pre_trained_embedding_simple_RNN model 7 with an additional SimpleRNN layer of size 8.
9) Pre_trained_embedding_stacked(2)_RNN model 7 with two SimpleRNN layers staked on top;
Model
Weighted F1 score
on test set
Accuracy
on test set
Kfold-validation
average accuracy
Training time
(seconds)
1 Naïve_model 0.29 0.46 n/a n/a
2 One-hot-baseline_NN_16_nodes 0.30 0.46 0.46 3.76
3 One-hot-baseline_NN_32_w_dropout 0.31 0.43 0.46 3.84
4 Embedding_solo 0.38 0.42 0.45 4.04
5 Embedding_layer_NN_32_w_dropout 0.40 0.44 0.44 5.08
6 Embedding_1D_CNN_MaxPooling 0.33 0.46 0.46 5.45
7 Pre-trained_embedding_solo 0.47 0.49 0.45 3.01
8 Pre-trained_embedding_simple_RNN 0.40 0.44 0.44 8.01
9 Pre-trained_embedding_stacked(2)_RNN 0.34 0.42 0.44 10.42
10 Pre-trained_embedding_LSTM 0.39 0.49 0.46 11.50
11 Pre-trained_embedding_GRU 0.29 0.45 0.46 11.85
12 Pre-trained_embedding_1D_CNN_MaxPooling 0.45 0.47 0.41 3.84
13 Pre-trained_embedding_1D_CNN_GobalMaxPooling 0.46 0.49 0.48 4.45
10) Pre_trained_embedding_LSTM model 7 with an additional LSTM layer of size 8.
11) Pre_trained_embedding_GRU model 7 with an additional GRU layer.
12) Pre_trained_embedding_1D_CNN_MaxPooling model 7 with a 1D CNN and MaxPooling
(equivalent to model 6 but with pre-trained embeddings). 13) Pre-
trained_embedding_1D_CNN_GobalMaxPooling model 13 but with Global Max pooling instead of
MaxPooling1D.
Metrics. I used the weighted F1 metric and top-1 accuracy on the holdout test data set to
compare the performance of the models as well as the time it takes to train the model under the K-
fold cross-validation framework. A confusion matrix was produced for each model, I also estimated
the average accuracy on the validation set across the K-folds (4 folds). After cross-validation, I
trained each model on the full training data set with a 90%/10% split for test and validation to get a
sense of the training performance. I used a virtual machine with GPU via Google Collaboratory for
all experiments.
Results
Table 1. Model comparison
As can be seen in Table 1, none of the models perform significantly better than the
naïve_model in terms of accuracy, which is discouraging. The best performing model is a simple
pre-trained embedding layer with the classifier on top. The F1 score for this model, 0.47, is
significantly higher than 0.29, but still not good enough for real use on an unlabeled data set.
Although all the models trained really fast, even using cross-validation, given the small data set, it is
clear that the RNN, LSTM, and GRU models are way more computationally expensive, and in this
case, do not achieve higher performance. The models with one-hot encoding underperform most of
the ones with word embeddings on the F1 score. And using pre-trained word embeddings seems to
contribute positively as evidenced by model 7. Looking at the confusion matrix of model 7, it is clear
that the model confuses “Task-searches” with “Both” providing a window into potential areas for
further exploration.
Conclusion
The initial exploration of Deep Learning methods for classifying search queries into Task vs
Item oriented queries illustrates the challenges with the task. 1) search queries are often a short
combination of keywords and often lack the necessary context to extract deep semantic meaning.
2) Many contain typos or slight variations making it harder for a machine to realize they are the
same queries 3) The classes are ill-defined: a query that is “Both” is at the same time a “Task-
search” probably sharing the same cues and type of language making it hard to separate from one
another. 4) The data set is small to allow for good learning of custom embeddings, and although
pre-trained word embeddings seem to help there is a need for more context to help the algorithm
learn valuable info.
This suggests a couple of potential areas for improvement: 1) make the data a binary
classification problem by either removing “Both” for the data set or making “Both” and “Task-
searches” equivalent. 2) try character level encoding to provide richer representations of the data
and enable the power of temporal processing by RNNs and 1D-CNNs 3) feature engineer additional
variables, such as the search volume behind these queries (as a measure of popularity), and
journey-mapping of the search query. (I have developed another algorithm that relies on an Intent-
ontology to classifies these queries into different buckets corresponding to different stages in the
consumer journey). This use of external data outside of the language in the text might provide
additional and valuable context. 4) Try shallow learning methods such as random forest and SVM
and consider combining them (ensembles of models) to make a concerted prediction. 5) optimize
the hyperparameters of promising models after exploring the opportunities from points 1 - 4.
References
Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978-
1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python-
notebooks.git
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network
Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV]
Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation
Volume 9 | Issue 8 | November 15, 1997 p.1735-1780
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, Yoshua Bengio (2014). "Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation". arXiv:1406.1078 [cs.CL].

More Related Content

DOC
PDF
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
PDF
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
PDF
Opinion mining on newspaper headlines using SVM and NLP
PDF
On the benefit of logic-based machine learning to learn pairwise comparisons
PDF
Controlling informative features for improved accuracy and faster predictions...
PDF
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
PDF
A survey of modified support vector machine using particle of swarm optimizat...
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
Opinion mining on newspaper headlines using SVM and NLP
On the benefit of logic-based machine learning to learn pairwise comparisons
Controlling informative features for improved accuracy and faster predictions...
USING NLP APPROACH FOR ANALYZING CUSTOMER REVIEWS
A survey of modified support vector machine using particle of swarm optimizat...

What's hot (19)

PDF
SUPPORT VECTOR MACHINE CLASSIFIER FOR SENTIMENT ANALYSIS OF FEEDBACK MARKETPL...
PDF
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
PDF
Sentiment Analysis Using Hybrid Approach: A Survey
PPTX
PhD defense
PDF
Multidirectional Product Support System for Decision Making In Textile Indust...
PDF
Streaming Analytics
PPTX
PhD Consortium ADBIS presetation.
PDF
Neural Network Based Context Sensitive Sentiment Analysis
PPTX
A Fuzzy Logic Intelligent Agent for Information Extraction
PDF
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
PDF
11.hybrid ga svm for efficient feature selection in e-mail classification
PDF
Hybrid ga svm for efficient feature selection in e-mail classification
PDF
Trading outlier detection machine learning approach
PDF
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
PDF
Query aware determinization of uncertain
PDF
IRJET- Text Document Clustering using K-Means Algorithm
DOCX
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
PDF
DeepSearch_Project_Report
DOC
DATA MINING.doc
SUPPORT VECTOR MACHINE CLASSIFIER FOR SENTIMENT ANALYSIS OF FEEDBACK MARKETPL...
TEXT SENTIMENTS FOR FORUMS HOTSPOT DETECTION
Sentiment Analysis Using Hybrid Approach: A Survey
PhD defense
Multidirectional Product Support System for Decision Making In Textile Indust...
Streaming Analytics
PhD Consortium ADBIS presetation.
Neural Network Based Context Sensitive Sentiment Analysis
A Fuzzy Logic Intelligent Agent for Information Extraction
IRJET - Support Vector Machine versus Naive Bayes Classifier:A Juxtaposition ...
11.hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classification
Trading outlier detection machine learning approach
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
Query aware determinization of uncertain
IRJET- Text Document Clustering using K-Means Algorithm
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
DeepSearch_Project_Report
DATA MINING.doc
Ad

Similar to Binary search query classifier (20)

PPT
powerpoint
PDF
50120140504015
PDF
Introduction to active learning
PPT
i i believe is is enviromntbelieve is is enviromnt7.ppt
PDF
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
PDF
Active Learning Literature Survey
PDF
Machine learning with in the python lecture for computer science
PDF
50120140503003 2
PPTX
Introduction to Machine Learning
PDF
Introduction Machine Learning Syllabus
PDF
XGBoost @ Fyber
PPT
c23_ml1.ppt
PDF
Choosing a Machine Learning technique to solve your need
PPT
PPT-3.ppt
PPT
MAchine learning
PPT
Machine Learning Machine Learnin Machine Learningg
PDF
Technical Area: Machine Learning and Pattern Recognition
PPTX
digital image processing - classification
PPTX
Interactive ML.pptx XAI by Radhika selvamani
powerpoint
50120140504015
Introduction to active learning
i i believe is is enviromntbelieve is is enviromnt7.ppt
IRJET- Breast Cancer Relapse Prognosis by Classic and Modern Structures o...
Active Learning Literature Survey
Machine learning with in the python lecture for computer science
50120140503003 2
Introduction to Machine Learning
Introduction Machine Learning Syllabus
XGBoost @ Fyber
c23_ml1.ppt
Choosing a Machine Learning technique to solve your need
PPT-3.ppt
MAchine learning
Machine Learning Machine Learnin Machine Learningg
Technical Area: Machine Learning and Pattern Recognition
digital image processing - classification
Interactive ML.pptx XAI by Radhika selvamani
Ad

More from Esteban Ribero (8)

PDF
Conjoint analysis with mcmc
PDF
Campaign response modeling
PDF
Consumer Segmentation with Bayesian Statistics
PDF
Modeling Sexual Selection with Agent-Based Models
PDF
The Learning Lab
PDF
Brand Communications Modeling: Developing and Using Econometric Models in Adv...
PDF
ARF RE:THINK 2005. The Extension of The Concept of Brand to Cultural Event Ma...
PDF
Is looking at consumers' brain the ultimate solution?
Conjoint analysis with mcmc
Campaign response modeling
Consumer Segmentation with Bayesian Statistics
Modeling Sexual Selection with Agent-Based Models
The Learning Lab
Brand Communications Modeling: Developing and Using Econometric Models in Adv...
ARF RE:THINK 2005. The Extension of The Concept of Brand to Cultural Event Ma...
Is looking at consumers' brain the ultimate solution?

Recently uploaded (20)

PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Business Analytics and business intelligence.pdf
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
1_Introduction to advance data techniques.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Analytics and business intelligence.pdf
Qualitative Qantitative and Mixed Methods.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Acumen Training GuidePresentation.pptx
Mega Projects Data Mega Projects Data
Introduction-to-Cloud-ComputingFinal.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Knowledge Engineering Part 1
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
1_Introduction to advance data techniques.pptx

Binary search query classifier

  • 1. Binary Search Query Classifier with Ensemble Models Esteban Ribero – Chicago, March 2020 Abstract In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production. The conclusions confirm some of the common wisdom in Machine Learning regarding size and quality of data, shallow vs Deep Learning and ensemble vs individual models. Introduction Search data from search engines such as Google and Bing are a great source of information about consumers’ needs and opportunities for content development and strategic media placement for marketers. However, the search data that is available to marketers is sometimes limited to lists of keywords with estimates of their search volume, cost per click, and competitive level. To expand and complement these data, the Intent Lab– a research partnership between Performics (a leading performance marketing agency) and Northwestern University’s Medill School of Journalism – performs primary research with consumers to better understand their intent when using search engines. In a recent study (to be published), The Intent Lab asked more than 1700 consumers of the makeup category to write a search query that they would use to research the category when looking to refresh their makeup. Each respondent submitted a search query and was asked to classify the query as being an Item-query (one used to look for a specific item, such as “makeup kit”) or a Task-query (used to learn more about how to accomplish a broad goal such as “find new makeup look”). It turns out that knowing what type of query a searcher is performing provides clear guidance into the type of content that should be served to those consumers. To expand the learnings beyond the study and build a practical application to automatically label thousands of
  • 2. search queries as Goal-oriented (Task-search) vs Item-oriented (Item-search), it is necessary to develop and train a custom search-query classifier. This classifier can then be used to analyze the publicly available data provided to marketers by the type of search query on an ongoing basis. The purpose of this paper is to report on the development of such classifier. Literature review Deep Learning has become a popular toolkit for Data Scientists trying to use the latest in Machine Learning and AI (Sejnowski, 2018). Natural Language Processing (NLP) has been one of the areas much influenced by these recent developments, although the field is vast and includes several traditional approaches (Lane, Howard, and Hapke, 2019) that could be reframed nowadays as shallow learning (Chollet, 2018). Classifying search queries into distinct buckets can be conceived as a case of Supervised Learning and the available models suited for the task is vast. Classifying queries into Task-searches or Item-Searches is a binary classification problem and well suited for a variety of techniques. It is often advised to try simpler, shallow learning models before moving on to more complex deep learning models (Chollet, 2018). The following section describes some of the most common models. Logistic Regression is commonly used to estimate the probability that an instance belongs to a particular class. The model computes a weighted sum of the input features (plus a bias term), but instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result (Géron, 2020). This is a typical sigmoid function that squashes the input (the weighted sum of input features plus bias in this case) into an S-shape form that outputs values between 0 and 1. Once this probability is estimated, a prediction can be made by identifying a threshold and assigning the observation to either one class or the other. The threshold is by default 0.5 but this can be adjusted if needed. This sigmoid function is also often used in Deep Learning as the final layer of a neural network binary classifier. A great feature of Logistic Regression is the ability to identify the important features leading to the classifier decision. The coefficients in the regression convey such information (Géron, 2020). Support Vector Machines are powerful ML models and are particularly well suited for classification of complex small or medium-sized datasets (Géron, 2020). It relies on two parallel
  • 3. vectors to a line that separates the classes in two. These “support” vectors make sure that the decision boundary that separates the classes is the one farther away from each class creating a “large margin classification” (Géron, 2020). Unfortunately, unlike the other binary classifiers, SVM classifiers do not output probabilities for each class making them hard to use as part of an ensemble of models that average the prediction probabilities at inference time to come up with a pooled classification decision (soft voting). However, they can be effectively used with ensembles if hard voting is used instead. Random Forrest is among the most popular ML algorithms. It is an ensemble of Decision Trees, where each tree is usually trained on a subset of the samples selected at random most often with bootstrapping (sampling with replacement). It can also randomly choose the features available to each tree. This introduces extra randomness when growing trees because instead of searching for the very best feature when splitting a node, it searches for the best features among a random subset of features (Géron, 2020). Radom Forrest have several hyperparameters that can be used to fine tune the model. The method provides estimates of probabilities as well as specific class prediction and also provides feature importance. Gradient Boosting is another ensemble method that has become extremely popular among Data Scientists (Chollet, 2019). Gradient Boosting works by sequentially adding predictors, usually trees, to an ensemble, each one correcting its predecessor. This method tries to fit the new predictor to the residual errors made by the previous predictor. Unlike Random Forrest where each tress is gown independent of the others, Gradient Boosting sequentially train trees. This makes the method slower but tends to perform better. Fortunately, in 2016 Tianqi Chen and Carlos Guestrin (2016) developed an optimized system to perform gradient boosting that drastically improves speed and scalability. The system is called Extreme Gradient Boosting (XGBoost) and is available free via many open source packages. Deep Learning Methods for text classification can be divided into two simple Neural Nets (NN) that do not take word order into consideration (BOW models) and those that do. Simple NN are simply stacks of fully connected layers with a sigmoid classifier on top. Models that do take into account the context and/or word order are: 1) 1D-CNNs, useful to extract ordered patterns in sequences of words regardless of where they appear in the text, and 2) Recurrent NN (RNNs) that
  • 4. treat the word inputs as time-based representations where the input in the present is combined with all previous inputs from the past. There are several variations of RNNs that have been designed over the years to overcome some obstacles like the exploding or vanishing gradient problem. The most popular ones are the Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al. 2014). There are other types of Deep Learning techniques for language processing and understanding called Transformers that rely on attention mechanisms to help the models weight the importance of the surrounding sequences. One of the most popular is BERT and its variations (Devlin, Chang, Lee, Toutanova, 2018). As with many Deep Learning problems, it is not always clear what is the best approach for a given problem, and so it is often suggested to start with the simplest solution and gradually increase the complexity by trying different approaches (Chollet, 2018). In the following section, I describe the approach taken to explore these different techniques to classify search queries performed by consumers of the makeup category. And eventually narrow down on the best solution for this task. Method Initial Data Set. The Intent Lab study described in the introduction asked more than 1700 consumers of the makeup category to write a search query that they would use to research the category when looking to refresh their makeup. Each respondent submitted a search query and was asked to classify the query as being an Item-query (one used to look for a specific item, such as “makeup kit” ) or Task-query (used to learn more about how to accomplish a broad goal such as “find new makeup look”). Respondents were also provided with two more options: the search query is “Both” a task and item search or “I don’t know”. Many respondents used the same search query, but they did not always agree as to what type of query it was. This means the initial data is noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and assigned the label with the majority of votes. When there was a tie between any choice and “Both” I selected both since it encompasses the other two options and when the tie was between Task or Item I chose Task since it is the more general one unless it was obvious from the search query that the individual was asking for a specific item. After cleaning for duplicates with disagreement, irrelevant queries (“abc def ceb”), and removing “I don’t know”s, I ended up with n=954 queries. I
  • 5. performed an initial exploration of methods with these data (reported in a separate report and provided here in the appendix) and concluded, that more data was needed. Final Data Set. To simplify the task into a binary classification challenge, I reclassified the queries with a “Both” label into either a Task or an Item search. Most of those queries ended up being Task-searches which are the more abstract of the two. Item searches, by definition, mention a specific item, such a “mascara”, “foundation”, “lipstick”, etc. so it was relatively easy to identify them. I also looked at the ones labeled “I don’t know” and was able to correctly classify them into one or the other class. This helped me expand the workable set of search queries to around 1200 but most (71%) ended up being Task-searches probably due to the way the question was asked which primed consumers to think about a broader goal more so than a specific item. The question was something equivalent to: “what would you type in the search bar if you were going to use a search engine to look for information to help you refresh your makeup look?”. To balance the data set and expand it even further I manually identified 1150 Item-searches and 650 Task-searches from a list of >20,000 keywords downloaded from Google Keyword Planner. The resulting and final data set is a perfectly balanced list of 1500 Item-searches and 1500 Task-searches for the makeup category for a total of n = 3000. I randomly split the data set into a Development set (for Kfold cross-validation) and a Test set with a 80/20 split. Models. I trained three sets of models: A) The four “shallow learning” models described above: 1 Logistic Regression; 2 Random Forrest; 3 Extreme Gradient Boosting; and 4 Support Vector Classifier (SVC). I used the default settings for these models provided by the Python Skit-Learn library. B) Several Deep Learning models using word vectorization and different topologies/types of layers: A basic NN with a single layer with 100 nodes using one-hot-encoding of the input sequences (5 Baseline_NN_100_nodes). The same model using a trainable embedding layer of shape (2000, 100) (6 Embedding_1_NN_100_nodes. There were only 1556 words in the vocabulary but set the dim to 2000 to avoid collisions when hashing. A model with a simple RNN layer with 100 nodes and a pre-trained word embedding using GloVe vectors of size 100 (7 Pre-trained_embedding _simple_RNN_100). The same model with a learnable embedding layer (8
  • 6. Embedding_simple_RNN_100). An LSTM layer with 100 nodes and the same learnable embedding layer (9 Embedding_1_LSTM_layer_100). And a 1D-CNN layer of size 100 with a window size 3 and Global Max Pooling before the classifier (10, Embedding_ 1D_CNN_100_GobalMaxPooling). I used sequences of size 10 with padding for all embedding layers in the models above. C) Two character-level Deep Learning models. Both with an embedding layer of size 50 and 100, and padded sequence of size 40. One has a single 1D-CNN layer of size 32 with a window of 10 characters and Max Pooling with 3 filters (11 Char_Embedding_1D_ CNN_32_10_MaxPooling_3). The other has two stacks of 1D-CNNs of size 100 with Max Polling with three filters in between and a final Global Max Pooling layer before the classifier. (12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling). This is the most complex model of all. Apart from the models above I created a random classifier for reference and shown in Table 1 along with the other models. After training the models above I chose the three winning models and optimized them tweaking the hyperparameters. I trained 6 versions of the Logistic Regression and 10 versions of the XGBoost model. I did not tweak the Baseline NN given the extensive exploration of different topologies and types of layers for the Deep Learning models performed already. Metrics. I used Kfold (4 folds) cross-validation with the Development data set and estimated the average Train and Validation Accuracy across the folds. I also estimated the average ROC-AUC score on the Validation sets and estimated the total cross-validation training time. I trained again the 3 winning models using all the available training data and evaluated them on the holdout test data set using Accuracy, F1 Score, and the ROC-AUC score. I used a virtual machine with GPU via Google Collaboratory for all the experiments. Model ensembling. In a final exploration of the best models, I ensembled the predictions of the top three and top 2 classifiers by averaging the prediction probabilities and identified the optimal threshold to maximize the true positive rate while keeping the false-negative rate to its minimum possible.
  • 7. Results Table 1. Model comparison As can be seen in Table 1, all models perform extremely well surpassing 95% accuracy on the validation set and the ROC_AUC scores are all above 0.97. The top-performing models are the Logistic Regression with a ROC-AUC score of 0.996 and a validation accuracy of 0.980 and the XGBoost with a validation accuracy of 0.983 and a ROC-AUC of 0.995. The SVC follows these models very closely but with a significantly higher computation cost. Of the Deep Learning models, the simplest two NN were the winners with similar ROC-AUC scores but with lower validation accuracy. Although the most complex of all the models (model 12) achieve high performance it did not beat the simplest NN or even the worst performing of the shallow learning models. This is another case where less is more and the most traditional of all the classifiers (Logistic Regression) won the top prize with the most cost-effective implementation. It only took 0.57 seconds to train. Table 2. Hyperparameter tuning of the winning models Model Train accuracy Validation accuracy Validation ROC-AUC Training time (seconds) Random_Classifier 0.495 0.505 0.523 0.069 1 Logistic_Regression 0.987 0.980 0.996 0.577 2 Random_Forrest 1.000 0.973 0.993 4.894 3 XGBoost 0.984 0.983 0.995 24.493 4 SVC 0.991 0.983 0.993 70.915 5 Baseline_NN_100_nodes 0.997 0.976 0.995 13.157 6 Embedding_1_NN_100_nodes 0.995 0.976 0.994 18.456 7 Pre-trained_embedding_simple_RNN_100 0.972 0.956 0.977 20.193 8 Embedding_simple_RNN_100 0.997 0.958 0.989 23.984 9 Embedding_1_LSTM_layer_100 0.998 0.960 0.987 39.607 10 Embedding_1D_CNN_100_GobalMaxPooling 0.998 0.973 0.994 18.607 11 Char_Embedding_1D_CNN_32_10_MaxPooling_3 0.998 0.964 0.988 48.268 12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling 0.996 0.965 0.990 77.049 Kfold-Cross Validation Results Model Train accuracy Validation accuracy Validation ROC-AUC Training time (seconds) 1 Logistic_Regression 0.9872 0.9804 0.9958 0.577 2 Logistic_Regression_C_0.5 0.9832 0.9788 0.9954 0.345 3 Logistic_Regression_C_1.5 0.9894 0.9825 0.9959 0.486 4 Logistic_Regression_C_2 0.9907 0.9829 0.9960 0.476 5 Logistic_Regression_C_5 0.9953 0.9838 0.9959 0.561 6 Logistic_Regression_elasticnet_l1_0.5 0.9853 0.9838 0.9960 9.496 1 XGBoost 0.9842 0.9833 0.9948 24.493 2 XGBoost_lr_1 0.9874 0.9813 0.9926 20.157 3 XGBoost_lr_0.05 0.9697 0.9679 0.9924 19.854 4 XGBoost_max_depth_2 0.9782 0.9779 0.9938 14.717 5 XGBoost_max_depth_4 0.9851 0.9842 0.9951 24.363 6 XGBoost_max_depth_5 0.9854 0.9829 0.9956 29.120 7 XGBoost_max_depth_5_200_trees 0.9864 0.9825 0.9948 58.037 8 XGBoost_max_depth_4_200_trees 0.9861 0.9829 0.9951 48.427 9 XGBoost_max_depth_5_L2_0.5 0.9857 0.9838 0.9955 29.010 10 XGBoost_max_depth_5_L2_5 0.9833 0.9800 0.9953 28.690 Kfold-Cross Validation Results
  • 8. Table 2 shows the results of the different variations of the top-performing models to identify the optimal setting for some of the hyperparameters. Looking at the results an interesting finding emerges. The models with less regularization up to a point are the ones performing the best. The Logistic Regression with L2 regularization with a C value of 5 (vs C = 1, the default) is the top performer followed by the XGBoost with max_depth of 5 (deeper trees) that tends to fit better the training data but generalize worse. Presumably, this indicates that the data are very homogeneous and the test set ends up being very similar to the train set. Table 3 shows the final results of the winning models and the ensemble of the top 2 and top 3 models with the standard 0.5 threshold as well as the optimized threshold. The winning of all approaches appears to be an ensemble of the optimized Logistic Regression and the optimized XGBoost with a custom decision threshold of 0.78. Beyond this probability threshold, the search query would be classified as a Task-search. The accuracy of the final classifier is 98.83% and F1 Score of the same value. Figure 1 and 2 show the confusion matrix and the ROC curve for the winning model(s). Table 3. Final results on the test data set Figure 1 and 2. Confusion Matrix and ROC curve for Ensemble top 2 with a 0.78 threshold Accuracy F1 Score ROC-AUC Logistic_Regression_C_5 0.9800 0.9800 0.9966 XGBoost_max_depth_5 0.9800 0.9800 0.9954 Baseline_NN_100_nodes 0.9767 0.9767 0.9954 Ensamble top 3 0.9800 0.9800 0.9968 Ensemble top 2 0.9817 0.9817 0.9962 Ensemble top 2 (threshold = 0.65) 0.9867 0.9867 0.9962 Ensemble top 2 (threshold = 0.78) 0.9883 0.9883 0.9962 Test Set Results
  • 9. Conclusion An extensive exploration of Machine Learning models was used to develop a robust and highly accurate classifier that can be put into production right away. This exploration of methods and data shows a couple of interesting learnings and reinforces some of the common wisdom in Machine Learning: 1) Quality and quantity of data are key. The difference between the first exploration with the original data set (see appendix for reference) where the models seem to perform just marginally better than a naïve classifier and the performance on the cleaner and enhanced data set used here is truly remarkable. 2) shallow learning models often outperform Deep Learning models and one should not assume Deep Learning is always better. 3) More regularization is not necessarily going to always perform the best. It appears to depend on the situation and the default settings are often good enough. 4) An ensemble of good and different models outperforms any single one. And 5) a custom decision threshold may maximize the performance of a binary classifier and give some flexibility in the tradeoff between true positives and false positives.
  • 10. References Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978- 1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python- notebooks.git Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2 Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV] Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation Volume 9 | Issue 8 | November 15, 1997 p.1735-1780 Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio, 2014. "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078 [cs.CL]. Torrence J. Sejnkowski, 2018. The Deep Learning Revolution. MIT Press Aurélian, Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow. 2nd Edition. O’Reilly Media, Inc. Tianqi Chen, Carlos Guestrin, 2016. XGBoost: A Scalable Tree Boosting System. arXiv:1603.02754v3 [cs.LG]
  • 11. Appendix – Previous report Abstract In this paper, I explore different NLP techniques to classify search queries made by consumers to search engines such as Google and Bing when researching about the makeup category. I used one-hot-encoding, learnable and pre-trained word embeddings, as well as several temporal processing neural nets such as LSTM, GRU 1D-CNNs. None of the models appear to significantly outperform a naïve classifier that always predicts the most frequent class. A discussion of potential areas for further exploration is provided at the end. Introduction Natural Language Processing (NLP) and natural language understanding are a booming research area and a field with lots of applications (Lane, Howard, Hapke, 2019). The ability to teach machines to process text and use algorithms to understand and act on that understanding is magical. In this paper, I will explore several Deep Learning techniques for a text classification task. Literature review Before using language, machines need to process it in a way that allows computation and mathematical transformation. Vectorizing text is the first step. There are several techniques to represent units of text (documents) for classification purposes. Lane, Howard, and Hapke in their book Natural Language Processing in Action (2019) present the following common approaches: One-hot-encoding, Term Frequency, term Frequency (TF), Term Frequency Inverse Document Frequency, (TF-IDF), Word embeddings, among others. These can be divided into two broad types: techniques that do not take into account the order (or context) of the words -or characters, known as Bag-of-Words (BOW) and techniques that do, such as word embeddings. Bag-of-words representations tend to provide sparse high dimensional vectors while word embeddings provide dense and low dimensional (Chollet, 2018). BOW vectors are hard-coded while word embeddings are learned from data. There are several approaches to creating word embeddings, these can be created while training the models for the classification task or they can be created separately using different techniques fitted on large data sets, such as the Wikipedia
  • 12. corpus, Google news, Facebook content, etc. and are often made publicly available. The most popular ones are Word2Vec, Doc2Vec, GloVe, and fastText (Lane, Howard, Hapke, 2019). More recently BERT and variations have also become public (Devlin, Chang, Lee, Toutanova, 2018). Representing the data as numerical vectors is only the beginning. Different Deep Learning techniques that utilize the data to make predictions or to generate text have also been developed. Some of the most common are 1D-CNNs and Recurrent Neural Networks (RNNs). Regular Deep Neural Nets (DNNs) can also be used with text but they don’t fully leverage the order of the words or sequences. 1D-CNNs are useful to extract ordered patterns in sequences of words regardless of where they appear in the text. RNNs treat the word inputs as time-based representations where the input in the present is combined with all previous inputs from the past. There are several variations of RNNs that have been designed over the years to overcome some obstacles like the exploding or vanishing gradient problem. The most popular ones are the Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al. 2014). As with many Deep Learning problems, it is not always clear what is the best approach for a given problem, and so it is often suggested to start with the simplest solution and gradually increase the complexity by trying different approaches (Chollet, 2018). In the following section, I describe the approach taken to explore these different techniques to classify search queries performed by consumers of the makeup category into three classes (Task-Oriented search queries, Item-Oriented search queries or Both Task-Item). Method Data. A study (to be published) from the Intent Lab – a research partnership between Performics and Northwestern University’s Medill School of Journalism and Microsoft’s Bing – asked more than 1700 consumers of the makeup category to write a search query they would use to research the category when looking to refresh their makeup. Each respondent submitted a search query and was asked to classify the query as being an Item-query (one used to look for a specific item, such as “makeup kit” ) or Task-query (used to learn more about how to accomplish a broad goal such as “find new makeup look”). Respondents were also provided with two more options: the search query is “Both” a task and item search or “I don’t know”. Many respondents used the same
  • 13. search query, but they did not always agree as to what type of query it was. This means the data is noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and assigned the label with the majority of votes. When there was a tie between any choice and “Both” I selected both since it encompasses the other two options and when the tie was between Task or Item I chose Task since it is the more general one unless it was obvious from the search query that the individual was asking for a specific item. After cleaning for duplicates with disagreement, irrelevant queries (“abc def ceb”), and removing “I don’t know”s, I ended up with n=954 queries. I split the data into a Train (n=686) Validation (n=77) and Test (n =191) data sets. Sequence vectorizing. To test the effect of different type of representation I trained two models using one-hot-matrices and a simple hidden fully-connected layer -one with dropout (0.5) and 32 nodes the other one without dropout and 16 nodes- and 10 models (including two simple similar models to the ones without embeddings) using an embedding layer of dimensions 1000, 100, 8. The size of the vocabulary is 606 but to avoid collisions I set the max to 1000. I used vectors with padding of size 8. Most queries are of size 3 while the longer is of size 21. 8 was somehow an arbitrary choice and I did not test different lengths in this set of experiments. The choice of the length of the embedding vector (100) was chosen to easily use pre-trained GloVe word embeddings. Models. Apart from the two, simple, baseline DNN models (trained with one-hot-vectors) described above, I tested the following models: 1) a naïve_model that always predicts Task-search (the most common label (46%) and used as a reference; 2) and 3) described above (One-hot- baseline_NN_16_nodes and One-hot-baseline_NN_32_w_dropout). 4) Embedding_solo a model that uses only the embedded layer to be trained as part of the task. 5) Embedding_layer_NN_32_w_dropout the same as model 4 with an additional dense layer of size 32 and dropout (0.5) -comparable to model 3. 6) Embedding_1D_CNN_MaxPooling a model with a trainable embedding layer, a 1D CNN with a window size of 3 and a MaxPooling layer. 7) Pre_trained_embedding_solo same as model 4 but with pre-trained embeddings from GloVe. 8) Pre_trained_embedding_simple_RNN model 7 with an additional SimpleRNN layer of size 8. 9) Pre_trained_embedding_stacked(2)_RNN model 7 with two SimpleRNN layers staked on top;
  • 14. Model Weighted F1 score on test set Accuracy on test set Kfold-validation average accuracy Training time (seconds) 1 Naïve_model 0.29 0.46 n/a n/a 2 One-hot-baseline_NN_16_nodes 0.30 0.46 0.46 3.76 3 One-hot-baseline_NN_32_w_dropout 0.31 0.43 0.46 3.84 4 Embedding_solo 0.38 0.42 0.45 4.04 5 Embedding_layer_NN_32_w_dropout 0.40 0.44 0.44 5.08 6 Embedding_1D_CNN_MaxPooling 0.33 0.46 0.46 5.45 7 Pre-trained_embedding_solo 0.47 0.49 0.45 3.01 8 Pre-trained_embedding_simple_RNN 0.40 0.44 0.44 8.01 9 Pre-trained_embedding_stacked(2)_RNN 0.34 0.42 0.44 10.42 10 Pre-trained_embedding_LSTM 0.39 0.49 0.46 11.50 11 Pre-trained_embedding_GRU 0.29 0.45 0.46 11.85 12 Pre-trained_embedding_1D_CNN_MaxPooling 0.45 0.47 0.41 3.84 13 Pre-trained_embedding_1D_CNN_GobalMaxPooling 0.46 0.49 0.48 4.45 10) Pre_trained_embedding_LSTM model 7 with an additional LSTM layer of size 8. 11) Pre_trained_embedding_GRU model 7 with an additional GRU layer. 12) Pre_trained_embedding_1D_CNN_MaxPooling model 7 with a 1D CNN and MaxPooling (equivalent to model 6 but with pre-trained embeddings). 13) Pre- trained_embedding_1D_CNN_GobalMaxPooling model 13 but with Global Max pooling instead of MaxPooling1D. Metrics. I used the weighted F1 metric and top-1 accuracy on the holdout test data set to compare the performance of the models as well as the time it takes to train the model under the K- fold cross-validation framework. A confusion matrix was produced for each model, I also estimated the average accuracy on the validation set across the K-folds (4 folds). After cross-validation, I trained each model on the full training data set with a 90%/10% split for test and validation to get a sense of the training performance. I used a virtual machine with GPU via Google Collaboratory for all experiments. Results Table 1. Model comparison As can be seen in Table 1, none of the models perform significantly better than the naïve_model in terms of accuracy, which is discouraging. The best performing model is a simple pre-trained embedding layer with the classifier on top. The F1 score for this model, 0.47, is significantly higher than 0.29, but still not good enough for real use on an unlabeled data set. Although all the models trained really fast, even using cross-validation, given the small data set, it is clear that the RNN, LSTM, and GRU models are way more computationally expensive, and in this
  • 15. case, do not achieve higher performance. The models with one-hot encoding underperform most of the ones with word embeddings on the F1 score. And using pre-trained word embeddings seems to contribute positively as evidenced by model 7. Looking at the confusion matrix of model 7, it is clear that the model confuses “Task-searches” with “Both” providing a window into potential areas for further exploration. Conclusion The initial exploration of Deep Learning methods for classifying search queries into Task vs Item oriented queries illustrates the challenges with the task. 1) search queries are often a short combination of keywords and often lack the necessary context to extract deep semantic meaning. 2) Many contain typos or slight variations making it harder for a machine to realize they are the same queries 3) The classes are ill-defined: a query that is “Both” is at the same time a “Task- search” probably sharing the same cues and type of language making it hard to separate from one another. 4) The data set is small to allow for good learning of custom embeddings, and although pre-trained word embeddings seem to help there is a need for more context to help the algorithm learn valuable info. This suggests a couple of potential areas for improvement: 1) make the data a binary classification problem by either removing “Both” for the data set or making “Both” and “Task- searches” equivalent. 2) try character level encoding to provide richer representations of the data and enable the power of temporal processing by RNNs and 1D-CNNs 3) feature engineer additional variables, such as the search volume behind these queries (as a measure of popularity), and journey-mapping of the search query. (I have developed another algorithm that relies on an Intent- ontology to classifies these queries into different buckets corresponding to different stages in the consumer journey). This use of external data outside of the language in the text might provide additional and valuable context. 4) Try shallow learning methods such as random forest and SVM
  • 16. and consider combining them (ensembles of models) to make a concerted prediction. 5) optimize the hyperparameters of promising models after exploring the opportunities from points 1 - 4. References Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978- 1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python- notebooks.git Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2 Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV] Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation Volume 9 | Issue 8 | November 15, 1997 p.1735-1780 Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078 [cs.CL].