Binary search query classifier

Binary Search Query Classifier with Ensemble Models
Esteban Ribero – Chicago, March 2020
Abstract
In this paper, I develop a custom binary classifier of search queries for the makeup category
using different Machine Learning techniques and models. An extensive exploration of shallow and
Deep Learning models was performed using a cross-validation framework to identify the top three
models, optimize them tuning their hyperparameters, and finally creating an ensemble of models
with a custom decision threshold that outperforms all other models. The final classifier achieves an
accuracy of 98.83% on a test set, making it ready for production. The conclusions confirm some of
the common wisdom in Machine Learning regarding size and quality of data, shallow vs Deep
Learning and ensemble vs individual models.
Introduction
Search data from search engines such as Google and Bing are a great source of information
about consumers’ needs and opportunities for content development and strategic media placement
for marketers. However, the search data that is available to marketers is sometimes limited to lists
of keywords with estimates of their search volume, cost per click, and competitive level. To expand
and complement these data, the Intent Lab– a research partnership between Performics (a leading
performance marketing agency) and Northwestern University’s Medill School of Journalism –
performs primary research with consumers to better understand their intent when using search
engines.
In a recent study (to be published), The Intent Lab asked more than 1700 consumers of the
makeup category to write a search query that they would use to research the category when
looking to refresh their makeup. Each respondent submitted a search query and was asked to
classify the query as being an Item-query (one used to look for a specific item, such as “makeup kit”)
or a Task-query (used to learn more about how to accomplish a broad goal such as “find new
makeup look”). It turns out that knowing what type of query a searcher is performing provides clear
guidance into the type of content that should be served to those consumers. To expand the
learnings beyond the study and build a practical application to automatically label thousands of

search queries as Goal-oriented (Task-search) vs Item-oriented (Item-search), it is necessary to
develop and train a custom search-query classifier. This classifier can then be used to analyze the
publicly available data provided to marketers by the type of search query on an ongoing basis. The
purpose of this paper is to report on the development of such classifier.
Literature review
Deep Learning has become a popular toolkit for Data Scientists trying to use the latest in
Machine Learning and AI (Sejnowski, 2018). Natural Language Processing (NLP) has been one of the
areas much influenced by these recent developments, although the field is vast and includes several
traditional approaches (Lane, Howard, and Hapke, 2019) that could be reframed nowadays as
shallow learning (Chollet, 2018). Classifying search queries into distinct buckets can be conceived as
a case of Supervised Learning and the available models suited for the task is vast. Classifying queries
into Task-searches or Item-Searches is a binary classification problem and well suited for a variety of
techniques. It is often advised to try simpler, shallow learning models before moving on to more
complex deep learning models (Chollet, 2018). The following section describes some of the most
common models.
Logistic Regression is commonly used to estimate the probability that an instance belongs
to a particular class. The model computes a weighted sum of the input features (plus a bias term),
but instead of outputting the result directly like the Linear Regression model does, it outputs the
logistic of this result (Géron, 2020). This is a typical sigmoid function that squashes the input (the
weighted sum of input features plus bias in this case) into an S-shape form that outputs values
between 0 and 1. Once this probability is estimated, a prediction can be made by identifying a
threshold and assigning the observation to either one class or the other. The threshold is by default
0.5 but this can be adjusted if needed. This sigmoid function is also often used in Deep Learning as
the final layer of a neural network binary classifier. A great feature of Logistic Regression is the
ability to identify the important features leading to the classifier decision. The coefficients in the
regression convey such information (Géron, 2020).
Support Vector Machines are powerful ML models and are particularly well suited for
classification of complex small or medium-sized datasets (Géron, 2020). It relies on two parallel

vectors to a line that separates the classes in two. These “support” vectors make sure that the
decision boundary that separates the classes is the one farther away from each class creating a
“large margin classification” (Géron, 2020). Unfortunately, unlike the other binary classifiers, SVM
classifiers do not output probabilities for each class making them hard to use as part of an ensemble
of models that average the prediction probabilities at inference time to come up with a pooled
classification decision (soft voting). However, they can be effectively used with ensembles if hard
voting is used instead.
Random Forrest is among the most popular ML algorithms. It is an ensemble of Decision
Trees, where each tree is usually trained on a subset of the samples selected at random most often
with bootstrapping (sampling with replacement). It can also randomly choose the features available
to each tree. This introduces extra randomness when growing trees because instead of searching
for the very best feature when splitting a node, it searches for the best features among a random
subset of features (Géron, 2020). Radom Forrest have several hyperparameters that can be used to
fine tune the model. The method provides estimates of probabilities as well as specific class
prediction and also provides feature importance.
Gradient Boosting is another ensemble method that has become extremely popular among
Data Scientists (Chollet, 2019). Gradient Boosting works by sequentially adding predictors, usually
trees, to an ensemble, each one correcting its predecessor. This method tries to fit the new
predictor to the residual errors made by the previous predictor. Unlike Random Forrest where each
tress is gown independent of the others, Gradient Boosting sequentially train trees. This makes the
method slower but tends to perform better. Fortunately, in 2016 Tianqi Chen and Carlos Guestrin
(2016) developed an optimized system to perform gradient boosting that drastically improves
speed and scalability. The system is called Extreme Gradient Boosting (XGBoost) and is available
free via many open source packages.
Deep Learning Methods for text classification can be divided into two simple Neural Nets
(NN) that do not take word order into consideration (BOW models) and those that do. Simple NN
are simply stacks of fully connected layers with a sigmoid classifier on top. Models that do take into
account the context and/or word order are: 1) 1D-CNNs, useful to extract ordered patterns in
sequences of words regardless of where they appear in the text, and 2) Recurrent NN (RNNs) that

treat the word inputs as time-based representations where the input in the present is combined
with all previous inputs from the past. There are several variations of RNNs that have been designed
over the years to overcome some obstacles like the exploding or vanishing gradient problem. The
most popular ones are the Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and
the Gated Recurrent Unit (GRU) (Cho et al. 2014). There are other types of Deep Learning
techniques for language processing and understanding called Transformers that rely on attention
mechanisms to help the models weight the importance of the surrounding sequences. One of the
most popular is BERT and its variations (Devlin, Chang, Lee, Toutanova, 2018). As with many Deep
Learning problems, it is not always clear what is the best approach for a given problem, and so it is
often suggested to start with the simplest solution and gradually increase the complexity by trying
different approaches (Chollet, 2018).
In the following section, I describe the approach taken to explore these different
techniques to classify search queries performed by consumers of the makeup category. And
eventually narrow down on the best solution for this task.
Method
Initial Data Set. The Intent Lab study described in the introduction asked more than 1700
consumers of the makeup category to write a search query that they would use to research the
category when looking to refresh their makeup. Each respondent submitted a search query and was
asked to classify the query as being an Item-query (one used to look for a specific item, such as
“makeup kit” ) or Task-query (used to learn more about how to accomplish a broad goal such as
“find new makeup look”). Respondents were also provided with two more options: the search
query is “Both” a task and item search or “I don’t know”. Many respondents used the same search
query, but they did not always agree as to what type of query it was. This means the initial data is
noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and
assigned the label with the majority of votes. When there was a tie between any choice and “Both”
I selected both since it encompasses the other two options and when the tie was between Task or
Item I chose Task since it is the more general one unless it was obvious from the search query that
the individual was asking for a specific item. After cleaning for duplicates with disagreement,
irrelevant queries (“abc def ceb”), and removing “I don’t know”s, I ended up with n=954 queries. I

performed an initial exploration of methods with these data (reported in a separate report and
provided here in the appendix) and concluded, that more data was needed.
Final Data Set. To simplify the task into a binary classification challenge, I reclassified the
queries with a “Both” label into either a Task or an Item search. Most of those queries ended up
being Task-searches which are the more abstract of the two. Item searches, by definition, mention a
specific item, such a “mascara”, “foundation”, “lipstick”, etc. so it was relatively easy to identify
them. I also looked at the ones labeled “I don’t know” and was able to correctly classify them into
one or the other class. This helped me expand the workable set of search queries to around 1200
but most (71%) ended up being Task-searches probably due to the way the question was asked
which primed consumers to think about a broader goal more so than a specific item. The question
was something equivalent to: “what would you type in the search bar if you were going to use a
search engine to look for information to help you refresh your makeup look?”. To balance the data
set and expand it even further I manually identified 1150 Item-searches and 650 Task-searches from
a list of >20,000 keywords downloaded from Google Keyword Planner. The resulting and final data
set is a perfectly balanced list of 1500 Item-searches and 1500 Task-searches for the makeup
category for a total of n = 3000. I randomly split the data set into a Development set (for Kfold
cross-validation) and a Test set with a 80/20 split.
Models. I trained three sets of models:
A) The four “shallow learning” models described above: 1 Logistic Regression; 2 Random
Forrest; 3 Extreme Gradient Boosting; and 4 Support Vector Classifier (SVC). I used the
default settings for these models provided by the Python Skit-Learn library.
B) Several Deep Learning models using word vectorization and different topologies/types
of layers: A basic NN with a single layer with 100 nodes using one-hot-encoding of the
input sequences (5 Baseline_NN_100_nodes). The same model using a trainable
embedding layer of shape (2000, 100) (6 Embedding_1_NN_100_nodes. There were
only 1556 words in the vocabulary but set the dim to 2000 to avoid collisions when
hashing. A model with a simple RNN layer with 100 nodes and a pre-trained word
embedding using GloVe vectors of size 100 (7 Pre-trained_embedding
_simple_RNN_100). The same model with a learnable embedding layer (8

Embedding_simple_RNN_100). An LSTM layer with 100 nodes and the same learnable
embedding layer (9 Embedding_1_LSTM_layer_100). And a 1D-CNN layer of size 100
with a window size 3 and Global Max Pooling before the classifier (10, Embedding_
1D_CNN_100_GobalMaxPooling). I used sequences of size 10 with padding for all
embedding layers in the models above.
C) Two character-level Deep Learning models. Both with an embedding layer of size 50
and 100, and padded sequence of size 40. One has a single 1D-CNN layer of size 32 with
a window of 10 characters and Max Pooling with 3 filters (11 Char_Embedding_1D_
CNN_32_10_MaxPooling_3). The other has two stacks of 1D-CNNs of size 100 with Max
Polling with three filters in between and a final Global Max Pooling layer before the
classifier. (12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling). This is the
most complex model of all.
Apart from the models above I created a random classifier for reference and shown in Table
1 along with the other models. After training the models above I chose the three winning models
and optimized them tweaking the hyperparameters. I trained 6 versions of the Logistic Regression
and 10 versions of the XGBoost model. I did not tweak the Baseline NN given the extensive
exploration of different topologies and types of layers for the Deep Learning models performed
already.
Metrics. I used Kfold (4 folds) cross-validation with the Development data set and estimated
the average Train and Validation Accuracy across the folds. I also estimated the average ROC-AUC
score on the Validation sets and estimated the total cross-validation training time. I trained again
the 3 winning models using all the available training data and evaluated them on the holdout test
data set using Accuracy, F1 Score, and the ROC-AUC score. I used a virtual machine with GPU via
Google Collaboratory for all the experiments.
Model ensembling. In a final exploration of the best models, I ensembled the predictions of
the top three and top 2 classifiers by averaging the prediction probabilities and identified the
optimal threshold to maximize the true positive rate while keeping the false-negative rate to its
minimum possible.

Results
Table 1. Model comparison
As can be seen in Table 1, all models perform extremely well surpassing 95% accuracy on
the validation set and the ROC_AUC scores are all above 0.97. The top-performing models are
the Logistic Regression with a ROC-AUC score of 0.996 and a validation accuracy of 0.980 and
the XGBoost with a validation accuracy of 0.983 and a ROC-AUC of 0.995. The SVC follows these
models very closely but with a significantly higher computation cost. Of the Deep Learning models,
the simplest two NN were the winners with similar ROC-AUC scores but with lower validation
accuracy. Although the most complex of all the models (model 12) achieve high performance it did
not beat the simplest NN or even the worst performing of the shallow learning models. This is
another case where less is more and the most traditional of all the classifiers (Logistic Regression)
won the top prize with the most cost-effective implementation. It only took 0.57 seconds to train.
Table 2. Hyperparameter tuning of the winning models
Model
Train
accuracy
Validation
accuracy
Validation
ROC-AUC
Training time
(seconds)
Random_Classifier 0.495 0.505 0.523 0.069
1 Logistic_Regression 0.987 0.980 0.996 0.577
2 Random_Forrest 1.000 0.973 0.993 4.894
3 XGBoost 0.984 0.983 0.995 24.493
4 SVC 0.991 0.983 0.993 70.915
5 Baseline_NN_100_nodes 0.997 0.976 0.995 13.157
6 Embedding_1_NN_100_nodes 0.995 0.976 0.994 18.456
7 Pre-trained_embedding_simple_RNN_100 0.972 0.956 0.977 20.193
8 Embedding_simple_RNN_100 0.997 0.958 0.989 23.984
9 Embedding_1_LSTM_layer_100 0.998 0.960 0.987 39.607
10 Embedding_1D_CNN_100_GobalMaxPooling 0.998 0.973 0.994 18.607
11 Char_Embedding_1D_CNN_32_10_MaxPooling_3 0.998 0.964 0.988 48.268
12 Char-Embedding_2_1D-CNN_MaxPooling_GlobalMaxPooling 0.996 0.965 0.990 77.049
Kfold-Cross Validation Results
Model
Train
accuracy
Validation
accuracy
Validation
ROC-AUC
Training time
(seconds)
1 Logistic_Regression 0.9872 0.9804 0.9958 0.577
2 Logistic_Regression_C_0.5 0.9832 0.9788 0.9954 0.345
3 Logistic_Regression_C_1.5 0.9894 0.9825 0.9959 0.486
4 Logistic_Regression_C_2 0.9907 0.9829 0.9960 0.476
5 Logistic_Regression_C_5 0.9953 0.9838 0.9959 0.561
6 Logistic_Regression_elasticnet_l1_0.5 0.9853 0.9838 0.9960 9.496
1 XGBoost 0.9842 0.9833 0.9948 24.493
2 XGBoost_lr_1 0.9874 0.9813 0.9926 20.157
3 XGBoost_lr_0.05 0.9697 0.9679 0.9924 19.854
4 XGBoost_max_depth_2 0.9782 0.9779 0.9938 14.717
7 XGBoost_max_depth_5_200_trees 0.9864 0.9825 0.9948 58.037
8 XGBoost_max_depth_4_200_trees 0.9861 0.9829 0.9951 48.427
9 XGBoost_max_depth_5_L2_0.5 0.9857 0.9838 0.9955 29.010
10 XGBoost_max_depth_5_L2_5 0.9833 0.9800 0.9953 28.690
Kfold-Cross Validation Results

Table 2 shows the results of the different variations of the top-performing models to
identify the optimal setting for some of the hyperparameters. Looking at the results an interesting
finding emerges. The models with less regularization up to a point are the ones performing the best.
The Logistic Regression with L2 regularization with a C value of 5 (vs C = 1, the default) is the top
performer followed by the XGBoost with max_depth of 5 (deeper trees) that tends to fit better the
training data but generalize worse. Presumably, this indicates that the data are very homogeneous
and the test set ends up being very similar to the train set.
Table 3 shows the final results of the winning models and the ensemble of the top 2 and top
3 models with the standard 0.5 threshold as well as the optimized threshold. The winning of all
approaches appears to be an ensemble of the optimized Logistic Regression and the optimized
XGBoost with a custom decision threshold of 0.78. Beyond this probability threshold, the search
query would be classified as a Task-search. The accuracy of the final classifier is 98.83% and F1
Score of the same value. Figure 1 and 2 show the confusion matrix and the ROC curve for the
winning model(s).
Table 3. Final results on the test data set
Figure 1 and 2. Confusion Matrix and ROC curve for Ensemble top 2 with a 0.78 threshold
Accuracy F1 Score ROC-AUC
Logistic_Regression_C_5 0.9800 0.9800 0.9966
XGBoost_max_depth_5 0.9800 0.9800 0.9954
Baseline_NN_100_nodes 0.9767 0.9767 0.9954
Ensamble top 3 0.9800 0.9800 0.9968
Ensemble top 2 0.9817 0.9817 0.9962
Ensemble top 2 (threshold = 0.65) 0.9867 0.9867 0.9962
Ensemble top 2 (threshold = 0.78) 0.9883 0.9883 0.9962
Test Set Results

Conclusion
An extensive exploration of Machine Learning models was used to develop a robust and
highly accurate classifier that can be put into production right away. This exploration of methods
and data shows a couple of interesting learnings and reinforces some of the common wisdom in
Machine Learning: 1) Quality and quantity of data are key. The difference between the first
exploration with the original data set (see appendix for reference) where the models seem to
perform just marginally better than a naïve classifier and the performance on the cleaner and
enhanced data set used here is truly remarkable. 2) shallow learning models often outperform Deep
Learning models and one should not assume Deep Learning is always better. 3) More regularization
is not necessarily going to always perform the best. It appears to depend on the situation and the
default settings are often good enough. 4) An ensemble of good and different models outperforms
any single one. And 5) a custom decision threshold may maximize the performance of a binary
classifier and give some flexibility in the tradeoff between true positives and false positives.

References
Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978-
1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python-
notebooks.git
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network
Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV]
Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation
Volume 9 | Issue 8 | November 15, 1997 p.1735-1780
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, Yoshua Bengio, 2014. "Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation". arXiv:1406.1078 [cs.CL].
Torrence J. Sejnkowski, 2018. The Deep Learning Revolution. MIT Press
Aurélian, Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow. 2nd
Edition. O’Reilly Media, Inc.
Tianqi Chen, Carlos Guestrin, 2016. XGBoost: A Scalable Tree Boosting System.
arXiv:1603.02754v3 [cs.LG]

Appendix – Previous report
Abstract
In this paper, I explore different NLP techniques to classify search queries made by
consumers to search engines such as Google and Bing when researching about the makeup
category. I used one-hot-encoding, learnable and pre-trained word embeddings, as well as several
temporal processing neural nets such as LSTM, GRU 1D-CNNs. None of the models appear to
significantly outperform a naïve classifier that always predicts the most frequent class. A discussion
of potential areas for further exploration is provided at the end.
Introduction
Natural Language Processing (NLP) and natural language understanding are a booming
research area and a field with lots of applications (Lane, Howard, Hapke, 2019). The ability to teach
machines to process text and use algorithms to understand and act on that understanding is
magical. In this paper, I will explore several Deep Learning techniques for a text classification task.
Literature review
Before using language, machines need to process it in a way that allows computation and
mathematical transformation. Vectorizing text is the first step. There are several techniques to
represent units of text (documents) for classification purposes. Lane, Howard, and Hapke in their
book Natural Language Processing in Action (2019) present the following common approaches:
One-hot-encoding, Term Frequency, term Frequency (TF), Term Frequency Inverse Document
Frequency, (TF-IDF), Word embeddings, among others. These can be divided into two broad types:
techniques that do not take into account the order (or context) of the words -or characters, known
as Bag-of-Words (BOW) and techniques that do, such as word embeddings.
Bag-of-words representations tend to provide sparse high dimensional vectors while word
embeddings provide dense and low dimensional (Chollet, 2018). BOW vectors are hard-coded while
word embeddings are learned from data. There are several approaches to creating word
embeddings, these can be created while training the models for the classification task or they can
be created separately using different techniques fitted on large data sets, such as the Wikipedia

corpus, Google news, Facebook content, etc. and are often made publicly available. The most
popular ones are Word2Vec, Doc2Vec, GloVe, and fastText (Lane, Howard, Hapke, 2019). More
recently BERT and variations have also become public (Devlin, Chang, Lee, Toutanova, 2018).
Representing the data as numerical vectors is only the beginning. Different Deep Learning
techniques that utilize the data to make predictions or to generate text have also been developed.
Some of the most common are 1D-CNNs and Recurrent Neural Networks (RNNs). Regular Deep
Neural Nets (DNNs) can also be used with text but they don’t fully leverage the order of the words
or sequences. 1D-CNNs are useful to extract ordered patterns in sequences of words regardless of
where they appear in the text. RNNs treat the word inputs as time-based representations where the
input in the present is combined with all previous inputs from the past. There are several variations
of RNNs that have been designed over the years to overcome some obstacles like the exploding or
vanishing gradient problem. The most popular ones are the Long Short Term Memory (LSTM)
(Hochreiter & Schmidhuber, 1997) and the Gated Recurrent Unit (GRU) (Cho et al. 2014). As with
many Deep Learning problems, it is not always clear what is the best approach for a given problem,
and so it is often suggested to start with the simplest solution and gradually increase the complexity
by trying different approaches (Chollet, 2018).
In the following section, I describe the approach taken to explore these different
techniques to classify search queries performed by consumers of the makeup category into three
classes (Task-Oriented search queries, Item-Oriented search queries or Both Task-Item).
Method
Data. A study (to be published) from the Intent Lab – a research partnership between
Performics and Northwestern University’s Medill School of Journalism and Microsoft’s Bing – asked
more than 1700 consumers of the makeup category to write a search query they would use to
research the category when looking to refresh their makeup. Each respondent submitted a search
query and was asked to classify the query as being an Item-query (one used to look for a specific
item, such as “makeup kit” ) or Task-query (used to learn more about how to accomplish a broad
goal such as “find new makeup look”). Respondents were also provided with two more options: the
search query is “Both” a task and item search or “I don’t know”. Many respondents used the same

search query, but they did not always agree as to what type of query it was. This means the data is
noisy. To solve these discrepancies, I use majority voting to decide what type of query it was and
assigned the label with the majority of votes. When there was a tie between any choice and “Both”
I selected both since it encompasses the other two options and when the tie was between Task or
Item I chose Task since it is the more general one unless it was obvious from the search query that
the individual was asking for a specific item.
After cleaning for duplicates with disagreement, irrelevant queries (“abc def ceb”), and
removing “I don’t know”s, I ended up with n=954 queries. I split the data into a Train (n=686)
Validation (n=77) and Test (n =191) data sets.
Sequence vectorizing. To test the effect of different type of representation I trained two
models using one-hot-matrices and a simple hidden fully-connected layer -one with dropout (0.5)
and 32 nodes the other one without dropout and 16 nodes- and 10 models (including two simple
similar models to the ones without embeddings) using an embedding layer of dimensions 1000,
100, 8. The size of the vocabulary is 606 but to avoid collisions I set the max to 1000. I used vectors
with padding of size 8. Most queries are of size 3 while the longer is of size 21. 8 was somehow an
arbitrary choice and I did not test different lengths in this set of experiments. The choice of the
length of the embedding vector (100) was chosen to easily use pre-trained GloVe word
embeddings.
Models. Apart from the two, simple, baseline DNN models (trained with one-hot-vectors)
described above, I tested the following models: 1) a naïve_model that always predicts Task-search
(the most common label (46%) and used as a reference; 2) and 3) described above (One-hot-
baseline_NN_16_nodes and One-hot-baseline_NN_32_w_dropout). 4) Embedding_solo a model
that uses only the embedded layer to be trained as part of the task.
5) Embedding_layer_NN_32_w_dropout the same as model 4 with an additional dense layer of size
32 and dropout (0.5) -comparable to model 3. 6) Embedding_1D_CNN_MaxPooling a model with a
trainable embedding layer, a 1D CNN with a window size of 3 and a MaxPooling layer.
7) Pre_trained_embedding_solo same as model 4 but with pre-trained embeddings from GloVe.
8) Pre_trained_embedding_simple_RNN model 7 with an additional SimpleRNN layer of size 8.
9) Pre_trained_embedding_stacked(2)_RNN model 7 with two SimpleRNN layers staked on top;

Model
Weighted F1 score
on test set
Accuracy
on test set
Kfold-validation
average accuracy
Training time
(seconds)
1 Naïve_model 0.29 0.46 n/a n/a
2 One-hot-baseline_NN_16_nodes 0.30 0.46 0.46 3.76
3 One-hot-baseline_NN_32_w_dropout 0.31 0.43 0.46 3.84
4 Embedding_solo 0.38 0.42 0.45 4.04
5 Embedding_layer_NN_32_w_dropout 0.40 0.44 0.44 5.08
6 Embedding_1D_CNN_MaxPooling 0.33 0.46 0.46 5.45
7 Pre-trained_embedding_solo 0.47 0.49 0.45 3.01
8 Pre-trained_embedding_simple_RNN 0.40 0.44 0.44 8.01
9 Pre-trained_embedding_stacked(2)_RNN 0.34 0.42 0.44 10.42
10 Pre-trained_embedding_LSTM 0.39 0.49 0.46 11.50
11 Pre-trained_embedding_GRU 0.29 0.45 0.46 11.85
12 Pre-trained_embedding_1D_CNN_MaxPooling 0.45 0.47 0.41 3.84
13 Pre-trained_embedding_1D_CNN_GobalMaxPooling 0.46 0.49 0.48 4.45
10) Pre_trained_embedding_LSTM model 7 with an additional LSTM layer of size 8.
11) Pre_trained_embedding_GRU model 7 with an additional GRU layer.
12) Pre_trained_embedding_1D_CNN_MaxPooling model 7 with a 1D CNN and MaxPooling
(equivalent to model 6 but with pre-trained embeddings). 13) Pre-
trained_embedding_1D_CNN_GobalMaxPooling model 13 but with Global Max pooling instead of
MaxPooling1D.
Metrics. I used the weighted F1 metric and top-1 accuracy on the holdout test data set to
compare the performance of the models as well as the time it takes to train the model under the K-
fold cross-validation framework. A confusion matrix was produced for each model, I also estimated
the average accuracy on the validation set across the K-folds (4 folds). After cross-validation, I
trained each model on the full training data set with a 90%/10% split for test and validation to get a
sense of the training performance. I used a virtual machine with GPU via Google Collaboratory for
all experiments.
Results
Table 1. Model comparison
As can be seen in Table 1, none of the models perform significantly better than the
naïve_model in terms of accuracy, which is discouraging. The best performing model is a simple
pre-trained embedding layer with the classifier on top. The F1 score for this model, 0.47, is
significantly higher than 0.29, but still not good enough for real use on an unlabeled data set.
Although all the models trained really fast, even using cross-validation, given the small data set, it is
clear that the RNN, LSTM, and GRU models are way more computationally expensive, and in this

case, do not achieve higher performance. The models with one-hot encoding underperform most of
the ones with word embeddings on the F1 score. And using pre-trained word embeddings seems to
contribute positively as evidenced by model 7. Looking at the confusion matrix of model 7, it is clear
that the model confuses “Task-searches” with “Both” providing a window into potential areas for
further exploration.
Conclusion
The initial exploration of Deep Learning methods for classifying search queries into Task vs
Item oriented queries illustrates the challenges with the task. 1) search queries are often a short
combination of keywords and often lack the necessary context to extract deep semantic meaning.
2) Many contain typos or slight variations making it harder for a machine to realize they are the
same queries 3) The classes are ill-defined: a query that is “Both” is at the same time a “Task-
search” probably sharing the same cues and type of language making it hard to separate from one
another. 4) The data set is small to allow for good learning of custom embeddings, and although
pre-trained word embeddings seem to help there is a need for more context to help the algorithm
learn valuable info.
This suggests a couple of potential areas for improvement: 1) make the data a binary
classification problem by either removing “Both” for the data set or making “Both” and “Task-
searches” equivalent. 2) try character level encoding to provide richer representations of the data
and enable the power of temporal processing by RNNs and 1D-CNNs 3) feature engineer additional
variables, such as the search volume behind these queries (as a measure of popularity), and
journey-mapping of the search query. (I have developed another algorithm that relies on an Intent-
ontology to classifies these queries into different buckets corresponding to different stages in the
consumer journey). This use of external data outside of the language in the text might provide
additional and valuable context. 4) Try shallow learning methods such as random forest and SVM

and consider combining them (ensembles of models) to make a concerted prediction. 5) optimize
the hyperparameters of promising models after exploring the opportunities from points 1 - 4.
References
Francois Chollet, 2018. Deep Learning with Python. Shelter Island, N.Y.: Manning. [ISBN-13: 978-
1617294433] Source code available at: https://github.com/fchollet/deep-learning-with-python-
notebooks.git
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, 2018. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. Available at arXiv:1810.04805v2
Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. An Analysis of Deep Neural Network
Models for Practical Applications. Available at: arXiv:1605.07678v4 [cs.CV]
Sepp Hochreiter & Jürgen Schmidhuber, 1997. Long Short-Term Memory. Neural Computation
Volume 9 | Issue 8 | November 15, 1997 p.1735-1780
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, Yoshua Bengio (2014). "Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation". arXiv:1406.1078 [cs.CL].

Binary search query classifier

More Related Content

What's hot (19)

Similar to Binary search query classifier (20)

More from Esteban Ribero (8)

Recently uploaded (20)

Binary search query classifier