SlideShare a Scribd company logo
Predicting the Relevance of Search
Results for
E-Commerce Systems
Mohammed Zuhair Al-Taie
Joel Pinho Lucas
Siti Mariyam Shamsuddin
International Workshop Big Data Analytics 2015 “Multi Strategy Learning Analytics for Big Data”
Universiti Teknologi Malaysia, Kuala Lumpur, 17-18 Aug 2015
Study Published in
International Journal of Advances in Soft Computing and its Applications (IJASCA)
Introduction
► Search engines (e.g. Google.com, Yahoo.com, and
Bing.com) have become the dominant model of online
search
► Large business are able to hire the necessary skills to build
advanced search engines, while
♦ small online business still lack the ability to evaluate the
results of their search engines  losing the opportunity to
compete
► The purpose of this paper is to build an open-source model
that can:
♦ Measure the relevance of search results for online businesses
♦ Measure the accuracy of their underlined search algorithms
Data Pre-Processing
Data preprocessing usually consists of four steps
1. Data cleaning: a.k.a. data cleansing or scrubbing which
aims at removing noise, filling missing values and
correcting inconsistencies in data
♦ Involves the identification and removal of outliers
2. Data integration: seeks to combine data from varied and
different sources into a coherent data storage
3. Data transformation: aims at converting data to more
appropriate form.
♦ The goal is to have more efficient data mining operations
by making the data more understandable
4. Data reduction: aims at reducing the size of data while
minimizing any possible loss of information
Text Retrieval Systems
► Text retrieval involves retrieving the documents that
contain particular keywords for a given query
♦ Given a string T=t1 t2…tn, and a particular keyword pattern
P=p1 p2…pm, the goal is to verify whether P is a substring of
T
► Pattern matching can be applied in two ways:
♦ forward pattern matching, where the text and pattern are
matched in the forward direction; and
♦ Backward pattern matching, where the matching process is
done from right to left.
CrowdFlower Dataset
► CrowdFlower created its dataset with the help of its crowd
♦ The crowd rated a list of products from a handle of
ecommerce websites on a scale from 1 to 4
♦ “4” indicates the product completely satisfied the search
query and “1” to indicate that the result did not match the
query search
► 261 search terms were generated by CrowdFlower
► The dataset incorporated six attributes: id, query text,
product title, product description, median relevance and
relevance variance of the query.
♦ The train file contains 10,158 rows in total
♦ The test file contains another portion of data with the same
attributes, except the attributes related to the query relevance
(median relevance and relevance variance), which is the
label attribute to be predicted
CrowdFlower Dataset (cont.)
Data Cleaning and
Transformation
► The CrownFlower dataset described in last section is just
an amount of raw tuples related to text queries and product
descriptions and has some noise
♦ Apply data preprocessing which included the removal of
undesired content to improve data quality and make it ready
for mining and analysis
► We also need to pre-process the data in order to extract
some value and then transform in order to:
♦ Provide an input for some machine learning algorithms
► The first task to be accomplished is feature extraction
♦ We need to transform raw text search attributes in valuable
features
♦ To provide an input for running in machine learning
algorithms
Word Match Counting
► The dataset encompasses text attributes related to textual
searches. To extract features:
♦ The first method to extract feature is counting how many
words, in each text search, match the product title and
product description
► In order to implement word match counting, we took in
advantage facilities and expressiveness of the Python
language,
♦ Use the widely employed built-in packages for data
manipulation, including NumPy and SciPy
NumPy and SciPy
► NumPy is an extension to the
Python programming language,
adding support for large, multi-
dimensional arrays and matrices,
along with a large library of high-
level mathematical functions to
operate on these arrays (Wikipedia)
► SciPy (pronounced “Sigh Pie”) is
an open source Python library used
by scientists, analysts, and
engineers doing scientific
computing and technical computing
(Wikipedia)
Word Match Counting (cont.)
► The steps of word match counting were as follows:
1. First: we extracted word tokens from text (i.e.: splitting text
by spaces)
2. Second: we removed non-alphanumeric characters from
text.
♦ Words with just one or two characters (mostly articles and
prepositions) were also removed
3. Third: we transformed each list of words (from search text
attributes) in an array
► NumPy methods were applied for intersecting the search
word vector with product title and product description
vectors, respectively.
Word Match counting with tf-idf
► At this point we simply counted how many words
remained in each intersection,
♦ S={t1,t2,…,tn} is the search words (terms) vector,
D={t1,t2,…,tn} the product description words vector and
T={t1,t2,…,tn} the product title words vector, respectively:
Feature1 = count (T  S)
Feature2 = count (D  S)
► Since t 𝑓 − 𝑖𝑑𝑓 weight-based algorithms are broadly
employed by e-commerce and publishing companies for
text retrieval and searching 
♦ We decided to make use of it for enhancing the
expressiveness of the two features we acquired so far
TF-IDF
► Term Frequency-Inverse Document Frequency (TF-IDF)
produces a weight for each term fi in each document dj.
► It is basically a combination of two earlier methods: Term
Frequency (TF) and Inverse Document Frequency (IDF).
𝑡𝑓 − 𝑖𝑑𝑓 is defined as in Eq. 1:
𝑡𝑓 − 𝑖𝑑𝑓 𝑓𝑖, 𝑑𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝑖𝑑𝑓 (𝑓𝑖)
► In order to acquire word weights we used the
TfidfVectorizer class available in scikit-learn Python
package
♦ scikit-learn is a machine learning library for Python to serve
different purposes such as classification, regression and
clustering
Word Match counting with tf-idf
(cont.)
► Before calculating weights: stop words are ignored, which
means that irrelevant words for searching, such as articles,
prepositions and pronouns, are ignored
► The features we extracted now are much more expressive,
since they take into account weights of word terms,
♦ Higher weights are related to rare terms and lower weights
are related to high frequent terms across all documents
► Feature values are now the sum of all intersection word
weights
♦ Instead of merely using a string to represent words in
vectors, now we use Python dictionary to represent
words along with their weights, where words are the
keys and weights are the values in dictionaries
Word Match counting with tf-idf
(cont.)
► In this way, we have
♦ a search dictionary Sdict={t1w1, t2w2, …, tnwn},
♦ a product description words dictionary Ddict={t1w1, t2w2,
…, tnwn},
♦ a product title words dictionary
Tdict={t1w1,t2w2,…,tnwn}
♦ We may also represent, as vectors, the keys from dictonary:
Skeys={tk1, tk2, …, tkn}
► Thus, features are calculated as follows:
Feature1 = sum(values(Tkeys  Skeys))
Feature2 = sum(values(Dkeys  Skeys))
Data Analysis and Prediction
► The main goal with CrowdFlower data is to predict the
relevance of search queries.
♦ The label attribute for this study will be the median relevance
of the search
► We first tried to predict the values of median relevance
from CrowdFlower test set rows employing the (SVM)
Support Vector Machine
♦ Implementation from the scikit-learn
► We also tried CrowdFlower data with a more sophisticated
algorithm: the Random Forest
♦ Random Forest is an ensemble learning method that is
largely employed in machine learning
♦ Largely implemented in scikit-learn user community
scikit-learn
• scikit-learn is an open source machine learning library for
Python
• It features various classification, regression and clustering
algorithms including support vector machines, random
forests, gradient boosting, k-means and DBSCAN,
• It Is designed to interoperate with NumPy and SciPy.
Data Analysis and Prediction
(cont.)
► after acquiring the training data, and storing it properly in
data frame, learning in scikit-learn should be taken in three
simple steps:
1. initializing the model,
2. fitting it to the training data, and
3. predicting new values
► We stored the train set provided by CrowdFlower, but
using our extracted features, in a data frame and provide it
as input for both algorithms.
► After initializing and fitting train data, we can now predict
the label attribute (search terms median relevance)
Learning Models Benchmark
► We describe the resulted scores we obtained running the
four combination methods we applied:
♦ SVM with word match counting based features,
♦ SVM with tf-idf based features,
♦ Random Forest with word match counting based
features and,
♦ Random Forest with tf-idf based features
Learning Models Benchmark
(cont.)
►The best score was obtained with Random Forest with
tf-idf based features.
►We can also notice that Random Forest obtained better
score than SVM and,
►conversely, that tf-idf obtained better results than word
match counting based features.
►However, the impact of applying tf-idf on
preprocessing was substantially higher than Random
Forest over SVM
Match Counting tf-idf
SVM 0.51241 0.57654
Random Forest 0.53834 0.59211
Conclusion
► With our benchmark on CrowdFlower test set, we could
attest that Random Forest is an efficient machine-learning
algorithm.
► Random Forest shown better accuracy compared to a
simple SVM implementation. This, in part, is due to:
♦ The ensemble nature of Random Forest, which allows
multiple learning algorithms to be run
♦ Beside this fact, the nature of the dataset is also another
reason for SVM disadvantage, since such algorithm is likely
to provider poorer performances when the number of
features is much greater than the number of samples
Conclusion (cont.)
► Employing tf-idf for preprocessing features and, Random
Forest is a powerful and effective approach for predicting,
and measuring, the relevance of text search in e-commerce
scenarios.
► Such approach suits small e-commerce businesses emerged
in big data necessities.
► The use of scikit-learn package, as well as other built-in
Python packages (NumPy and SciPy), saves substantial
development time
Conclusion (cont.)
► The preprocessing step is crucial for the whole data
analysis because:
1. It consumes most of the time and implementation efforts
needed on the whole analysis
2. Preprocessing may be more critical for precision than the
machine-learning algorithm itself 
♦ SVM combined with tf-idf have shown higher score than
Radom Forest combined with word matches counting
Thank you

More Related Content

PDF
Python networkx library quick start guide
PDF
Graph Libraries - Overview on Networkx
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
PDF
Networkx tutorial
PDF
Data visualization
PDF
Data visualization in Python
PPT
Computing with Directed Labeled Graphs
PDF
Data Wrangling and Visualization Using Python
Python networkx library quick start guide
Graph Libraries - Overview on Networkx
A Fast and Dirty Intro to NetworkX (and D3)
Networkx tutorial
Data visualization
Data visualization in Python
Computing with Directed Labeled Graphs
Data Wrangling and Visualization Using Python

What's hot (20)

PDF
High Performance Python - Marc Garcia
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PPTX
Data engineering and analytics using python
PDF
Big data analysis in python @ PyCon.tw 2013
PDF
Machine Learning and GraphX
PPT
Mapreduce in Search
PDF
GraphX: Graph analytics for insights about developer communities
PPTX
Session 2
PDF
Mining Big Data in Real Time
PDF
Big data distributed processing: Spark introduction
PDF
R statistics with mongo db
PDF
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
PDF
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
PDF
Interpreting Relational Schema to Graphs
PDF
2014.06.24.what is ubix
PPTX
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...
ODP
OrientDB for real & Web App development
PPT
Web Data Extraction Como2010
PPTX
Tapestry
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
High Performance Python - Marc Garcia
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Data engineering and analytics using python
Big data analysis in python @ PyCon.tw 2013
Machine Learning and GraphX
Mapreduce in Search
GraphX: Graph analytics for insights about developer communities
Session 2
Mining Big Data in Real Time
Big data distributed processing: Spark introduction
R statistics with mongo db
ScaleGraph - A High-Performance Library for Billion-Scale Graph Analytics
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Interpreting Relational Schema to Graphs
2014.06.24.what is ubix
The FLuID Meta Model: Incrementally Compute Schema-level Indices for the Web...
OrientDB for real & Web App development
Web Data Extraction Como2010
Tapestry
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ad

Similar to Predicting the relevance of search results for e-commerce systems (20)

PDF
Lesson 2 data preprocessing
PPTX
Predicting Tweet Sentiment
PPTX
UNIT_5_Data Wrangling.pptx
PPS
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
PDF
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
PDF
Ju3517011704
PPT
Designing A Syntax Based Retrieval System03
PDF
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
PPTX
Data Science Using Scikit-Learn
PPTX
Internship project presentation_final_upload
PDF
data science with python_UNIT 2_full notes.pdf
PPTX
fINAL ML PPT.pptx
PPTX
Text Analytics
PPTX
Query expansion_Team42_IRE2k14
PPTX
Query expansion_group42_ire
PDF
business analytic meeting 1 tunghai university.pdf
PPTX
Building largescalepredictionsystemv1
PDF
Data Science - Part XI - Text Analytics
PDF
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
PDF
E017252831
Lesson 2 data preprocessing
Predicting Tweet Sentiment
UNIT_5_Data Wrangling.pptx
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
Ju3517011704
Designing A Syntax Based Retrieval System03
Mining Query Log to Suggest Competitive Keyphrases for Sponsored Search Via I...
Data Science Using Scikit-Learn
Internship project presentation_final_upload
data science with python_UNIT 2_full notes.pdf
fINAL ML PPT.pptx
Text Analytics
Query expansion_Team42_IRE2k14
Query expansion_group42_ire
business analytic meeting 1 tunghai university.pdf
Building largescalepredictionsystemv1
Data Science - Part XI - Text Analytics
OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
E017252831
Ad

More from Universiti Technologi Malaysia (UTM) (10)

PDF
A self organizing communication model for disaster risk management
PDF
Spark Working Environment in Windows OS
PDF
Python 3.x quick syntax guide
PDF
Social media with big data analytics
PPT
Scientific theory of state and society parities and disparities between the p...
PPTX
Nation building current trends of technology use in da’wah
PPT
Flight MH370 community structure
PPT
Visualization of explanations in recommender systems
PPT
Explanations in Recommender Systems: Overview and Research Approaches
PPT
Factors disrupting a successful implementation of e-commerce in iraq
A self organizing communication model for disaster risk management
Spark Working Environment in Windows OS
Python 3.x quick syntax guide
Social media with big data analytics
Scientific theory of state and society parities and disparities between the p...
Nation building current trends of technology use in da’wah
Flight MH370 community structure
Visualization of explanations in recommender systems
Explanations in Recommender Systems: Overview and Research Approaches
Factors disrupting a successful implementation of e-commerce in iraq

Recently uploaded (20)

PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PDF
A Brief Introduction About Julia Allison
PDF
Nidhal Samdaie CV - International Business Consultant
DOCX
Business Management - unit 1 and 2
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
Business model innovation report 2022.pdf
PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
DOCX
Euro SEO Services 1st 3 General Updates.docx
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
Types of control:Qualitative vs Quantitative
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PPTX
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
PDF
Dr. Enrique Segura Ense Group - A Self-Made Entrepreneur And Executive
PDF
Deliverable file - Regulatory guideline analysis.pdf
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Chapter 5_Foreign Exchange Market in .pdf
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
Ôn tập tiếng anh trong kinh doanh nâng cao
A Brief Introduction About Julia Allison
Nidhal Samdaie CV - International Business Consultant
Business Management - unit 1 and 2
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
340036916-American-Literature-Literary-Period-Overview.ppt
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
Business model innovation report 2022.pdf
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Euro SEO Services 1st 3 General Updates.docx
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
Types of control:Qualitative vs Quantitative
Reconciliation AND MEMORANDUM RECONCILATION
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Dragon_Fruit_Cultivation_in Nepal ppt.pptx
Dr. Enrique Segura Ense Group - A Self-Made Entrepreneur And Executive
Deliverable file - Regulatory guideline analysis.pdf
Lecture (1)-Introduction.pptx business communication
Chapter 5_Foreign Exchange Market in .pdf
ICG2025_ICG 6th steering committee 30-8-24.pptx

Predicting the relevance of search results for e-commerce systems

  • 1. Predicting the Relevance of Search Results for E-Commerce Systems Mohammed Zuhair Al-Taie Joel Pinho Lucas Siti Mariyam Shamsuddin International Workshop Big Data Analytics 2015 “Multi Strategy Learning Analytics for Big Data” Universiti Teknologi Malaysia, Kuala Lumpur, 17-18 Aug 2015 Study Published in International Journal of Advances in Soft Computing and its Applications (IJASCA)
  • 2. Introduction ► Search engines (e.g. Google.com, Yahoo.com, and Bing.com) have become the dominant model of online search ► Large business are able to hire the necessary skills to build advanced search engines, while ♦ small online business still lack the ability to evaluate the results of their search engines  losing the opportunity to compete ► The purpose of this paper is to build an open-source model that can: ♦ Measure the relevance of search results for online businesses ♦ Measure the accuracy of their underlined search algorithms
  • 3. Data Pre-Processing Data preprocessing usually consists of four steps 1. Data cleaning: a.k.a. data cleansing or scrubbing which aims at removing noise, filling missing values and correcting inconsistencies in data ♦ Involves the identification and removal of outliers 2. Data integration: seeks to combine data from varied and different sources into a coherent data storage 3. Data transformation: aims at converting data to more appropriate form. ♦ The goal is to have more efficient data mining operations by making the data more understandable 4. Data reduction: aims at reducing the size of data while minimizing any possible loss of information
  • 4. Text Retrieval Systems ► Text retrieval involves retrieving the documents that contain particular keywords for a given query ♦ Given a string T=t1 t2…tn, and a particular keyword pattern P=p1 p2…pm, the goal is to verify whether P is a substring of T ► Pattern matching can be applied in two ways: ♦ forward pattern matching, where the text and pattern are matched in the forward direction; and ♦ Backward pattern matching, where the matching process is done from right to left.
  • 5. CrowdFlower Dataset ► CrowdFlower created its dataset with the help of its crowd ♦ The crowd rated a list of products from a handle of ecommerce websites on a scale from 1 to 4 ♦ “4” indicates the product completely satisfied the search query and “1” to indicate that the result did not match the query search ► 261 search terms were generated by CrowdFlower ► The dataset incorporated six attributes: id, query text, product title, product description, median relevance and relevance variance of the query. ♦ The train file contains 10,158 rows in total ♦ The test file contains another portion of data with the same attributes, except the attributes related to the query relevance (median relevance and relevance variance), which is the label attribute to be predicted
  • 7. Data Cleaning and Transformation ► The CrownFlower dataset described in last section is just an amount of raw tuples related to text queries and product descriptions and has some noise ♦ Apply data preprocessing which included the removal of undesired content to improve data quality and make it ready for mining and analysis ► We also need to pre-process the data in order to extract some value and then transform in order to: ♦ Provide an input for some machine learning algorithms ► The first task to be accomplished is feature extraction ♦ We need to transform raw text search attributes in valuable features ♦ To provide an input for running in machine learning algorithms
  • 8. Word Match Counting ► The dataset encompasses text attributes related to textual searches. To extract features: ♦ The first method to extract feature is counting how many words, in each text search, match the product title and product description ► In order to implement word match counting, we took in advantage facilities and expressiveness of the Python language, ♦ Use the widely employed built-in packages for data manipulation, including NumPy and SciPy
  • 9. NumPy and SciPy ► NumPy is an extension to the Python programming language, adding support for large, multi- dimensional arrays and matrices, along with a large library of high- level mathematical functions to operate on these arrays (Wikipedia) ► SciPy (pronounced “Sigh Pie”) is an open source Python library used by scientists, analysts, and engineers doing scientific computing and technical computing (Wikipedia)
  • 10. Word Match Counting (cont.) ► The steps of word match counting were as follows: 1. First: we extracted word tokens from text (i.e.: splitting text by spaces) 2. Second: we removed non-alphanumeric characters from text. ♦ Words with just one or two characters (mostly articles and prepositions) were also removed 3. Third: we transformed each list of words (from search text attributes) in an array ► NumPy methods were applied for intersecting the search word vector with product title and product description vectors, respectively.
  • 11. Word Match counting with tf-idf ► At this point we simply counted how many words remained in each intersection, ♦ S={t1,t2,…,tn} is the search words (terms) vector, D={t1,t2,…,tn} the product description words vector and T={t1,t2,…,tn} the product title words vector, respectively: Feature1 = count (T  S) Feature2 = count (D  S) ► Since t 𝑓 − 𝑖𝑑𝑓 weight-based algorithms are broadly employed by e-commerce and publishing companies for text retrieval and searching  ♦ We decided to make use of it for enhancing the expressiveness of the two features we acquired so far
  • 12. TF-IDF ► Term Frequency-Inverse Document Frequency (TF-IDF) produces a weight for each term fi in each document dj. ► It is basically a combination of two earlier methods: Term Frequency (TF) and Inverse Document Frequency (IDF). 𝑡𝑓 − 𝑖𝑑𝑓 is defined as in Eq. 1: 𝑡𝑓 − 𝑖𝑑𝑓 𝑓𝑖, 𝑑𝑗 = 𝑡𝑓𝑖𝑗 ∗ 𝑖𝑑𝑓 (𝑓𝑖) ► In order to acquire word weights we used the TfidfVectorizer class available in scikit-learn Python package ♦ scikit-learn is a machine learning library for Python to serve different purposes such as classification, regression and clustering
  • 13. Word Match counting with tf-idf (cont.) ► Before calculating weights: stop words are ignored, which means that irrelevant words for searching, such as articles, prepositions and pronouns, are ignored ► The features we extracted now are much more expressive, since they take into account weights of word terms, ♦ Higher weights are related to rare terms and lower weights are related to high frequent terms across all documents ► Feature values are now the sum of all intersection word weights ♦ Instead of merely using a string to represent words in vectors, now we use Python dictionary to represent words along with their weights, where words are the keys and weights are the values in dictionaries
  • 14. Word Match counting with tf-idf (cont.) ► In this way, we have ♦ a search dictionary Sdict={t1w1, t2w2, …, tnwn}, ♦ a product description words dictionary Ddict={t1w1, t2w2, …, tnwn}, ♦ a product title words dictionary Tdict={t1w1,t2w2,…,tnwn} ♦ We may also represent, as vectors, the keys from dictonary: Skeys={tk1, tk2, …, tkn} ► Thus, features are calculated as follows: Feature1 = sum(values(Tkeys  Skeys)) Feature2 = sum(values(Dkeys  Skeys))
  • 15. Data Analysis and Prediction ► The main goal with CrowdFlower data is to predict the relevance of search queries. ♦ The label attribute for this study will be the median relevance of the search ► We first tried to predict the values of median relevance from CrowdFlower test set rows employing the (SVM) Support Vector Machine ♦ Implementation from the scikit-learn ► We also tried CrowdFlower data with a more sophisticated algorithm: the Random Forest ♦ Random Forest is an ensemble learning method that is largely employed in machine learning ♦ Largely implemented in scikit-learn user community
  • 16. scikit-learn • scikit-learn is an open source machine learning library for Python • It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, • It Is designed to interoperate with NumPy and SciPy.
  • 17. Data Analysis and Prediction (cont.) ► after acquiring the training data, and storing it properly in data frame, learning in scikit-learn should be taken in three simple steps: 1. initializing the model, 2. fitting it to the training data, and 3. predicting new values ► We stored the train set provided by CrowdFlower, but using our extracted features, in a data frame and provide it as input for both algorithms. ► After initializing and fitting train data, we can now predict the label attribute (search terms median relevance)
  • 18. Learning Models Benchmark ► We describe the resulted scores we obtained running the four combination methods we applied: ♦ SVM with word match counting based features, ♦ SVM with tf-idf based features, ♦ Random Forest with word match counting based features and, ♦ Random Forest with tf-idf based features
  • 19. Learning Models Benchmark (cont.) ►The best score was obtained with Random Forest with tf-idf based features. ►We can also notice that Random Forest obtained better score than SVM and, ►conversely, that tf-idf obtained better results than word match counting based features. ►However, the impact of applying tf-idf on preprocessing was substantially higher than Random Forest over SVM Match Counting tf-idf SVM 0.51241 0.57654 Random Forest 0.53834 0.59211
  • 20. Conclusion ► With our benchmark on CrowdFlower test set, we could attest that Random Forest is an efficient machine-learning algorithm. ► Random Forest shown better accuracy compared to a simple SVM implementation. This, in part, is due to: ♦ The ensemble nature of Random Forest, which allows multiple learning algorithms to be run ♦ Beside this fact, the nature of the dataset is also another reason for SVM disadvantage, since such algorithm is likely to provider poorer performances when the number of features is much greater than the number of samples
  • 21. Conclusion (cont.) ► Employing tf-idf for preprocessing features and, Random Forest is a powerful and effective approach for predicting, and measuring, the relevance of text search in e-commerce scenarios. ► Such approach suits small e-commerce businesses emerged in big data necessities. ► The use of scikit-learn package, as well as other built-in Python packages (NumPy and SciPy), saves substantial development time
  • 22. Conclusion (cont.) ► The preprocessing step is crucial for the whole data analysis because: 1. It consumes most of the time and implementation efforts needed on the whole analysis 2. Preprocessing may be more critical for precision than the machine-learning algorithm itself  ♦ SVM combined with tf-idf have shown higher score than Radom Forest combined with word matches counting