SlideShare a Scribd company logo
Search Engine Using
Machine Learning and
NLP
How can we do searching?
Well we can search with help of some algorithms like Binary
Search, Linear Search, BST, Tries etc. But Applying them on a
string is not very useful as we have to match exact words in a big
dataset and length of string is also a factor.
Dataset:
• Firstly we need some data where we can perform match our queries with.
Stack Overflow Data from Kaggle.
Data Visualization & Observations:
• We find out at 30% of our data is duplicated i.e. there are many rows
which are repeated twice or thrice. Some rows were repeated even 6
times.
Searching without Machine Learning
Now we want to perform searching and find similar questions for our query questions:
• We will pre-process our question Titles, To remove html and extra spaces and other things.
We will use this same pre-process function to pre-process our query too.
• So let us first Vectorise the data. We will use TF-IDF Vectorizer as BoW vectorizer was not
giving better results.
Result without Machine Learning:
• Now we want to perform searching and find similar questions
for our query questions.
For query = “Synchronization” let’s see what our function returns:
Searching Using Machine Learning
When we use Stack Overflow we mostly add programming
language in query. Like : Static Variable in C, Synchronization in
Java, View Controller error in iOS.
We have Tags for our dataset, Can we use those tags to optimise
our queries?
But How? Answer is YES, we can train a model which can
predict the Tag for a given query and adding that tag in the
query.
Technique Used:
• We first Simplify the our tags by using only the first tag in each row in our
dataset.
• Then we have to first change our Tags into numeric form
• Then we perform TF-IDF vectorization, and train our model on LR and
SVM.
We observe that our LR Model performed better with hyper parameter
tuning.
Models Used
• So we trained our model again with all the data with Logistic
Regression, SVM, Naïve Bayes
• Then we add that predicted Tag into our query by using this
function.
Precision and Recall:
Models Result:
Model Precision Recall
Logistic Regression 0.52 0.40
SVM 0.78 0.65
Naïve Bayes 0.79 0.71
Output:
• We first optimizes the query then perform TF-IDF on query as it
is important for query to have same shape as of our dataset.
Then we get indices and we publish them. Let’s again try it for
query = “Synchronization”.
Search Engine UI:
Future Aspect:
• We can use w2v, TF-IDF weighted w2v or other techniques to
vectorize, As I am limited by my computing powers so couldn’t
do so.
• Using full dataset.
• Making a better UI in Search Engine
Thank You

More Related Content

PDF
IRJET- Semantic Analysis of Online Customer Queries
PDF
Query Understanding at LinkedIn [Talk at Facebook]
PDF
Modern Search: Using ML & NLP advances to enhance search and discovery
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
PDF
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
PPTX
Machine Learning at Quora (2/26/2016)
PDF
FriendlyData - Natural Language Interface for Database
IRJET- Semantic Analysis of Online Customer Queries
Query Understanding at LinkedIn [Talk at Facebook]
Modern Search: Using ML & NLP advances to enhance search and discovery
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Machine Learning at Quora (2/26/2016)
FriendlyData - Natural Language Interface for Database

Similar to Search Engine using Natural Language Processing (20)

PDF
Role of Data Science in eCommerce
PPTX
Predicting the relevance of search results for e-commerce systems
PDF
Machine Learning Design Interview Machine Learning System Design Interview Kh...
PDF
XGBoost @ Fyber
PDF
Deep Learning for Semantic Search in E-commerce​
PDF
InformationRetrieval
PDF
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC
PDF
Deep Representation: Building a Semantic Image Search Engine
PPTX
An introduction to AI in Test Engineering
PPTX
Multimedia Answer Generation for Community Question Answering
PDF
Machine Learning for Q&A Sites: The Quora Example
PDF
Arules_TM_Rpart_Markdown
PPTX
DOCX
Software defect estimation using machine learning algorithms
DOCX
Software defect estimation using machine learning algorithms
PPTX
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
PPTX
Introduction to machine_learning
PPTX
iNews: Final Presentation
PPTX
Candidate selection tutorial
PPTX
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
Role of Data Science in eCommerce
Predicting the relevance of search results for e-commerce systems
Machine Learning Design Interview Machine Learning System Design Interview Kh...
XGBoost @ Fyber
Deep Learning for Semantic Search in E-commerce​
InformationRetrieval
Session 2 - Akyildiz, Beinecke, Yee at MLconf NYC
Deep Representation: Building a Semantic Image Search Engine
An introduction to AI in Test Engineering
Multimedia Answer Generation for Community Question Answering
Machine Learning for Q&A Sites: The Quora Example
Arules_TM_Rpart_Markdown
Software defect estimation using machine learning algorithms
Software defect estimation using machine learning algorithms
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
Introduction to machine_learning
iNews: Final Presentation
Candidate selection tutorial
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Reco...
Ad

Recently uploaded (20)

PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to machine learning and Linear Models
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Galatica Smart Energy Infrastructure Startup Pitch Deck
IBA_Chapter_11_Slides_Final_Accessible.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to machine learning and Linear Models
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
1_Introduction to advance data techniques.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
Introduction-to-Cloud-ComputingFinal.pptx
annual-report-2024-2025 original latest.
climate analysis of Dhaka ,Banglades.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
SAP 2 completion done . PRESENTATION.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Ad

Search Engine using Natural Language Processing

  • 1. Search Engine Using Machine Learning and NLP
  • 2. How can we do searching? Well we can search with help of some algorithms like Binary Search, Linear Search, BST, Tries etc. But Applying them on a string is not very useful as we have to match exact words in a big dataset and length of string is also a factor.
  • 3. Dataset: • Firstly we need some data where we can perform match our queries with. Stack Overflow Data from Kaggle.
  • 4. Data Visualization & Observations: • We find out at 30% of our data is duplicated i.e. there are many rows which are repeated twice or thrice. Some rows were repeated even 6 times.
  • 5. Searching without Machine Learning Now we want to perform searching and find similar questions for our query questions: • We will pre-process our question Titles, To remove html and extra spaces and other things. We will use this same pre-process function to pre-process our query too. • So let us first Vectorise the data. We will use TF-IDF Vectorizer as BoW vectorizer was not giving better results.
  • 6. Result without Machine Learning: • Now we want to perform searching and find similar questions for our query questions. For query = “Synchronization” let’s see what our function returns:
  • 7. Searching Using Machine Learning When we use Stack Overflow we mostly add programming language in query. Like : Static Variable in C, Synchronization in Java, View Controller error in iOS. We have Tags for our dataset, Can we use those tags to optimise our queries? But How? Answer is YES, we can train a model which can predict the Tag for a given query and adding that tag in the query.
  • 8. Technique Used: • We first Simplify the our tags by using only the first tag in each row in our dataset. • Then we have to first change our Tags into numeric form • Then we perform TF-IDF vectorization, and train our model on LR and SVM. We observe that our LR Model performed better with hyper parameter tuning.
  • 9. Models Used • So we trained our model again with all the data with Logistic Regression, SVM, Naïve Bayes • Then we add that predicted Tag into our query by using this function.
  • 11. Models Result: Model Precision Recall Logistic Regression 0.52 0.40 SVM 0.78 0.65 Naïve Bayes 0.79 0.71
  • 12. Output: • We first optimizes the query then perform TF-IDF on query as it is important for query to have same shape as of our dataset. Then we get indices and we publish them. Let’s again try it for query = “Synchronization”.
  • 14. Future Aspect: • We can use w2v, TF-IDF weighted w2v or other techniques to vectorize, As I am limited by my computing powers so couldn’t do so. • Using full dataset. • Making a better UI in Search Engine