SlideShare a Scribd company logo
Building a Meta - Search Engine
Information Retrieval
CS60092
Mentor:
Suman Kalyan Maity
Project Members:
Ayan Chandra, CS, 16CS72P02
Sandeep Sharma, MI, 13MI31025
Ankita Saha, AT, 16AT72P01
Vineet Jain, ME, 15ME30044
Indrasekhar Sengupta, RJ, 16RJ72P01
Sudeshna Das, ET, 16ET91R01
Github: https://github.com/metasearchengine/metarank
Introduction
● A meta-search engine (MSE) is an aggregator search service which uses data from a set
of search engines to produce its own results from the internet, given a query from the
user interface.
● It takes input from a user and simultaneously send the queries to third party search
engine APIs , and on receiving sufficient data, formats by its re-ranker and presents to
the user.
Objective
To build an experimental meta-search engine
Key areas :
● Meta-search infrastructure.
● Meta-ranking or rank aggregation.
System Module & Methodology
Infrastructure
Query Set
A set of 100 queries are
selected to be the benchmark
query set.
1. Ten queries for distinct keywords search
2. Ten queries for phrase word search
3. Ten queries for appended keywords search
4. Ten queries for words related to named entities `
e.g. persons
5. Ten queries for keywords related to trending
topics
6. Ten queries for keywords related to news
7. Ten queries for keywords for video, specifically
youtube
8. Ten queries for product search
9. Ten queries for rare search
10. Ten queries for keywords related to weather
Query Type Query phrases/words
Appended Keyword Java , Java programming , Java programming tutorial ...
Distinct Keyword Cricket score , CM of UP , Latest Hollywood movies ...
Phrase word The bewildered tourist , Knowing what i know now ….
Named entity Sachin Tendulkar , Cormen , Coorg , Elon Musk ….
Trending keywords IPL , Donald Trump , ISIS , Yogi Adityanath , Space-X ...
News Indian News , Delhi MCD Election , Dalai Lama visit ….
Video keyword Latest songs , DBMS Lectures , Latest Movie Trailers ...
Product Phone charger , Earphones , Books , IPad , Watch ...
Rare keyword Philanthropists , Anthropology , Serendipity , Gynecologist ...
Weather query Today’s weather , Weather on 1’st January , Temperature ….
Query Pre - Processing
On-demand module
β€’ Word limits in search engines
β€’ Ensures that important words
are not lost
β€’ Module is triggered for large
queries only (# of words > 10)
β€’ Avoids unnecessary pre-
processing
β€’ Terminological noun phrase
extraction using a large
corpus
Algo: Keyphrase Extraction
Input: Query q
Output: Keyphrases
1. Perform POS tagging on query q
2. Extract terminological noun phrases by
using regular expression patterns
3. Filter noun phrases by using a large web
corpus
4. Return keyphrases
Assumption: If length of q is above a certain
threshold, it is likely to be a well-formed
sentence(s).
● POS tagging: NLTK toolkit
● Regular expressions:
β—‹ P1 = C*N
β—‹ P2 = (C*NP)?(C*N)
β—‹ P3 = A*N+
β–  N = noun,
β–  P = preposition,
β–  A = adjective, C = A|N
Example
Query where can i find a real example of a very long search engine query
POS tagging where/WRB can/MD i/VB find/VB a/DT real/JJ example/NN of/IN a/DT
very/RB long/JJ search/NN engine/NN query/NN
Regular expression
filtering
real example, long search engine query
Web corpus filtering real example, long search engine query
Output real example long search engine query
Caching
β€’ Issues :
β€’ Freshness of the result for the
identical query
β€’ How long and How much query we
will keep in the cache
β€’ Benefits ?
Pre-threading
management
β€’ The new query has been
tagged or identified to be
from a particular topic or
genre ?
β€’ The system is not able to
receive response from all the
considered search engine
APIs within a certain
threshold time limit
Query limit and API
Key management
β€’ For a single API Key we can
utilize 1000 queries for free
β€’ A pool of API keys is
generated for each of the
search engine API
Threading Module
Multithreading is the ability of a central processing unit (CPU) or a
single core in a multicore processor to execute multiple processes or
threads concurrently.
Building a Meta-search Engine
Meta rank module
1. For a query Q, the set of identical results
provided by different search engine APIs to be
re-ranked.
2. If most of the search engines vote that result i
has better result index or rank than result j,
then result i is assumed to be better than j.
3. Concept of alpha majority is a better approach
if we have large number of search engines.
X= {0,1} , set of possible opinions. nX β†’ The number of rankings which give the opinion x belongs to X
Total number of ranking β†’ N. 0<=alpha<=0.5, 0<=beta<=1 .
Ranking k has disagreed with the alpha-majority iff the following conditions are satisfied:
1. n0+n1 >= ceil ( beta * N) ……. eq (1)
2. nx(k) < alpha * (n0 + n1 ) …….. eq (2)
Weight assignment rule:
Wl = 1 - delta / |S|C2 …...eq (3)
Wl β†’ fraction of item pairs for which an input ranking Rl agrees with alpha majority.
where , delta = 0, if Rl does not disagree with alpha-majority for (i,j)
= 1, if Rl disagrees with alpha-majority for (i,j)
= 0.5, if both i and j are not ranked by Rl
|S| β†’ the number of distinct items that appear in the input rankings.
The opinion of a ranker is incorrect if it fails to agree with a fraction alpha of rankers that rank both
the items. [Alpha Majority]
Query & Response Log
Analysis
Three phases :
1.Collection,
2.Preparation
3.Analysis
β€’ Collection : Query responses in Json
format
β€’ Preparation :
a. Importing log data to NoSQL format
b. Cleaning
c. Log Format: JSON, CSV
d. Log Database: MongoDB
β€’ Analysis :
a. Term level Analysis
b. Query level Analysis
c. Search Engine specific Analysis
IR Evaluation
β€’ Mean Average Precision
β€’ Recall
β€’ Precision-Recall Ratio
Query Type Mean Average Precision Recall Precision/Recall Ratio
Appended Keyword 3.67 6 0.61
Distinct Keyword 4.49 7 0.64
Phrase word 4.33 7 0.62
Named entity 5.67 8 0.71
Trending keywords 6.11 6 1.01
News 6.69 6 1.12
Video keyword 6.54 6 1.09
Product 4.14 6 0.69
Rare keyword 6.76 8 0.85
Weather query 7.14 5 1.43
Recall vs MAP
1. Core technology, weightage: 5%
2. Scalability, weightage: 10%
3. Search time, weightage: 20%
4. Query functionality, weightage: 10%
5. Search relevance, weightage: 50%
Our rating as per the system: 4+7+12+7+38=68 out of 100
Metric for overall performance
User Interface
Query result view: example 1
Query result view: example 2
Thank You
References:
M.S. Desarkar, S. Sarkar, P. Mitra: Preference
relations based unsupervised rank aggregation for
meta-search. Expert Systems With Applications 49
(2016) 86-98
Manning, Christopher D., Mihai Surdeanu, John
Bauer, Jenny Finkel, Steven J. Bethard, and David
McClosky. 2014. The Stanford CoreNLP Natural
Language Processing Toolkit In Proceedings of the
52nd Annual Meeting of the Association for
Computational Linguistics: System Demonstrations,
pp. 55-60.

More Related Content

PPTX
Are students-smart-about-money-oecd-pisa-2018-results
PPTX
Impact of internet in day to day life
DOCX
Siri bootcamp
PDF
Dockerization of real mobile device farm and scalable QA automation ecosystem
PPTX
How the internet aids communication
PPTX
Cybercrime
PPT
Identity Theft: How to Avoid It
Β 
PDF
Advanced Metasearch Engine Technology Weiyi Meng Clement T Yu
Are students-smart-about-money-oecd-pisa-2018-results
Impact of internet in day to day life
Siri bootcamp
Dockerization of real mobile device farm and scalable QA automation ecosystem
How the internet aids communication
Cybercrime
Identity Theft: How to Avoid It
Β 
Advanced Metasearch Engine Technology Weiyi Meng Clement T Yu

Similar to Building a Meta-search Engine (20)

PDF
Building multi billion ( dollars, users, documents ) search engines on open ...
PDF
Search quality in practice
PDF
Instant search - A hands-on tutorial
PDF
Building efficient and effective metasearch engines
PDF
Quest Trail: An Effective Approach for Construction of Personalized Search En...
PPTX
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
PPTX
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
PDF
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
PPTX
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
PDF
Search engines in the industry
PPT
3 Understanding Search
PDF
SEARCH ENGINE THROUGH GOOGLE API
PDF
IR: Open source state
PDF
Elasticsearch and Spark
PDF
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
PPT
Introduction into Search Engines and Information Retrieval
Β 
PPT
PPTX
Search-Engines-and-Information-Retrievals.pptx
PDF
CS6007 information retrieval - 5 units notes
PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
Β 
Building multi billion ( dollars, users, documents ) search engines on open ...
Search quality in practice
Instant search - A hands-on tutorial
Building efficient and effective metasearch engines
Quest Trail: An Effective Approach for Construction of Personalized Search En...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
SOFTWARE ENGINEERING PROJECT FOR AI AND APPLICATION
Search engines in the industry
3 Understanding Search
SEARCH ENGINE THROUGH GOOGLE API
IR: Open source state
Elasticsearch and Spark
MACHINE LEARNING FOR OPTIMIZING SEARCH RESULTS WITH DRUPAL & APACHE SOLR
Introduction into Search Engines and Information Retrieval
Β 
Search-Engines-and-Information-Retrievals.pptx
CS6007 information retrieval - 5 units notes
14. Michael Oakes (UoW) Natural Language Processing for Translation
Β 
Ad

Recently uploaded (20)

PPTX
presentation_pfe-universite-molay-seltan.pptx
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PDF
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
PPTX
artificial intelligence overview of it and more
PDF
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
PPTX
Database Information System - Management Information System
PPTX
Digital Literacy And Online Safety on internet
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PPTX
E -tech empowerment technologies PowerPoint
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
PPTX
Funds Management Learning Material for Beg
DOCX
Unit-3 cyber security network security of internet system
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PDF
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
PPTX
t_and_OpenAI_Combined_two_pressentations
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PPT
Ethics in Information System - Management Information System
PPT
250152213-Excitation-SystemWERRT (1).ppt
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
presentation_pfe-universite-molay-seltan.pptx
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
mera desh ae watn.(a source of motivation and patriotism to the youth of the ...
artificial intelligence overview of it and more
πŸ’° π”πŠπ“πˆ πŠπ„πŒπ„ππ€ππ†π€π πŠπˆππ„π‘πŸ’πƒ π‡π€π‘πˆ 𝐈𝐍𝐈 πŸπŸŽπŸπŸ“ πŸ’°
Β 
Database Information System - Management Information System
Digital Literacy And Online Safety on internet
Mathew Digital SEO Checklist Guidlines 2025
E -tech empowerment technologies PowerPoint
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
Funds Management Learning Material for Beg
Unit-3 cyber security network security of internet system
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Best Practices for Testing and Debugging Shopify Third-Party API Integrations...
t_and_OpenAI_Combined_two_pressentations
Sims 4 Historia para lo sims 4 para jugar
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Ethics in Information System - Management Information System
250152213-Excitation-SystemWERRT (1).ppt
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Ad

Building a Meta-search Engine

  • 1. Building a Meta - Search Engine Information Retrieval CS60092
  • 2. Mentor: Suman Kalyan Maity Project Members: Ayan Chandra, CS, 16CS72P02 Sandeep Sharma, MI, 13MI31025 Ankita Saha, AT, 16AT72P01 Vineet Jain, ME, 15ME30044 Indrasekhar Sengupta, RJ, 16RJ72P01 Sudeshna Das, ET, 16ET91R01 Github: https://github.com/metasearchengine/metarank
  • 3. Introduction ● A meta-search engine (MSE) is an aggregator search service which uses data from a set of search engines to produce its own results from the internet, given a query from the user interface. ● It takes input from a user and simultaneously send the queries to third party search engine APIs , and on receiving sufficient data, formats by its re-ranker and presents to the user.
  • 4. Objective To build an experimental meta-search engine Key areas : ● Meta-search infrastructure. ● Meta-ranking or rank aggregation.
  • 5. System Module & Methodology
  • 7. Query Set A set of 100 queries are selected to be the benchmark query set. 1. Ten queries for distinct keywords search 2. Ten queries for phrase word search 3. Ten queries for appended keywords search 4. Ten queries for words related to named entities ` e.g. persons 5. Ten queries for keywords related to trending topics 6. Ten queries for keywords related to news 7. Ten queries for keywords for video, specifically youtube 8. Ten queries for product search 9. Ten queries for rare search 10. Ten queries for keywords related to weather
  • 8. Query Type Query phrases/words Appended Keyword Java , Java programming , Java programming tutorial ... Distinct Keyword Cricket score , CM of UP , Latest Hollywood movies ... Phrase word The bewildered tourist , Knowing what i know now …. Named entity Sachin Tendulkar , Cormen , Coorg , Elon Musk …. Trending keywords IPL , Donald Trump , ISIS , Yogi Adityanath , Space-X ... News Indian News , Delhi MCD Election , Dalai Lama visit …. Video keyword Latest songs , DBMS Lectures , Latest Movie Trailers ... Product Phone charger , Earphones , Books , IPad , Watch ... Rare keyword Philanthropists , Anthropology , Serendipity , Gynecologist ... Weather query Today’s weather , Weather on 1’st January , Temperature ….
  • 9. Query Pre - Processing On-demand module β€’ Word limits in search engines β€’ Ensures that important words are not lost β€’ Module is triggered for large queries only (# of words > 10) β€’ Avoids unnecessary pre- processing β€’ Terminological noun phrase extraction using a large corpus
  • 10. Algo: Keyphrase Extraction Input: Query q Output: Keyphrases 1. Perform POS tagging on query q 2. Extract terminological noun phrases by using regular expression patterns 3. Filter noun phrases by using a large web corpus 4. Return keyphrases Assumption: If length of q is above a certain threshold, it is likely to be a well-formed sentence(s). ● POS tagging: NLTK toolkit ● Regular expressions: β—‹ P1 = C*N β—‹ P2 = (C*NP)?(C*N) β—‹ P3 = A*N+ β–  N = noun, β–  P = preposition, β–  A = adjective, C = A|N
  • 11. Example Query where can i find a real example of a very long search engine query POS tagging where/WRB can/MD i/VB find/VB a/DT real/JJ example/NN of/IN a/DT very/RB long/JJ search/NN engine/NN query/NN Regular expression filtering real example, long search engine query Web corpus filtering real example, long search engine query Output real example long search engine query
  • 12. Caching β€’ Issues : β€’ Freshness of the result for the identical query β€’ How long and How much query we will keep in the cache β€’ Benefits ?
  • 13. Pre-threading management β€’ The new query has been tagged or identified to be from a particular topic or genre ? β€’ The system is not able to receive response from all the considered search engine APIs within a certain threshold time limit
  • 14. Query limit and API Key management β€’ For a single API Key we can utilize 1000 queries for free β€’ A pool of API keys is generated for each of the search engine API
  • 15. Threading Module Multithreading is the ability of a central processing unit (CPU) or a single core in a multicore processor to execute multiple processes or threads concurrently.
  • 17. Meta rank module 1. For a query Q, the set of identical results provided by different search engine APIs to be re-ranked. 2. If most of the search engines vote that result i has better result index or rank than result j, then result i is assumed to be better than j. 3. Concept of alpha majority is a better approach if we have large number of search engines.
  • 18. X= {0,1} , set of possible opinions. nX β†’ The number of rankings which give the opinion x belongs to X Total number of ranking β†’ N. 0<=alpha<=0.5, 0<=beta<=1 . Ranking k has disagreed with the alpha-majority iff the following conditions are satisfied: 1. n0+n1 >= ceil ( beta * N) ……. eq (1) 2. nx(k) < alpha * (n0 + n1 ) …….. eq (2) Weight assignment rule: Wl = 1 - delta / |S|C2 …...eq (3) Wl β†’ fraction of item pairs for which an input ranking Rl agrees with alpha majority. where , delta = 0, if Rl does not disagree with alpha-majority for (i,j) = 1, if Rl disagrees with alpha-majority for (i,j) = 0.5, if both i and j are not ranked by Rl |S| β†’ the number of distinct items that appear in the input rankings. The opinion of a ranker is incorrect if it fails to agree with a fraction alpha of rankers that rank both the items. [Alpha Majority]
  • 19. Query & Response Log Analysis Three phases : 1.Collection, 2.Preparation 3.Analysis β€’ Collection : Query responses in Json format β€’ Preparation : a. Importing log data to NoSQL format b. Cleaning c. Log Format: JSON, CSV d. Log Database: MongoDB β€’ Analysis : a. Term level Analysis b. Query level Analysis c. Search Engine specific Analysis
  • 20. IR Evaluation β€’ Mean Average Precision β€’ Recall β€’ Precision-Recall Ratio
  • 21. Query Type Mean Average Precision Recall Precision/Recall Ratio Appended Keyword 3.67 6 0.61 Distinct Keyword 4.49 7 0.64 Phrase word 4.33 7 0.62 Named entity 5.67 8 0.71 Trending keywords 6.11 6 1.01 News 6.69 6 1.12 Video keyword 6.54 6 1.09 Product 4.14 6 0.69 Rare keyword 6.76 8 0.85 Weather query 7.14 5 1.43
  • 23. 1. Core technology, weightage: 5% 2. Scalability, weightage: 10% 3. Search time, weightage: 20% 4. Query functionality, weightage: 10% 5. Search relevance, weightage: 50% Our rating as per the system: 4+7+12+7+38=68 out of 100 Metric for overall performance
  • 25. Query result view: example 1
  • 26. Query result view: example 2
  • 27. Thank You References: M.S. Desarkar, S. Sarkar, P. Mitra: Preference relations based unsupervised rank aggregation for meta-search. Expert Systems With Applications 49 (2016) 86-98 Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.