SlideShare a Scribd company logo
WebCrawling-based
Search Engine using
Python
By Mudit Bansal, Sanya Goel, Atul Kr. Srivastava, Neha Arora
ASET, AMITY university, Noida, UP
Motivation :
Difficulties in finding a
good institution for
education
Parents or
Students
Problems
 Parents not able to find good schools and institutions for their
kids.
 Not being able to track nearby preparatory schools.
 Not able to find the right amenities (Basketball, Football, Horse
riding etc.)
 Resulting in compromising decisions.
Institution’s
Problems
 Small schools or Preparatory schools not getting recognized.
 Not having a good PR team or not having one at all.
 Not able to reach to the public.
 RESULT : Closure!
 Loss for all.
Idea to solve
the problem
 A portal with unbiased, ever-increasing database of institutions
with track of :
 Amenities
 Location
 Affiliations
 Fee Structure
WhyUse
python?
 Smaller codes
 Lighter on processing
 Easy Multi-Threading
 Can be the only language used while making a search engine
 Has ample of modules for browser activities to make web
scrapping easy (To skip the error coded robot.txt files)
Example of
python code
Other uses of
this algorithm?
 CV boost – Using the algorithm to collect the jobs or career related
email IDs of major companies to boost CVs.
 Convert e-commerce search into an excel to easily compare
 Easy tracking & comparison during Real estate buying
Use of the
result data?
 Data Processing including :
 N-Grams
 Hash Files
 HashTables (Reduces storage requirements)
 Keywords
 Relevancy
How a search engine
works
Basic Plan
Scrapping
Workflow
AmazonWeb
Services
 Free for first year
 Vast variety of advanced services under one hood
 More computational power than other free services
Working Windows Server on AWS
Beautifulsoup4
Short-
Comings
 Black holes for scrapping bots
 Duplicate data
 Unethically coded sites
Bad robot.txt
Example This allows the spiders to crawl the wholeCMS of the website
Resulting in duplicate data that can be accessed by different
URLs
This completely disturbs the indexing of the data in a website.
How we
overcame
them? (Mostly)
 Single site lock on link finders
 Dynamic duplicate sorters
 A range for indexing to detect scrap or trash data
THANKYOU

More Related Content

PPT
MLA Plenary Session IV - Bart Ragon
PPTX
SEO-HIGH TRAFFIC ROUTING
PDF
Optimising Google's Knowledge Graph - #SMX Munich
PPTX
Machine Learning in Google Algorithm - Where? What? How?
PPT
Nonprofit E Marketing On A Shoe String
PPT
Search Analytics at Enterprise Search Summit Fall 2011
PPT
Introduction to search_marketing
PPT
Cse535 chapter19-web search
MLA Plenary Session IV - Bart Ragon
SEO-HIGH TRAFFIC ROUTING
Optimising Google's Knowledge Graph - #SMX Munich
Machine Learning in Google Algorithm - Where? What? How?
Nonprofit E Marketing On A Shoe String
Search Analytics at Enterprise Search Summit Fall 2011
Introduction to search_marketing
Cse535 chapter19-web search

Similar to Web Crawling-based Search Engine using Python (20)

PDF
Understanding Semantic Search and AI Content to Drive Growth in 2023 March 2023
PPT
Search Enginesv2
PPTX
Web Search Engine, Web Crawler, and Semantics Web
PDF
Searchland: Search quality for Beginners
PPT
Nova Spivack - Semantic Web Talk
PDF
Search Engine Optimisation - SEO basic training
PPT
My presentation at Kent State IAKM
PDF
Data analytics and SEO to grow your international business | John Caldwell | ...
PDF
You Don't Know SEO
PPTX
Introduction to internet.
PDF
Search V Next Final
PDF
Pratical Deep Dive into the Semantic Web - #smconnect
PPT
Search engines by ganesh kavhar
PPT
Search Analytics for Fun and Profit
PDF
The beginners guide to SEO
PPTX
Ordering the chaos: Creating websites with imperfect data
PPT
The Internet
PDF
Search engines
PPT
Search Engines.ppt
PDF
Seo report
Understanding Semantic Search and AI Content to Drive Growth in 2023 March 2023
Search Enginesv2
Web Search Engine, Web Crawler, and Semantics Web
Searchland: Search quality for Beginners
Nova Spivack - Semantic Web Talk
Search Engine Optimisation - SEO basic training
My presentation at Kent State IAKM
Data analytics and SEO to grow your international business | John Caldwell | ...
You Don't Know SEO
Introduction to internet.
Search V Next Final
Pratical Deep Dive into the Semantic Web - #smconnect
Search engines by ganesh kavhar
Search Analytics for Fun and Profit
The beginners guide to SEO
Ordering the chaos: Creating websites with imperfect data
The Internet
Search engines
Search Engines.ppt
Seo report
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Programs and apps: productivity, graphics, security and other tools
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Building Integrated photovoltaic BIPV_UPV.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks
Mobile App Security Testing_ A Comprehensive Guide.pdf
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Ad

Web Crawling-based Search Engine using Python