SlideShare a Scribd company logo
Challenges in web crawling
CHALLENGES IN WEB CRAWLING
WEB CRAWLER
Web crawler (also known in other terms like ants, automatic
indexers, bots, web spiders, web robots) is an automated
program, or script, that methodically scans or “crawls”
through web pages to create an index of the data it is set to
look for. This process is called Web crawling or spidering.
CRAWLER
A crawler is a program that visits Web sites and reads their
pages and other information in order to create entries for
a search engine index. The major search engines on the
Web all have such a program, which is also known as a
"spider" or a "bot." Crawlers are typically programmed to
visit sites that have been submitted by their owners as new
or updated.
HOW A WEB CRAWLER WORKS
The world wide web is full of information. If you want to know
something, you can probably find the information online. But
how can you find the answer you want, when the web contains
trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the looking for us.
But how do search engines know where to look? How can
search engines recommend a few pages out of the trillions that
exist? The answer lies with web crawlers.
HOW A WEB CRAWLER WORKS
Crawlers scan web pages to see what words they contain,
and where those words are used. The crawler turns its
findings into a giant index. The index is basically a big list of
words and the web pages that feature them. So when you
ask a search engine for pages about hippos, the search
engine checks its index and gives you a list of pages that
mention hippos. Web crawlers scan the web regularly so
they always have an up-to-date index of the web.
THE SEO IMPLICATIONS OF WEB CRAWLERS
Now that you know how a web crawler works, you can see
that the behavior of the web crawler has implications for
how you optimize your website.
For example, you can see that, if you sell parachutes, it’s
important that you write about parachutes on your website.
If you don’t write about parachutes, search engines will
never suggest your website to people searching for
parachutes.
THE SEO IMPLICATIONS OF WEB CRAWLERS
It’s also important to note that web crawlers don’t just pay attention to
what words they find – they also record where the words are found. So
the web crawler knows that a word contained in headings, meta data
and the first few sentences are likely to be more important in the context
of the page, and that keywords in prime locations suggest that the page
is really ‘about’ those keywords.
So if you want search engines to know that parachutes are a big deal on
your website, mention them in your headings, meta data and opening
sentences.
The fact that web crawlers regularly trawl the web to make sure their
index is up to date also suggests that having fresh content on your
website is a good thing too.
SEARCH ENGINE INDEXES
Once the crawler has found information by crawling over the web, the
program builds the index. The index is essentially a big list of all the
words the crawler has found, as well as their location.
CHALLENGES IN WEB CRAWLING
• Challenge I: Non-Uniform Structures
• Challenge II: Omnipresence of AJAX elements
• Challenge III: The “Real” Real-Time Latency
• Challenge IV: Who owns UGC?
CHALLENGE I: NON-UNIFORM STRUCTURES
Data formats and structures are inconsistent in the ever-evolving Web space.
Also, norms on how to build an Internet presence are non-existent.
The result?
Lack of uniformity and the vast ever-changing terrains of the Internet.
The problem?
Collecting data in a machine-readable format becomes difficult. Also,
problems increase with increase in scale.
Especially, when:
a) structured data is needed, and,
b) large number of details are to be extracted w.r.t. specific schema from
multiple sources.
CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS
AJAX and interactive web components make websites more user-friendly. But
not for crawlers!
The result?
Content is produced dynamically (and on-the-go) by the browser and
therefore not visible to crawlers.
The problem?
To keep the content up-to-date, the crawler needs to be maintained manually
on a regular basis. So much so, that even Google’s crawlers find it difficult to
extract information!
The solution?
Crawlers need to be refined in their approach to be more efficient and
scalable.
CHALLENGE III: THE “REAL” REAL-TIME LATENCY
Acquiring data-sets in real-time is a huge problem! Real-time data is
critical in security and intelligence to predict, report, and enable
preemptive actions against untoward incidents.
The problem?
The real problem comes in deciding what is and isn't important in real
time.
CHALLENGE IV: WHO OWNS UGC?
User-Generated Content (UGC) proprietorship is claimed by giants
like Craigslist and Yelp and is usually out-of-bounds for commercial
crawlers.
The result?
Only 2-3 % sites disallow bots. Others believe in data democratization,
but it is possible these may follow suit and shut access to the data gold
mine!
The problem?
Site policing for web scraping and rejecting bots.
THANK YOU!

More Related Content

PDF
Search engine and web crawler
PDF
Generative AI and SEO
PPTX
Search engine Optimization,Advantages Of SEO, Benefits of Seo
PDF
Colloquim Report - Rotto Link Web Crawler
PPTX
Introduction to SEO Presentation
PPTX
SEO Presentation
PDF
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
ODP
What is SEO | Type of seo | Technique of seo | SEO PPT
Search engine and web crawler
Generative AI and SEO
Search engine Optimization,Advantages Of SEO, Benefits of Seo
Colloquim Report - Rotto Link Web Crawler
Introduction to SEO Presentation
SEO Presentation
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
What is SEO | Type of seo | Technique of seo | SEO PPT

What's hot (20)

PPT
Seo ppt
PPTX
PPTX
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
PPTX
Web mining
PDF
Website SEO Strategy WordCamp Raleigh
PPTX
Seo presentation
PDF
Core Web Vitals - Why You Need to Pay Attention
PDF
Monthly SEO Report By calipus.com
PDF
eCommerce SEO
PPTX
PPT
SEO On Page Activities 2014
PPTX
Seo strategy
PDF
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
PDF
Automating Google Lighthouse
PPT
Seo Marketing Plan Ppt
PDF
Advanced SEO: Logs, Load, and Language
PDF
A Practical Guide to Keyword Research
PDF
The Value of Featured Snippets (BrightonSEO 2023).pdf
PPTX
How Search Works
PPTX
Search Engine Marketing 101
Seo ppt
The Reason Behind Semantic SEO: Why does Google Avoid the Word PageRank?
Web mining
Website SEO Strategy WordCamp Raleigh
Seo presentation
Core Web Vitals - Why You Need to Pay Attention
Monthly SEO Report By calipus.com
eCommerce SEO
SEO On Page Activities 2014
Seo strategy
How to Incorporate ML in your SERP Analysis, Lazarina Stoy -BrightonSEO Oct, ...
Automating Google Lighthouse
Seo Marketing Plan Ppt
Advanced SEO: Logs, Load, and Language
A Practical Guide to Keyword Research
The Value of Featured Snippets (BrightonSEO 2023).pdf
How Search Works
Search Engine Marketing 101
Ad

Similar to Challenges in web crawling (20)

PPTX
1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
PPTX
How developer's can help seo
PDF
Large-Scale Web Scraping: An Ultimate Guide
PDF
The ultimate guide to the invisible web
PPTX
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
PDF
E3602042044
PPTX
Search engine
PDF
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
PPTX
unit 2.pptx
PPTX
Search Engine
DOCX
Day 7
PDF
A Novel Interface to a Web Crawler using VB.NET Technology
DOC
Seo Manual
PDF
The ultimate guide to web scraping 2018
PDF
SEO Interview FAQ
DOCX
How search engine works
PPT
Web Crawler
PPTX
Search Engine
PDF
Effective Searching Policies for Web Crawler
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
1ST TECH TALK: Web Crawler and Scraper by Abaam Germones
How developer's can help seo
Large-Scale Web Scraping: An Ultimate Guide
The ultimate guide to the invisible web
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
E3602042044
Search engine
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
unit 2.pptx
Search Engine
Day 7
A Novel Interface to a Web Crawler using VB.NET Technology
Seo Manual
The ultimate guide to web scraping 2018
SEO Interview FAQ
How search engine works
Web Crawler
Search Engine
Effective Searching Policies for Web Crawler
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Ad

More from Burhan Ahmed (20)

PPTX
Wireless mobile communication
PPTX
Virtual function
PPTX
Uses misuses and risk of software
PPTX
Types of computer
PPTX
PPTX
Topology
PPTX
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
PPTX
Software house organization
PPT
Social interaction
PPTX
Role model
PPTX
Rights and duties
PPTX
Planning work activities
PPTX
Peripheral devices
PPTX
Parallel computing and its applications
PPTX
Operator overloading
PPT
Normalization
PPTX
Managing strategy
PPT
Letter writing
PPTX
Job analysis and job design
PPTX
Intellectual property
Wireless mobile communication
Virtual function
Uses misuses and risk of software
Types of computer
Topology
The distinction of prophet muhammad (s.a.w) among the teachers of moral conduct
Software house organization
Social interaction
Role model
Rights and duties
Planning work activities
Peripheral devices
Parallel computing and its applications
Operator overloading
Normalization
Managing strategy
Letter writing
Job analysis and job design
Intellectual property

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Business Ethics Teaching Materials for college
PDF
RMMM.pdf make it easy to upload and study
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Complications of Minimal Access Surgery at WLH
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Insiders guide to clinical Medicine.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Cell Types and Its function , kingdom of life
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Business Ethics Teaching Materials for college
RMMM.pdf make it easy to upload and study
2.FourierTransform-ShortQuestionswithAnswers.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
Complications of Minimal Access Surgery at WLH
FourierSeries-QuestionsWithAnswers(Part-A).pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Insiders guide to clinical Medicine.pdf
Basic Mud Logging Guide for educational purpose
O7-L3 Supply Chain Operations - ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
TR - Agricultural Crops Production NC III.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Cell Types and Its function , kingdom of life

Challenges in web crawling

  • 2. CHALLENGES IN WEB CRAWLING
  • 3. WEB CRAWLER Web crawler (also known in other terms like ants, automatic indexers, bots, web spiders, web robots) is an automated program, or script, that methodically scans or “crawls” through web pages to create an index of the data it is set to look for. This process is called Web crawling or spidering.
  • 4. CRAWLER A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The major search engines on the Web all have such a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated.
  • 5. HOW A WEB CRAWLER WORKS The world wide web is full of information. If you want to know something, you can probably find the information online. But how can you find the answer you want, when the web contains trillions of pages? How do you know where to look? Fortunately, we have search engines to do the looking for us. But how do search engines know where to look? How can search engines recommend a few pages out of the trillions that exist? The answer lies with web crawlers.
  • 6. HOW A WEB CRAWLER WORKS Crawlers scan web pages to see what words they contain, and where those words are used. The crawler turns its findings into a giant index. The index is basically a big list of words and the web pages that feature them. So when you ask a search engine for pages about hippos, the search engine checks its index and gives you a list of pages that mention hippos. Web crawlers scan the web regularly so they always have an up-to-date index of the web.
  • 7. THE SEO IMPLICATIONS OF WEB CRAWLERS Now that you know how a web crawler works, you can see that the behavior of the web crawler has implications for how you optimize your website. For example, you can see that, if you sell parachutes, it’s important that you write about parachutes on your website. If you don’t write about parachutes, search engines will never suggest your website to people searching for parachutes.
  • 8. THE SEO IMPLICATIONS OF WEB CRAWLERS It’s also important to note that web crawlers don’t just pay attention to what words they find – they also record where the words are found. So the web crawler knows that a word contained in headings, meta data and the first few sentences are likely to be more important in the context of the page, and that keywords in prime locations suggest that the page is really ‘about’ those keywords. So if you want search engines to know that parachutes are a big deal on your website, mention them in your headings, meta data and opening sentences. The fact that web crawlers regularly trawl the web to make sure their index is up to date also suggests that having fresh content on your website is a good thing too.
  • 9. SEARCH ENGINE INDEXES Once the crawler has found information by crawling over the web, the program builds the index. The index is essentially a big list of all the words the crawler has found, as well as their location.
  • 10. CHALLENGES IN WEB CRAWLING • Challenge I: Non-Uniform Structures • Challenge II: Omnipresence of AJAX elements • Challenge III: The “Real” Real-Time Latency • Challenge IV: Who owns UGC?
  • 11. CHALLENGE I: NON-UNIFORM STRUCTURES Data formats and structures are inconsistent in the ever-evolving Web space. Also, norms on how to build an Internet presence are non-existent. The result? Lack of uniformity and the vast ever-changing terrains of the Internet. The problem? Collecting data in a machine-readable format becomes difficult. Also, problems increase with increase in scale. Especially, when: a) structured data is needed, and, b) large number of details are to be extracted w.r.t. specific schema from multiple sources.
  • 12. CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS AJAX and interactive web components make websites more user-friendly. But not for crawlers! The result? Content is produced dynamically (and on-the-go) by the browser and therefore not visible to crawlers. The problem? To keep the content up-to-date, the crawler needs to be maintained manually on a regular basis. So much so, that even Google’s crawlers find it difficult to extract information! The solution? Crawlers need to be refined in their approach to be more efficient and scalable.
  • 13. CHALLENGE III: THE “REAL” REAL-TIME LATENCY Acquiring data-sets in real-time is a huge problem! Real-time data is critical in security and intelligence to predict, report, and enable preemptive actions against untoward incidents. The problem? The real problem comes in deciding what is and isn't important in real time.
  • 14. CHALLENGE IV: WHO OWNS UGC? User-Generated Content (UGC) proprietorship is claimed by giants like Craigslist and Yelp and is usually out-of-bounds for commercial crawlers. The result? Only 2-3 % sites disallow bots. Others believe in data democratization, but it is possible these may follow suit and shut access to the data gold mine! The problem? Site policing for web scraping and rejecting bots.