Challenges in web crawling

WEB CRAWLER
Web crawler (also known in other terms like ants, automatic
indexers, bots, web spiders, web robots) is an automated
program, or script, that methodically scans or “crawls”
through web pages to create an index of the data it is set to
look for. This process is called Web crawling or spidering.

CRAWLER
A crawler is a program that visits Web sites and reads their
pages and other information in order to create entries for
a search engine index. The major search engines on the
Web all have such a program, which is also known as a
"spider" or a "bot." Crawlers are typically programmed to
visit sites that have been submitted by their owners as new
or updated.

HOW A WEB CRAWLER WORKS
The world wide web is full of information. If you want to know
something, you can probably find the information online. But
how can you find the answer you want, when the web contains
trillions of pages? How do you know where to look?
Fortunately, we have search engines to do the looking for us.
But how do search engines know where to look? How can
search engines recommend a few pages out of the trillions that
exist? The answer lies with web crawlers.

HOW A WEB CRAWLER WORKS
Crawlers scan web pages to see what words they contain,
and where those words are used. The crawler turns its
findings into a giant index. The index is basically a big list of
words and the web pages that feature them. So when you
ask a search engine for pages about hippos, the search
engine checks its index and gives you a list of pages that
mention hippos. Web crawlers scan the web regularly so
they always have an up-to-date index of the web.

THE SEO IMPLICATIONS OF WEB CRAWLERS
Now that you know how a web crawler works, you can see
that the behavior of the web crawler has implications for
how you optimize your website.
For example, you can see that, if you sell parachutes, it’s
important that you write about parachutes on your website.
If you don’t write about parachutes, search engines will
never suggest your website to people searching for
parachutes.

THE SEO IMPLICATIONS OF WEB CRAWLERS
It’s also important to note that web crawlers don’t just pay attention to
what words they find – they also record where the words are found. So
the web crawler knows that a word contained in headings, meta data
and the first few sentences are likely to be more important in the context
of the page, and that keywords in prime locations suggest that the page
is really ‘about’ those keywords.
So if you want search engines to know that parachutes are a big deal on
your website, mention them in your headings, meta data and opening
sentences.
The fact that web crawlers regularly trawl the web to make sure their
index is up to date also suggests that having fresh content on your
website is a good thing too.

SEARCH ENGINE INDEXES
Once the crawler has found information by crawling over the web, the
program builds the index. The index is essentially a big list of all the
words the crawler has found, as well as their location.

CHALLENGES IN WEB CRAWLING
• Challenge I: Non-Uniform Structures
• Challenge II: Omnipresence of AJAX elements
• Challenge III: The “Real” Real-Time Latency
• Challenge IV: Who owns UGC?

CHALLENGE I: NON-UNIFORM STRUCTURES
Data formats and structures are inconsistent in the ever-evolving Web space.
Also, norms on how to build an Internet presence are non-existent.
The result?
Lack of uniformity and the vast ever-changing terrains of the Internet.
The problem?
Collecting data in a machine-readable format becomes difficult. Also,
problems increase with increase in scale.
Especially, when:
a) structured data is needed, and,
b) large number of details are to be extracted w.r.t. specific schema from
multiple sources.

CHALLENGE II: OMNIPRESENCE OF AJAX ELEMENTS
AJAX and interactive web components make websites more user-friendly. But
not for crawlers!
The result?
Content is produced dynamically (and on-the-go) by the browser and
therefore not visible to crawlers.
The problem?
To keep the content up-to-date, the crawler needs to be maintained manually
on a regular basis. So much so, that even Google’s crawlers find it difficult to
extract information!
The solution?
Crawlers need to be refined in their approach to be more efficient and
scalable.

CHALLENGE III: THE “REAL” REAL-TIME LATENCY
Acquiring data-sets in real-time is a huge problem! Real-time data is
critical in security and intelligence to predict, report, and enable
preemptive actions against untoward incidents.
The problem?
The real problem comes in deciding what is and isn't important in real
time.

CHALLENGE IV: WHO OWNS UGC?
User-Generated Content (UGC) proprietorship is claimed by giants
like Craigslist and Yelp and is usually out-of-bounds for commercial
crawlers.
The result?
Only 2-3 % sites disallow bots. Others believe in data democratization,
but it is possible these may follow suit and shut access to the data gold
mine!
The problem?
Site policing for web scraping and rejecting bots.

Challenges in web crawling

More Related Content

What's hot (20)

Similar to Challenges in web crawling (20)

More from Burhan Ahmed (20)

Recently uploaded (20)

Challenges in web crawling