The document describes a method for focused crawling to retrieve structured data from web pages. It involves using an online classifier trained on URL features to identify pages containing structured data. A bandit-based selection strategy is used to balance exploration and exploitation. Experiments show the adaptive approach retrieves 26% more relevant pages than static classification, and 66% more when focused on a specific objective. Decaying the bandit randomness over time improved results further. The method was able to retrieve hundreds of millions of structured data pages from billions of web pages.
Related topics: