This document summarizes previous work on content extraction from web pages and proposes a new approach. It discusses existing methods that use techniques like entropy analysis, DOM trees, clustering, and ratios of text, links and tags. The proposed approach combines word to leaf ratio with text link ratio and link text ratio to identify informative nodes in the DOM tree. It calculates weights and relative positions of nodes to select the most informative content. The method will be tested on different website types and compared to existing approaches.
Related topics: