This document discusses two machine learning techniques, supervised and unsupervised, for automatically marking up natural heritage literature to make it structured and machine-readable. It describes a prototype application that uses these techniques to convert free text documents to XML format in both batch and online modes. The supervised technique involves training on manually annotated examples, while the unsupervised technique derives structure solely from the regularities in the text without examples. The document compares the performance of these techniques on a real corpus.
Related topics: