The document discusses several key topics in natural language processing and computational linguistics:
1. It defines the basic units of language like words, tokens, types and texts.
2. It describes techniques for extracting text from various sources like files, web pages and corpora and preprocessing the text by removing HTML tags and normalizing whitespace.
3. It discusses empirical observations about word frequencies like Zipf's Law and Heap's Law, which state that a small number of words occur very frequently while most words occur rarely.