This document provides an introduction to information retrieval concepts including document ingestion, tokenization, terms, normalization, stemming, and lemmatization. It discusses how documents are parsed and tokenized, dealing with complications from different languages and formats. Terms are the normalized words that are indexed, and the document examines issues with stop words, numbers, case folding, and dealing with other languages. Stemming and lemmatization are introduced as ways to reduce inflectional variants of words.
Related topics: