- Documents are represented as vectors in a vector space, with one dimension per term. A training set consists of labelled documents that correspond to labelled points in this vector space.
- Classification methods include Rocchio classification, which divides the space into regions centered on class centroids, and k-nearest neighbors (kNN) classification, which assigns classes based on the labels of the k closest training examples without explicit surface definitions.
- Common text classification approaches include prototype-based classification, which represents each class as the centroid of training examples, and assigns new documents to the closest centroid class.
Related topics: