Bag-of-words models represent documents or images as histograms of occurrences of visual words. Three key steps are:
1) Feature detection and representation, where local features are extracted from images and represented as vectors.
2) Codebook formation, where vector quantization is used to cluster local features into visual words.
3) Image representation, where images are represented as histograms of visual word frequencies.
Generative models like pLSA and LDA can learn the relationships between visual words and object categories in a training set. Discriminative methods like SVMs with pyramid match kernels can also be used for object recognition based on bag-of-words representations.