The document discusses adding column statistics to Hive to improve query optimization. It outlines computing statistics like minimum/maximum values, average/max lengths, number of distinct/null values, and histograms. The statistics would be computed implicitly during loads or explicitly via ANALYZE and stored in the metastore. Computing some statistics like histograms and distinct values poses challenges around memory usage and computation time that probabilistic algorithms could help address.
Related topics: