This document discusses optimizing discrete wavelet transforms for CPU performance. It covers techniques like loop fusion, removing prologs and epilogs, leveraging CPU cache, SIMD vectorization, and parallelization. Benchmark results show these optimizations can achieve up to an 11x speedup over the separable diagonal implementation for a 10 megapixel image on an Intel Core2 Quad CPU. Future work areas discussed include merging multiple levels and transforms.
Related topics: