From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 24,700 courses taught by industry experts.
Bucketing
From the course: Big Data Analytics with Hadoop and Apache Spark
Bucketing
- [Instructor] As seen in the previous video, partitioning is only optimal when a given attribute has a small set of unique values. What if we need to partition for a key with a large number of values without prolificating the number of directories. Bucketing is the answer. Bucketing works similar to partitioning, but instead of using the value of the attribute, it uses a hash function to convert the value into a specific hash key. Values that have the same hash key end up in the same bucket or subdirectory. The number of unique buckets can be controlled and limited. This also ensures even distribution of values across all buckets. It's ideal for attributes that have a large number of unique values like order number or transaction ID. Choose buckets for attributes that have a large number of unique values and those that are most frequently used inquiry filters. Experiment with multiple bucket counts to find optimal read-write performance for the specific use case. In the next video, I…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.