From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 24,700 courses taught by industry experts.

Bucketing

Bucketing

- [Instructor] As seen in the previous video, partitioning is only optimal when a given attribute has a small set of unique values. What if we need to partition for a key with a large number of values without prolificating the number of directories. Bucketing is the answer. Bucketing works similar to partitioning, but instead of using the value of the attribute, it uses a hash function to convert the value into a specific hash key. Values that have the same hash key end up in the same bucket or subdirectory. The number of unique buckets can be controlled and limited. This also ensures even distribution of values across all buckets. It's ideal for attributes that have a large number of unique values like order number or transaction ID. Choose buckets for attributes that have a large number of unique values and those that are most frequently used inquiry filters. Experiment with multiple bucket counts to find optimal read-write performance for the specific use case. In the next video, I…

Contents