LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Big Data Analytics with Hadoop and Apache Spark

Unlock the full course today

Join today to access over 24,700 courses taught by industry experts.

Bucketing

Bucketing

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial Buy for my team

Bucketing

“

- [Instructor] As seen in the previous video, partitioning is only optimal when a given attribute has a small set of unique values. What if we need to partition for a key with a large number of values without prolificating the number of directories. Bucketing is the answer. Bucketing works similar to partitioning, but instead of using the value of the attribute, it uses a hash function to convert the value into a specific hash key. Values that have the same hash key end up in the same bucket or subdirectory. The number of unique buckets can be controlled and limited. This also ensures even distribution of values across all buckets. It's ideal for attributes that have a large number of unique values like order number or transaction ID. Choose buckets for attributes that have a large number of unique values and those that are most frequently used inquiry filters. Experiment with multiple bucket counts to find optimal read-write performance for the specific use case. In the next video, I…

Contents