Understanding Spark’s BucketBy Clause

3 min readSep 7, 2024

Understanding Spark’s BucketBy Clause

1️⃣ What is bucketBy Clause?
When there are a large number of distinct values (High Cardinality), then Bucketing is a better choice over Partitioning. In case of bucketing, the number of buckets and the bucketing column has to be defined upfront and passed as parameters to the `bucketBy` clause.

Bucketing helps in 2 ways:

1. Skipping the irrelevant data: When a query is executed, only the relevant buckets (i.e., those that contain data related to the query) are scanned. This reduces the amount of data that needs to be processed, thereby improving query performance.

2. Join Optimizations: When joining two bucketed tables on the bucketed column, Spark can perform a bucketed join, which can be more efficient than a shuffle join. This is because a bucketed join does not require shuffling the data as it is guaranteed that the data with the same bucketed column value resides in the same bucket.

2️⃣ Usage of bucketBy
The `bucketBy` clause is used when writing data to a DataFrame or Dataset. The number of buckets and the column(s) to bucket by are specified as parameters. The data is then divided into buckets based on the hash of the bucketing column’s values. Each bucket is stored as a separate file in the underlying file system.

3️⃣ Performance Gains
Running a query on a bucketed table on the bucketed column results in significant performance gains as only one file (i.e., the bucket containing the relevant data) will be scanned to get the desired results. This is known as bucket pruning.

4️⃣ Key Points:
- Based on a hash function, the records will be moved to the different buckets/ files.
- In case of bucketing, a managed spark table has to be created to save the bucketed data.
- A combination of Partitioning followed by Bucketing is possible, where there will be two level filtering.
- However, Bucketing followed by Partitioning is not possible. Because, partitioning results in folders and bucketing will result in files. Having files inside a folder is possible but we cannot have a folder inside a file.

5️⃣ Storing and Retrieving Data Using Bucketing Approach
Consider an example where the number of fixed buckets = 4. The data is stored in the buckets based on a hash function, such as modulo. For instance, if there are 4 buckets, modulo 4 is applied to the bucketing column’s values to determine which bucket a record should go into.
When retrieving data, the same hash function is used.

6️⃣ Limitations of bucketBy
Bucketing in Spark requires upfront definition of bucket numbers, which can’t be altered later, potentially causing inefficiency if data distribution changes. It necessitates creating a managed Spark table, which may not always be suitable. Moreover, it’s less effective with columns having few unique values, as it can lead to empty buckets or skewed data processing.

#ApacheSpark #BigData #DataPartitioning #SparkOptimization #DataEngineering #SparkPerformance #bucketBy #dataengineering

Written by Sachin D N

No responses yet