Hive Bucketing in Apache Spark

Slides

https://github.com/keypointt/reading/blob/master/spark/2017_spark_spring_SF_FaceBook_Hive_Spark.pdf

note

1. Why bucketing

bucketing use case

Painpoint: when student table needs to join with multiple other tables on the same id, student table needs to be shuffled (write to and read from disk many times), which is high io/network cost.

Solution: here is to pre-compute at the creation of student table.

Generalize: applicapility domain is, a lot of join operation on the same key, since if using the same key over and over again then no need to shuffled every time.

Result: less CPU, less latency.

2. Bucketing: hive-vs-spark

bucketing-hive-vs-spark

Master Ticket: Hive bucketing support

https://issues.apache.org/jira/browse/SPARK-19256

Reference:

https://www.youtube.com/watch?v=6BD-Vv-ViBw&feature=youtu.be