From the course: Complete Guide to Databricks for Data Engineering

Unlock this course with a free trial

Join today to access over 24,600 courses taught by industry experts.

Broadcast join

Broadcast join

- [Instructor] Broadcast Join is a popular way of joining the two tables in the PySpark. Let's just go and see what is Broadcast Join, and how we can use it. Broadcast Join is a type of a join where one of the DataFrame is broadcasted to all worker nodes. What does it mean? It means that one DataFrame has multiple partitions, right? And when we say that we are joining two DataFrames together, in this case, rather than moving both the DataFrames, what we do is, we take one DataFrame and we transfer the data of this one DataFrame, or broadcast the data of this one DataFrame, on all the worker nodes. The idea behind this is that this is going to improve the performance. The smaller DataFrame is the one in the joining of the two tables you're going to broadcast. For example, let's say I'm joining two DataFrames, one DataFrame contains a hundred rows, and another DataFrame contains 1 million rows. In that case, what I'm going to do is, I'm going to take the smallest DataFrame that is in…

Contents