Partition Wise Joins
By Jean-Pierre Dijcks on Jun 02, 2010
Some say they are the holy grail of parallel computing and PWJ is the basis for a shared nothing system and the only join method that is available on a shared nothing system (yes this is oversimplified!). The magic in Oracle is of course that is one of many ways to join data. And yes, this is the old flexibility vs. simplicity discussion all over, so I won't go there... the point is that what you must do in a shared nothing system, you can do in Oracle with the same speed and methods.
A partition wise join is a join between (for simplicity) two tables that are partitioned on the same column with the same partitioning scheme. In shared nothing this is effectively hard partitioning locating data on a specific node / storage combo. In Oracle is is logical partitioning.
If you now join the two tables on that partitioned column you can break up the join in smaller joins exactly along the partitions in the data. Since they are partitioned (grouped) into the same buckets, all values required to do the join live in the equivalent bucket on either sides. No need to talk to anyone else, no need to redistribute data to anyone else... in short, the optimal join method for parallel processing of two large data sets.
PWJ's in Oracle
Since we do not hard partition the data across nodes in Oracle we use the Partitioning option to the database to create the buckets, then set the Degree of Parallelism (or run Auto DOP - see here) and get our PWJs. The main questions always asked are:
- How many partitions should I create?
- What should my DOP be?
In a shared nothing system the answer is of course, as many partitions as there are nodes which will be your DOP. In Oracle we do want you to look at the workload and concurrency, and once you know that to understand the following rules of thumb.
Within Oracle we have more ways of joining of data, so it is important to understand some of the PWJ ideas and what it means if you have an uneven distribution across processes.
Assume we have a simple scenario where we partition the data on a hash key resulting in 4 hash partitions (H1 -H4). We have 2 parallel processes that have been tasked with reading these partitions (P1 - P2). The work is evenly divided assuming the partitions are the same size and we can scan this in time t1 as shown below.
Now assume that we have changed the system and have a 5th partition but still have our 2 workers P1 and P2. The time it takes is actually 50% more assuming the 5th partition has the same size as the original H1 - H4 partitions.
In other words to scan these 5 partitions, the time t2 it takes is not 1/5th more expensive, it is a lot more expensive and some other join plans may now start to look exciting to the optimizer. Just to post the disclaimer, it is not as simple as I state it here, but you get the idea on how much more expensive this plan may now look...
Based on this little example there are a few rules of thumb to follow to get the partition wise joins.
First, choose a DOP that is a factor of two (2). So always choose something like 2, 4, 8, 16, 32 and so on...
Second, choose a number of partitions that is larger or equal to 2* DOP.
Third, make sure the number of partitions is divisible through 2 without orphans. This is also known as an even number...
Fourth, choose a stable partition count strategy, which is typically hash, which can be a sub partitioning strategy rather than the main strategy (range - hash is a popular one).
Fifth, make sure you do this on the join key between the two large tables you want to join (and this should be the obvious one...).
Translating this into an example:
DOP = 8 (determined based on concurrency or by using Auto DOP with a cap due to concurrency) says that the number of partitions >= 16.
Number of hash (sub) partitions = 32, which gives each process four partitions to work on. This number is somewhat arbitrary and depends on your data and system. In this case my main reasoning is that if you get more room on the box you can easily move the DOP for the query to 16 without repartitioning... and of course it makes for no leftovers on the table...
And yes, we recommend up-to-date statistics. And before you start complaining, do read this post on a cool way to do stats in 11.