Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Partition Wise Joins

Jean-Pierre Dijcks
Master Product Manager

Some say they are the holy grail of parallel computing and PWJ is the basis for a shared nothing system and the only join method that is available on a shared nothing system (yes this is oversimplified!). The magic in Oracle is of course that is one of many ways to join data. And yes, this is the old flexibility vs. simplicity discussion all over, so I won't go there... the point is that what you must do in a shared nothing system, you can do in Oracle with the same speed and methods.

The Theory

A partition wise join is a join between (for simplicity) two tables that are partitioned on the same column with the same partitioning scheme. In shared nothing this is effectively hard partitioning locating data on a specific node / storage combo. In Oracle is is logical partitioning.

If you now join the two tables on that partitioned column you can break up the join in smaller joins exactly along the partitions in the data. Since they are partitioned (grouped) into the same buckets, all values required to do the join live in the equivalent bucket on either sides. No need to talk to anyone else, no need to redistribute data to anyone else... in short, the optimal join method for parallel processing of two large data sets.

PWJ's in Oracle

Since we do not hard partition the data across nodes in Oracle we use the Partitioning option to the database to create the buckets, then set the Degree of Parallelism (or run Auto DOP - see here) and get our PWJs. The main questions always asked are:

  1. How many partitions should I create?
  2. What should my DOP be?

In a shared nothing system the answer is of course, as many partitions as there are nodes which will be your DOP. In Oracle we do want you to look at the workload and concurrency, and once you know that to understand the following rules of thumb.

Within Oracle we have more ways of joining of data, so it is important to understand some of the PWJ ideas and what it means if you have an uneven distribution across processes.

Assume we have a simple scenario where we partition the data on a hash key resulting in 4 hash partitions (H1 -H4). We have 2 parallel processes that have been tasked with reading these partitions (P1 - P2). The work is evenly divided assuming the partitions are the same size and we can scan this in time t1 as shown below.

Scanning 4 partitions in Time t1 with 2 processes

Now assume that we have changed the system and have a 5th partition but still have our 2 workers P1 and P2. The time it takes is actually 50% more assuming the 5th partition has the same size as the original H1 - H4 partitions.


In other words to scan these 5 partitions, the time t2 it takes is not 1/5th more expensive, it is a lot more expensive and some other join plans may now start to look exciting to the optimizer. Just to post the disclaimer, it is not as simple as I state it here, but you get the idea on how much more expensive this plan may now look...

Based on this little example there are a few rules of thumb to follow to get the partition wise joins.

First, choose a DOP that is a factor of two (2). So always choose something like 2, 4, 8, 16, 32 and so on...

Second, choose a number of partitions that is larger or equal to 2* DOP.

Third, make sure the number of partitions is divisible through 2 without orphans. This is also known as an even number...

Fourth, choose a stable partition count strategy, which is typically hash, which can be a sub partitioning strategy rather than the main strategy (range - hash is a popular one).

Fifth, make sure you do this on the join key between the two large tables you want to join (and this should be the obvious one...).

Translating this into an example:

DOP = 8 (determined based on concurrency or by using Auto DOP with a cap due to concurrency) says that the number of partitions >= 16.

Number of hash (sub) partitions = 32, which gives each process four partitions to work on. This number is somewhat arbitrary and depends on your data and system. In this case my main reasoning is that if you get more room on the box you can easily move the DOP for the query to 16 without repartitioning... and of course it makes for no leftovers on the table...

And yes, we recommend up-to-date statistics. And before you start complaining, do read this post on a cool way to do stats in 11.

Join the discussion

Comments ( 2 )
  • Amir riaz Tuesday, October 26, 2010
    Nice Article,
    But dont you think, oracle partition is too old for the huge data warehouses of modern times. The latest trends is to use storage index or zone maps. In oracle exadata storage index oracle uses storage indexes at runtime to eliminate extents. in Zonemap database uses zonemap to eliminate extents at Plan creation time and selects the partitions on which data resides. Zonemap technique is highly impressive since it reduce the IO size at plan creation time using the max/min values of zonemap and then from remaining extents it calculates the DoP. Similar is the case with joining.
  • jean-pierre.dijcks Wednesday, October 27, 2010
    Hmmm. I'd say don't discount one technique because something else is around as well. In Oracle, assuming we are on Exadata and have storage indexes, partitioning will help reduce what to scan and a storage index may reduce the scan even further. So they combine into a 1-2 punch mechanism on Oracle (not on necessarily on other systems).
    Partitioning (in the PWJ case) also allows any parallel system to do efficient joins. Any parallel system because every MPP platform (and Oracle by leveraging hash partitions) hashes the data and then tries to do co-located joins. On for example Teradata you hash the data over the nodes and do a partition wise join based on that distribution. In Oracle it is a logical concept getting to the same point (PWJ).
    But that is only the query side. In VLDW's maintenance operations can be a lot "cheaper" especially when we can apply them on a per partition basis. This because a partition is a standalone object. You can also vary compression ratios across partitions.
    And you may choose to do partition exchange loading. This gives you the ability to "publish" data as a snapshot by exchanging a table with a partition. As this is just dictionary stuff it takes a very, very small amount of time to publish a coherent data set to the end users.
    So, yes I agree that a zone map (storage index) can be very beneficial at query times, having partitioning on top of that gives a lot of benefits in many areas and should be something you look for in a DW system.
    Hope this makes sense.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.