Workload Management – A Simple (but real) Example

We talked a bit about workload management in my previous post and here I wanted to use Enterprise Manager, Database Resource Manager (DBRM) and Parallel Statement Queuing on an Exadata (V1 hardware, v2 software and 11.2.0.2 database) machine to show some of it in action.

Context

In DBRM I have created 2 plans, one called batch_plan and one called daytime_plan. The batch_plan favors long running queries from the user RETAIL and gives the resource group called BATCH_PLAN (that is linked to RETAIL - again for simplicity I use a 1:1 mapping to a user) 70% of the CPU capacity and gives a resource group called RT_CRITICAL 30% of the CPU capacity. RT_CRITICAL is linked to a single user called... RT_CRITICAL which will run our highly critical queries for this environment.

The idea is that we are running a heavy workload with long running queries (batch type of stuff) within user RETAIL.

plan_settings

During the time window where we are running these batch like queries, we have the BATCH_PLAN enabled as is shown above. This means that with the current settings our critical queries run by RT_CRITICAL are squeezed to only 30% of CPU capacity. We have other throttles in place, but more about that later. First we will have a look at some of the query profiles and the result on our system of individual queries.

Running a Single Large Query

To show some of the characteristics for an individual query we run it with the BATCH_PLAN enabled and with nothing else running on the system.

sqlmon_1query_progress

We can see a couple of things, first off all the query is capped at DOP 64, running across 8 nodes of the database machine. Currently the query is in progress and clicking on the query will show the execution plan in action. This allows you to monitor long running queries in EM. The tool we are using here is called SQL Monitor.

SQL Monitor shows us the progress and the plan, of which we see a part here:

sqlmon_1query_plan_offload

As you can see this plan is a real query doing real work. You can also see that for some of the data / IO pieces we collect information on the Exadata efficiency. Here we achieve 63% offload efficiency. That means rather than moving 126GB to the compute nodes we only move 46GB.

We actually read that data with the peak capacity of my V1 hardware (note a V2 machine will push about 21GB or so per second), which is 14GB/sec as shown here:

sqlmon_1query_throughput

You can see that we read at 14GB/sec in the top graph - yellow line. What is also interesting is that we see the memory build up underneath. E.g. as we are reading at peak levels we are building any constructs in memory and slowly increase the PGA usage. Once the IO is mostly done PGA starts going up and we work in PGA.

Critical Queries

Now that we have an idea about our batch queries, let's look at some characteristics of the critical queries. We have 5 of them in this scenario, and if we run them all by themselves they complete in around 5 seconds or so on average. Note nothing else is running on the system, so even while we have BATCH_PLAN active these 5 queries can use all CPU resources. This is because DBRM allows a group to use more resource as long as the other groups are not working on the system.

sqlmon_5query_progress

The Goal of the Case Study

Now, the entire idea is to load up the machine with batch queries and show the effect of this load on the critical queries. With the batch plan, the critical queries will be squeezed for resources and show a large variation in runtime. We will then switch to the daytime_plan, and will see that the critical queries run as if nothing else is going on on the system.

To  understand what is going on at the system and plan level let's have a look at the actual plans we will be using here.

BATCH_PLAN

The BATCH_PLAN is the plan that favors the analytical or large batch queries. Here we can see (in order from top to bottom) the settings (in edit mode) for CPU, Parallelism and Thresholds:

batch_plan_general

batch_plan_parallel

batch_plan_thresholds

As you can see you can modify and create these plans within Enterprise Manager, under the Server Tab in DB Control. We have set CPU limits at 70% for RT_ANALYTICS. We have set a maximum DOP for RT_ANALYTICS at 64, and give it 70% of the available parallel processes set in parallel_servers_target. This means it starts queuing at 70% of 8 * 128.

In this plan we have not worked with any thresholds. Again this is for simplicity reasons. If you look at the Thresholds it is important to understand that the Use Estimate? column signifies that we use the optimizer estimate for execution time to determine what to do. So before the query executes you can decide on thresholds and for example move something from RT_CRITICAL into RT_ANALYTICS.

DAYTIME_PLAN

The following is the overview of the DAYTIME_PLAN. If it needs any editing you press Edit and get the tabbed version of the plan as we saw earlier in our BATCH_PLAN screenshots.

daytime_plan

As we can see in the DAYTIME_PLAN we did set thresholds but these are based on actual execution times (we left the Use Estimate? box unchecked!).

Within the DAYTIME_PLAN we reversed the CPU settings and now are giving only 30% of CPU resources to the RT_ANALYTICS group. RT_ANALYTICS is also capped at a much lower DOP (16 instead of 64) and will start queuing at 30% of the total number in parallel_servers_target.

Running the Batch

With the BATCH_PLAN on, we will first start a load on the system. The system is always loaded with 10 to 12 high octane queries like the one we looked at when it ran by itself.

sqlmon_load_and_queue

SQL Monitor above shows what is going on. We have five queries running at DOP 64 (with 128 processes each). We see that subsequent statements are being queued to ensure we do not completely overload the system and use up all parallel processes.

This statement queuing is an important workload management tool as it stops statements from being downgraded to serial (which happens once we run out of parallel processes).

With the system loaded up with queries, we are starting our critical queries. As expected, the critical queries struggle to get enough resources. They are not queued in this case and get the appropriate DOP, but are starved for CPU (and IO).

sqlmon_load_and_queue_plus_prio_running_and_complete

If you look closely above, you see that an RT_CRITICAL query completed in 1.3 minutes, one completed in 31 seconds and a third one is running. Eventually the latter queries would complete in around 25 seconds or so. This is of course a far cry from the 5 seconds on average.

Switching the Plan

Next we switch the plan to the DAYTIME_PLAN. In our case we do this via the UI, but in reality this would be scheduled, which is of course supported via the database scheduler.

change_plans

A Better Run

After we switch the plan we can see a couple of things. First of all we see the new DOP cap for any statement within RT_ANALYTICS. The cap is now set to 16:

sqlmon_capping_dops

As you can see we still have statement that completed under the old plan with DOP 64, but the new statements that started after we change the plan are now showing a cap of 16.

If we now run the critical statements we will see that they run very near their original runtimes of 5 seconds:

sqlmon_capping_dops_with_critical

Note that I cropped the statement to show the queuing, the caps and the critical runtimes.

Conclusion

While this scenario is relatively simple, you can quickly see the power of the functionality. Here we protect statements from downgrades to serial through queuing. We ensure the important workloads get the appropriate resource allocation when it matters (here we did it by time windows), and we can see how a heavy load can be managed to ensure critical work gets done.

By using the parallelism in 11.2.0.2, Database Resource Manager and SQL Monitor (which is part of the EM Tuning pack!) you can manager your mixed workloads the way you want to manage them.


Credits for the main demo framework go to Oracle's Real World Performance Group!

Comments:

Hi, This is a really interesting article, thank you. I have been trying to understand more about DBRM, but in my testing (see here) couldn't get it to honour the CPU allocations of a plan once there was a skewed number of sessions in one consumer group over another. My understanding was that once CPU was constrained DBRM would enforce the allocations defined within a resource plan, regardless of the number of sessions in each consumer group. Is this not correct?

Posted by rnm1978 on October 04, 2010 at 10:50 PM PDT #

Hmmm, just to be honest here, my current testing was geared towards parallel statement management and the workload management with simple CPU directives. So I have not seen the issue you are describing. I would assume that your understanding is correct and it should work regardless of the sessions in a consumer group. In the case study we ran we tested with 12 - 15 sessions in a consumer group going at 64 parallel (with a set queued). I did not notice too much issues on my setup. However, I found your posting on your blog and will have a chat with the development team and see what they think. JP

Posted by jean-pierre.dijcks on October 05, 2010 at 02:04 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
2
4
5
6
7
8
9
10
11
12
13
14
16
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today