I thought it might be fun for you to share about some of the technology underneath your own site. Are you using Oracle9i? Are our questions and your answers stored as CLOBs? Are you using Oracle Text? Separate servers? Oracle9i Application Server (Oracle9iAS)?
As of the writing of this column, asktom.oracle.com uses a single small server running Oracle9i Database Release 2 for the database, Oracle Text as the search engine, and Oracle9iAS's mod_plsql module for database access. The actual "software" that is the Ask Tom site started as a homegrown development environment that let us put together PL/SQL applications fast. This internal effort became so successful that it is actually one of Oracle Database 10g's coolest new database features. To that end, I've asked Sergio Leunissen—one of the key people working on this software—to tell us a little about it.
Leunissen: Tom is right. That homegrown development has become HTML DB, one of the most exciting new features in Oracle Database 10g. HTML DB offers sophisticated Web application development capabilities right out of the box.
Here's how it works: Contained entirely in the database, HTML DB is made up of a metadata repository and an engine for the rendering and processing of pages. When you're building your application, the definitions of pages, regions, templates, buttons, and fields are stored in the metadata repository. When an application runs, every page and all the content in it is rendered in real time from the metadata repository.
If you've developed Web applications using the PL/SQL Web toolkit and mod_plsql, you'll love how much time HTML DB saves you. Because the engine and the metadata repository reside inside the database, you have full access to SQL and PL/SQL. Displaying data from the database on the Web is simply a matter of writing a SQL query. The rendering engine takes care of the rest. In fact, if your SQL skills are a tad rusty, you can even rely on wizards to walk you through the steps of building a report based on one or more database tables.
Similarly, inserting, updating, or deleting data in the database is a breeze. Whenever a page is submitted, the processing engine can run data manipulation language (DML), such as an INSERT statement that you define. The INSERT statement will generally include user input from forms, which may be referenced using bind variable syntax. For example, say you wanted to capture people's first names, last names, and phone numbers, and store this information in the database. Assuming you already had a table to hold this information, you'd start by adding fields, referred to as items, to a page. Items have unique names so that their values can later be retrieved from session state. After you add a button to the page to submit the information, all you need is a process to insert the data into your target table. Assuming you named your items P1_FIRST_NAME, P1_LAST_NAME , and P1_PHONE , your process could be implemented as the following insert statement:
insert into T (first_name, last_name, phone_number) Values (:P1_FIRST_NAME, :P1_LAST_NAME, :P1_PHONE);
Note that the user input collection in form fields is retrieved from session state using bind variable syntax.
The HTML DB engine and the metadata repository provide a powerful and complete framework for building HTML user interfaces on top of the Oracle database. In fact, every screen in HTML DB is built using HTML DB!
Now you know how HTML DB works. Here's a little bit more on its key components: SQL Workshop, Data Workshop, and Application Builder.
SQL Workshop. The SQL Workshop allows you to interact with your database as you would with SQL*Plus with added features, such as database object browsing, query by example, and data dictionary browsing.
Data Workshop. With the Data Workshop, you can load data into and extract data from the database. My personal favorite feature allows you to take a spreadsheet and turn it into a database table in a few seconds. All you need to do is cut and paste your data from a spreadsheet and step through a wizard. Once you've turned your spreadsheet into a database table, you're only minutes away from building a Web-based report that can be accessed by anyone with a browser.
Application Builder. The centerpiece of HTML DB, the Application Builder is loaded with wizards to help build pages containing reports, forms, or charts. Pages are easily connected using branches.
There are nearly 75 predefined widgets you can use when you assemble your application, from basic HTML form elements to more advanced pop-up lists of values, pop-up calendars that select lists that automatically submit a page, and so on. Want to make sure data entered into forms is correct? Not a problem. You can add validations with mere mouse clicks. You can even use the regular expression syntax new in Oracle Database 10g to perform validations on data elements such as phone numbers, dates, or social security numbers.
The Application Builder's reporting capabilities are also very well-rounded. Common reporting user interface patterns such as pagination (the ability to page through a result set), user-controlled sorting on several columns, and hyperlinking to other reports or forms can all be added declaratively. That is, no coding is required, just a few mouse clicks.
Here are a few more features that will save time developing Web applications:
Session and state management. Each user running an HTML DB application always has a session, and no coding for this is required. Every form element you place on a page using HTML DB automatically has associated session state, managed in the database. Inside your application, you can reference any session state variable used on any page using familiar bind variable syntax.
User interface control. Look-and-feel is separated from application logic and data access through HTML DB's extensive use of templates. Templates are used to control the look-and-feel of pages, reports, buttons, tabs, menus, field labels, and so on. This means you can quickly start building an application without having to worry about look-and-feel right away. Once you are comfortable with the way your application works, you can design the HTML templates. Or, you can simply choose from a gallery of template themes available in HTML DB.
Flexible authentication and authorization schemes. If you decide your application requires people to log in with a username and password, you can either choose one of the prebuilt authentication schemes from the HTML DB gallery or build a custom scheme to suit your needs. Ready-made authentication schemes allow your application to use, among others, an HTML DB user repository, an LDAP repository, or Oracle9iAS single sign-on to manage user credentials.
Last, HTML DB natively supports team-based, "multitenant" hosted development. This means a single Oracle database can support multiple development projects, each being developed by teams of people using only a Web browser.
Currently we use the following structure for our ETL:
Cursor for Delta table Loop insert into dimension tables values ( cursor variables ) insert into xref tables insert into fact tables end loop commit;
The reason for this structure is to log all the errors in the process. Currently it's taking more than 16 hours for 140,000 records. It completes in about one minute for 300 records in a delta table. These INSERT statements are variables. Should this take that long? Is there anything fundamentally wrong with the procedure? Would committing within the loop (which you strongly oppose) speed the process?
I'll do this question as a LIFO (last in, first out) queue and answer in reverse. Moving the commit into the cursor FOR loop would accomplish three things for you:
Your code would run more slowly.
You would generate more redo.
You would generate more undo.
Now, on the plus side: There actually isn't one. Committing in the cursor FOR loop would be a really bad idea.
As for the "Is there anything fundamentally wrong with the procedure" question, the answer is yes. As coded, it does not employ any bulk operations whatsoever. At the least, it should be using bulk collects and FORALL inserts. The technique of bulk processing can be used to make things go many times faster. Taken to its logical extreme, compacting the code into a single SQL statement can make it many orders of magnitude faster. Here I'll compare row-by-row processing, bulk processing, and processing within a single SQL statement. You'll see the increase in performance you might expect.
Starting with the example tables T and T2 , I first want to load data from T into T2 . I've set it up so that 24 rows will get rejected. Additionally, I sorted the data by object name to make it so the errors happen randomly throughout the load:
create table t as select * from all_objects order by object_name; create table t2 as select * from all_objects where 1=0; alter table t2 add constraint check_object_id check ( object_id <= 31900 );
I'll start with the "row at a time" process similar to what you have now. It will "read a row, write a row" over and over. It is the least efficient approach to this problem:
declare l_cnt number := 0; begin for x in ( select * from t ) loop begin insert into t2 values X; exception when others then dbms_output.put_line(sqlcode); l_cnt := l_cnt+1; end; commit; end loop; dbms_output.put_line ( l_cnt || ' total errors' ); end;
When I ran this in SQL*Plus, the output included the following:
... 24 total errors PL/SQL procedure successfully completed. Elapsed: 00:00:58.69
That is your method: with a COMMIT added in the cursor FOR loop. Simply moving that COMMIT outside the loop and rerunning, the output would be:
... 24 total errors PL/SQL procedure successfully completed. Elapsed: 00:00:15.68
That shows the impact committing in the loop would have for you. Now continuing on to the bulk processing, I'll look at an Oracle9i Database Release 2-specific example. (Refer to asktom.oracle.com for how one might do this in Oracle8i and earlier.) It uses the new BULK_EXCEPTIONS clause and the ability to INSERT/UPDATE from a record. The following code bulk-collects 100 rows at a time and then bulk-inserts them 100 at a time while catching all errors nicely. The idea is that you would be performing some processing after you fetched but before you inserted here. You'll see in a moment what the code should look like if this processing step is not needed:
DECLARE type array is table of t%rowtype index by binary_integer; data array; errors NUMBER; dml_errors EXCEPTION; l_cnt number := 0; PRAGMA exception_init (dml_errors, -24381); cursor c is select * from t; BEGIN open c; loop fetch c BULK COLLECT INTO data LIMIT 100; /* some processing */ begin FORALL i IN 1 .. data.count SAVE EXCEPTIONS insert into t2 values data(i); EXCEPTION WHEN dml_errors THEN errors := SQL%BULK_EXCEPTIONS.COUNT; l_cnt := l_cnt + errors; FOR i IN 1..errors LOOP dbms_output.put_line ('Error occurred during iteration ' || SQL%BULK_EXCEPTIONS(i).ERROR_INDEX || ' Oracle error is ' || SQL%BULK_EXCEPTIONS(i).ERROR_CODE ); end loop; end; exit when c%notfound; END LOOP; close c; dbms_output.put_line ( l_cnt || ' total errors' ); end;
The output from that was as follows:
... 24 total errors PL/SQL procedure successfully completed. Elapsed: 00:00:01.45
Now, that is night-and-day different, performancewise. Processing 100 rows at a time versus 1 row at a time makes a huge difference. I did say that if you don't need that processing step in there, you could use an even better method. Simply roll your "validation" into the SELECT itself. Here, since I know that records where object_id > 31900 will fail, I'll simply skip them. The resulting insert is the following:
insert into t2 select * from t where object_id <= 31900; 29895 rows created. Elapsed: 00:00:00.52
This shows where you could be performancewise, if you can put your ETL validation into SQL itself!
How can I convince my developers there is a better way to do the following?
select max (weight) weight from client.pet_weight where pet_id = :pet_id and setup_date = (select max (setup_date) from client.pet_weight where pet_id = :pet_id)
The answer is as always, find the better way, benchmark it, and the numbers will speak for themselves. Here I will show a query that can perform much better. Not only will it go faster; it will use significantly fewer logical I/O resources (and that is one of the goals when tuning queries). In order to write the better query, what I did was to phrase your problem as a problem statement in English rather than SQL. That way I can tune a question, not tune a query. The problem you have is "for a given pet_id , find the weight for the maximum setup_date " or "for a given pet_id , sort by setup_date from big to small and take the first weight you find." Once I had that, coming up with an alternate query was pretty easy. I'll start by setting up a simple table to benchmark with:
create table t ( weight int, pet_id int, setup_date date ); insert into t ( pet_id, weight, setup_date ) select mod(rownum,1000), dbms_random.random, sysdate-1000 + dbms_random.value(0,1000) from all_objects; create index t_idx on t( pet_id, setup_date, weight ); analyze table t compute statistics for table for all indexes for all indexed columns;
Now, on my system, the table T has about 31,000 rows in it. I started testing by benchmarking your query against my table. I did this inside of a loop that runs from 1 to 1,000, getting the weight for each pet_id I had just inserted. The results of that were:
select max(weight) from t where pet_id = :b1 and setup_date = (select max(setup_date) from t where pet_id = :b1 ); call cnt cpu query rows _____ _____ _____ _____ _____ Parse 1 0.00 0 0 Execute 1000 0.42 2000 0 Fetch 1000 0.06 2005 1000 _____ _____ _____ _____ _____ total 2001 0.48 4005 1000 Rows Row Source Operation _____ __________ __________ 100 SORT AGGREGATE 1000 INDEX RANGE SCAN T_IDX
So, it took about half a CPU second and a little over 4,000 logical I/Os to perform that query with 1,000 different inputs. The query I came up with achieves the same answer but uses an inline view that sorts the data along with ROWNUM to get the results. What I should be able to achieve with this query is to remove the correlated subquery that caused me to hit the data twice as I did above, where I first hit the data to get the max setup_date for a pet_id and then hit it again to get the weight for that pet_id/max(setup_date) . This inline view query had the following performance characteristics:
select weight from (select weight from t where pet_id = :b1 order by pet_id DESC, setup_date DESC, weight DESC ) where rownum = 1; call cnt cpu query rows _____ _____ _____ _____ _____ Parse 1 0.00 0 0 Execute 1000 0.21 0 0 Fetch 1000 0.06 2000 1000 _____ _____ _____ _____ _____ total 2001 0.27 2000 1000 Rows Row Source Operation _____ __________ __________ 1000 COUNT STOPKEY 1000 VIEW 1000 INDEX RANGE SCAN DESCENDING T_IDX
That returns the same results in about half the CPU time with half the logical I/Os. That's information that should require no convincing on your part. All you need to do is ask your developers "would you like to get the answer with half the work in half the time?" It should be a quick meeting.
Now, if they cannot use ORDER BY in a subquery, just declare a cursor that looks like the following and open/fetch 1 row and close it:
select weight from t where pet_id = :b1 order by pet_id DESC, setup_date DESC, weight DESC;
Does the cost-based optimizer (CBO) consider all blocks up to the high-water mark (HWM) when calculating the cost of a full table scan?
Yes, it has to. Since all the blocks up to the HWM are going to be in the full scan, it must consider that in the cost.
Consider the following:
SQL> create table t pctfree 90 pctused 5 2 as 3 select * from all_objects; SQL> analyze table t compute statistics; Table analyzed. SQL> set autotrace traceonly explain; SQL> select * from t; Execution Plan ________________________________________ SELECT STATEMENT Optimizer=CHOOSE (Cost=384 Card=29272 Bytes=2488120) TABLE ACCESS (FULL) OF 'T' (Cost=384 Card=29272 Bytes=2488120) SQL> set autotrace off SQL> delete from t; 29272 rows deleted. SQL> commit; Commit complete. SQL> analyze table t compute statistics; Table analyzed. SQL> set autotrace traceonly explain; SQL> select * from t; Execution Plan ________________________________________ SELECT STATEMENT Optimizer=CHOOSE (Cost=384 Card=1 Bytes=128) TABLE ACCESS (FULL) OF 'T' (Cost=384 Card=1 Bytes=128)
Notice that the Cost value of the full scan remains constant at 384, regardless of the number of rows in the table. That is because when Oracle does a full scan, it must read each and every block that is below the HWM. In the second execution plan, notice that the Card (short for cardinality—the expected number of rows in the result set) value is significantly lower. The optimizer knows it will find only about one row, but it will have to scan just as many blocks as it did to find 29,272 rows. Let's truncate this table, lowering the HWM, and see what happens:
SQL> truncate table t; Table truncated. SQL> analyze table t compute statistics; Table analyzed. SQL> set autotrace traceonly explain SQL> select * from t; Execution Plan ________________________________________ SELECT STATEMENT Optimizer=CHOOSE (Cost=2 Card=1 Bytes=128) TABLE ACCESS (FULL) OF 'T' (Cost=2 Card=1 Bytes=128)
With the same table and the same number of rows but after the truncate lowered the HWM, the cost of a full scan is significantly lower (Cost=2).
Photography by Aaron Burson, Unsplash