X

Celebrating the joy and power of Oracle SQL with the Oracle Developer Advocate team

Recent Posts

Announcing the Winners of the 2018 Oracle Dev Gym Championship for Database Design

On Thursday 14 February, 37 plucky contestants took part in the annual Database Design Championships over on Oracle Dev Gym. This competition tested the players knowledge of Oracle Database 18c in a series of five quizzes. These covered temporary tables, external tables, partitioning, identity columns, and JSON in the database. Ryan McGuire Gratisography The results are now in, so it's time to announce the winners! Iudith Mentzel of Israel (86.5% correct) Sartograph of Germany (82.7% correct) Chad Lee of United States (80.8% correct) These three were head-and-shoulders above the rest, being the only players to clear eighty percent right. And huge kudos to to Iudith for taking the top spot for the second year in a row! And, once again, a massive thanks to Elic, our quiz-reviewer supreme. His keen eye ensured that all the choices were accurate. Think you know more about Oracle Database? Find out in the weekly tournaments on Oracle Dev Gym. These cover SQL, PL/SQL, and Database Design. Or if you're looking for something a little different, we also offer weekly ranked quizzes on Java and Logic. We hope you can join us! The complete rankings are: Rank Name Total Time % Correct Total Score 1 mentzel.iudith (5) 33 m 87% 6066 2 Sartograph (3) 31 m 83% 5825 3 Chad Lee (5) 29 m 81% 5730 4 NickL (4) 31 m 79% 5625 5 li_bao (5) 24 m 79% 5601 6 Andrey Zaytsev (5) 34 m 79% 5513 7 JeroenR (4) 30 m 75% 5276 8 Karel_Prech (4) 34 m 75% 5212 9 siimkask (5) 15 m 71% 5139 10 Maxim Borunov (5) 33 m 73% 5116 11 PZOL (4) 34 m 71% 5012 12 JustinCave (5) 26 m 71% 4995 13 Jan Šerák (5) 31 m 71% 4974 14 seanm95 (5) 20 m 69% 4919 15 Sandra99 (5) 33 m 69% 4864 16 Chase Mei (5) 22 m 67% 4860 17 Aleksei Davletiarov (3) 23 m 69% 4858 18 NielsHecker (5) 32 m 67% 4821 19 Michal P. (5) 33 m 67% 4665 20 mcelaya (3) 30 m 65% 4579 21 Joaquin_Gonzalez (5) 24 m 65% 4550 22 Hertha Rettinger (2) 20 m 63% 4469 23 Rytis Budreika (5) 08 m 62% 4365 24 Talebian (2) 21 m 62% 4365 25 Ivan Blanarik (5) 20 m 62% 4317 26 Mike Tessier (2) 24 m 60% 4200 27 whab@tele2.at (3) 18 m 60% 4125 28 msonkoly (3) 34 m 60% 4112 29 RalfK (2) 24 m 60% 4051 30 richdellheim (2) 33 m 58% 3967 31 umir (4) 34 m 58% 3910 32 MarkusId (1) 27 m 56% 3890 33 Sachi (3) 17 m 54% 3881 34 Stelios Vlasopoulos (5) 34 m 58% 3860 35 Cor van Berkel (2) 34 m 54% 3710 36 Köteles Zsolt (1) 30 m 50% 3230 37 swesley_perth (3) 03 m 17% 1087

On Thursday 14 February, 37 plucky contestants took part in the annual Database Design Championships over on Oracle Dev Gym. This competition tested the players knowledge of Oracle Database 18c in a...

How to Store, Query, and Create JSON Documents in Oracle Database

JavaScript Object Notation (JSON) is a lightweight data transfer format. It's the de facto standard for document exchange. So it's likely you'll want to send and receive JSON documents from and to your database. And store them in your tables. Oracle Database has a huge amount of functionality that makes this easy. In this post you'll learn how to: Store JSON in a database table Query a table for rows storing particular values within a JSON document Create indexes to find these rows efficiently Generate JSON from relational rows and columns Let's get started! How to Store JSON in Your Database JSON documents are just text. So to store them all you need is a column that holds this. Before we get into the details, a bit of a rant: The JSON homepage describes it as: JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language Notice how data-interchange appears twice? And storage appears, well... never! That's because JSON is primarily for transferring data between systems. Personally I think traditional relational tables should be the default choice for data storage. Taking JSON input and it storing as-is in your database should be the last resort. But I know that you may just want to dump JSON in your database and be done with it. So let's review how. As JSON is text, the obvious choice for storing it is a VARCHAR2 for small documents. I.e. those under 4,000 bytes. Or 32,767 bytes if you have enabled extended data types. But if you're storing bigger documents, we recommend going for BLOB. Why not the text-based CLOB you ask? This is because BLOB stores the document as-is, avoiding character set conversions. It also means the documents may use less storage. The downside of this is you need to do data conversions when you save or retrieve JSON in BLOBs. Luckily, as we'll see later, this is easy to do. So to create a table that will store "large" JSON documents, run: create table departments_json ( department_id integer not null primary key, department_data blob not null ); Regardless of which data type you use, you'll want to verify that you're storing real JSON. And not any old junk. To do this, add the IS JSON check constraint, like so: alter table departments_json add constraint dept_data_json check ( department_data is json ); Now you've got the table, it's time to start adding documents! How to Insert JSON in the Table You can now save a JSON document with a standard insert. If you've used a character data type, just whack it in. If you've gone with BLOB, add text-to-binary conversion: insert into departments_jsonjson values ( 110, utl_raw.cast_to_raw ( '{ "department": "Accounting", "employees": [ { "name": "Higgins, Shelley", "job": "Accounting Manager", "hireDate": "2002-06-07T00:00:00" }, { "name": "Gietz, William", "job": "Public Accountant", "hireDate": "2002-06-07T00:00:00" } ] }' )); If you've added the IS JSON constraint, the database will also reject anything that's not real JSON: insert into departments_json values ( 100, utl_raw.cast_to_raw ( 'Random junk' ) ); ORA-02290: check constraint (CHRIS.DEPT_DATA_JSON) violated Great. But what if you want to edit part of a document saved in your table? How to Update JSON with SQL Let's say there's a big corporate re-org, merging the Finance and Accounting departments. Now imaginatively titled "Finance and Accounting". So you need to change the name of the Finance department. Pre-19c, to change part of a JSON document, you had to replace the whole thing. Leading to a big, clunky update like this: update departments_json set department_data = utl_raw.cast_to_raw ( '{ "department": "Finance and Accounting", "employees": [ { "name": "Higgins, Shelley", "job": "Accounting Manager", "hireDate": "2002-06-07T00:00:00" }, { "name": "Gietz, William", "job": "Public Accountant", "hireDate": "2002-06-07T00:00:00" } ] }' ) where department_id = 110; Oracle Database 19c introduces a new option: JSON patching with JSON_mergepatch. This only replaces the relevant section of the document. Just specify the attribute you want to update and its new value. So instead of having to write the whole document, you can change just the department's value like this: update departments_json set department_data = json_mergepatch ( department_data, '{ "department" : "Finance and Accounting" }' ) where department_id = 110 ; Of course, as part of this re-org you'll probably need to merge in the Finance employees too. The employees attribute is an array. To patch an array, you have to replace the whole thing. So you'll have an update like: update departments_json set department_data = json_mergepatch ( department_data, '{ "employees" : [ { "name" : "Gietz, William", "job" : "Public Accountant", "hireDate" : "2002-06-07T00:00:00" }, { "name" : "Higgins, Shelley", "job" : "Accounting Manager", "hireDate" : "2002-06-07T00:00:00" }, { "name" : "Chen, John", "job" : "Accountant", "hireDate" : "2005-09-28T00:00:00" }, { "name" : "Greenberg, Nancy", "job" : "Finance Manager", "hireDate" : "2002-08-17T00:00:00" }, { "name" : "Urman, Jose Manuel", "job" : "Accountant", "hireDate" : "2006-03-07T00:00:00" } ] }' ) where department_id = 110 ; So this is still clunky if you're working with large arrays. But better than having to replace the whole document! How to Query JSON with SQL So now you have JSON in your database. If you want to manipulate it, you can select the whole document into your app. And process it there. But that's a lot of data to transfer if you're only interested in a couple of attributes. It's better to fetch the parts you want from the table. Luckily Oracle Database offers many ways to get only the values you're interested in. Simple Dot-Notation Access Provided you added the IS JSON constraint, you can access attributes by: Selecting the column Adding the JSON path to the value you want So to get the department names, list out the path to it: select d.department_data.department from departments_json d; Note you need to prefix the columns with the table name or alias. Or you'll get an "ORA-00904: "DEPARTMENT_DATA"."DEPARTMENT": invalid identifier" error. You can also use this to get array elements. To do this, state the index of the element you want (remembering JSON arrays are zero-indexed), then its attribute. For example, to find the name of the first employee in department 110, use: select d.department_data.employees[0].name from departments_json d where department_id = 110; EMPLOYEES Gietz, William Or perhaps you want the names of all the employees in a department. To do this, get the attributes at position * (asterisk). And you'll get an array back: select d.department_data.employees[*].name from departments_json d where department_id = 110; EMPLOYEES ["Gietz, William","Higgins, Shelley"] Remember you need the IS JSON constraint to do this. So what if you forgot to add it? In 18c you can get around this using the TREAT function with the AS JSON clause. Wrap this around a column in a subquery or view. And you can use dot-notation again! with j_data as ( select treat ( d.department_data as json ) as department_data from departments_json d where department_id = 110 ) select j.department_data.department from j_data j where department_data is json; DEPARTMENT Accounting Note TREAT tells the database to consider the contents to be a JSON document. Key point being consider. If the contents aren't valid JSON, you won't get an error! And dot-notation access will fail silently, returning null instead. This highlights a key limitation of dot-notation access. You have no control over the formatting and error handling. Everything comes back as a string (VARCHAR2(4000)). And it suppresses errors. Returning null if there's an issue accessing the element. Which is a problem if you've got messy data. If you want more control, you can use the JSON query functions JSON_value and JSON_query. Return a Single Value with JSON_value Say you want to return Shelly's hire date. And return the value as a real, proper date. Using dot-notation, the value is a VARCHAR2. So you need to to_date it afterwards. But with JSON_value, all you need to do is specify the returning value as a date (or timestamp)! select json_value ( department_data, '$.employees[1].hireDate' returning date ) hire_date from departments_json d Note this only works if you format datetime values in your document to ISO 8601 standards. If you've used your own, custom format mask this won't work! If you have chosen your own date format, by default the above will return NULL for the hire date. Because the database suppresses errors. Which can make tracking down problems in your JSON hard. For example, if you search for a non-existent attribute you get nothing back: select json_value ( department_data, '$.nonExistentAttribute' ) not_here from departments_json d where department_id = 110; NOT_HERE <null> This is because the error clause defaults to NULL ON ERROR. So how do you get an exception instead? Set the error clause to ERROR ON ERROR: select json_value ( department_data, '$.nonExistentAttribute' error on error ) not_here from departments_json d; ORA-40462: JSON_VALUE evaluated to no value JSON_value is great if you're getting a single value. I.e. not an array or object. But what if you want to return all the details for an employee? Or all the employees in a department? To do that, you're going to need JSON_query. Return a Document or Array with JSON_query The process for using JSON_query is like JSON_value. State the column holding JSON and the path you want to extract. For example, if you want to return an array of the employees for a department, you can write: select json_query ( department_data, '$.employees[*]' returning varchar2 pretty with wrapper ) employees from departments_json d where department_id = 110; EMPLOYEES [ { "name" : "Gietz, William", "job" : "Public Accountant", "hireDate" : "2002-06-07T00:00:00" }, { "name" : "Higgins, Shelley", "job" : "Accounting Manager", "hireDate" : "2002-06-07T00:00:00" } ] Note the with wrapper clause. You need to supply this if the path matches more than one value. Otherwise the call will fail silently, returning null due to the default NULL ON ERROR clause. You can also use this to return one of the attributes of a nested object array as an array. For example, you can return just the names of each employee: select json_query ( department_data format json, '$.employees[*].name' returning varchar2 pretty with wrapper ) employee_names from departments_json d where department_id = 110; EMPLOYEE_NAMES [ "Gietz, William", "Higgins, Shelley" ] This function is great if you want the result as JSON. But what if you want to display the document as traditional rows-and-columns? Say you want a report, showing the details of each employee as separate rows? Enter JSON_table. Convert JSON to Relational with JSON_table The JSON_table function enables you to transform a JSON array into rows using SQL. This makes it easier to view the data in reports. Or join it with relational tables in your database. Or take a document and save it to traditional tables! To use JSON_table, pass the document as the first argument. Then list out the columns you want to project in the columns clause. To convert an array to rows, you need to make it a nested path. And state the attributes you want to expose in another columns clause. So to turn the department documents to a row per employee, you need to: Have an initial columns clause. This includes any department-level attributes you want to display. Within this define a nested path, returning the employee array (employees[*]). In this nesting, have another columns clause. This lists the employee attributes you want to return. Which gives a query like: select j.* from departments_json d, json_table ( d.department_data, '$' columns ( department path '$.department', nested path '$.employees[*]' columns ( name path '$.name', job path '$.job' ) ) ) j where d.department_id = 110; DEPARTMENT NAME JOB Accounting Gietz, William Public Accountant Accounting Higgins, Shelley Accounting Manager Neat. But kinda clunky. You're repeating yourself a lot. The column names match the attribute names. Making mistakes likely. So in 18c we've simplified the syntax. If the names of the columns you're projecting match the attribute names in the document, all you need to do is list the attributes! So you can simplify the above to: select j.* from departments_json d, json_table ( d.department_data, '$' columns ( department, nested employees[*] columns ( name, job ) ) ) j where d.department_id = 110; DEPARTMENT NAME JOB Accounting Gietz, William Public Accountant Accounting Higgins, Shelley Accounting Manager You can mix-and-match simplified and extended notation. So if you want to rename attributes, or include extra formatting you can. For example, this adds employee's hire date to the results. In the process it renames the column. And specifies it as a DATE: select j.* from departments_json d, json_table ( d.department_data, '$' columns ( department, nested employees[*] columns ( name, job, hire_date date path '$.hireDate' ) ) ) j where d.department_id = 110; DEPARTMENT NAME JOB HIRE_DATE Accounting Gietz, William Public Accountant 07-JUN-2002 Accounting Higgins, Shelley Accounting Manager 07-JUN-2002 So far we've talked about using these functions to return parts of your document in the query. But you can also use JSON_value and JSON_object in your where clause. So you can find rows which store given values. Which brings an important question: How do you do this efficiently? How to Search JSON Documents in Your Database JSON documents can be huge. Reading all these in a table storing billions of rows is going to take a while. To do this quickly, you're going to need an index. Luckily Oracle Database has a few options to speed up your JSON search SQL. JSON Function-based Indexes If you know which attributes you're searching on in advance, you can create function-based indexes for your queries. For example, say you want to allow staff to search by department name and get the corresponding document back. So if you have a query like this: select * from departments_json where json_value ( department_data, '$.department' ) = :dept_name; To make it fast, create an index with the exact-same function you use in your where clause, like so: create index dept_department_name_i on departments_json ( json_value ( department_data, '$.department' ) ); Be aware that this can only index one value per document. So you can't index a whole array. And an attribute could flip between a single value and an array between documents. So this could lead to unexpected results. To prevent this problem, and ensure index attributes are always scalars, add the ERROR ON ERROR clause. And NULL ON EMPTY to avoid errors for missing attributes. So the full form of the above index is really: create index dept_department_name_i on departments_json ( json_value ( department_data, '$.department' error on error null on empty ) ); Whatever you do, ensure the clauses you use in your where clause match those you used in the index! So function-based indexes are fine if you know what you're looking for. But what if you want to support ad-hoc queries? JSON Search Index for Fast Ad-Hoc SQL Before Oracle Database 12.2 you could create an Oracle Text index over JSON. With this in place, the database could use the index for any JSON function query. But this was kinda messy. So in 12.2 we simplified the syntax. With a JSON search index. To use this, first add the IS JSON constraint to the column. Then create a SEARCH index with the FOR JSON clause: create search index dept_json_i on departments_json ( department_data ) for json; This creates an Oracle Text index behind the scenes. So now, whatever attribute you're inspecting, you can find the associated rows. Using JSON_textcontains, you can find any document which has a given value anywhere in the path you provide. For example, to find documents which have the value "Public" anywhere in them, use: select * from departments_json d where json_textcontains ( department_data, '$', 'Public' ); Note this has to be the exact word. The query above won't return documents that include "Publication" unless the value "Public" is also in the same document. The database can also use the search index if you use JSON_value in your where clause. But it indexes the whole document for every row. Which means the index will be bigger than its function-based counterpart. And less efficient. For example, let's load up the table with JSON documents for all the departments with employees in the HR schema. Then compare how the indexes perform when searching for the accounting department: select * from departments_json d where json_value ( department_data, '$.department' ) = 'Accounting'; When using the targeted, function-based index, you'll get a plan like: ------------------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | ------------------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| DEPARTMENTS_JSON | 1 | 1 | 1 |00:00:00.01 | 2 | |* 2 | INDEX RANGE SCAN | DEPT_DEPARTMENT_NAME_I | 1 | 1 | 1 |00:00:00.01 | 1 | ------------------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("D"."SYS_NC00003$"='Accounting') But the search index uses the following plan: ------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | ------------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.01 | 28 | 4 | |* 1 | TABLE ACCESS BY INDEX ROWID| DEPARTMENTS_JSON | 1 | 1 | 1 |00:00:00.01 | 28 | 4 | |* 2 | DOMAIN INDEX | DEPT_JSON_I | 1 | | 1 |00:00:00.01 | 27 | 4 | ------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(JSON_VALUE("DEPARTMENT_DATA" FORMAT JSON , '$.department' RETURNING VARCHAR2(4000) NULL ON ERROR)='Accounting') 2 - access("CTXSYS"."CONTAINS"("D"."DEPARTMENT_DATA",'{Accounting} INPATH (/department)')>0) The key thing to note here is the buffers column. This is how many consistent gets each step does. The lower the number, the better. The search index uses 28 consistent gets. Compared to 2 for the function-based index. That's an order of magnitude worse! So if you know which queries you'll be running, e.g. if you're coding them in the app, use a function-based index. Reserve search indexes for cases where you need to support ad-hoc queries. Of course, you can create a generic search index and (many!) function-based indexes on the same column. So you can have targeted indexes for critical queries. And a fall back for rarely-executed general searches. So we can use the JSON functions to extract values from a document. And efficiently find documents with given values. But they rely on you knowing the structure of your JSON. You need good documentation and standards to know what the attributes are. If your processes are… not the best, your documentation may be out-of-date. Or non-existent! So you need to inspect every single row to find the attribute names. (Something you get for free if you used proper relational tables…) Luckily you can find which rows (if any) have a given attribute. With JSON_exists. Search for Attributes with JSON_exists Say you want to know which documents have an employee with a hire date. Do that with JSON_exists by listing the path you're looking for. Like so: select department_id from departments_json d where json_exists ( department_data, '$.employees.hireDate' ); DEPARTMENT_ID 110 Or check if you've saved anyone's salary: select department_id from departments_json d where json_exists ( department_data, '$.employees.salary' ); no rows selected Using this you can easily see which documents have which attributes. Of course, this could still lead to a lot of trial-and-error to figure out what is in your documents. Fortunately Oracle Database has a neat way to expose a document's attributes. The JSON Data Guide. Convert Documents to Relational with JSON Data Guide The JSON Data Guide allows you to add columns to a table, returning values from a JSON document. Under the hood, this creates virtual columns calling JSON_value to get the values out. To add these columns, you need: To create a JSON search index with the DATAGUIDE ON option Be on Oracle Database 12.2 or higher Create the columns by calling DBMS_JSON.add_virtual_columns This will then trawl through your objects, finding the attributes and adding columns to your table. Like so: alter index dept_json_ii rebuild parameters ( 'dataguide on' ) exec dbms_json.add_virtual_columns ( 'departments_json', 'department_data' ); desc departments_json; Name Null? Type DEPARTMENT_ID NOT NULL NUMBER(38) DEPARTMENT_DATA NOT NULL BLOB DEPARTMENT_DATA$department VARCHAR2(16) As you can see, this added a column for the department. Which has a cryptic JSON$colname name. Which is case-sensitive. Meaning you have to use double quotes to access it: select "DEPARTMENT_DATA$department" department_name from departments_json where department_id = 110; It'd be better if you could have properly named columns. The good news is, you can! DBMS_JSON includes a rename_column procedure. This maps a given attribute path and data type to a new name. After setting this, you need to regenerate the columns. To this by calling DBMS_JSON.add_virtual_columns again. This drops and recreates the columns: begin dbms_json.rename_column( 'departments_json', 'department_data', '$.department', dbms_json.type_string, 'DEPARTMENT_NAME' ); dbms_json.add_virtual_columns ( 'departments_json', 'department_data' ); end; / desc departments_json Name Null? Type DEPARTMENT_ID NOT NULL NUMBER(38) DEPARTMENT_DATA NOT NULL BLOB DEPARTMENT_NAME VARCHAR2(16) This is great. But what about that employees array? How come there are no columns for that? Well it's a one-to-many problem. There's no clear way to return all the elements of an array in a single row. Fortunately, you can also use the JSON Data Guide to create a view. This uses JSON_table under the covers to turn an array to rows-and-columns: begin dbms_json.create_view ( 'department_employees', 'departments_json', 'department_data', dbms_json.get_index_dataguide ( 'departments_json', 'department_data', dbms_json.format_hierarchical ) ); end; / select * from department_employees where department_id = 110; DEPARTMENT_ID DEPARTMENT_DATA$job DEPARTMENT_DATA$name DEPARTMENT_DATA$hireDate DEPARTMENT_NAME 110 Public Accountant Gietz, William 2002-06-07T00:00:00 Accounting 110 Accounting Manager Higgins, Shelley 2002-06-07T00:00:00 Accounting Simple! So you've added your virtual columns or created views. But what happens if someone adds a document with a new attribute? How do you keep the virtual columns in sync? You could call DBMS_JSON.add_virtual_columns periodically. Or come up with some DDL-trigger based solution. Or you could let the database do it for you automagically! If you set the search index's parameters to DATAGUIDE ON CHANGE add_vc, the database will keep the columns in sync for you: alter index dept_json_i rebuild parameters ( 'DATAGUIDE ON CHANGE add_vc' ); Bear in mind this will happen whenever the index updates. This could be handy when you're prototyping in development. But I would advise against this in production; you could end up running lots of DDL against your tables! But what if you want to know when a new document has new attributes? And let your developers know something's changed? Well, you can write your own, custom change procedure. This could inspect the current virtual columns. Then see if they match those in the Data Guide. And send a notification to the relevant people if they don't. How to Generate JSON with SQL So far we've discussed storing JSON documents as-is in your database. But what if you're doing the good thing and storing your data relationally? And you want to return the rows as JSON? Converting query results to JSON is a non-trivial task. Many people have built libraries to help you with this. So Oracle Database 12.2 we added new functions to do this natively: json_array json_arrayagg json_object json_objectagg How you use these is pretty self-explanatory. The array functions return the arguments as an array (surrounded by square brackets). The object functions return the attribute/value pairs as an object (surrounded by curly braces). The non-AGG versions return a document for each row in the input. Whereas the AGG varieties can combine many rows into one document or array. According to your GROUP BY. So if you want to create the document we started with, from the standard HR schema, you can do this: select json_object ( 'department' value d.department_name, 'employees' value json_arrayagg ( json_object ( 'name' value last_name || ', ' || first_name, 'job' value job_title, 'hireDate' value hire_date ) ) ) from hr.departments d join hr.employees e on d.department_id = e.department_id join hr.jobs j on e.job_id = j.job_id where d.department_id = 110 group by d.department_name; To understand what's going on, it's best to work from the inside out. First, we create a JSON object for each employee. json_object ( 'name' value last_name || ', ' || first_name, 'job' value job_title, 'hireDate' value hire_date ) Then combine these into an array for each department: json_arrayagg ( json_object ( 'name' value last_name || ', ' || first_name, 'job' value job_title, 'hireDate' value hire_date ) ) Finally we add the department attributes, returning the finished document: json_object ( 'department' value d.department_name, 'employees' value json_arrayagg ( json_object ( 'name' value last_name || ', ' || first_name, 'job' value job_title, 'hireDate' value hire_date ) ) ) Note that on the first release these functions had limited data type support: just VARCHAR2, NUMBER, and DATE. In 18c we've finished off the implementation. So they support almost all the data types in Oracle Database. But like JSON_table, these can be tricky to work with. Converting a row to a JSON object is tedious if you have many columns. So in 19c, we've made it easier! Pass * to JSON_object and it'll generate an object from all the columns returned by your query: select json_object ( * ) jdoc from hr.departments where department_id = 110; JDOC { "DEPARTMENT_ID" : 110, "DEPARTMENT_NAME" : "Accounting", "MANAGER_ID" : 205, "LOCATION_ID" : 1700 } So these functions make it easy to convert your data to JSON. And it's simple to use JSON_table to receive documents and store the data relationally. If you do this, it's inevitable that at some point you'll want to generate back out the same document you stored in the first place. Which brings the important question: How do you ensure the JSON you received and generated are the same? How to Compare JSON Documents using SQL At first glance this seems like a simple problem. JSON is just text. So you can just check if one document equals the other. Right? Well, according to the JSON standard: Insignificant whitespace is irrelevant Insignificant attribute order is irrelevant So when comparing documents, first you need to strip out all the whitespace surrounding attributes and their values. Tricky, but doable. But comparing documents which have their attributes in a different order?! Well that's another matter. One way to do this, is to convert the documents to their relational form. Then you can do set difference operations in SQL to find any mismatches: with doc1_rows as ( select t.* from departments_json, json_table ( department_data columns department, nested employees[*] columns ( name , job ) ) t where department_id = 110 ), doc2_rows as ( select * from json_table ( '{ "employees" : [ { "name" : "Gietz, William", "job" : "Public Accountant", "hireDate" : "2002-06-07T00:00:00" }, { "hireDate" : "2002-06-07T00:00:00", "name" : "Higgins, Shelley", "job" : "Accounting Manager" } ], "department" : "Accounting" }' columns department, nested employees[*] columns ( name , job ) ) ), all_rows as ( select d.*, -1 tab from doc1_rows d union all select d.*, 1 tab from doc2_rows d ) select department, name, job from all_rows group by department, name, job having sum ( tab ) <> 0; no rows selected Yuck. So we've addressed this in 18c. There's a new condition, JSON_equal. This takes two documents as input. And returns true if they're the same, false if they're different. According to JSON standard rules. So you can replace all the code above with: select case when json_equal ( department_data, '{"employees" : [ { "name" : "Gietz, William", "job" : "Public Accountant", "hireDate" : "2002-06-07T00:00:00" }, { "hireDate" : "2002-06-07T00:00:00", "name" : "Higgins, Shelley", "job" : "Accounting Manager" } ], "department" : "Accounting" }' ) then 'EQUAL' else 'DIFFERENT' end matching from departments_json where department_id = 110; MATCHING EQUAL Nice! Note this doesn't tell you which parts are different. Just that they are. So if you need to show which attributes are different, you'll need to go down the relational shredding approach. One final thing before we finish. I promised it would be easy to convert JSON stored as BLOB to text. How? How to Pretty Print JSON Simple. The JSON_query function includes a PRETTY clause. Include this at the end and it'll return the document as pretty-printed text: select json_query ( department_data, '$' pretty ) from departments_json; Or, from 19c, you can use the new JSON_serialize function. This allows you to convert JSON from and to binary by specifying the return type as VARCHAR2, CLOB or BLOB. And it includes a PRETTY clause :) So you can get a nicely formatted document like so: select json_serialize ( department_data returning varchar2 pretty ) from departments_json; So that was a quick tour of JSON in Oracle Database. So far we've concentrated on working with JSON in SQL. But if you're building PL/SQL APIs, chances are you'll want to work with JSON there too. The good news is we've created some object types to do just that. If you want to try out the example in this post, you can find them all in this Live SQL script. Or if you want to get into the details of how this works or any of the other options discussed above, read the JSON Developer's Guide. So now we'd like to know: Are you using JSON in your database? How? Any other options you'd like to see to make this easier to work with? Let us know in the comments!

JavaScript Object Notation (JSON) is a lightweight data transfer format. It's the de facto standard for document exchange. So it's likely you'll want to send and receive JSON documents from and to your...

Partitioning

How to Create an Execution Plan

When you’re trying to get SQL to run faster there’s one thing that’s vital: an execution plan. In this post we’ll investigate four ways to create these: Autotrace SQL Monitor TKPROF DBMS_XPlan But before we begin let's answer the question: What is an Execution Plan? The plan for a SQL statement is a set of instructions. This tells the database how to access the data and join it together. Plans come in two varieties: Explain Execution These sound similar. But they’re different. To understand why consider this example. You’re making a car journey. You plan it beforehand. This is the route you expect to take. But just as you’re going to leave, you hear on the news that there’s been an accident on your chosen route. This will make it much slower. So you go a different way. There are two routes here. The one you expected to take and the one you actually took. After you arrive you wonder whether you could have completed the journey faster. To figure this out, you need to look at where you went. Not where you planned to go. An explain plan predicts how Oracle will process your query. An execution plan describes the steps it actually took. Just as in the driving example above, Oracle may use a different route than the one it predicted. As Tom Kyte discusses, there are several reasons this could happen. So to diagnose your queries, you need to know what Oracle did. Not what it guessed it might do! Side note "Explain plan" sounds a lot like "execution plan". This can lead to confusion. Many times when I ask people to supply an execution plan they provide one of the explain variety instead. I think it would be better if we gave these very different names. For example, calling execution plans “described paths” or similar. This would help alert people to the fact these are different things (though they look similar) and reduce the chance of confusion. Top tip: If someone asks you for “the plan” they’re usually looking for one of the execution variety. I.e. what really happened. So with that cleared up, how do you go about gathering them? Autotrace First up, autotrace. This is a freely available tool. You can enable it in SQL*Plus using the “set autotrace” command. With this on, after each query Oracle displays its output. Below this is the plan along with its performance stats: If your query returns hundreds or thousands of rows you have to wait for Oracle to display them all. This can be a pain. Luckily you can suppress the results with the trace[only] option. You can also choose to show just the plan or the stats. For example, to hide the query output and show just the stats, use: set autotrace trace stat To include the output and the plan, enter set autotrace on exp Once you're finished you can switch it off with: set autotrace off While using SQL*Plus is better than nothing, it has several limitations: It doesn’t show the stats (ouput rows, buffer gets, etc.) for each step in the plan It uses explain plans, so what you see may not be what Oracle did! So while it’s useful for a quick finger in the air, there are better versions available. Such as: Autotrace in SQL Developer This is a step up over the SQL*Plus version. It displays the stats for every step of the plan. So for each operation you can see metrics such as: How many buffer gets it used How long it took to run How many disk reads and writes it did This makes it much easier to which points are consuming the most resources. To see how to configure and run it, check out this video: Besides the step-by-step breakdown, autotrace in SQL Developer has more advantages over SQL*Plus. These include: You can expand and collapse parts of the plan. This makes it much easier to see what’s going on You can compare execution plans. A fantastic tool when you’re trying to find the difference between two large plans. This is a big step up over the SQL*Plus version. So from now on when I refer to autotrace I mean the SQL Developer version. A quick word of warning. By default SQL Developer only fetches the number of rows according to the "SQL array fetch size" parameter. For queries that return a large number of rows this could miss a big chunk of the work. Ensure you set it to "Fetch all rows". Often when you're tuning queries you want to save the plan for later reference or to share with others. To do this, right click the plan and you'll get an option to "Export HTML": Save this and you'll get a file like this: OPERATION OBJECT_NAME CARDINALITY LAST_CR_BUFFER_GETS LAST_OUTPUT_ROWS SELECT STATEMENT           SORT   1 35 1     TABLE ACCESS TARGET 1 35 3       Filter Predicates         T.SORT_CODE=:B1   TABLE ACCESS SOURCE 10 8 10 This all sounds great. But there is one big drawback to autotrace: you have to wait for the query to finish! If your SQL takes minutes or hours to complete waiting for this is a pain. It would be awesome if you could see query progress in real time. Which brings us to the next method: SQL Monitor The SQL Monitor levels up autotrace. It provides similar operation-level stats. But with an added bonus. You can view the progress of execution plans in real time! Got a troublesome full tablescan? You can watch while Oracle churns away at it. Instead of having to wait for the query to finish, you can see the plan immediately. This makes identifying bottlenecks easy. Even better, unlike autotrace which you need to run manually, Oracle will capture the plan for you automatically. So how do you view the plans? They're available in SQL Developer by going to Tools -> Monitor SQL… Or you can view them in the Performance tab of Enterprise Manager: Clicking on the SQL ID for your statement brings up the full details: As with SQL Developer, you can expand and collapse the sections of your plan. For more details on reading these reports see this post on monitoring parallel execution. For those of you who prefer text, you can output the plans using SQL. To do so, use the following query: select dbms_sqltune.report_sql_monitor( sql_id => '4htx5uyx0gxxx', type => 'TEXT', report_level => 'ALL' ) as report from dual; You'll need to replace SQL ID parameter with the ID of your query. You can find this with the following statement: select sql_id, sql_text from v$sql where sql_text like '%some text from your query%' and sql_text not like '%not this%'; If there are many matches, check the SQL text to see which one is your statement. Note SQL_text is limited to 1,000 characters. So for really large statements, you may need to include the SQL_fulltext column too! If it’s so great, why isn’t this my favourite approach? Well, there are a couple of caveats: Not all queries appear by default. Oracle only captures those lasting longer than five seconds or those running in parallel. You need to have Diagnostics and Tuning packs licenses to use it The first limitation is easy to get around. One way to do this is by adding the monitor hint: select /*+ monitor */… from … For licensing you’ll need to speak to you local sales rep ;) Assuming you are licensed, the SQL Monitor is a fantastic way to do early troubleshooting on those hour-long queries. Often you can spot which parts of the plan are doing the most damage within a few minutes. This enables early diagnosis. You can plan new approaches without waiting for the statement to finish! Both of these methods are great. But they work on a single statement at a time. What if you want to analyze the performance of several pieces of SQL in a transaction? Enter: TKPROF TKPROF is a command line utility that analyzes trace files and turns them into readable form. It gives you the execution stats for all the SQL in the file. Which begs the question: How do I generate a trace file? There are several methods. The easiest is to set SQL tracing on. The command to do this is: alter session set sql_trace = true; Oracle will capture all statements you execute after this in the trace file. To stop this, either disconnect or turn tracing off with: alter session set sql_trace = false; This method is easy but limited. For example, it only traces your session. A more powerful method is to call DBMS_monitor.session_trace_enable. This has five parameters: Session_id Serial_num Waits Binds Plan_stat Pass in the relevant session_id and serial_num to trace another session. If you leave these null, Oracle will trace your current session. Setting waits and binds to true includes information about these in the file. To stop tracing, call DBMS_monitor.session_trace_disable. As with the enable procedure, pass the relevant session_id and serial_num. Or leave blank if tracing your current session. So to generate a trace file for your current session, including waits and bind details, do the following: exec DBMS_monitor.session_trace_enable ( null, null, true, true ); ***your code here*** exec DBMS_monitor.session_trace_disable; DBMS_monitor also includes procedures to trace all statements: Across the database For given clients Specific combinations of service name, module and action. Note that tracing adds overhead. So avoid enabling it for the whole database. And remember to disable it when you've finished! Once you've traced your code you need to get the file so you can analyze it. How to get the trace file “But where can I find the trace file?” I hear you ask. It lives on the database server. This means you need access to it (or help from someone who does!). You also need to know where it is and what it’s called! Connor explains how to find them in this video: Getting the files can be fiddly, especially if you don’t have access to the server. You can get around this by configuring a file reader enabling you to query their contents (and thus save to your local machine). With the trace file in hand you can parse it with TKPROF. The basic syntax for this is: tkprof <trace_file_name> <output_file_name> For example: tkprof ORCL_ora_27883.trc trace.log This parses the contents of ORCL_ora_27883.trc into trace.log. You're now ready to start analyzing the queries! Quick note: TKPROF also includes an explain option. This will show you explain, not execution plans. Use this with caution. Compared to autotrace and the SQL Monitor this is a lot of work. So you may be wondering: is this extra hassle worth it? Well TKPROF has some of advantages over these: It includes all the SQL statements run between you starting and stopping tracing. This includes recursive SQL i.e. statements inside triggers, functions, etc. It breaks down the execution time into parse, execution and fetch times The first benefit is great if you have a slow transaction that includes several statements. You can sort them in the output file from slowest to fastest. This helps you spot which is takes the longest. To do this, use the sort option: tkprof <trace_file_name> <output_file_name> sort=prsela,exeela,fchela This is particularly useful for spotting fast queries that you execute many times. This could be because you're doing row-by-row processing. Or you have a SQL statement with many calls PL/SQL which itself calls SQL (something you should generally avoid). Or trigger logic you weren't aware of. Though, as Jonathan Lewis notes in the comments, be aware that a SQL statement could have a different plan for each exection. TKPROF will only report one of these.  The parse, execution and fetch breakdown helps you spot issues unrelated to the plan itself. For example, in an overloaded system parse times can be higher. If a significant fraction of a statement’s runtime is parsing then you should start looking for issues other than the plan itself. So TKPROF enables you to see information not visible in the previous two tools. Sometimes the extra hassle is worth it! DBMS_XPlan The final method we’ll look at uses the package DBMS_XPlan. This includes several display functions. These are pipelined, so you can query these like normal tables using the table() operator. Using these, you can get the performance stats for each step of the plan: actual rows, time taken, etc. To see this information, you need to increase Oracle’s stats level. You can do this by: Setting statistics_level to all in the session (or database!) using: alter session set statistics_level = all; Adding the /*+ gather_plan_statistics */ hint to your query Once you’ve done one of these and run the statement to completion you can get your plan! Call display_cursor to get the plan immediately after executing your SQL (taking care to ensure you’ve set serveroutput off): select * from table(dbms_xplan.display_cursor); This fetches the plan for the last statement in your session. This is good but misses many of the juicy details. To include these, use the format parameter. This enables you to include or exclude parts of the plan. Some of my favourite options are: ALLSTATS - Shorthand for IOSTATS MEMSTATS. This includes details such as how many rows and consistent gets each step used. LAST - Only display the stats for the last execution. Otherwise this default to giving you the information for every execution. PARTITION - Include the Pstart and Pstop columns. This tells you which partitions Oracle accessed. Only needed if you're querying partitioned tables! PARALLEL - Again, only relevant for parallel statements! NOTE - includes the note section. This provides extra information, such as whether Oracle used features like dynamic statistics, adaptive plans, etc. If you want to exclude parts of the plan, just prefix the format with the minus sign '-'. You can also use this to get the plans for previous statements. You just need to find the SQL ID and pass it in: select * from table(dbms_xplan.display_cursor('sql_id', null, 'ALLSTATS LAST')); The plan must still be in the cursor cache to use this. Head to LiveSQL for an example of how to do this. If you're using the Automatic Workload Repository, you can get older plans using display_AWR: select * from table(dbms_xplan.display_awr('sql_id', null, null, 'ALLSTATS LAST')); This is all well and good. But the output is static text. You can’t expand and collapse the plan like you can in SQL Developer. This can make large plans tricky to read. So why would you choose this over SQL Developer? An obvious answer is because this is part of a scripted process. You need pure SQL for these! Another is if you want to post the plan somewhere (*cough* Ask Tom *cough*) to get help tuning help from others. The simple text format makes it great for sharing. Summary We’ve looked at four methods of creating execution plans. These all come with their own pros and cons. In the end which you use is largely a matter of personal taste. For short running queries, the ability to expand, collapse and compare execution plans makes autotrace in SQL Developer the winner for me. For longer queries – where licensed – I’d go for the SQL monitor. But according to my recent Twitter poll, DBMS_XPlan is the most popular in my followers: There are many other ways to generate Oracle execution plans. Notably SQLTXPLAIN. This is a powerful tool that also includes lots of supporting details, such as the current table statistics. This is most useful if need help from other people (e.g. support!). This ensures they have all the information they need to diagnose the problem. The most important thing is to ensure that you are getting execution plans, not explain plans. If the output is missing details such as actual rows and time taken then there's a good chance you've got the wrong type. At the very least, you're missing vital information to help you tune your SQL. How about you? Which do you like best? Are there any other tools you use I haven’t mentioned? I’ve heard there are tools from other vendors for creating execution plan tools too ;) Do any of these have significant benefits? Let us know in the comments! UPDATED: 19 July 2017: added Jonathan Lewis' comment about multiple plans, link to LiveSQL demo for DBMS_XPlan and fixed some formatting UPDATED: 25 Apr 2019; updating code formatting

When you’re trying to get SQL to run faster there’s one thing that’s vital: an execution plan. In this post we’ll investigate four ways to create these: Autotrace SQL Monitor TKPROF DBMS_XPlan But before...

Why You Can Get ORA-00942 Errors with Flashback Query

I've recently been playing with one of my favorite Oracle features: Flashback. More specifically, Flashback Data Archive. lightning last year by oompa123 (CC BY 3.0)   This extends the power of flashback query. Normally you can only use this to query as far back as your undo allows. With Flashback Data Archive (FBA), Oracle keeps a permanent store of a table's history (up to the retention period).   So using this you can run "as of" queries over a larger period. This enables you to see what was in a table days, months or even years ago! You can also view all the changes to a table between two times. Enabling FBA is a quick and easy way to audit data changes.   Even better, unlike standard flashback query, you can go back to a time before you ran DDL against the table.   And the best part? As of 12c it's FREE!*   But I found myself hitting a strange error. After creating the archive and adding a table to it, I was getting "ORA-00942: table or view does not exist" errors: SQL> select * from products as of timestamp sysdate - interval '1' minute; select * from products as of timestamp sysdate - interval '1' minute * ERROR at line 1: ORA-00942: table or view does not exist Strange. The table itself existed. So what's going wrong? How Oracle enables flashback archive on a table When you add a table to a flashback archive, Oracle creates extra tables in the background for you. It uses these to construct historic versions of the data.   You can find one of these tables with the following query: select * from sys.dba_flashback_archive_tables; TABLE_NAME OWNER_NAME FLASHBACK_ARCHIVE_NA ARCHIVE_TABLE_NAME STATUS ---------- ---------- -------------------- -------------------- ------ PRODUCTS CHRIS FB_ARCHIVE SYS_FBA_HIST_104694 ENABLED The sys_fba_hist_* table is where the past data sits. Let's look at that table: SQL> select * from sys_fba_hist_104694; select * from sys_fba_hist_104694 * ERROR at line 1: ORA-00942: table or view does not exist What the… the history table isn't there!?   Of course, it's not my table that doesn't exist; it's the archive table that doesn't exist! So why does that matter? When you run a flashback query on an FBA enabled table, Oracle may also query the flashback tables. You can see these by looking at the execution plan for a flashback query: So why did I get an error? When you place a table in a flashback archive, Oracle does this in a separate process. Because of this, if it fails for any reason, you don't receive an exception! How to fix it In my case, I had forgotten to give my user a quota on the FBA tablespace. So the background process was unable to create the tables. So when I ran: select * from products as of timestamp sysdate - interval '1' minute; Oracle was looking for sys_fba* tables that didn't exist!   To avoid this, ensure you grant the table owner has a quota allocated on the FBA tablespace. Also verify that there's space available in it. Once the user is setup correctly, disable and re-enable flashback archive on your tables.   Creating the history tables may fail for other reasons. So if you plan on using FBA, double check it's working before you rely on it.   If you managed to get yourself in this mess, you can get out of it by disassociating and reassociating the table with the FDA. See MOS note 1330817.1 if you need further details about this. Or if you're stuck, contact support!   If you'd like to have a go yourself, then you can use this script.   * This is also back ported to 11.2.0.4. Applies to Basic Flashback Archive only. Optimization requires EE license and the Advanced Compression Option.

I've recently been playing with one of my favorite Oracle features: Flashback. More specifically, Flashback Data Archive. lightning last year by oompa123 (CC BY 3.0)   This extends the power of...

Analytical SQL

Finding the Previous Row in a Different Group Using SQL

Imagine you have a table with a series of transactions. These are classified into credits or debits: TRANS_ID TRANS_DATETIME TRANS_TYPE AMOUNT -------- ----------------- ---------- -------- 6 01-JAN-2015 00:00 CREDIT 40 7 01-JAN-2015 00:00 CREDIT 90 8 01-JAN-2015 00:00 DEBIT 80 5 02-JAN-2015 00:00 CREDIT 200 4 03-JAN-2015 00:00 DEBIT 99 3 04-JAN-2015 00:00 CREDIT 50 2 05-JAN-2015 00:00 DEBIT 200 1 06-JAN-2015 00:00 DEBIT 100 Your boss has asked you to do the following: "I want a report showing the ID of the previous transaction of the other type, according to the transaction date. Now!" So for credits, you need the previous debit trans_id and vice-versa. In the case of ties (transactions at the same time), simply take the highest transaction id. So the output you want looks like: TRANS_ID TRANS_DATETIME TRANS_TYPE AMOUNT PREV_ID_OTHER_TYPE -------- ------------- ------------- ------ ------------------ 6 01-JAN-2015 00:00 CREDIT 40 7 01-JAN-2015 00:00 CREDIT 90 8 01-JAN-2015 00:00 DEBIT 80 7 5 02-JAN-2015 00:00 CREDIT 200 8 4 03-JAN-2015 00:00 DEBIT 99 5 3 04-JAN-2015 00:00 CREDIT 50 4 2 05-JAN-2015 00:00 DEBIT 200 3 1 06-JAN-2015 00:00 DEBIT 100 3 You need to write the SQL to do this. How? Let's look at three different approaches: Standard analytics The model clause Pattern matching. Analytics First, let's analyze the problem. You need to find values from previous rows. The obvious way to do this is with lag. Using this as-is doesn't do what you need though. A basic order by trans_datetime returns the value from the immediately preceding row. You need to skip back N rows. The second parameter for lag is the offset. You can use this to provide the value for N. But the value for N is variable. You don't know how many credits (or debits) there are in the current series. Calculating the value for N is tricky. Using partition by to group the results by trans_type doesn't help either. This just returns the previous ID of the same type. So how can you do it? You need a two-step process. First, check the type of the current transaction. If it's credit, then you want the previous row that was a debit (and the reverse). You can use a case expression to do this. Next you need to find the trans_id value for this. Lag's out, so what can you use? The last_value function enables you to look backwards in your results. This returns the final value according to the sorting and windowing you specify. You're looking for the last transaction of the other type. This looks like it fits. Now put this all together. If your current transaction is credit, you want the last_value of debit transactions (and vice-versa). You can do this as follows: case when trans_type = 'CREDIT' then last_value( case when trans_type = 'DEBIT' then trans_id end ) over( order by trans_datetime, trans_id ) else last_value( case when trans_type = 'CREDIT' then trans_id end ) over( order by trans_datetime, trans_id ) end By default last_value considers all values, including null. It also has a windowing clause of: range between unbounded preceding and current row This isn't what you want. The case expression inside the last_value maps the current row to null. This is the final value in the ordered set (so far). So all the values for this are null. To overcome this, you need to skip back to the first non-null entry. You do this by adding the "ignore nulls" clause. The complete query is then: select t.*, case when trans_type = 'CREDIT' then last_value( case when trans_type = 'DEBIT' then trans_id end ) ignore nulls over( order by trans_datetime, trans_id ) else last_value( case when trans_type = 'CREDIT' then trans_id end ) ignore nulls over( order by trans_datetime, trans_id ) end prev_id_other_trans_type from transactions t order by trans_datetime, trans_id; This does the job. Yet it takes a bit of thinking to understand the SQL. Let's look at other ways to see if they are easier to understand. Model Clause Ny Nordisk mode, Catwalk by Benjamin Suomela (CC-BY-2.5-DK)   No, not that kind of model! In 10g, Oracle introduced the model clause. This enables you to create columns using complex formulas. This gives spreadsheet-like abilities in SQL! You can use it here to do the logic above. To do this, you need to supply three parts for the model. Dimension by states how you're going to uniquely identify each cell. Measures lists the columns you're returning. Rules tells Oracle how to create the values for the previous ID. Dimension By You need to supply this to tell Oracle how to identify each row and column in the results. i.e. a unique identifier. Row_number(), ordered by transaction_datetime, trans_id does this. To help with switching credits to debits and the reverse, you need a bit more. So also put trans_type in the dimension by, giving: dimension by ( trans_type tp, row_number() over (order by trans_datetime, trans_id) rn ) Measures Here you just need all the columns in the table and the column we're creating, prev_id_other_type. You need to give this an initial value. It's a number, so zero works fine: measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type ) Rules This is where the fun starts. You want to assign the last debit to credits and vice-versa. First, you need to state which cells you're providing values for. This is where the columns you defined in the dimension by come in. To aid understanding, you can do this in two separate rules. One for CREDIT and one for DEBIT. So you have two assignment targets: prev_id_other_type['CREDIT', any] = <expression>, prev_id_other_type['DEBIT', any] = <expression> The any keyword means you want to calculate the values for all the rows. Now you need to figure out what goes in the expression. In words, this is: "The most recent prior trans_id of the other type. If there are two transactions at the same time, take the one with the greatest trans_id." You need the greatest trans_id, which is a clue to use max(). Next you need to state which rows you want the get this max from. For credits, this is the debits. You want to consider all these previous rows. You can do this with the slightly cryptic expression: rn <= cv() CV() returns the current value of the dimension. This is rn. So you're saying look at all rows where the row number is less than or equal to the current. Remember that you defined rn as the order to look at rows (trans_datetime, trans_id). So gives you: prev_id_other_type['CREDIT', any] = max(trans_id) ['DEBIT', rn <= cv()] You can do a similar thing for debits by switching CREDIT and DEBIT in the above. But this isn't quite right. Transaction dates and ids are in different orders. So the previous row ordered by trans_id isn't necessarily the previous row ordered by trans_datetime. To overcome this, you can use the last analytic function. This enables you to return the trans_id of the rows ordered by date. You use it as follows: max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id ) This gives a complete solution of: select * from transactions t model dimension by ( trans_type tp, row_number() over (order by trans_datetime, trans_id) rn ) measures ( trans_datetime, trans_type, trans_id, 0 prev_id_other_type ) rules ( prev_id_other_type['CREDIT', any] = max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id ) ['DEBIT', rn <= cv()], prev_id_other_type['DEBIT', any] = max(trans_id) keep ( dense_rank last order by trans_datetime, trans_id ) ['CREDIT', rn <= cv()] ) order by trans_datetime, trans_id; Again, this works. While I think it makes it more obvious you're returning credit ids for debits and vice-versa, exactly what this does is still unclear. So let's look at the final method, pattern matching. Pattern Matching (Match_Recognize) Puzzle Krypt by Muns (CC BY-SA-2.0)   12c brought the match_recognize clause. This enables you to look for patterns in your data using regular expressions. As with the model clause you need to supply several parts. Let's take a bottom-up approach to using this. First, you need to define the variables. You want one for each transaction type. This is as easy as stating which trans_type they are equal to: define c as trans_type = 'CREDIT', d as trans_type = 'DEBIT' Next, you need to define the pattern you're looking for. You need any sequence of one or more occurrences of either variable. Here you need a regular expression. The | symbol is the choice operator. This matches the expression either side of it. So you want ( c | d ). You want to match at least one of these. So place the + symbol after it. This gives the following pattern: pattern( ( c | d )+ ) You need to return all the rows in the table. So specify all rows per match To finish, state which order you need to process the rows and how to calculate the prev_id_other_type column. As before, the order is simply trans_datetime, trans_id. As with the model clause, you use the measures clause to define the new column. To "switch over" the transaction types, use the classifier function. This gives the name of the variable that matches the current row. So if a row's trans_type = 'CREDIT', this is C and its D for debits. To get the previous trans_id of the other type, simply decode classifier. Use this to return the trans_id of the other variable: decode(classifier(), 'D', c.trans_id, d.trans_id) This works because if the current row is a debit, c.trans_id keeps the value from the last credit transaction. So your final pattern matching SQL is: select * from transactions match_recognize ( order by trans_datetime, trans_id measures decode(classifier(), 'D', c.trans_id, d.trans_id) prev_id_other_trans_type all rows per match pattern( (c|d)+ ) define c as trans_type = 'CREDIT', d as trans_type = 'DEBIT' ) order by trans_datetime, trans_id; Phew! That was hard work. You now have three different methods to choose from. How do you decide which to use? All need a bit of thinking to understand. Picking one over the other based on clarity comes down to personal preferences. In all cases, good commenting is essential. Let's take an objective approach. Which is fastest? Performance All three methods only access the table once. So it comes down to how efficiently each technique can process the data. To test this, I've loaded the table with 10,000 rows. I ran it on an Oracle Cloud Database (Extreme Performance Service) on version 12.1.0.2. Let's look at the times for three executions of each.   Analytics Run 1: Elapsed: 00:00:15.86 Run 2: Elapsed: 00:00:13.10 Run 3: Elapsed: 00:00:12.84 Mean: 13.93 seconds Model Run 1: Elapsed: 00:00:38.73 Run 2: Elapsed: 00:00:35.80 Run 3: Elapsed: 00:00:40.88 Mean: 38.47 seconds Match Recognize Run 1: Elapsed: 00:00:12.13 Run 2: Elapsed: 00:00:12.30 Run 3: Elapsed: 00:00:12.21 Mean: 12.21 seconds The model clause is the clear loser here. It takes around three times longer than the other two approaches. Pattern matching has a slight edge on standard analytics. So if you're on 12c, you may want to go for this method. Want to try this out for yourself? Have a go on LiveSQL. What do you think? Which approach do you like best? Are there any better ways to do this? Please let us know in the comments! Thanks to Stew Ashton and Rajeshwaran Jeyabal for providing the pattern matching and analytic solutions in this Ask Tom thread.

Imagine you have a table with a series of transactions. These are classified into credits or debits: TRANS_ID TRANS_DATETIME TRANS_TYPE AMOUNT -------- ----------------- ---------- ...

Creating Multiple Tables in a Single Transaction

The following quiz is taken from the Oracle Dev Gym. Can you guess which choices are correct? The Question Your boss has asked you to create a release script for the tables shown in this schema diagram:   You should create these tables in the PLCH_APP_OWNER user. Which of the following choices will create the tables shown above? You can assume that you are connected to the database as the PLCH_APP_OWNER user for this question. The Choices Choice 1 create table plch_app_owner.plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null ) ; alter table plch_app_owner.plch_orders add constraint plch_orders_pk primary key ( order_id ) ; create table plch_app_owner.plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2) not null ) ; alter table plch_app_owner.plch_order_items add constraint plch_order_items_pk primary key ( order_id, item_number ) ; alter table plch_app_owner.plch_order_items add constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ; Choice 2 create schema authorization plch_app_owner create table plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null, constraint plch_orders_pk primary key ( order_id ) ) create table plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2) not null, constraint plch_order_items_pk primary key ( order_id, item_number ) , constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ) ; Choice 3 create schema create table plch_app_owner.plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null, constraint plch_orders_pk primary key ( order_id ) ) create table plch_app_owner.plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2) not null, constraint plch_order_items_pk primary key ( order_id, item_number ) , constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ) ; Choice 4 create schema authorization plch_app_owner create table plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null, constraint plch_orders_pk primary key ( order_id ) ) ; create schema authorization plch_app_owner create table plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2) not null, constraint plch_order_items_pk primary key ( order_id, item_number ) , constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ) ; Choice 5 create schema authorization plch_app_owner create table plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null ) alter table plch_orders add constraint plch_orders_pk primary key ( order_id ) create table plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2) not null ) alter table plch_order_items add constraint plch_order_items_pk primary key ( order_id, item_number ) alter table plch_order_items add constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ; Answer Oracle issues an implicit commit before and after every DDL statement. Consequently you can't undo DDL by issuing a rollback command. If you have more than one DDL command in a process (e.g. a release script) and one fails, then you must explicitly undo the completed steps to rollback the entire process. Using the "CREATE SCHEMA" command you can combine multiple DDL steps into a single statement. As a result, the failure of a single part causes all commands within the "CREATE SCHEMA" to be rolled back. CREATE SCHEMA is limited to creating tables, views and issuing grants. If you wish to alter tables, "create or replace" a view or create other objects (e.g. indexes) these must be issued as separate statements. If you want to create constraints on tables within a "create schema", you must use inline constraints. Oracle will automatically resolve foreign key dependencies within a create schema command. This means it doesn't matter which order the tables appear in the command. Provided the end result of the schema creation is consistent, all the tables will be created. For example, choice 2 could be rewritten as: create schema authorization plch_app_owner create table plch_order_items ( order_id integer not null , item_number integer not null , product_id integer not null , quantity integer not null , unit_price number (10,2), constraint plch_order_items_pk primary key ( order_id, item_number ) , constraint plch_orit_plch_orders_fk foreign key ( order_id ) references plch_orders ( order_id ) ) create table plch_orders ( order_id integer not null , order_date date not null , customer_id integer not null, constraint plch_orders_pk primary key ( order_id ) ) ; The command creates the tables, despite the child table (PLCH_ORDER_ITEMS) being listed first. Finally, you must be connected to the user listed in the AUTHORIZATION clause for CREATE SCHEMA to work. This doesn't create the database user - this must already exist. Not sure which choices are correct? Click here to take the quiz on the Oracle Dev gym and find out! How did you get on? Do you like quizzes such as this? If so, then head over the Oracle Dev Gym to get your fill of quizzes on SQL, PL/SQL and Database Design. There's new quizzes every week for you to test your skills and learn! UPDATED: 19 Dec 2017 changed links to point to Oracle Dev Gym instead of PL/SQL Challenge

The following quiz is taken from the Oracle Dev Gym. Can you guess which choices are correct? The Question Your boss has asked you to create a release script for the tables shown in this schema diagram:  ...

UKOUG Tech15 Sessions with the Oracle Developer Advocates

UKOUG Tech15, Britain's largest Oracle technology conference, is just one week away! The SQL Developer Advocates, Connor McDonald and Chris Saxon are both presenting. There are five chances for you to join us and expand your SQL skills. Our sessions are: Super Sunday 6 Dec 12c Features for Developers 15:00 with Connor McDonald With the banner headline multitenancy features in Oracle Database 12c, lost in the mix has been the incredible list of new features for developers. Hundreds of improvements and new facilities have arrived in Oracle Database 12c. This session redresses the balance and excites developers with all the new Oracle Database 12c goodies Monday 7 Dec Supercharge Your SQL Skillset 15:10 with Connor McDonald Cloud and virtualisation services now take care of many of the day to day concerns for organisations - keeping infrastructure up to date, ensuring the integrity of backup/recovery, resource capacity and high availability. Finally IT professionals can focus on the true art of software development - writing quality SQL to provide data to upstream services. This session covers some of the SQL techniques available in the Oracle database, to allow developers to supercharge their SQL skillset. Tuesday 8 Dec Oracle Flashback 11:20 with Connor McDonald With Flashback Data Archive now available as part of the core product, it's worth exploring this and all of the other flashback facilities available in the database engine. Flashback options are rarely considered in day to day practice, because of blinkered view that it is only for when an "accident" occurs. However, there are countless other value use cases that will be covered in this session. All of the variations: - flashback table - flashback drop - flashback query - flashback database - flashback transaction - flashback data archive will all be covered. Wednesday 9 Dec SQL Magic - Tricks to Demonstrate SQL Features You've Probably Never Used 9:00 with Chris Saxon In this fun session you’ll see SQL versions of classic magic tricks. These include the cup and balls trick, where rows inserted into one table magically reappear in another, a card trick where a randomly selected “card” (row) from a “deck” (table) floats to the top of the deck and the magically changing row trick where the value returned by a query changes just by issuing a select statement! Unlike real magicians however, I’ll show you how the tricks are done. This will enable you to perform your own SQL “magic” and apply these principles when building Oracle database applications. This session is intended for developers, DBAs and data modelers looking to expand their SQL skills, highlighting features that are new to 12c as well as little used existing options. Finding All the Red M&Ms: A Story of Indexes and Full Table Scans 14:30 with Chris Saxon "Why isn't my query using an index?" is one of the most common questions beginners have when tuning SQL. This talk explores the factors that influence the optimizer's decision behind this question. It does so by comparing fetching rows from a database table to finding all the red M&Ms across a number of bags of sweets. It looks at two methods for getting the chocolate and then compares these to index lookups and full table scans in execution plans. It demonstrates how the optimizer analyzes the work necessary for each method, discussing the point at which a full table scan becomes more efficient than an index range scan. It goes on by discussing how the physical ordering of data in the tables affects these calculations. This talk is intended for developers and DBAs new to query optimization who want to understand the principles behind the cost-based optimizer. By the end of the talk they will be able to explain the factors that influence the original question and understand the terms block, multiblock read and clustering factor. And yes, there will be chocolate! There's still time to register. If you're not registered you can do so here. We hope you can join us share the magic of SQL.

UKOUG Tech15, Britain's largest Oracle technology conference, is just one week away! The SQL Developer Advocates, Connor McDonald and Chris Saxon are both presenting. There are five chances for you to...

Analytical SQL

Calculate Stock Market Returns Using SQL II

In the previous post we looked at stock market investment strategies. It looked like we'd hit on a winning method: buy $100 of shares every time the S&P 500 falls 2% or more during a day. This provided the best return on our money. As hinted, there are problems with this method though. Let's find out what these are and see if we can mitigate them. We'll use SQL magic to do this, including a look at the model clause.   First, let's review how feasible the 2% buy rule is. Our previous check for this was basic. How many times did this happen between the start of 2005 and the end of 2014?   We were working to a monthly budget. With one $100 transfer to our broker a month, we could buy 120 times. Our analysis of daily 2% falls stopped when we found this happened 109 times.   This is too simple. To be able to buy every time the rule triggers, the drops need to be evenly spread.   Is this really the case?   Let's check again. This time we'll break it down by year.   To do this, use trunc again to convert each date to the first of the year. Then group by this and count the occurrences: select trunc(quote_date, 'y'), count(*) from sp_quotes where ((close-open)/open) < -0.02 group by trunc(quote_date, 'y') order by trunc(quote_date, 'y'); TRUNC(QUOTE_DATE,'Y') COUNT(*) --------------------- ---------- 01 Jan 2007 11 01 Jan 2008 38 01 Jan 2009 22 01 Jan 2010 9 01 Jan 2011 20 01 Jan 2012 3 01 Jan 2013 2 01 Jan 2014 4 8 rows selected That's not even at all!   There were no two percent daily falls in either 2005 or 2006. And over a third of these happened in 2008!   This leads to a couple of problems.   The first is psychological. If we had started this method at the beginning of 2005 it would have been a long wait before we bought anything.   How long exactly?   To find this we need the first day there was a 2% fall. This is the minimum quote_date. So we can find it with: select min(quote_date) from sp_quotes where ((close-open)/open) < -0.02; MIN(QUOTE_DATE) --------------- 27 Feb 2007 Nearly 26 months. It's difficult to keep faith in a strategy that buys nothing! It's too tempting to give up and do something else.   The next three years had the opposite problem: too many falls. The market hit a 10 plus year low in early 2009. Not many people would be confident to keep buying in this situation.   The second problem is practical. If there's a time when the total number of possible purchases is greater than the number of transfers in, we're over budget. Depending on our means, we may be unable to buy every time the alert triggers.   We would have been over budget several times in this period. Over a third of the possible purchases were in 2008 alone. By the end of 2011 we would have spent $1,600 more than our transfers in. Possibly more if we had spent some of the allotted money in 2005-6 instead of transferring it to a broker.   If your budget is fixed, that's sixteen buying chances missed.   How does this affect the gains?   Let's compare two scenarios. In both we'll transfer $100 at the start of each month to a trading account. If there's a 2% daily fall we buy. If we have no money left however, we buy nothing.   In the first case let's spend everything we have saved since the last purchase. In the second we'll only spend $100.   How do we write the SQL to calculate the returns?   Let's start with the transfers in. There were 120 months in this period. So we need that many deposits. Let's use the connect by level row generation technique to create this number of rows.   For each of these we want to return 100 (our transfer amount) and the first of the month. We can use the add_months(<date>, <months>) function to supply the dates. This does what is says on the tin: adds the given number of months to the date you pass and returns the result.   Our starting date is 1 Jan 2005. We can use rownum minus one to provide the number of months to add. This gives us: select transfer_date, amount from ( select add_months(date'2005-01-01', rownum-1) transfer_date, 100 amount from dual connect by level <= 120 ) Now we need to figure out how many shares we plan to buy and on which dates.   For the "spend everything" method, this is the sum of the money we have saved. This is the total of all the transfers in after the last purchase up to the current date.   To find this, we can start with the query in the previous post to find the two percent drops: select * from ( select quote_date, lead(open) over (order by quote_date) purchase_price, 100/lead(open) over (order by quote_date) shares, (close-open)/open day_return, 100 purchase from sp_quotes ) where day_return < -0.02; We don't know how much we'll be spending each time. So we need to remove the shares and amount columns. To get these values we want to find all the transfers in between the current and previous falls. To make this easier, let's add the previous purchase date to this query.   We can do this in two passes. First we identify the days we want to buy. This is the day after each fall. So we need the next trading day. To get this, copy the lead() function from purchase_price. Just replace open with quote_date in lead().   The next pass finds the previous purchase date. We want to look backwards in our results to do this. The analytic lag() helps here. This is the opposite of lead(). It reverses down the result set instead of looking forward.   There's a small problem here though. For the first transaction, there's no prior purchase date. So lag() returns null.   To overcome this, we can make use of the optional parameters for lag(). The second parameter is the offset. This is how many rows to look back. We want to keep this at the default of one. The third is more interesting. This is the default value it returns if it doesn't find a row. For example, the first row in a result set has no previous. By providing 1 Jan 2005 for this we solve this issue.   Our query to find days we want to buy is then: select purchase_date, purchase_price, day_return, lag(purchase_date, 1, date'2005-01-01') over (order by quote_date) prev_purchase from ( select quote_date, lead(open) over (order by quote_date) purchase_price, lead(quote_date) over (order by quote_date) purchase_date, (close-open)/open day_return from sp_quotes ) where day_return < -0.02 We now have our two queries. One to calculate our transfers into the broker account. The other to find all the dates we plan to buy. To ease readability, let's place both of these in CTEs again. We'll call them transfer_dates and purchase_dates respectively.   Let's assume that transfers to the broker account arrive before the purchase. Therefore the total we spend is the sum of all deposits made after the previous (potential) purchase up to and including the current purchase date: select purchase_date, sum(amount) purchase, sum(amount) / max(purchase_price) shares from transfer_dates join purchase_dates on transfer_date <= purchase_date and transfer_date > prev_purchase group by purchase_date All that remains now is to calculate the value of our holdings and return. We can reuse the final_close query and formulas from the previous post to do this. The finished query is: with final_close as ( select close from sp_quotes where quote_date = date'2014-12-31' ), transfer_dates as ( select transfer_date, amount from ( select add_months(date'2005-01-01', rownum-1) transfer_date, 100 amount from dual connect by level <= 120 ) ), purchase_dates as ( select purchase_date, purchase_price, day_return, lag(purchase_date, 1, date'2005-01-01') over (order by quote_date) prev_purchase from ( select quote_date, lead(open) over (order by quote_date) purchase_price, lead(quote_date) over (order by quote_date) purchase_date, (close-open)/open day_return from sp_quotes ) where day_return < -0.02 ) select round( ((( sum(shares)*max(close) ) - sum(purchase) )/ sum(purchase) ) * 100, 2 ) pct_gain, to_char(sum(shares)*max(close), 'fm$999,999.00') value, sum(purchase) total_spend from ( select purchase_date, sum(amount) purchase, sum(amount) / max(purchase_price) shares, prev_purchase from transfer_dates join purchase_dates on transfer_date <= purchase_date and transfer_date > prev_purchase group by purchase_date , prev_purchase ) s, final_close; How much money does this make us? PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 56.72 $18,335.92 11700 Hmmm. That's the worst return so far!   Why is this?   Our first purchase was on 27 Feb 2007. At this point the index was near an all-time high. We spent $2,600 at this price. This was also our largest transaction. This means we spent the most at the market's peak. This is the opposite of what we want!   To make matters worse, we missed many buying chances near the bottom in 2008. There were months with multiple large daily falls. We could only buy after the first of these each month. So we weren't able to buy low. Even when we did, it was for relatively small amounts.   Clearly this method doesn't work. We're saving while the market is climbing. This can lead to us spending a lot when there's a tiny fall near the peak. Let's look at the second case. Here we'll only purchase $100 when the market falls. This avoids the "spend lots at the peak" issue. Our budget is still fixed. If our total purchases will be greater than our total deposits, we do nothing. To figure out whether we have any cash available, we need to keep a running total of the balance. For each transfer in date we add 100. When we buy we subtract 100 - unless this brings our balance below zero. If this happens, we do nothing.   For example, in the following table we have two transfers in (1 Oct and Nov). There's four possible purchase dates (3, 25 & 31 Oct and 1 Nov). We can't buy on 25 and 31 Oct though, because this will leave us with a negative balance. So we do nothing and the account stays at zero. DATE AMOUNT BALANCE ----------- ------ ------- 01 Oct 2011 100 200 03 Oct 2011 -100 100 25 Oct 2011 -100 0 31 Oct 2011 -100 0 01 Nov 2011 100 100 01 Nov 2011 -100 0 Calculating the balance like this using a spreadsheet is easy. Simply add the amount to the previous balance. Then return the greatest of this and zero. If date, amount and balance are the columns A, B and C respectively and row 1 is the transfer in on 1 Oct, the formula for cell C2 (bolded) is: =greatest(C1 + B2, 0) The general version of this for field C<N> is: =greatest(C<N-1> + B<N>, 0) At first glance this is hard to do in SQL. Oracle has something that mimics spreadsheet calculations however. The model clause.   To use it, we want a table that looks like the one above. This has a series dates with positive amounts for deposits and negative for purchases. We can then calculate the balance using the model clause.   Our transfer_dates and purchase_dates queries are good starting points. Our purchase amounts are now fixed. So we can add an amount column to purchase_dates. This has the value minus one hundred.   We no longer want to join these queries however. We want to merge them instead. The set operator union is perfect for this: select transfer_date quote_date, amount, null purchase_price from transfer_dates union all select quote_date, amount, purchase_price from purchase_dates This gives the dates and the amounts. Now comes the fun part - using model to calculate the cash available!   First, we need to define how we're going to reference rows. We do this with the dimension by clause.   An easy way to do this is to assign each row a number. We can do this with another analytic function row_number(). We pass to this the order in which we want Oracle to assign the values.   We're working through our rows by date, from oldest to newest. Therefore we're ordering by date.   Next we need to consider what happens if a deposit and buy happen on the same day. We're assuming the monies arrive before the purchase. So we need to order by amount from highest to lowest.   This gives us: dimension by ( row_number() over (order by quote_date, amount desc) rn ) Next is the measures clause. This lists the columns we want our query to return. These are the date, amount, balance and purchase price. Just list these like so: measures ( quote_date, amount, 0 balance, purchase_price ) Note balance doesn't exist in our original tables. We need to define how Oracle assigns values to it. We do this in the rules clause.   To figure out what we need to put here, first let's restate our formula. For each row, the balance is the balance of the previous row plus the current amount. If this is less than zero, return zero: balance [ current row ] = greatest ( balance [ previous row ] + amount [ current row ] , 0 ) We want this rule to apply to all rows. The any wildcard enables this. So balance [current row] becomes balance [any].   Next we need to find previous balance and current amount. The cv() (current value) function helps here. Specifying column[ cv() ] returns the value of the column for the current row. So the current amount is amount [ cv() ].   You can also use this function to access values from previous rows. Do this by subtracting the number of rows to look back from cv(). We need the previous balance. So this is balance [ cv() - 1 ].   So our rules formula is: balance[ any ] = greatest ( balance[ cv()-1 ] + amount[ cv() ], 0 ) We're not quite finished. The first row has no previous balance. Therefore balance [ cv() - 1 ] is null. Adding anything to null returns null. So this will return null for all the balances!   To get around this, we need to return zero when the balance is null. We can use nvl() for this: balance[any] = greatest ( nvl( balance[cv()-1], 0 ) + amount[cv()], 0 ) We now have our running total. Next we need to figure out how many shares we're going to buy. This is the transaction amount divided by the share price.   We have the prices in the purchase_dates query. And we know we're spending $100. We can't simply divide 100 by these prices however. This gives us shares even if we have no money left!   We could calculate the number bought when the balance is greater than zero. This isn't right either though. Our balance formula works on the assumption the transaction happened. If we have $100 left a purchase brings our cash to zero.   What we want is to set the purchase price to null if buying would leave us owing money.   We can do this using the model clause again. It can overwrite values for existing columns. We want to do this for any row where the balance would be less than zero.   We do this by checking whether the balance calculation above (without the greatest function) is negative. If it is, set the price to null. Otherwise return it: purchase_price[any] = case when nvl(balance[cv()-1], 0) + amount[cv()] < 0 then null else purchase_price[cv()] end Finally, a bit of tidying up.   We've modeled purchases as negative amounts. Therefore our shares purchased and total spend will also be negative. We want these values to be positive.   An easy way to do this is return the absolute amounts. We can also do this in the model clause. Just set the amount to the abs() of its current value: amount[ any ] = abs( amount[ cv() ] ) The full model clause is then: model dimension by (row_number() over (order by quote_date, amount desc) rn) measures (quote_date, 0 balance, amount, purchase_price) rules ( balance[any] = greatest ( nvl( balance[cv()-1] , 0) + amount[cv()], 0 ), purchase_price[any] = case when nvl(balance[cv()-1], 0) + amount[cv()] < 0 then null else purchase_price[cv()] end, amount[any] = abs(amount[cv()]) ) So our complete query for the "buy $100 if the market falls 2% or more and we have money available" rule is: with final_close as ( select close from sp_quotes where quote_date = date'2014-12-31' ), transfer_dates as ( select transfer_date, amount from ( select add_months(date'2005-01-01', rownum-1) transfer_date, 100 amount from dual connect by level <= 120 ) ), purchase_dates as ( select quote_date, purchase_price, day_return, lag(quote_date, 1, date'2005-01-01') over (order by quote_date) prev_purchase, -100 amount from ( select quote_date, lead(open) over (order by quote_date) purchase_price, (close-open)/open day_return from sp_quotes ) where day_return < -0.02 ), dts as ( select transfer_date quote_date, amount, null purchase_price from transfer_dates union all select quote_date, amount, purchase_price from purchase_dates ) select round( ((( sum(shares)*max(close) ) - sum(purchase) )/ sum(purchase) ) * 100, 2 ) pct_gain, to_char(sum(shares)*max(close), 'fm$999,999.00') value, sum(purchase) total_spend from ( select quote_date, amount/purchase_price shares, case when purchase_price is not null then amount end purchase from ( select * from dts model dimension by (row_number() over (order by quote_date, amount desc) rn ) measures (quote_date, 0 balance, amount, purchase_price) rules ( balance[any] = greatest ( nvl(balance[cv()-1], 0) + amount[cv()], 0 ), purchase_price[any] = case when nvl(balance[cv()-1], 0) + amount[cv()] < 0 then null else purchase_price[cv()] end, amount[any] = abs(amount[cv()]) ) ) ), final_close; PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 83.15 $17,033.03 9300 The second best return for regular purchases. We've only spent three quarters of our available cash though. So we may be leaving money on the table.   We could fiddle with the percentage that triggers our purchase rule. Lowering it means we'll buy more often. We can tweak the trigger point until we find one that gives the best final value. We risk falling for the Texas sharpshooter fallacy however, making our model fit the data.   Instead, let's take a step back. We've only looked at a ten-year slice of prices. Let's increase our time horizon and ask:   How often have two percent daily drops happened in the S&P's history?   In doing so we'll find a bigger flaw in the 2% fall rule.   The current S&P 500 was founded on 4 March 1957. Let's load sp_quotes with all the prices from then until 31 Dec 2014. You can download these here.   That's nearly 58 years. If the past ten years were representative we would expect 600-700 daily falls of 2% or more.   Is this the case? select count(*) from sp_quotes where ((close-open)/open) < -0.02; COUNT(*) ---------- 291 Nowhere near! The past ten years are less than 20% of the full time period. These accounted for well over a third of the large daily falls however. The recent recession really has been the worst in a lifetime, at least in terms of stock market fluctuations.   Still, it's worth having a quick check how much money we would make. Just update the transfer_dates query to generate 694 rows, starting at 1 March 1957 and the default value for purchase_dates.prev_purchase to this as well. Do so and we find: PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 525.83 $182,116.99 29100 How does this compare to the simple "buy $100 when the market opens each month" method? PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 1261.8 $945,088.77 69400 Ouch. We'd nearly be millionaires with the dumb method. We're not even a fifth of the way there with the 2% approach.   We did fall for the Texas sharpshooter fallacy. We picked a technique and our data just happened to fit it.   There are a few lessons here: When looking for patterns, ensure your sample is representative of the full data set Ensure you have two datasets. One to come up with a hypothesis. Another to test it! Who needs spreadsheets when you have the model clause? SQL is the perfect tool for helping you analyze your data. It can only work with the data you provide it though. In the era of Big Data this mistake is common. It's important to validate your dataset before you draw any conclusions from it!   What do you think? Is there any hope for saving the 2% strategy? Are there any methods better than simply buying a fixed amount each month? How else could we calculate the gains?   I'd love to hear from you. Let us know in the comments!

In the previous post we looked at stock market investment strategies. It looked like we'd hit on a winning method: buy $100 of shares every time the S&P 500 falls 2% or more during a day. This...

Finding the Longest Common Starting Substring Using SQL

A poster asked an interesting question on Ask Tom today. This amounted to: How can I return a distinct list of the longest common sequences of characters at the start of a set of strings?For example, given the following strings:emcpower1emcpower2emcpower3sdasdbsdcThe longest sequences where there is at least one other row that starts the same are "emcpower" and "sd". So the output should be:emcpowersdHow can you solve this using SQL? To figure out how to do this, it's useful to break it down. Here's a basic algorithm: For a given row, check whether any other rows start with the same character If there are, check there's any that match on the first two characters As long as there's at least one other matching row, repeat this process. Each time add one more character to the comparison. Stop when there's no matches or you've reached the end of the string A simple implementation is to loop through all the strings. For each of these, loop up to the number of characters within it, running the process above. The pseudo code to do this is: for strs in 1 .. number_of_strings loop for inx in 1 .. length(string(strs)) loop check_string := substr(full_string, 1, inx); if exists another row where check_string := substr(other_string, 1, inx); continue; else exit; end if; end loop;end loop; SQL is a set based language however. There is no loop operation. So how do you do it? First put the rows above in a table: create table strings as select 'emcpower1' str from dual union all select 'emcpower2' from dual union all select 'emcpower3' from dual union all select 'sda' from dual union all select 'sdb' from dual union all select 'sdc' from dual; Next create a set of all the strings you want to compare. These are all the leading substrings. For example, emcpower becomes: e, em, emc, emcp, emcpo, emcpow, emcpowe, emcpower To do this you can use a row generator. This should return as many rows as the length of the longest string. You can subtract one from this if you know the strings are unique. select rownum r from dual connect by level <= (select max(length(str)) from strings); Cross join this with the table of strings: with rws as ( select rownum r from dual connect by level <= (select max(length(str)) from strings)) select * from strings s1 cross join rws; This lists each string N times, where N is the length of the longest string. You only need to repeat each string up to its length however. So you should add a where clause, limiting the number to the length of each individual string. Using this list, create the set of all the starting substrings. Do this with substr(string, 1, n): with rws as ( select rownum r from dual connect by level <= (select max(length(str)) from strings)) select str, substr(str, 1, r) subs from strings s1 cross join rws where r < length(str); With this, you can check whether there's another row with the same starting characters. Exists comes in handy here. First you must ensure you're not comparing a row to itself! Next see if any other rows have the same beginning: with rws as ( select rownum r from dual connect by level <= (select max(length(str)) from strings)) select str, substr(str, 1, r) subs from strings s1 cross join rws where r < length(str) and exists ( select * from strings s2 where s1.str <> s2.str and substr(s1.str, 1, r) = substr(s2.str, 1, r) ); This returns a list of all the common strings. We just want the longest of these for each row. To do this, group by string and return the max substring: with rws as ( select rownum r from dual connect by level <= (select max(length(str)) from strings)) select str, max(subs) s from ( select str, substr(str, 1, r) subs from strings s1 cross join rws where r < length(str) and exists ( select * from strings s2 where s1.str <> s2.str and substr(s1.str, 1, r) = substr(s2.str, 1, r) ) ) group by str; STR S --------- ---------sda sd sdc sd sdb sd emcpower1 emcpower emcpower2 emcpower emcpower3 emcpower All that now remains is to return the distinct list of the substrings: with rws as ( select rownum r from dual connect by level <= (select max(length(str)) from strings)) select distinct s from ( select str, max(subs) s from ( select str, substr(str, 1, r) subs from strings s1 cross join rws where r < length(str) and exists ( select * from strings s2 where s1.str <> s2.str and substr(s1.str, 1, r) = substr(s2.str, 1, r) ) ) group by str ); S ---------emcpower sd There's a possible problem with this solution however. It returns all the common starting strings. If we add emcdisk and samba to the original set, the query now returns: S ---------emcemcpowerssd This is what the original poster wanted. This may not be what you need if you doing this though. You need to work with the data consumers to know what they want. Should it be the output above, the two shortest where there's a match, i.e.: emc s Or the original output? If you're looking to solve a similar problem then make sure you've handled the edge cases! Can you think of other ways to solve this problem in SQL? Are there better ways to write the query? What about the general version of this problem: find the longest common substrings anywhere within the strings. Can you provide a solution for that? If you have other approaches let us know in the comments No access to a database? No problem. You can view and run these scripts in your browser using LiveSQL.

A poster asked an interesting question on Ask Tom today. This amounted to: How can I return a distinct list of the longest common sequences of characters at the start of a set of strings?For example,...

Analytical SQL

Calculate Stock Market Returns Using SQL

What's the best strategy for investing in the stock market? Simple - buy low and sell high. There's big a catch to this method. We don't know in advance what the highs and lows are! Only by looking back can we know the best times to buy and sell.   So which approaches can we use? And, more importantly, how much money can we make by following them?   In this post we'll use SQL to calculate stock market returns for different purchasing strategies. The purpose is to show how to write the queries. We'll see if we can learn anything about investing in the process.   Before we start, let's make some assumptions. I doubt people's ability to beat the market by selecting stocks. So we'll simulate returns by buying a (hypothetical) S&P 500 index tracker. To keep it simple, this charges no fees and matches the index exactly (if only such a fund existed!).   Let's assume we have $100/month available to invest over ten years. This gives $12,000 in total. We'll only purchase shares in this period (no selling).   Unfortunately my crystal ball is (still) broken. So I'll use quotes for the past ten years (1 Jan 2005 - 31 Dec 2014) for comparisons.   If you want to play along, create the following table: create table sp_quotes ( quote_date date primary key, open number(12,6), high number(12,6), low number(12,6), close number(12,6), volume number(12,0), adj_close number(12,6) ); Then load it with daily quotes for the S&P 500 (^GSPC) from 1 Jan 2005 - 31 Dec 2014. You download these here.   Finally, I'm going to calculate returns using the total return formula: (End Value - Initial Value) / Initial Value To benchmark strategies let's first calculate the returns for two scenarios: Invest all $12,000 when the market first opens in January 2005. Make no further transactions. Make one purchase $1,200 of shares each year. Do this on the lowest point for the year. First up, what's the return if we had invested all our money when the market opened on 3 Jan 2005 (it was closed on the first and second) until it closed on 31 Dec 2014?   We can find this by getting the opening price on 3 Jan 2005 and the closing price at the end of 2014. I know that the S&P closed much higher than it opened in this period. So we can plug in maximum closing price for as end value and minimum opening price for initial value to the formula above.   This gives the following query: select round((max(close) - min(open)) / min(open) * 100, 2) pct_gain, to_char((1 + (max(close) - min(open)) / min(open)) * 12000, 'FM$999,999.00') total from sp_quotes where quote_date in (date'2005-01-03', date'2014-12-31'); Note - this query only gives the correct answer if we know that the market rose between the two dates we provide!   How much money did we make? PCT_GAIN TOTAL ---------- ------------ 69.89 $20,386.49 A gain just short of $8,400. A decent return considering this includes the crash of 2008-09 and one of the worst recessions ever.   This strategy works on the assumption we have a (large) lump sum to invest. For most of us this isn't the case. We can only drip-feed money in. So we want a regular investment strategy.   Let's look at the returns from a (hypothetical) "timing the market" method.   If we're lucky enough to buy shares at their cheapest each year we're "buying low". Let's calculate our gains if we make a single $1,200 purchase each year when the S&P is at its lowest that year.   To do this we need to group all the dates within a year together. We can easily do this using the function trunc(<date>, 'y'). This converts the supplied date to the 1 Jan in the same year. To find the cheapest price, group by this function and return the minimum low: select trunc(quote_date, 'y') y, min(low) from sp_quotes group by trunc(quote_date, 'y') order by trunc(quote_date, 'y'); Next we need to calculate how many shares this purchases. Do this by dividing $1,200 by each min(low). Summing this up gives our total holding.   To find the final value of our shares, just multiply our total shares by the closing price on the last day.   We can find the final closing price with the following query: select close from sp_quotes where quote_date = date'2014-12-31'; I'm building the full query using the with clause (aka subquery factoring or common table expressions (CTEs)). This enables us to define "views" on-the-fly. We can then reference them the main query. This can improve the maintenance and readability of SQL. Each of the two queries above will go in their own CTE.   Putting it all together gives us: with final_close as ( select close from sp_quotes where quote_date = date'2014-12-31' ), share_purchases as ( select trunc(quote_date, 'y') y, 1200 / min(low) shares, 1200 purchase from sp_quotes group by trunc(quote_date, 'y') ) select round( ((( sum(shares)*max(close) )- sum(purchase) )/ sum(purchase) ) * 100, 2 ) pct_gain, to_char(sum(shares)*max(close), 'fm$999,999.00') value, sum(purchase) total_spend from share_purchases, final_close; PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 90.93 $22,911.25 12000 Much better than the simple buy-and-hold approach!   Unfortunately, we can only know the cheapest price in a year once it's over. By this time it's too late to buy shares at that price!   These two strategies give us reference points. Now let's a take a look at simple passive investing methods we can actually follow. First, buying $100 at the start of each month. Using this approach, we'll make our purchase when the market opens on the first trading day of each month.   To find the first day of each month we can use trunc again. If we pass 'mm' instead of 'y', it converts dates to the first of the corresponding month. So to return the quotes on these days, join the quote date to the result of month truncing, like so: where quote_date = trunc(quote_date, 'mm') Again, we need to calculate how many shares we buy. This is 100 divided by the opening price. To confirm how much we're spending, I've also included a purchase column fixed at 100. This gives the following query. select quote_date, 100 / open shares, 100 purchase from sp_quotes where quote_date = trunc(quote_date, 'mm') At this point we can realize a benefit of using the with clause. The query to calculate returns for monthly purchases is similar to the one for yearly buying. We just need to change how many shares we are buying when. To do this, we just replace share_purchases with the query above.   To validate our expenditure, I've also add the sum of the purchase column. This gives: with final_close as ( select close from sp_quotes where quote_date = date'2014-12-31' ), share_purchases as ( select quote_date, 100 / open shares, 100 purchase from sp_quotes where quote_date = trunc(quote_date, 'mm') ) select round( ((( sum(shares)*max(close) )- sum(purchase) )/ sum(purchase) ) * 100, 2 ) pct_gain, to_char(sum(shares)*max(close), 'fm$999,999.00') value, sum(purchase) total_spend from share_purchases, final_close; PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 59.57 $12,286.57 7700 Hmmm. We've only spent $7,700. Why is that?   The index is closed at weekends and public holidays. We need to find the first day the market was open each month. This is the minimum date within the month, not the first. We can find the opening days grouping by trunc(quote_date, 'mm') and finding the min(quote_date): select min(quote_date) from sp_quotes group by trunc(quote_date, 'mm') To get the result we want, replace the where clause for share_purchases with this subquery: with final_close as ( select close from sp_quotes where quote_date = date'2014-12-31' ), share_purchases as ( select quote_date, 100 / open shares, 100 purchase from sp_quotes where quote_date in (select min(quote_date) from sp_quotes group by trunc(quote_date, 'mm')) ) select round( ((( sum(shares)*max(close) )- sum(purchase) )/ sum(purchase) ) * 100, 2 ) pct_gain, to_char(sum(shares)*max(close), 'fm$999,999.00') value, sum(purchase) total_spend from share_purchases, final_close; PCT_GAIN VALUE TOTAL_SPEND ---------- ------------ --------------------------------------- 59.01 $19,081.15 12000 Now we've spent our whole budget. Our gains are notably worse than both methods we described previously however. It's ten percentage points worse than buy-and-hold (69.89) and thirty worse than timing the market (90.93).   Are there any other simple, regular investment approaches we could try?   We can't know at the time if the price on a given day is the lowest the market will be that week, month or year. We can identify local minimums though. If the market closes lower than it opens it's a daily low. If this fall is great enough it may also be a weekly or even monthly low.   What if we create a rule to buy the day after the index falls 2% or more?   Sticking with the total return formula, we can identify these days using the following equation: ((close-open)/open) < -0.02 First, a sanity check for how viable this approach is. We've assumed 120 rounds of $100 purchases. So we should check how often there is a two percent drop in a day. If there's significantly more than 120, we can't afford it. Notably less and we're leaving too much money on the table.   To find the number of days, just count how many rows meet the formula: select count(*) from sp_quotes where ((close-open)/open) < -0.02; COUNT(*) ---------- 109 Eleven off. Close enough.   We can use the previous query to identify the days where there's a 2% or more drop. Next we need to find how much we would pay following this rule.   To find the daily fall, we have to wait until the market closes. Therefore we'll make our purchases the day after the S&P drops. This means we need to find the opening price for the next day (tomorrow's opening can be different from today's close).   We can use analytic functions to do this. Lead() enables us to fetch values from the next row in the results. We pass this the column we want values for. We want the opening price, so we're looking for lead(open). Next we need to tell Oracle which order to traverse the results. We do this with an over() clause. This takes an order by clause including the column we want to sort our data by. Here this is quote_date, giving: lead(open) over (order by quote_date) We can then add this to our select statement.   It's tempting to replace count(*) in our earlier query with this function and the other values we want like so: select quote_date, lead(open) over (order by quote_date) purchase_price, 100/lead(open) over (order by quote_date) shares, (close-open)/open day_return, 100 purchase from sp_quotes where (close-open)/open < -0.02 This gives incorrect results however.   The where clause is processed before analytics. So in the query above, lead() only applies to rows where there's a 2% or more fall instead of all the rows. This means lead returns the opening price on the next day there's a big drop, not the day immediately after a fall!   This isn't what we want. We need the opening price on the next trading day. So we need to place the query above inside an inline view. The where clause goes outside this view: select * from ( select quote_date, lead(open) over (order by quote_date) purchase_price, 100/lead(open) over (order by quote_date) shares, (close-open)/open day_return, 100 purchase from sp_quotes ) where day_return < -0.02; As before, we can now replace share_purchases in the full query. With this in place it's time to find our gains: PCT_GAIN VALUE TOTAL_SPEND -------- ------------ --------------------------------------- 91.27 $20,848.08 10900 More than 91%! This is the best we've done so far. It works with information we have available and we have $1,100 left over :)   This looks like a winning strategy. It's easy to do. Potentially we could automate the whole process. There's no guarantee it'll work in the future. Based on this analysis it's tempting to try however.   Before you rush out to do this we need to analyze this further however. There are big problems with this approach. We'll look at these in the next post. You can read this here.   *** Disclaimer. This post is intended to explore and understand use cases for SQL. It is not financial advice. Do not base your financial decisions on the musings of a SQL advocate! ***

What's the best strategy for investing in the stock market? Simple - buy low and sell high. There's big a catch to this method. We don't know in advance what the highs and lows are! Only by looking...

What Is SQL Injection and How to Stop It

Data breaches are a common occurrence. As described in The Securing Account Details story, SQL injection is a simple way to access data from exposed sites. How easy is SQL injection and how bad can it be? In this post we'll take a look at how it's possible. We'll see how easy it is to access information from a database that is vulnerable to SQL injection. We'll finish up by showing how you can prevent it. Let's start with a simple HTML form. It accepts an email address. It passes this to a Java servlet. This looks up subscription details for the email. The form is: <body> <form action='HelloInjection'> <input type='text' name='emailAddress' value=''/> <input type='submit' value='Submit'/></form></body> The following Java code processes the values: protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { PrintWriter out = response.getWriter(); Connection conn = null; try { Class.forName ("oracle.jdbc.driver.OracleDriver"); String url = "jdbc:oracle:thin:@//db_server:1521/service_name"; String userName = "XXXX"; String password = "XXXX"; conn = DriverManager.getConnection(url, userName, password); String email = request.getParameter("emailAddress"); Statement st = conn.createStatement(); String query= "select * from email_subscriptions where email_address = '" + email + "'"; out.println("Query : " + query + "<br/>"); ResultSet res = st.executeQuery(query); out.println("Results:" + "<br/>"); while (res.next()) { String s = res.getString("email_address"); out.println(s + "<br/>"); } } catch (Exception e) { e.printStackTrace(); } } The key part of this the query: String query= "select * from email_subscriptions where email_address = '" + email + "'"; This takes whatever the user inputs and concatenates it to the string. For example, if we submit chris.saxon@oracle.com, the query becomes: select * from email_subscriptions where email_address = 'chris.saxon@oracle.com' What's so bad about this? We can supply whatever we want to go between the quotes. This enables us to add more conditions to the where clause. To do this, just close the quotes. Then add our additional criteria. For example, if we submit the following: ' or 1='1 This query becomes: select * from email_subscriptions where email_address = '' or 1='1' The expression 1='1' is always true. True or anything is true. Therefore this query returns all the email addresses stored in the table! The code loops through all the results, displaying them all. This means anyone can get their hands on the table's contents. If it contains sensitive or private data we're in trouble. But, you may say, our application only uses string concatenation for queries against public or non-sensitive data. For example, lookup tables such as countries and currencies. It can't be that bad, can it? Yes it can. Remember that you can union queries together. With this simple operation, we can view the contents of any table in the database user has access to! What if they don't know the names of our tables? That's not a problem. Just submit the following: ' union all select table_name from all_tables where 1 = '1 And the query becomes (formatted for clarity): select * from email_subscriptions where email_address = '' union all select table_name from all_tables where 1 = '1 Hmmm. This returns all the tables the user has access to. Using a similar query, we can also find all the columns they can access. Armed with this information a hacker could union the original query with any other table, potentially leaving your whole database exposed. So how can we stop this? Simple. Use bind variables. To do this in the Java code above, change the Statement to a PreparedStatement. Then modify the query string to have a question mark (?) instead of concatenating the email address into it. Finally set the value of the variable to the supplied email address. Putting this all together gives: String query= "select * from email_subscriptions where email_address = ?"; PreparedStatement st = conn.prepareStatement(query); st.setString(1, email); ResultSet res = st.executeQuery(); Now, regardless of what someone types in the form, the query will always be: select * from email_subscriptions where email_address = ? Hacking attempts such as ' or 1='1 now return no results. Going further ensure your application user and table owner are different database users. Then follow the principle of least privilege to ensure that application users only have access to the tables they need. This limits what hackers can see if they do manage to find a loophole. Even better, build PL/SQL APIs to fetch and modify data. Grant application users permissions to just these APIs. With no direct access to tables, you've further decreased the risk of people accessing something they shouldn't. Above all: to ensure your data is safe, use bind variables! Image from Togo picture gallery maintained by Database Center for Life Science (DBCLS) CC-BY-3.0

Data breaches are a common occurrence. As described in The Securing Account Details story, SQL injection is a simple way to access data from exposed sites. How easy is SQL injection and how bad can it...

Last week on AskTom

Its been an interesting week on AskTom. Here's a couple of question that seem like odd scenarios, but ultimately (as is often the case) have simple and straightforward explanations. 1) Different object counts after import Neil posted as question, where he had performed an export in version 9i of the database, and imported it into 11g as part of an upgrade process. What was interesting was that as part of this exercise, the number of indexes in the target system had increased. It was as though the import had auto-magically created more objects in the database. How could this be ? After some further exploration, the solution becaome apparent. There were Text indexes as part of the schema. When a Text index is created, several other objects are created to support the usage of the index. You can see these tables in your schema, prefixed with DR$. More accurately, you can query USER_TABLES with SECONDARY=Y to see those tables that Oracle created on your behalf. When Neil moved from 9i to 11g, the internal structure of these DR$ tables has also changed from 9i to 11g. Hence during import, when the import process encountered the task: create index MY_INDEX on MY_TABLE ( MY_COL ) indextype is ctxsys.context the number of secondary objects created was different. See the original post here 2) Predicates in ANSI joins Giuseppe presented two simple tables, and asked about ANSI outer joins. SQL> select * from A; COL_A ---------- 1 2 3 4 4 rows selected. SQL> select * from B; COL_B ---------- 1 2 3 5 4 rows selected. Giuseppe asked how come what he thought was a standard outer join was not returning the un-matched rows ? SQL> select a.col_a, b.col_b 2 from a left outer join b 3 on a.col_a=b.col_b 4 where b.col_b <> 5; COL_A COL_B ---------- ---------- 1 1 2 2 3 3 3 rows selected. But when using the ANSI syntax, there is a subtle difference in the way you specify your predicates. They are either "part" of the join, or applied "after" the join. What Giuseppe probably wanted was: SQL> select a.col_a, b.col_b 2 from a left outer join b 3 on a.col_a=b.col_b 4 and b.col_b <> 5; COL_A COL_B ---------- ---------- 1 1 2 2 3 3 4 4 rows selected. See the original post here.  

Its been an interesting week on AskTom. Here's a couple of question that seem like odd scenarios, but ultimately (as is often the case) have simple and straightforward explanations.1) Different object...

#AskSQL Twitter Q&A Recap

Last Wednesday Connor and myself hosted a Twitter Q&A with the hashtag #AskSQL. In case you missed it, here's some of the highlights: Ask Tom @OracleDatabase Here's a question to kick the #AskSQL Q&A off. You recently took over AskTom.oracle.com - how's that been going? @chrisrsaxon - .@OracleDatabase Great! It's an amazing honour to be answering on Ask Tom. @connor_mc_d - #asksql Its challenging knowing that its Tom we're taking over from, but so far its been great fun and very interesting @chrisrsaxon - .@connor_mc_d I still find it hard to believe I'm answering there. Ask Tom was where I learned about Oracle back in 2003... :) #asksql @connor_mc_d - #asksql Tom started helping me (&veveryone) way back in the late 90's on comp.databases.oracle .server. So its been at least 20yrs of help @OracleDatabase Are there best practice tips for submitting to AskTom.oracle.com? #AskSQL @chrisrsaxon - .@OracleDatabase Funny you should ask :) I have a post today discussing what makes a good/bad q blogs.oracle.com/sql/entry/ask_… #asksql @connor_mc_d - .@OracleDatabase Priority #1 - if possible, give us a test case, ie, a set of scripts we can run to replicate the problem #asksql @chrisrsaxon - .@connor_mc_d @OracleDatabase This is key. Without it we can spend ages trying to re-create your issue without success #asksql @connor_mc_d - .@OracleDatabase Tom used to often refer people to this picture goo.gl/8AONm4 But basically its "help us to help you" #asksql Performance @OracleDBDev What is the most common mistake in SQL that degrades performance? #asksql @connor_mc_d - @OracleDBDev Expressions in predicates that potentially disable index usage is one we see all the time #asksql @connor_mc_d - @OracleDBDev And a particular example of that is when datatype mismatches in SQL, eg where STRING_COL = 123 #AskSQL @PrasannaPeshkar Can you please tell important performance tip for faster SQL queries. #asksql @chrisrsaxon - .@PrasannaPeshkar Learn how to read and understand execution plans. These show you what your query is actually doing... #asksql @chrisrsaxon - .@PrasannaPeshkar ... and therefore what you need to do to fix it #asksql @connor_mc_d - .@PrasannaPeshkar Google for gather_plan_statistics hint. Its an awesome way to see *where* to focus #asksql @chrisrsaxon - .@connor_mc_d @PrasannaPeshkar Or use #sqldev ;) blogs.oracle.com/sql/entry/quer… #asksql Model Clause @galobalda In your opinion, what are the use cases where the model clause shines when compared to other sql constructs? #asksql @chrisrsaxon - .@galobalda I think of it as like having spreadsheet abilities in SQL, enabling you to get vals from other cols and rows #asksql @connor_mc_d - #asksql 1/2 MODEL clause is perhaps over complex in that it can do *so* much, but I think its best use is as getting data back into the db @connor_mc_d - #asksql 2/2 since you can take data in 1000's of spreadsheets, get it back into the db where it belongs, and use MODEL to mimic the formulas @connor_mc_d - #asksql You see lots of "quirky" uses of MODEL (eg solving sudoku) but that doesnt mean there are not still genuine use cases @connor_mc_d - #asksql or as Rob did, you can use the MODEL clause for art work :-) goo.gl/u1KNQm @chrisrsaxon - .@galobalda I also used it in a recent AskTom q about transposing rows and cols asktom.oracle.com/pls/apex/f?p=1… #asksql Case vs Decode @PrasannaPeshkar CASE or IF or DECODE ? What's your opinion on this ? #asksql @chrisrsaxon - .@PrasannaPeshkar Well if doesn't work in SQL. Generally I prefer CASE - it's easier to understand #asksql @connor_mc_d - .@PrasannaPeshkar I use decode a lot, but prefer CASE. More understandable to most devs (eg non oracle), and seamless SQL => PLSQL #asksql Favourite Topics @thatjeffsmith . @connor_mc_d @chrisrsaxon if 'binds' was @OracleAskTom's goto answer or favorite topic, what are yours? #asksql @connor_mc_d - @thatjeffsmith for me, I love solutions with analytics - although I think 12c pattern matching may take over from that as I learn it #AskSql @chrisrsaxon - .@thatjeffsmith @connor_mc_d agree with Connor on analytics. I really like temporal validity in 12c #asksql 12c @OracleDatabase Top 3 favorite #DB12c features? #AskSQL #shamelessplug @connor_mc_d - @OracleDatabase 1) in-memory 2) better defaults 3) pattern matching #asksql @connor_mc_d - @OracleDatabase People have been waiting for "col default sequence.nextval" since 1992 :-) Now we have it #asksql @chrisrsaxon - .@OracleDatabase I've already said temp validity :) For the other two I think In-Memory is great and the EBR enhancements #asksql @connor_mc_d - @chrisrsaxon @OracleDatabase Agreed. EBR is a hidden gem that people so take a fresh look at in 12c #asksql The Future of SQL @galobalda What SQL enhancements would you like to see in future Oracle versions? #asksql @chrisrsaxon - .@galobalda Better support for cross-row and cross-table constraints. #asksql @chrisrsaxon - .@galobalda e.g. extending temporal validity so you can ensure date ranges don't overlap #asksql @connor_mc_d - @chrisrsaxon @galobalda Agreed. The concept of declarative assertions would be huge #asksql @connor_mc_d - .@galobalda Not particular features as such, but SQL as a means to query any and all data (relational, big data etc etc). #asksql If you want to see all the tweets, just search for #AskSQL. Thanks to everyone who took part, it was great chatting with you! If you missed out you can wait for our next session, put your questions to the panel at OOW15 or use the comments form on this blog. We look forward to hearing from you.

Last Wednesday Connor and myself hosted a Twitter Q&A with the hashtag #AskSQL. In case you missed it, here's some of the highlights: Ask Tom @OracleDatabase Here's a question to kick the #AskSQL Q&A off....

Ask Tom Questions: the Good, the Bad and the Ugly

It's one week since we launched the new version of Ask Tom! It's been an exciting time. It's clear that even with the rise of various forums since Ask Tom started, this is still the go-to site for many people. So far we've received 81 questions and a similar number of follow-ups. These have covered a whole range of topics including SQL manipulation, character sets, smart flash cache and careers using Oracle technology. People are certainly putting Connor and myself to the test! The quality of the questions has also varied from the great to the, well, not so great. In this post we'll look at some examples of good and bad questions people have posted. Before we start, it's worth restating the value of good quality questions. If you're going to ask a something, it's important that you put some effort in to making it high standard. Why? For your benefit! Without a clear explanation of your problem and the output you expect the answer we give may not be useful to you. So it's in your interests to make your question as clear as possible. Including examples helps. So what makes a question good or bad? Let's look at some examples: The Good Here's a good example asking how to create a SQL query: "Suppose we have a table that has two columns: a start and stop date. What would be the best way to list the year||months between the two dates in a single row, using a sub-query to select from this table and preferably using a "CONNECT BY" clause instead of the all_objects table? I have tried using the all_objects table however the cost of the query was extremely high (besides the fact I was not fully successful in getting the complete result set I had wished for). ie: CREATE TABLE TEST1 AS (SELECT SYSDATE AS START_DATE, SYSDATE AS STOP_DATE FROM DUAL WHERE 1 = 2); INSERT INTO TEST1 (START_DATE, STOP_DATE) VALUES (to_date('01032015','MMDDYYYY'),to_date('01152015','MMDDYYYY') ); INSERT INTO TEST1 (START_DATE, STOP_DATE) VALUES (to_date('04092015','MMDDYYYY'),to_date('06122015','MMDDYYYY') ); INSERT INTO TEST1 (START_DATE, STOP_DATE) VALUES (to_date('10152015','MMDDYYYY'),to_date('03182016','MMDDYYYY') ); In such a case, the output should be: RESULTSET 201501 201504 201505 201506 201510 201511 201512 201601 201602 201603"   The question includes all the statements we need to run to create the test case. It also shows the desired output. It's clear what we need to do. We can spend all our time figuring out and writing the SQL query for the answer.   Note that there's still some room for improvement with this question. It's not clear what should happen if different rows have overlapping ranges. Therefore our answer may be correct for the data shown, but mishandle overlaps in some of the asker's scenarios. It's easy to make mistakes when working with time intervals. James Allen defined 13 relations between them. Please be as precise as possible regarding how you want these relations handled. This helps ensure our solution works in all situations for you.   The Bad Now we've seen a good question, what do the bad ones look like? We've had several short questions. For example: "ORACLE error 8103 in FDPSTP Cause: FDPSTP failed due to ORA-08103: object no longer exists" That's it, just an error message! This means we need to do a lot of guessing to figure out what response the asker is looking for. An explanation of the error? How to fix it? We have to guess. In general questions that are only a couple of sentences long have too little information for us to give a meaningful answer. Help us help you! Please spend a couple of extra minutes fully defining your problem. Including an actual question is a good start :) Here's another one that's difficult to answer: "I am not able to understand one thing, I have two test databases, and separate hardware. It is a Server class installation and Enterprise addition, with all default optimizer settings. In one database say db1, when I am running dbms sqltune advisory it says a better execution plan exists and suggests me to accept the new plan, and when I am excepting the new plan, the response time gets reduced to half. In the second database the advisory says no better plan is there for the same query. and the costs are different in both the databases. Could you please help." It's common for Oracle to generate a different execution plan for a query when you run it in a new environment. Many, many factors that influence the optimizer. As this question stands we have few details to go on. Therefore any answer we give is guessing at the cause. Performance related questions are common. Lack of detail in these questions is also common! For us to answer them we need as much information as possible. At a minimum ensure you've included: The query Its execution plan Details about the tables involved, including any indexes and constraints on them (ideally in the form of create table statements) Stats on the relevant columns (number of rows, distinct values, etc.) Including these details goes a long way to ensuring that we are able to help you! The Ugly Unfortunately, the question area width is currently limited. This means that execution plans are displayed wrapped. This can make them hard to read. For example, here's part of one posted: | 0 | SELECT STATEMENT | | 33 | 1023 | 148K (1)| 00:00:06 | | | | | 1 | PX COORDINATOR | | | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10001 | 33 | 1023 | 148K (1)| 00:00:06 | Q1,01 | P->S | QC (RAND) | | 3 | HASH GROUP BY | | 33 | 1023 | 148K (1)| 00:00:06 | Q1,01 | PCWP | | | 4 | PX RECEIVE | | 33 | 1023 | 148K (1)| 00:00:06 | Q1,01 | PCWP | | | 5 | PX SEND HASH | :TQ10000 | 33 | 1023 | 148K (1)| 00:00:06 | Q1,00 | P->P | HASH | Some people have reached out to us asking for a fix. We're finding this annoying too! We've got this enhancement logged. On this note, we're collecting ideas for improvements to the site. If you have any enhancements you'd like to see, then head to the Ask Tom Feedback page on OTN. Here you can log new proposals and vote on submissions others have made. We'd love to hear your ideas. We'll review these periodically and use them to guide future developments on Ask Tom. If you have any general thoughts on the revised site and new answers team, please let us know in the comments!

It's one week since we launched the new version of Ask Tom! It's been an exciting time. It's clear that even with the rise of various forums since Ask Tom started, this is still the go-to site for many...

SQL Experts Panel at OpenWorld

Got SQL questions ? Get them logged here, then come along to our SQL Panel at Oracle OpenWorld ! We are holding a panel session exclusively dedicated to the amazing SQL language: SQL for Functionality, Performance and Maintainability, and its future We often think of SQL solely in terms of how to tune it. And whilst it is true that SQL performance is a key factor, SQL is also so much more than that. SQL is at the heart of every enterprise running on Oracle Database, so it is critical that our application’s SQL statements don't just run fast, but they are functionally correct, maintainable, and robust to handle the continuing evolution of our applications and database systems. And we're here to help. Join this session for a panel of SQL experts for a discussion (and lots of Q&A) on not just how to optimize your SQL statements, but to answer your questions on what features of SQL will help you, how to write robust, maintainable SQL, and get the panel's thoughts on the future of SQL in the ever changing data landscape. We've picked a a panel with a huge breadth of knowledge and expertise across the whole portfolio of the power that SQL offers. The panel will consist of John Clarke, Software Development Director, from the renowned Real World Performance Group Keith Laker, Senior Principal Product Manager, who can answer your questions on the incredible analytical SQL functionality Mohamed Zait, Optimizer Development Manager and Architect, who can assist you with your optimizer questions Connor McDonald, former Ace Director, now in Oracle's SQL Advocate team, and Tim Hall, Ace Director, both who come to the panel to share their many years of direct Oracle customer experiences. Don’t miss your chance to pose your questions to the panel. You will be able to write your question during the session itself (cards will be provided), but the best chance of getting your question in front of the panel is to post it right here in the blog as a blog comment. The panel will try to cover as many questions as we can in the time available. The session will be Moscone South Room 306, Monday October 26th at 4pm, but in the mean time - don’t be shy, the best questions relate to specific topics of interest to you, the audience. As long they relate to SQL functionality, performance, features or the future of SQL, we'll gladly share our thoughts and knowledge to answer as many of your questions as we can and help you succeed. And just like there is more than one way to write a SQL, there might be more than one opinion on your question - so don’t be surprised if our experts don’t always agree on their answers. There is no wrong or right here. We love a discussion! It educates us all. See you at Oracle OpenWorld.

Got SQL questions ? Get them logged here, then come along to our SQL Panel at Oracle OpenWorld ! We are holding a panel session exclusively dedicated to the amazing SQL language: SQL for Functionality,...

Query Optimization

Improve SQL Query Performance by Using Bind Variables

The Securing Account Details story in the previous blog post highlighted two problems with SQL that uses literals instead of bind variables: decreased performance and security holes. In this post we'll investigate the impact on execution times in more detail. Why do bind variables matter for performance? Before Oracle runs a SQL statement it checks it's valid and determines how to access the tables and join them together. This is called parsing. The optimizer has the task of figuring out which table access and join methods to use. This produces an execution plan. When Oracle sends a statement to the optimizer to do this it's called a hard parse. If a plan already exists for a query, Oracle doesn't need to go through the optimization process again. It can reuse the existing plan. This is referred to as soft parsing. Oracle considers even minor formatting differences between two SQL queries to be different statements. For example, Oracle sees: select * from orders where order_id = :ord and SELECT * FROM orders WHERE order_id = :ord as different statements, even though it's obvious (to us) that these are equivalent. Therefore it hard parses the first execution of both. Where do bind variables come into this? When you're querying a table by primary key the execution plan will invariably be: Perform a unique scan of the primary key index Access the corresponding row in the table using the rowid found in the index This applies whatever value you're supplying for the primary key. For example, we would expect both of the following queries to use the plan outlined above (assuming order_id is the primary key): select * from orders where order_id = 1234; select * from orders where order_id = 9876; The texts of these two are different however. Therefore both require a hard parse. This applies even if Oracle has already generated the plan for one of them. This is a bit like opening your favorite mapping application to find your route to work just because you're using a different car. In the vast majority of cases this is wasted effort. The brand and model of car has no impact on the direction you take. To avoid the extra hard parse, we need to replace the numbers in the statements above with a bind variable. This gives a single statement: select * from orders where order_id = :ord; If we run this query twice in succession with different values for :ord there's one hard parse and one soft parse. What's the impact of hard parsing? To answer this, let's look at some numbers. Let's take a basic orders table: create table orders ( order_id integer not null primary key , customer_id integer not null , order_datetime date not null , status varchar2(10) not null ); We'll query this by primary key 50,000 times. Each query will compare order_id to a different value. We'll take two approaches. One will build the SQL string dynamically, so each query is different. The other will use bind variables. Both will use execute immediate. /* No bind variables */ declare order_rec orders%rowtype; begin for i in 1 .. 50000 loop begin execute immediate 'select * from orders where order_id = ' || i into order_rec; exception when no_data_found then null; end; end loop; end; / PL/SQL procedure successfully completed. Elapsed: 00:01:33.94 /* Bind variables */ declare order_rec orders%rowtype; begin for i in 1 .. 50000 loop begin execute immediate 'select * from orders where order_id = :i' into order_rec using i; exception when no_data_found then null; end; end loop; end; / PL/SQL procedure successfully completed. Elapsed: 00:00:03.49 Ninety seconds down to less than four! The overhead for each individual execution is small (around additional 2ms). When we're executing queries thousands of times this adds up. Hard parsing is expensive relative to soft parsing. How does cursor_sharing = force affect this? The behavior described above relies on us setting the parameter cursor_sharing to exact. This is the default, so it's likely your databases have this setting. We can change this mode of operation by setting cursor_sharing to force. When we do this, Oracle strips out all the literals from the query and replaces them with bind variables in the optimization phase. For example, the previous two order queries now both become: select * from orders where order_id = :"SYS_B_0" In effect Oracle has transformed the query into one that does use bind variables. Problems with cursor_sharing = force Given that changing cursor_sharing is easy to do, it's tempting to say we should just set it to force. This introduces some other issues however. Let's say we have a process to find all the open orders for a customer. The query is: select * from orders where status = 'ACTIVE' and customer_id = :customer; The status will always be 'ACTIVE' for this query. Yet Oracle still changes this query to become: select * from orders where status = :"SYS_B_0" and customer_id = :customer; Why is this an issue? We may have a similar query elsewhere in the application to find all the completed orders for a customer, such as: select * from orders where status = 'COMPLETE' and customer_id = :customer; Oracle will also transform this to the bound version shown above. Instead of two distinct queries we've ended up with one. To see why this could be an issue, let's return to the car analogy. You don't want to change your route just because you're using a different car. It's likely you would want to take a different path if using another form of transport however (e.g. bike, walking, bus, …). If you don't know which type of transport you're using, how do you select which route to take? Oracle addressed this issue with bind variable peeking (introduced in 9i). This enables the optimizer to look the actual values passed when parsing a query and produce a plan suited to them. When the database first started peeking it only did this check on the first execution of a query. Subsequent executions of it would use this first plan. This applied even if there were more efficient plans for the values passed. For example, imagine you get a new job. No matter which transport method you use, you're forced to take the same route you took to get there on your first day. If your first journey was on foot, it's likely you'll take paths which are difficult (or impossible!) to traverse in a car! To reduce this problem, Oracle introduced Adaptive Cursor Sharing in 11g. This enabled the database to use multiple plans for the same query. It's able to identify when supplying different values for bind variables should result in different plans. In terms of your journey to work, you no longer need to force your car down a footpath because you walked to work on your first day :) These features reduce problems associated with forced cursor sharing merging queries that are genuinely different. They're not entirely gone however. The more information you can provide to Oracle, the more likely it is to choose the optimal execution plan. Converting a fixed literal to a bind variable removes some information. This increases the chance that the optimizer produces a sub-optimal query plan. This isn't the only issue with forced cursor sharing however. It still performs worse than using bind variables. Let's see what happens when we run the previous 50,000 order id lookups without binding and forced cursor sharing: alter session set cursor_sharing = force; Session altered. Elapsed: 00:00:00.00 declare order_rec orders%rowtype; begin for i in 1 .. 50000 loop begin execute immediate 'select * from orders where order_id = ' || i into order_rec; exception when no_data_found then null; end; end loop; end; / PL/SQL procedure successfully completed. Elapsed: 00:00:08.92 It's a lot faster than the default (exact). It's still noticeably slower than using bind variables however. Setting cursor_sharing to force may be useful as a quick fix if you have a poorly performing application that doesn't use binds. It's still slower than using bind variables and may adversely affect queries with literals. The best long-term solution is to write your queries to use bind variables. The advantage of using PL/SQL To finish off it's worth exploring a final way of implementing the query. In the story Dave was writing SQL queries directly in the middle tier. An alternative method is to place all your SQL in PL/SQL stored procedures. The application then just calls these modules. Let's repeat the test above, this time with the SQL statement written directly in the PL/SQL block: declare order_rec orders%rowtype; begin for i in 1 .. 50000 loop begin select * into order_rec from orders where order_id = i; exception when no_data_found then null; end; end loop; end; / PL/SQL procedure successfully completed. Elapsed: 00:00:03.48 As you can see this gives similar performance to the bound version of execute immediate. Why this is? Provided you're writing static SQL in PL/SQL blocks (i.e. not as part of an execute immediate or dbms_sql statement) you're already using bind variables! Additionally, Oracle doesn't replace genuine literals in PL/SQL when cursor_sharing = force (MOS note 285447.1). This makes it more likely Oracle will choose the optimal execution plan. Conclusion As we've seen, not using bind variables causes Oracle to do more work when determining how to execute queries. Instead of coming up with an execution plan once and reusing it, the optimizer regenerates the plan for each execution with different input values. This isn't the biggest issue with not using binds however. There's more significant problem lurking: security (or lack thereof!) due to SQL injection. We'll look at this in the next post.

The Securing Account Details story in the previous blog post highlighted two problems with SQL that uses literals instead of bind variables: decreased performance and security holes. In this post...

Securing Account Details: A Story About SQL Injection and Bind Variables

It was an ordinary Thursday afternoon. Dave had just returned to his desk after lunch. Wearily he opened his email. *** ATTENTION! TOP PRIORITY *** Following the recent leaking of account details from high profile websites, we've conducted a pen test. This has revealed holes in our security. Everyone must make all efforts to investigating and resolving this issue. Regards,The CEO Dave turned to his boss. "What happened?" he asked her. "The security consultants provided the CEO with a file containing the personal details for all people using our system" she replied. "Needless to say, he's not happy about it". Dave pondered. How could this have happened? He cast his mind back… The past two years were a blur. Fresh out of university, Dave joined the startup Acme Co as a developer. The job sounded exciting - building a new social media platform. His first task was creating the profile pages. The learning curve had been steep. Dave enjoyed the challenge though. He was a full stack developer, working on the UI, middle tier and Oracle database. He loved the variety and the chance to build a wide skill set. Although his knowledge was wide it wasn't particularly deep. There were times he wished he had a better understanding of all the technologies he used. SQL in particular was a grey area. When he started his comprehension of it was a couple of dimly remembered lectures from his second year at university. So he picked it up as he went along. He wanted to learn to really get to grips with SQL and understand it in detail. The pace of change and sheer volume of other things he needed to know for his job meant he hadn't dedicated as much time as he would have liked. His mind was drawn to the first production SQL he had written. The data to populate the profile page was stored in the accounts table. To fetch this he had written a simple select. This returned all the details for a given account by its primary key. He recalled struggling to figure out how to pass account_ids to the query. In the end he settled for building the query as a concatenated string, appending the account_id like so: String sql = "select * from accounts where account_id = " + accountID; Statement stmt = connection.createStatement();ResultSet result = stmt.executeQuery(sql); He remembered the excitement he felt when, a couple of weeks later, this SQL and all the other code to generate the profile page were in production. It felt great to produce something tangible. At the time Acme was tiny. The first year had seen fantastic growth. Within a year there were several million active users. With this growth came performance issues. The developers were constantly firefighting to keep the site running effectively. At first everything worked well. When the load grew to millions of SQL queries every day it reached crisis point. The database servers were at full capacity. They had already scaled out the database. It was becoming clear that throwing hardware at the problem wasn't a viable solution. Acme drafted in a database consultant to investigate. He noticed that a lot of the SQL wasn't using bind variables. The account selection query was a key culprit. He recommended reviewing all database access code ensure it was using bind variables. By this time the codebase was hundreds of thousands of lines long. At the time management was desperately seeking another round of funding from investors. They needed the money to keep the company running. To reassure investors they wanted a fast resolution to the performance issues. Management pressed the consultant for an easier, quicker solution. He said you could change the system parameter cursor_sharing to force. This would trick Oracle into thinking all literals in statements were bind variables. Changing the code was still the best solution. Not only would this make the queries faster, it would protect you from something called "SQL injection". Desperate to get performance under control, management insisted on going with the quickest solution. The cursor_sharing parameter was set in production. Almost immediately there was a noticeable improvement in response times. Modifying the code was quickly forgotten about. Remembering the consultant's advice, Dave had always intended to go back and modify his account access SQL. The constant pressure to develop new features meant this was always at the bottom of his to-do list. Until now. Dave scanned the application logs. There were some bizarre page requests like: /account/?account_id=1-pow(1,1)/account/?account_id=1-power(1,1)/account/?account_id=1 or 1=1/account/?account_id=1 union select * from user_tables; Fortunately Dave had just attended an IOUG conference. One of the presenters had demonstrated how queries that didn't use bind variable are vulnerable to SQL injection. This enables hackers to run arbitrary SQL statements and gain access to information that they should not be able to see. His mind immediately jumped to his accounts query. Could this be the source of the leak? He opened up the profile page in the development environment and tried different SQL injection strings. Adding a simple "or 1=1" enabled him to return all the account details. Clearly this was a likely candidate for the breach. He was annoyed with himself. "I should have fixed the code earlier," he thought to himself. He quickly loaded the database access code and searched for his accounts query. There it was, just as he'd written it two years ago. He now knew what to do. He had to change the query to use bind variables. Working fervently he updated the code to use a prepared statement: String sql = "select * from accounts where account_id = ?";PreparedStatement stmt = connection.prepareStatement(sql);stmt.setInt(1, accountID);ResultSet result = stmt.executeQuery(); After deploying the changes he repeated the SQL injection attempts. All passed. No matter what he tried only the details of the current account were returned. Success! The change was fast tracked into production. A few hours later the new code was live. Dave stayed at work to see the release. He wanted to be sure that his changes had worked. Once it was live he repeated the SQL injection attempts in production. As expected, his attempts all came back negative. Dave smiled to himself. While running the tests he noticed something else. The profile page was loading quicker. A glance at the production performance dashboard revealed that database CPU usage was also down. Not only had using bind variables secured the database, it had improved performance! It was a hard lesson. Dave took it to heart though. If you care about security and performance use bind variables!

It was an ordinary Thursday afternoon. Dave had just returned to his desk after lunch. Wearily he opened his email. *** ATTENTION! TOP PRIORITY ***Following the recent leaking of account details from...

PL/SQL

Optimizing the PL/SQL Challenge VIII: The Importance of Testing Equivalent Queries

In part seven of the PL/SQL Challenge optimization series we improved performance by extracting a query out of PL/SQL and pasting it directly into the main statement. This was a scalar subquery to find the most recent date a player visited each quiz. Simplifying, our query now looked like this: SELECT (SELECT created_on FROM (SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = :user_id_in AND qv.quiz_id = q.quiz_id ORDER BY qv.created_on DESC) WHERE ROWNUM = 1), q.* FROM qdb_quizzes q Two levels of nested selects look clumsy. Is there a more elegant way to write this? The most recent visit is the one with the maximum date. So all we need to do is get the max(created_on) for each quiz and user. This gives a query like: SELECT (SELECT MAX(qv.created_on) FROM qdb_quiz_visits qv WHERE qv.user_id = :user_id_in AND qv.quiz_id = q.quiz_id), q.* FROM qdb_quizzes q This is less code and easier to read (in my opinion). Our goal was to get the query to run faster though. Therefore, the important question here is: What impact does this have on performance? Let's compare the execution plans. For the original, order by query: And the max technique: 240,000 buffer gets down to 120,000. Significantly less work! What's going on? Both queries use (the same) index. They also both return just one row. Surely they should do the same amount of work? Let's compare some of the other performance stats. The buffer pins stats are almost the exact opposite of each other. When we used the max method almost all the buffers were pinned. Using order by none of them were. What does that mean? When Oracle acquires a buffer (memory area) it needs to use various latches (locks). After doing this it can pin the buffer. Once a buffer is pinned it can access the data stored there without having to go through the latching process. If Oracle reads the same data multiple times in a query this reduces the amount of work it needs to do. With the index range scan min/max, the optimizer chose to pin the index entries in memory. Therefore it didn't need to reacquire the same index block multiple times. When using the index range scan descending no blocks were pinned, so there was a lot of repeated work to reacquire the same data. The two methods are semantically the same (they give identical output). In theory they should perform the same too. Yet using max() did 50% less work. If you're trying to improve the performance of a query, it's worth asking yourself if you can rewrite it. To find the optimal approach it's important that you test out different methods on your data to discover which is best. There are even more ways we could rewrite this subquery. At the time we optimized the query the PL/SQL Challenge was on 11g. On 12c there's another possibility. We can use fetch first syntax to find the greatest created_on value like so: SELECT (SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = :user_id_in AND qv.quiz_id = q.quiz_id ORDER BY qv.created_on DESC FETCH FIRST 1 ROWS ONLY), q.* FROM qdb_quizzes q; How does its performance compare? This is more or less the same as using max(). Choosing between these two methods mostly a matter of style and personal preference. Or is it? Before asserting this we should check how the max() query performs on 12c. Let's look at the autotrace stats: A huge reduction in work! Oracle's transformed the query to do something vastly more efficient - a hash join of the two tables. Further proof (if you needed any) that calling SQL from PL/SQL from SQL is a bad idea. Had we simply changed this query in the PL/SQL function the optimizer would not be able to do this. Ensure you're allowing Oracle to do whatever it can to improve performance! What do you think? Which is your favorite? Are there any other methods I've missed? Let us know in the comments!  

In part seven of the PL/SQL Challenge optimization series we improved performance by extracting a query out of PL/SQL and pasting it directly into the main statement. This was a scalar subquery to...

PL/SQL

The Problem with SQL Calling PL/SQL Calling SQL

In the previous post we refactored a SQL query on the PL/SQL Challenge so it no longer called PL/SQL functions. This improved performance by removing context switches. The reduction in query execution time was ten seconds. The observant among you will have noticed that the statement processed over 60,000 rows. That's a saving of less than 200 microseconds per row. That's a tiny overhead for individual rows. At this point you may be saying: "Big deal. I'm calling the function in a top-N query. Worst case it returns 10 rows, increasing runtime by two milliseconds. The overhead is trivial. I prefer to keep my SQL queries in PL/SQL functions because it aids code reuse".   Let's put performance to one side for the minute. If you have common queries, why not put them in PL/SQL and call these functions in your select statements? This makes future maintenance easier if you need to change them.   Because there's an important caveat with SQL that calls PL/SQL that calls SQL. You may end up with results you didn't expect.   Let's look at an example. Your application has a typical users table with a user in it: create table app_users ( user_id integer not null primary key, given_name varchar2(100) not null, family_name varchar2(100) not null ); insert into app_users values (1, 'Chris', 'Saxon'); commit; You've stored components of the name separately. To ensure you display full names consistently across the application you've built a function, get_full_name. This returns the formatted name based on user_id. The query to do this is: select given_name || ' ' || family_name from app_users usr where usr.user_id = :user_id; So far so good. If we call this from SQL there's a problem however.   The following query should return the same value for name_fn and name_sql: select get_full_name ( user_id ) name_fn, given_name || ' ' || family_name name_sql from app_users; When I execute it however, I see the following: NAME_FN NAME_SQL ------------- ----------- Harvey Saxon Chris Saxon They've returned different values!! How did this happen?!   I haven't done anything to modify the results in the function. This is an expected consequence of calling PL/SQL functions that execute queries within a SQL statement. You can (and will) see this effect happen in your environments.   Why?   The reason is due to Oracle's read consistency model. Each SQL statement gets a read consistent view of the data at the time it starts executing. The query within the function begins at some point later than the main statement. This means that the statement inside the function can have a different view of the data than the SQL query that called it. It can see changes committed between the time the parent query started and its own inception.   In this case I issued the following in a separate session while the query above was executing: update app_users set given_name = 'Harvey' where user_id = 1; commit; To ensure the main query and the function returned different results I had to cheat slightly. I'm not quick enough to run the query in one session and the update in another. So I also placed a ten second sleep in get_full_name before the query. This ensured I had enough time to commit the change before the query inside the function began.   This is a key point. If the query that calls PL/SQL executes quickly it's difficult to reproduce the anomaly above. As a result you're unlikely to spot this kind of error in testing. This leads to bug reports that occur "randomly" in production you're not able to reproduce. The impact of these bugs can be somewhere between a mild annoyance and total catastrophe. If you're unfamiliar with the read consistency model tracking down and resolving them can be difficult.   Fortunately the solution is simple:   Copy all SQL out of PL/SQL and paste it directly into the calling statement(s).   Does this mean that there's no way to have reusable queries that are both correct and performant?   Of course not!   There is a better way. With views.   For example, you can create the following view: create or replace view app_users_formatted as select user_id, given_name || ' ' || family_name full_name from app_users; Substituting this for the PL/SQL in our original query we get: select (select full_name from app_users_formatted apuf where apus.user_id = apuf.user_id ) name_view, given_name || ' ' || family_name name_sql from app_users apus; Name_sql and Name_view are now part of the same statement. This means they are read consistent to the same point in time. No matter how long the query takes to execute both will return the same value.   Of course, for simple formatting issues such as this, you can create a virtual column with the result you want: alter table app_users add (full_name as (given_name || ' ' || family_name)); Conclusion There are two things to watch for when calling PL/SQL from SQL: The performance overhead of context switches Potentially incorrect results if the PL/SQL itself calls SQL You can avoid both of these by extracting the SQL query of PL/SQL and placing it directly in the calling statement. If the query you take out is one you want to reuse, save it in a view. You then reference the view instead of the PL/SQL function in your SQL. Using views instead of PL/SQL also enables the optimizer to include the SQL in query transformations. This can lead to more efficient execution plans. The performance gains from the new plan can be even greater than the time saved on context switches. Code Listing -- note: requires execute privileges on dbms_lock. create table app_users ( user_id integer not null primary key, given_name varchar2(100) not null, family_name varchar2(100) not null ); insert into app_users values (1, 'Chris', 'Saxon'); commit; create or replace function get_full_name ( user_id app_users.user_id%type ) return varchar2 as full_name varchar2(200); begin dbms_lock.sleep(10); select given_name || ' ' || family_name into full_name from app_users usr where usr.user_id = get_full_name.user_id; return full_name; end get_full_name; / -- in session 1: select get_full_name ( user_id ) name_fn, given_name || ' ' || family_name name_sql from app_users; -- in session 2, while query is still executing in session 1: update app_users set given_name = 'Harvey' where user_id = 1; commit;

In the previous post we refactored a SQL query on the PL/SQL Challenge so it no longer called PL/SQL functions. This improved performance by removing context switches. The reduction in query execution...

PL/SQL

Optimizing the PL/SQL Challenge VII: Removing Function Calls

Waaaay back in the first entry in this series we tuned one of the queries on the PL/SQL Challenge. The index we created improved its performance significantly. This missed the bigger picture however. The query was embedded in a PL/SQL function. The main SQL statement on the Library page called this function. We could still make noticeable performance gains by extracting the query out of PL/SQL and placing it directly in the main select. Why? Due to context switching. SQL and PL/SQL use different runtime engines. Therefore every time you call a PL/SQL function from SQL (or vice versa) there's overhead due to this switch. We can get rid of this overhead by taking the queries out of PL/SQL and placing them directly into the main select. In this situation we had two functions called in the same case expression: CASE WHEN qdb_thread_mgr.last_commented_on (pq.question_id) > qdb_player_mgr.most_recent_visit ( :ai_user_id, pq.quiz_id) Let's see what these queries are. Qdb_thread_mgr held: SELECT MAX (t.changed_on) INTO l_return FROM qdb_threads t, qdb_question_threads qt WHERE qt.question_id = question_id_in AND ( t.parent_thread_id = qt.thread_id OR t.thread_id = qt.thread_id) AND t.approved_by_user_id IS NOT NULL; And qdb_player_mgr contained: CURSOR visits_cur IS SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = user_id_in AND qv.quiz_id = quiz_id_in ORDER BY qv.created_on DESC; With the queries in hand, there are two things we need to do:   Move the queries into case statement and Check what difference this makes to performance   Moving the qdb_thread_mgr query is straightforward. We can copy it directly into the case statement. All we need to do is replace the parameter (question_id_in) with the column passed to the function: CASE WHEN (SELECT MAX (t.changed_on) FROM qdb_threads t, qdb_question_threads qt WHERE qt.question_id = pq.question_id AND (t.parent_thread_id = qt.thread_id OR t.thread_id = qt.thread_id) AND t.approved_by_user_id IS NOT NULL) > qdb_player_mgr.most_recent_visit ( :ai_user_id, pq.quiz_id )   We can't paste the in the second query as-is however. It can (and does) return multiple rows. To use it in the query we need to modify it so it can return at most one row. If we don't then it's possible we'll receive "single-row subquery returns more than one row" exceptions. How can we do this? First, we must determine what the query is trying to achieve. As the function name (most_recent_visit) implies, we're trying to find the date a person last accessed a quiz. The statement already sorts the output by created date descending. So all we need to do is stop it processing once it's returned the first result. To do this we just need to place the SQL in an inline view and stop it at the first row using rownum: SELECT created_on FROM ( SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = user_id_in AND qv.quiz_id = quiz_id_in ORDER BY qv.created_on DESC ) WHERE ROWNUM = 1 This gives a rewritten case expression like so: CASE WHEN (SELECT MAX (t.changed_on) FROM qdb_threads t, qdb_question_threads qt WHERE qt.question_id = pq.question_id AND (t.parent_thread_id = qt.thread_id OR t.thread_id = qt.thread_id) AND t.approved_by_user_id IS NOT NULL) > (SELECT created_on FROM (SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = :ai_user_id AND qv.quiz_id = pq.quiz_id ORDER BY qv.created_on DESC) WHERE ROWNUM = 1)   Now we've rewritten the query it's time to see whether our changes have made any difference. The execution plans show the following: Original approach with functions New SQL only method The plans are identical below the highlighted HASH JOIN. The SQL only query also includes two complicated looking SORT steps. With just this to go on it's tempting to say that the function approach must be better. After all, the queries do the same amount of work (buffer gets) up to the HASH JOIN. We need to add the work done by the SORT on top of this to the SQL only query. Surely this means the SQL approach is worse? Not so fast. Compare the total number of consistent gets each query does: The function is nearly three times worse! Where has all this work gone? Unfortunately autotrace doesn't give the full picture. It doesn't display the execution plans for any statements buried in PL/SQL. To view these we need to run a SQL trace. When we inspect the trace output for the statement with functions, we find familiar looking queries: SELECT MAX (T.CHANGED_ON) FROM QDB_THREADS T, QDB_QUESTION_THREADS QT WHERE QT.QUESTION_ID = :B1 AND ( T.PARENT_THREAD_ID = QT.THREAD_ID OR T.THREAD_ID = QT.THREAD_ID) AND T.APPROVED_BY_USER_ID IS NOT NULL And: SELECT QV.CREATED_ON FROM QDB_QUIZ_VISITS QV WHERE QV.USER_ID = :B2 AND QV.QUIZ_ID = :B1 ORDER BY QV.CREATED_ON DESC The trace also gives us the performance statistics and execution plans for the two queries. The first query shows the following: call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 60239 3.35 3.43 0 0 0 0 Fetch 60239 10.27 11.33 169 357596 0 60239 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 120479 13.63 14.77 169 357596 0 60239 Rows (1st) Row Source Operation ---------- --------------------------------------------------- 1 SORT AGGREGATE 0 NESTED LOOPS 0 NESTED LOOPS 0 TABLE ACCESS BY INDEX ROWID QDB_QUESTION_THREADS 0 INDEX RANGE SCAN I_FK_QDB_QUSTTHRD_QUEST 0 BITMAP CONVERSION TO ROWIDS 0 BITMAP OR 0 BITMAP CONVERSION FROM ROWIDS 0 INDEX RANGE SCAN I_FK_QDB_THREADS_PRNT 0 BITMAP CONVERSION FROM ROWIDS 0 INDEX RANGE SCAN QDB_THREADS 0 TABLE ACCESS BY INDEX ROWID QDB_THREADS The second: call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 17702 0.51 0.58 0 0 0 0 Fetch 17702 0.43 0.70 768 53109 0 682 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 35405 0.94 1.28 768 53109 0 682 Rows (1st) Row Source Operation ---------- --------------------------------------------------- 1 INDEX RANGE SCAN DESCENDING QDB_QUVI_USER_QUIZ_I Those execution plans look familiar! They're identical to the sub-plans under the SORT steps in the main SQL only query. The work hasn't gone away. We just couldn't see it in the autotrace report. Summing the total of the two elapsed time columns these two queries are taking around 16 seconds (14.77 + 1.28) to execute. This is over ten seconds slower than the SQL only query. The two SORT steps for this take a little over five seconds to execute (see the elapsed time column in the plan above). Clearly context switching is adding significant overhead in this case. In Summary Calling PL/SQL from SQL leads to context switches. This leads to runtime overheads while Oracle changes between the two environments. This can slow queries noticeably, particularly when the PL/SQL functions execute SQL themselves. Spotting this overhead isn't always easy. Autotrace is a great tool. It only shows you the statistics for the top-level statement however. Plans for additional queries executed (e.g. in functions, triggers, etc.) aren't displayed. To view all the statements executed when you run a query, use SQL trace. To identify this issue with autotrace pay attention to the total execution time for the query and the elapsed time it reports (last_elapsed_time). If there's a lot of missing time it's likely that there's other statements executed. Another clue that Oracle's doing a lot of additional work is the "recursive calls" metric. This measures how many additional SQL statements Oracle runs. What was this value for the original query? 101,002 Ouch! And the updated version? 29 Much better :)

Waaaay back in the first entry in this series we tuned one of the queries on the PL/SQL Challenge. The index we created improved its performance significantly. This missed the bigger picture however. Th...

ORA-54033 and the Hidden Virtual Column Mystery

A colleague recently asked me a question: "I'm modifying the data type of a column. When doing so I get the following error: ORA-54033: column to be modified is used in a virtual column expression But there's no virtual columns defined on the table! What on earth's going on?!" This was exceptionally confusing. Looking at the table definition we couldn't see any virtual columns defined: create table tab ( x integer, y date, z varchar2(30) ); Sure enough, when we tried to change the data type of y we got the exception: alter table tab modify (y timestamp); ORA-54033: column to be modified is used in a virtual column expression How could this be? Perhaps there was a column defined that we couldn't see. Querying user_tab_cols revealed something interesting: select column_name, data_default, hidden_column from user_tab_cols where table_name = 'TAB'; COLUMN_NAME DATA_DEFAULT HID ----------- ------------ --- SYS_STUYPW88OE302TFVBNC6$MMQXE SYS_OP_COMBINED_HASH("X","Y" YES Z NO Y NO X NO The SYS_... column isn't in the table DDL! Where does it come from? And what's SYS_OP_COMBINED_HASH all about? Has someone been mucking around with the database? The SYS_ prefix is a sign that the column is system generated. So something's happened that's caused Oracle to create this on our behalf. SYS_OP_COMBINED_HASH is an undocumented feature. The name implies Oracle is merging the arguments together to form a hash. Is there a feature where we want to capture information about a group of columns? Indeed there is - extended statistics! This feature enables to Oracle calculate statistics on a group of columns. It uses this information to improve row estimates. This is useful when there's a correlation between the values of two (or more) columns in a table. Someone had created extended stats on this table for (x, y). Now we've identified the problem, how do we get around it? Simple: drop and recreate the extended stats: exec dbms_stats.drop_extended_stats(user, 'tab', '(x, y)'); alter table tab modify (y timestamp); select dbms_stats.create_extended_stats(user, 'tab', '(x, y)') from dual; DBMS_STATS.CREATE_EXTENDED_STATS(USER,'TAB','(X,Y)') -------------------------------------------------------------------- SYS_STUYPW88OE302TFVBNC6$MMQXE Success! Extended stats are a great way to improve the optimizer's row estimates. If you need to create these, I recommend you also do the following: Apply the extended stats to all environments Put a comment on the columns explaining what you've done, e.g. comment on column tab.x is 'part of extended stats. To modify data type drop and recreate stats'; These actions will help prevent future developers getting stuck tracking down the cause of "missing" virtual columns!

A colleague recently asked me a question: "I'm modifying the data type of a column. When doing so I get the following error:ORA-54033: column to be modified is used in a virtual column expressionBut...

Implementing a Mutually Exclusive Relationship in the Database

In the Protecting People's Privacy story we faced a problem. We needed to ensure that we were not sending emails to or storing personal details about people who had opted out. This is a mutually exclusive relationship. To ensure we met this requirement we needed to do two things: Enforce the mutually exclusive relationship using database constraints. Exclude opted-out email addresses from all queries generating email lists. There is no simple answer for how to do this. In the story we looked at four different methods of implementing the requirements: Store all of the details in one big table. Use a check constraint to validate that the appropriate columns are null when the opt out flag is set. Split the demographic and opt out information into two tables. Police the fact that an email could only appear in one table via a check constraint on a materialized view (MV). Split the demographic and opt out information into two tables with a master table storing all the email addresses. Ensure that the child tables have no overlapping entries either by: Placing a trigger on the demographic and opt out tables to maintain the master list. Adding an email type column to all tables and using referential integrity to ensure the tables have no duplicate entries. Each of these approaches comes with their own advantages and disadvantages. Let's look at them. One table with a check constraint This is the most obvious solution. It's simple to implement - just add an opted out column to the demographic table. Then create a check constraint to ensure that the other columns are null when this is set. This approach has a number of drawbacks. The biggest of these is ensuring that all queries have the correct where clause. Without this you can accidentally generate mailing lists including opted out emails. There some steps you can take to mitigate this. If you're on Oracle Enterprise Edition you can use Virtual Private Database. This can add the where clause automatically, based on a policy. This ensures queries only return opted-in or out emails as appropriate. Correctly implementing this can be complex. An easier and more widely available option is to create two views. One with the demographic columns where the opt out flag isn't set. The other showing emails where the opt out is set. Provided no one has direct access to the underlying table this can work well. Developers must still take care to ensure they use the views rather than the base table. This method also has system maintenance implications. You must update the check constraint when adding new columns to the table. Developers can easily overlook this without good documentation and governance. Provided you have good development processes this technique can work. Two tables supported by a MV The major advantage of storing the emails separately is people are unlikely to generate a list including opted-in and opted-out email addresses. To do this you have to union the two tables together. This requires conscious effort. The downside is Oracle doesn't have a statement to create a unique key across two tables. Therefore you must implement your own solution to validate the rule. One way to do this is to create a MV. This uses a query checking that the join of the two tables returns no rows. Provided you set it refresh on commit, a check constraint on the MV can enforce the relationship. This solution comes with a number of caveats however. The key issue for people with highly concurrent environments is performance. Introducing the MV adds overhead when committing transactions including the tables used in the MV. Oracle serializes this refresh process - i.e. only one session at a time can update the MV. This can become a serious bottleneck in applications where many users are triggering refreshes at the same time. Laurent Schneider raised another important issue. Fast refreshable materialized views are brittle. There are a number of actions, such as truncating dependent tables, which can make them unusable. Oracle doesn’t validate constraints on MVs when this happens. This is a silent failure. Therefore you can have unexpected situations where the constraint is not enforced. The final three considerations regard the maintainability of a system using MVs. Oracle only validates MV constraints on refresh. If you have transactions with many DML operations it may not be clear which statement lead to the exception. This makes it harder to debug the application. Constraints on MVs also increase the learning curve for people new to the system. The logic applying the check is effectively separated from the dependent tables. This can confuse developers as to the cause of an exception. The upside is you can add new columns to the demographic and opt out tables without needing to update the constraints. Provided your rate of concurrent transactions is low and you've guarded against actions that invalidate MVs this approach can work well. If either of these are not the case then you're better off using a different technique. Two tables supported by a master table This method has two sub implementations. One relies on triggers to maintain the master table. The second adds an email type column to all tables. The type columns enable you to enforce the rule with foreign keys, check and unique constraints. As with the MV approach, the clear benefit of these are reduced risk that people will accidentally generate a list including opted in and opted out addresses. The presence of the master table does reintroduce this risk however - particularly in the trigger only solution where the master table has no context to the email addresses. The best way to mitigate this is to ensure that no end users or applications have select privileges on the master table. Both of these approaches perform better than using a MV to enforce the constraint. This is particularly noticeable in highly concurrent environments. They also allow you to add new columns to tables without having to modify any constraints. The same applies if you need add more tables to the distributed key (provided you use lookup table on the master table instead of a check constraint if using types!) Using triggers The main drawback of relying on the triggers are the don't validate existing data. For new development this isn't an issue. If you're applying this technique to an existing system you must manually validate the data once the triggers are in place. The downside is regards system maintainability. It's not clear just by looking at the triggers that they're used to enforce the distributed key. To overcome this ensure you include descriptive comments in the triggers explaining why they exist. Using email type This is a convoluted technique to implement. To do so, you need to: Add a type columns to all the tables Create an extra unique constraint on the master table over (email, type) Ensure that the type column in the child tables is a constant value Place foreign keys on the child tables back to the master using (email, type) This leads to several "redundant" columns and constraints. On 11g and higher you can reduce storage overhead by using virtual columns for the type columns on the child tables. If you're on 12c, you can also make these invisible so they are automatically excluded from insert statements, select stars and so on. This can make the system harder to learn however. Whether this possible saving in coding effort offsets the learning curve is decision you'll have to make! The upside is you can infer the rule just by looking at the table constraints (particularly when you're familiar with the pattern). It's also easy to add attributes common to both demographics and opt outs to the master table, e.g. insert date. This method is also a way to implement entity subtype relationships. If you'd like to test these techniques out for yourself use this script. You may be thinking that the email example is a niche problem and these implementations aren't relevant to you. Mutually exclusive relationships come in many forms. It's worth keeping these in your toolbox. Here's some examples: A competition is open to everyone except employees. A person can't serve time in a prison and be employed as a guard at the same time A bank account can only be one of saving, checking, loan, etc. A social media "event" can be a like, follow, share, … If you want scripts to try these approaches out yourself to see how they compare, use these scripts: One big table Two tables with MV Two tables with master and triggers Two tables with master and type columns Insert tests for performance of two table implementations I also want to say thanks to SalimWer, whose question on OTN provided the initial idea for the story and Karthick_Arp and Stew Ashton for their answers that are the basis of some solutions. Any other link to real people or events is purely coincidental! Overall using a master table with type columns is my favourite approach. Which method do you like best to implement a mutually exclusive relationship in Oracle? Are there any other or better ways to implement the requirements? Let us know in the comments!

In the Protecting People's Privacy story we faced a problem. We needed to ensure that we were not sending emails to or storing personal details about people who had opted out. This is a...

Protecting People's Privacy: Implementing Cross Table Constraints Part II

(In the previous episode you used a MV to implement a cross table constraint. The ensuing JI waits were affecting performance too badly for this to be a viable approach however. You're looking for another way to implement the constraint.) You're beginning to doubt the effectiveness of splitting the table. You know from experience that implementing trigger based solutions to enforce constraints are difficult. Dividing the table is the most robust solution you can think of to ensure no one accidentally includes opted out emails in queries however. Wearily you head back to your question on the Oracle forum and scroll through the answers. You spot another approach. Create master list of email addresses. Then place foreign keys from the demographic and opt out tables to this. Provided any inserts into email_demographics or email_opt_outs add the email to the master list first, you'll get a unique constraint violation. You can ensure this happens using before insert triggers. You test this out: create table email_master ( email_address constraint email_master_pk primary key) as select email_address from email_demographics union all select email_address from email_opt_outs; alter table email_demographics add constraint emde_email_address_fk foreign key (email_address) references email_master (email_address); alter table email_opt_outs add constraint emoo_email_address_fk foreign key (email_address) references email_master (email_address); create or replace trigger email_demographics_biudt before insert or delete or update of email_address on email_demographics for each row begin case when inserting then insert into email_master (email_address) values (:new.email_address); when deleting then delete from email_master emma where emma.email_address = :old.email_address; when updating then update email_master emma set emma.email_address = :new.email_address where emma.email_address = :old.email_address; end case; end email_demographics_biudt; / create or replace trigger email_opt_outs_biudt before insert or delete or update of email_address on email_opt_outs for each row begin case when inserting then insert into email_master (email_address) values (:new.email_address); when deleting then delete from email_master emma where emma.email_address = :old.email_address; when updating then update email_master emma set emma.email_address = :new.email_address where emma.email_address = :old.email_address; end case; end email_opt_outs_biudt; / insert into email_opt_outs values ( 'chris.saxon@oracle.com', sysdate);1 row inserted. insert into email_demographics values ( 'chris.saxon@oracle.com', 1, 'M', date'2000-01-01', 1); SQL Error: ORA-00001: unique constraint (CHRIS.EMAIL_MASTER_PK) violated It works! You worry that someone may get hold of the master email list. Without context they could easily start spamming people again. You add an email_type column to email_master. alter table email_master add ( email_type varchar2(30) constraint emma_type_c check ( email_type in ('DEMOGRAPHIC', 'OPTED_OUT') )); Thinking further, you realise that this enables you to enforce the mutually exclusive constraint declaratively. You can add email_type to the demographic and opt out tables as well and create a unique constraint over (email_address, email_type) in email_master. It's then possible to create a foreign key on (email_address, email_type) from the child tables to the master. Provided you ensure that email_demographics can only have email_type = 'DEMOGRAPHIC' and email_opt_outs have email_type = 'OPTED_OUT', it won't be possible to store an email address in a table that doesn't match the type given in the master. You're using Oracle 12c, so you decide to use virtual columns for type in email_demographics and email_opt_outs. This will save storage space and ensure no one can change their values to anything else. You also choose to make these columns invisible to reduce the coding you'll need to do: alter table email_master add constraint email_master_u unique (email_address, email_type); alter table email_demographics add ( email_type varchar2(30) invisible as ('DEMOGRAPHIC'));alter table email_opt_outs add ( email_type varchar2(30) invisible as ('OPTED_OUT'));alter table email_demographics add constraint emde_email_address_fk foreign key (email_address, email_type) references email_master (email_address, email_type); alter table email_opt_outs add constraint emoo_email_address_fk foreign key (email_address, email_type) references email_master (email_address, email_type); You'd love to make all modifications to these tables come through an API to manage DML on the master table. You're sure that marketing insert new email addresses directly to the database occasionally however. Getting them to use the API will be a battle. Time is also of the essence. Refactoring the application will take longer than you'd like. You settle for modifying the triggers you used earlier to include the email_type as appropriate. With the triggers updated you need to test the file imports. You cross your fingers set them running. It's nearly lunchtime. You quickly glance at your email then go to get some food. After lunch you gingerly check the status of the loads. The address import had already completed! This is looking good. Critically it's also rejected the addresses you added to email_opt_outs. You let out a sigh of relief. Confidence building, you set about finishing the code changes to support the table split. Throughout the afternoon you can't resist checking the status of the remaining imports. Each hour another completes, as expected. At 4:55pm the last one finishes. You just need check the mutually exclusive relationship held. You pause a moment before executing the verification query. "Could this really have worked?" you think. You click execute: select count(*) cfrom email_demographics demojoin email_opt_outs emooon demo.email_address = emoo.email_address; COUNT(*)---------- 0 Yes! The demographic imports have rejected all the email addresses you added to the opt out table. Excited, you work late finishing the code changes. The next day you get in, double check your work and prepare everything for release. On Friday the changes are released to production without a hitch. You're sure the solution will work. Over the weekend you can't help second guessing yourself. "Did I update the opt out application correctly? Was there something I missed?" you keep thinking to yourself. First thing next Monday you arrive back at work. Everything's quiet. This is a good sign. You double check that all the imports have worked. All good. Now for the opt outs. Two were received over the weekend. They were correctly removed from the demographics table. Success! No one bothers you all morning. At lunchtime you bump into Andy. "Did the changes work?" you ask. "Oh. Yeah." he replies. "Thanks" he mutters before sauntering off. At least he's not shouting at me anymore you think. A few weeks later the CEO invites you into his office. Nervously you go in. "Thanks for all your hard work!" he beams at you. "The fixes you made to the opt out feature were vital to keeping the company running. Keep up the good work and there might be a promotion in this for you someday! Now, get back to work." Looks like the pay rise will have to wait for another day. It felt good to be appreciated though. Smiling you head back to your desk and open up the support tickets. "Invalid countries stored in addresses". Sigh. Looks like there are more constraints for you to add... The end.

(In the previous episode you used a MV to implement a cross table constraint. The ensuing JI waits were affecting performance too badly for this to be a viable approach however. You're looking for...

Protecting People's Privacy: Implementing Cross Table Constraints Part I

(In the previous episode you resolved issues with the opt out feature of your company's email marketing database. It wasn't working correctly; Acme was still sending emails to and recording personal details about people who had opted out. You created a check constraint to prevent the storage of demographics for people who opted out and modified email list generating queries to exclude opted out addresses. The story continues...) You roll back into work next Monday. It’s 8:15am. Bill and Louise from marketing are already loitering by your desk. "This can't be good" you think and amble towards them. Andy spots you. “The changes haven’t worked!” he roars. “Opted out people are still receiving emails!” “How?” you splutter. “They’re still appearing in my Excel list” he retorts. Grrr. You forgot the marketing team has direct access to the production database. They’d built Excel macros to fetch email addresses from it. Unsurprisingly these queries don’t include the critical “where opt_out_datetime is null” check. You start to tell him how to modify his macros. He waves you off. “You’re the developer” he says. “You fix it”. He departs with Louise, leaving you to ponder solutions. Revoking production access was off the cards. This was a battle you and the rest of IT frequently had with marketing. They insisted that without live access their productivity would suffer. This debate had gone all the way to the top. Unfortunately senior management had sided with marketing. You could modify their macros. Eventually someone is bound to miss the vital where clause again though, leaving you back where you started. If you could apply logic in the database to ensure that their queries only returned valid email addresses it would avoid this issue. If the demographics table only stores legitimate addresses this would be in place by defintion. Then it hits you: put demographic details and opt out requests in separate tables. If you can validate that an email address is only stored in one table or the other, then all queries against email_demographics will return usable email addresses. You can easily create the opt out table and remove them from the demographics table: create table email_opt_outs ( email_address primary key, opt_out_datetime not null) as select email_address, opt_out_datetime from email_demographics where opt_out_datetime is not null;delete email_demographicswhere opt_out_datetime is not null; This undoes last week's effort implementing the check constraint however. You need new validation to ensure that you can only store an email address in either email_demographics or email_opt_outs. The question is how? Oracle doesn’t have any statements that support cross table constraints. Stuck, you turn to the Oracle community for help. You ask a question on the Oracle SQL forum: “How can I implement a mutually exclusive relationship across two tables?” The solutions start pouring in. One catches your eye. The following query checks whether there are any email addresses that appear in both tables: select count(*) cfrom demographics demojoin email_opt_outs emooon demo.email_address = emoo.email_address; This should always return zero. You can validate this by using the query in a materialized view (MV) with a check constraint applied. As long as the MV is set to refresh on commit, whenever someone saves a change to the demographic or opt out tables Oracle will ensure this condition holds. This will prevent anyone storing an email address in both tables. You get to work creating the materialized view: create materialized view distributed_emails_mv refresh fast on commitas select count(*) c from demographics demo join email_opt_outs emoo on demo.email_address = emoo.email_address;ORA-23413: table "PLCH_EMAIL_OPT_OUTS" does not have a materialized view log Doh! Just creating the MV isn’t enough. To make it fast refreshable on commit you need materialized view logs on the tables too: create materialized view log on email_demographics with rowid, sequence (email_address) including new values; create materialized view log on email_opt_outs with rowid, sequence (email_address) including new values; create materialized view distributed_emails_mv refresh fast on commitas select count(*) c from email_demographics demo join email_opt_outs emoo on demo.email_address = emoo.email_address;alter materialized view distributed_emails_mv add constraint diem_no_duplicates_c check (c = 0); With the MV is in place you do a quick test: insert into email_demographics (email_address, postal_address_id, gender, birth_date, salary_range_id)values ('chris.saxon@oracle.com', 1, 'M', date'2000-01-01', 1);insert into email_opt_outs (email_address, opt_out_datetime)values ('chris.saxon@oracle.com', sysdate);commit;SQL Error: ORA-12008: error in materialized view refresh pathORA-02290: check constraint (PLCH_DIEM_NO_DUPLICATES_C) violated Success! This looks promising. You set up some test data, schedule the file imports to run and head for coffee. Each of these normally take around an hour to complete. With five loads queued if you're really lucky they'll finish just before you leave work. You spend the rest of the day on the code changes to support the table split. At 5:45pm you check the status of the imports. They're still running. “I guess this'll have to wait until morning” you think to yourself, save your changes and head home. The next day you come in and eagerly check the status of the file loads. Still going. This isn't good. You quickly do some checks. Less than 60% complete. This is taking too long. At this rate they'll take over a day to complete. Unacceptably long. "What’s going on?" you think. You contact Jill, your DBA. "Could you get any stats on my imports or see if there's any unusual database activity last night?" you ask. She sends over a statspack report. Top of the list of timed events is "enq: JI - contention". "This wasn't there yesterday" she informs you. "enq: JI - contention? What on earth is that?" you ponder. A quick Google reveals that this wait event is related to MV refreshes. Only one process can refresh an MV at a time. If concurrent sessions trigger a refresh the "enq: JI - contention" wait event appears. Clearly there's a lot of refreshes taking place. You open up the address import package and search for "commit". 27 occurrences. All buried within 7,452 lines of spaghetti, row-by-row processing code. The other import packages are in a similar state. Every change to every email address is causing an MV refresh. No wonder performance is so badly affected. All the packages are in desperate need of a rewrite. Converting them to use set processing has been on your to-do list for some time. Doing so will reduce the number of commits, decrease the JI waits and improve overall runtime. Implementing these changes is several weeks' worth of work however. You're going to need to deploy a solution faster than that. It looks like using an MV to support the cross table constraint is out. "There must be another way to implement the distributed key" you think. To be continued…

(In the previous episode you resolved issues with the opt out feature of your company's email marketing database. It wasn't working correctly; Acme was still sending emails to and recording personal...

Protecting People's Privacy: The Power of Constraints

It’s Monday morning. You arrive at Acme, the small email marketing company where you work. You pour yourself a cup of lukewarm coffee from the communal coffee pot and head to your desk. You hear voices coming from the cavernous meeting room at the far end of the office. They sound angry. You ignore them and check your email. A few minutes later your boss bursts out of the meeting room and strides over to your desk. “The opt out feature isn’t working correctly!” he announces. “Several people have lodged complaints that they’re still receiving emails, despite opting out.” Worse, one has lodged a Data Protection Act (DPA) request to receive all the personal information we hold on them. It turns out we’re still storing all their demographic data, even though they’ve opted out. The fines for breaking the DPA and sending spam emails can run into the thousands. These alone could be enough to cripple the company. If those didn’t then the bad PR and subsequent client loss certainly would. Your boss drones on while you cast your mind back. The opt out feature was hastily implemented a few months ago by your co-worker, Jeff. It was built to help comply with anti-spam and data protection laws. Anyone could submit their email address via an online form. In return the company promised not to send them any more marketing emails and store no personal data associated with the email address. You’d had a heated debate with Jeff regarding the implementation. You’d recommended that he added database constraints making it impossible to store demographic information about people who opted out. He insisted that this was “business logic” that “belonged in the application”. With your own deadlines to meet you left him to it. “Sort this out!” your boss demands before storming off back to the meeting room. You sigh, open SQL Developer and connect to the database. Email addresses and demographic data are all stored in one giant table: create table email_demographics ( email_address varchar2(320) not null, postal_address_id number, gender varchar2(1), birth_date date, salary_range_id number, opt_out_datetime date); First you set about the DPA issues. You check how many rows store demographic data where the opt out date is set: select count(*)from email_demographicswhere opt_out_datetime is not nulland (gender is not null or postal_address_id is not null or salary_range_id is not null or birth_date is not null); COUNT(*)---------- 127 “We should have implemented database constraints” you mutter to yourself. Clearly the opt out feature isn’t preventing the collection of personal data. The question is how? You start testing the application. No matter what you try opting out always deletes a person’s personal data. You’re stuck. It’s 10:30am. Your boss asked for an update at 11. You need to provide details of how you’re going to fix this and timescales. While you're contemplating an Outlook notification pops up. Weekly address data import failed. Of course! Addresses and other personal details are loaded directly into the database via PL/SQL file imports. You quickly check the source code. Sure enough, this always set address details without verifying that the opt out date is null. With ten minutes left before your 11am update you brainstorm solutions. You need to verify that all the demographic columns are null when the opt out date is set. This is easily done using coalesce. This returns the first non-null argument. If you pass all the personal data columns to it you just need to ensure it returns null when the opt out datetime has a value. The following check constraint will do the trick: alter table email_demographics add constraint emde_opt_out_c check ( case when opt_out_datetime is not null then coalesce ( gender, to_char(postal_address_id), to_char(salary_range_id), to_char(birth_date) ) end is null); A combination of this constraint and modifications to the file imports will deal with the DPA issues. The next step is preventing spam. All the queries generating mail lists need to have the condition “and opt_out_datetime is null” added. You briefly consider using Virtual Private Database to do this automatically. Unfortunately you’re on Oracle Standard Edition so aren’t licensed to use it. You’ll have to dig through all the code manually. A quick scan of the source reveals 67 references to email_demographics. You do some back of the envelope calculations. Two weeks’ work you think to yourself. You check the time. 11:02. Solution in hand, you leap from your desk to inform your boss and head to the meeting room. When you arrive your boss is waiting along with all the company executives. You inform them of your findings. “Great” says the CEO. “But we need it by Friday”. Disgruntled, you head back to your desk, put your earphones on and start coding. After three days of non-stop coding, at 10pm on Wednesday you check-in all your changes. All the tests are passing. File imports no longer set demographic details against email addresses that have opted out. It’s looking like this will work. On Thursday evening the changes are released to production. You leave early on Friday, head home and relax. To be continued…

It’s Monday morning. You arrive at Acme, the small email marketing company where you work. You pour yourself a cup of lukewarm coffee from the communal coffee pot and head to your desk. You hear voices...

Query Optimization

Fetching a Row Plus N Either Side Performance Analysis

In a previous article, we looked methods for fetching a row from a table along with a number of rows either side of it. In the comments, Stew Ashton offered a solution using the match_recognize clause introduced in 12c. How do the numbers stack up on the performance of these different methods? Let's find out! To test this, I created a one million row table (with this script). For the tests I looked for one row along with the five either side of it (:lname := 'Aaaq', :lvl := 5). Below is each query along with its elapsed time and execution plans with selected statistics (actual rows, buffer gets and elapsed time). First up the lag() and lead() approach: with boundaries as ( select prevr, nextr from ( select person_id, last_name, lead(person_id,:lvl,0) over (order by person_id) as nextr, lag(person_id,:lvl,0) over (order by person_id) as prevr from test_people ) where last_name like :lname ) select /*+ gather_plan_statistics */emp.person_id, emp.last_name from test_people emp join boundaries b on emp.person_id between b.prevr and b.nextr order by emp.person_id; Elapsed: 00:00:12.72 OPERATION A_ROWS BGETS ETIME ------------------------------------------------------- ---------- -------- ----------------- SELECT STATEMENT SORT_ORDER BY 11 14,720 12,483,154 MERGE JOIN 11 14,720 12,303,574 SORT_JOIN 1,000,000 7,360 1,331,089 TABLE ACCESS_FULL TEST_PEOPLE 1,000,000 7,360 882,359 FILTER 11 7,360 9,038,280 SORT_JOIN 971,628 7,360 5,210,613 VIEW 1 7,360 1,852,479 WINDOW_SORT 1,000,000 7,360 2,644,408 TABLE ACCESS_FULL TEST_PEOPLE 1,000,000 7,360 829,814 Ouch. That's taking twelve seconds to execute and features two full table scans! Clearly this isn't going to be a workable solution unless you're certain the table will only ever store a few hundred rows. What about match_recognize? select /*+ gather_plan_statistics */person_id, last_name from test_people match_recognize ( order by person_id measures match_number() match_number all rows per match pattern(a{0, 5} b a{0, 5}) define b as last_name like :lname ); Elapsed: 00:00:04.78 OPERATION A_ROWS BGETS ETIME ------------------------------------------------------- ---------- -------- ----------------- SELECT STATEMENT VIEW 11 7,360 334,040 MATCH RECOGNIZE_SORT 11 7,360 334,028 TABLE ACCESS_FULL TEST_PEOPLE 1,000,000 7,360 939,029 That's significantly quicker - just less than five seconds. This is still doing a full table scan, so as the data grows this is going to get slower. I think this is the most straightforward query to read (at least, once you've got to grips with match_recognize!) and may be good enough for ad-hoc analysis. If this is part of an application you'll want something better. Next, let's take a look at the solution min/max solution: with boundaries (prevr, nextr, lvl) as ( select nvl((select max(e.person_id) from test_people e where e.person_id < emp.person_id), emp.person_id) prevr, nvl((select min(e.person_id) from test_people e where e.person_id > emp.person_id), emp.person_id) nextr, 1 lvl from test_people emp where last_name like :lname union all select nvl((select max(person_id) from test_people e where e.person_id < prevr), prevr ) prevr, nvl((select min(person_id) from test_people e where e.person_id > nextr), nextr ) nextr, lvl+1 lvl from boundaries where lvl+1 <= :lvl ) select /*+ gather_plan_statistics */e.person_id, e.last_name from test_people e join boundaries b on e.person_id between b.prevr and b.nextr and b.lvl = :lvl order by e.person_id; Elapsed: 00:00:00.25 OPERATION A_ROWS BGETS ETIME ------------------------------------------------------- -------- -------- ------------ SELECT STATEMENT SORT_ORDER BY 11 34 550 NESTED LOOPS 11 34 570 NESTED LOOPS 11 33 478 VIEW 1 30 437 UNION ALL (RECURSIVE WITH)_BREADTH FIRST 5 30 518 SORT_AGGREGATE 1 3 32 FIRST ROW 1 3 17 INDEX_RANGE SCAN (MIN/MAX) PERSON_PK 1 3 12 SORT_AGGREGATE 1 3 20 FIRST ROW 1 3 10 INDEX_RANGE SCAN (MIN/MAX) PERSON_PK 1 3 6 TABLE ACCESS_BY INDEX ROWID BATCHED TEST_PEOPLE 1 4 54 INDEX_RANGE SCAN PEOP_NAME_I 1 3 39 SORT_AGGREGATE 4 10 70 FIRST ROW 4 10 46 INDEX_RANGE SCAN (MIN/MAX) PERSON_PK 4 10 26 SORT_AGGREGATE 4 10 62 FIRST ROW 4 10 36 INDEX_RANGE SCAN (MIN/MAX) PERSON_PK 4 10 19 RECURSIVE WITH PUMP 4 0 22 INDEX_RANGE SCAN PERSON_PK 11 3 26 TABLE ACCESS_BY INDEX ROWID TEST_PEOPLE 11 1 33 Nice! Execution time is well under one second - fast enough for most purposes. Can we do any better? My latest colleague came up with another approach. At a high level: Find the row of interest Walk along the (primary key) index downwards until our boundary Walk along the (primary key) index upwards until our boundary Like the min/max method, this uses subquery factoring: with the_row_i_need as ( select person_id from test_people where last_name like :lname ), lookdown as ( select person_id from ( select person_id from ( select /*+ index_desc(e) */ e.person_id from test_people e where e.person_id < ( select r.person_id from the_row_i_need r ) order by e.person_id desc ) where rownum <= :lvl order by person_id asc ) where rownum = 1 ), lookup as ( select person_id from ( select person_id from ( select /*+ index_asc(e) */ e.person_id from test_people e where e.person_id > ( select r.person_id from the_row_i_need r ) order by e.person_id asc ) where rownum <= :lvl order by person_id desc ) where rownum = 1 ) select /*+ gather_plan_statistics */ e.* from test_people e, lookup, lookdown where e.person_id >= lookdown.person_id and e.person_id <= lookup.person_id; Elapsed: 00:00:00.35 OPERATION A_ROWS BGETS ETIME ---------------------------------------------------------- ---------- -------- ------------ SELECT STATEMENT TEMP TABLE TRANSFORMATION 11 24 19,086 LOAD AS SELECT 0 4 17,533 TABLE ACCESS_BY INDEX ROWID BATCHED TEST_PEOPLE 1 4 80 INDEX_RANGE SCAN PEOP_NAME_I 1 3 52 NESTED LOOPS 11 20 1,057 NESTED LOOPS 11 18 1,291 NESTED LOOPS 1 14 1,025 VIEW 1 8 789 COUNT_STOPKEY 1 8 776 VIEW 1 8 758 SORT_ORDER BY STOPKEY 1 8 746 COUNT_STOPKEY 5 8 735 VIEW 5 8 709 INDEX_RANGE SCAN DESCENDING PERSON_PK 5 8 682 VIEW 1 5 608 TABLE ACCESS_FULL SYS_TEMP_0FD9D66DB_1374337 1 5 589 VIEW 1 6 212 COUNT_STOPKEY 1 6 198 VIEW 1 6 177 SORT_ORDER BY STOPKEY 1 6 161 COUNT_STOPKEY 5 6 162 VIEW 5 6 131 INDEX_RANGE SCAN PERSON_PK 5 6 103 VIEW 1 3 66 TABLE ACCESS_FULL SYS_TEMP_0FD9D66DB_1374337 1 3 51 INDEX_RANGE SCAN PERSON_PK 11 4 226 TABLE ACCESS_BY INDEX ROWID TEST_PEOPLE 11 2 90 The execution time is marginally higher than the min/max approach. It uses fewer buffer gets however. It also only accesses the person_pk index twice instead of four times. If we ask for more rows each side, this solution should scale better. Unfortunately, there's a catch with this approach. So far I've assumed that names are unique. Clearly this isn't the case. What happens if the search matches multiple names? ORA-01427: single-row subquery returns more than one row Uh-oh. The look up and down approach is only guaranteed to work if searching on unique columns. Looks like the min/max approach is the most efficient general case solution. What do you think? Which method do you like best? Can you find a faster or more elegant approach? If so, please let us know in the comments!

In a previous article, we looked methods for fetching a row from a table along with a number of rows either side of it. In the comments, Stew Ashton offered a solution using the match_recognize clause...

PL/SQL

Optimizing the PL/SQL Challenge VI: The Problem with Group By Subqueries

So far in this series we've made substantial improvements to the performance of the Library page query on the PL/SQL Challenge. This was taking several minutes to execute; our efforts have reduced this to a few seconds. These efforts were all directed at the main body of the query. There are also a number of subqueries in the select clause. Oracle is still doing a lot of work processing these, as we can see from the following autotrace summary: The SORT (AGGREGATE) steps are all queries nested within the select clause of the statement. The FILTER is the main body of the SQL. We can see that the first of these subqueries is doing double the amount of work as the main query! To figure out why, I expanded this step. When doing so, something caught my eye (columns are, from left to right, last CR buffer gets, last output rows, last starts): That full index scan is processing over 300,000 rows! No wonder the buffer gets are so high. Now that we've identified where to direct our efforts, the next questions are why is this happening and what can we do about it? Notice that there are no filter or access predicates listed on this operation. This means Oracle's processing all the rows in the index. Twenty times. This explains why this step is returning so many rows. We're getting somewhere. But why is Oracle fully scanning the whole index? The presence of VIEW -> SORT steps above the index scan are a clue to what's going on here. These steps mean that Oracle is fully executing all operations under the VIEW before joining these results back to the rest of the query. This usually happens because there's an inline view (subquery) in the statement that it can't merge into the main query. This can happen with "complex" views that include operations such as group by, distinct, outer joins and other functions that aren't basic joins. Let's take a look at our query to see if we can find any of these. The object listed at the top of the autotrace output, qdb_correct_comp_events_v is a view. Its definition is: with chc as ( select qs1.question_id, count (*) choice_count from qdb_questions qs1, qdb_mc_options mco1 where qs1.question_id = mco1.question_id group by qs1.question_id ), per_quiz as ( select qz.comp_event_id , qz.quiz_id , qr.user_id , case when qt.only_one_choice_correct = 'N' then sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) else case when sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) < chc.choice_count then 0 else sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) end end correct_answers from qdb_quizzes qz , qdb_quiz_results qr , qdb_mc_options mco , qdb_result_choices rc , qdb_questions qs , qdb_question_types qt , chc where qz.quiz_id = qr.quiz_id and qz.question_id = mco.question_id and qr.result_id = rc.result_id and mco.mc_option_id = rc.mc_option_id and qz.question_id = qs.question_id and qs.question_type_id = qt.question_type_id and chc.question_id = qs.question_id and nvl (qr.saved_by_user, 'Y') = 'Y' group by qz.comp_event_id , qz.quiz_id , qr.user_id , qt.only_one_choice_correct , chc.choice_count ) select eva.comp_event_id , eva.user_id , sum ( per_quiz.correct_answers ) correct_answers from qdb_compev_answers eva, per_quiz where eva.comp_event_id = per_quiz.comp_event_id and eva.user_id = per_quiz.user_id group by eva.comp_event_id, eva.user_id ; Aha! The chc subquery includes a group by and count. The fully scanned index, qdb_mc_options_idx, is on the qdb_mc_options table in this subquery. Looks like we've found our culprit. Now we've identified this, what can we do about it? First up, let's analyze what's going on this query. It has two subqueries defined in the with clause: chc and per_quiz. Per_quiz returns a row for each quiz. Chc counts how many options there are to choose from for each question. We use this for "only one correct" quizzes to check whether the number of correct answers a player has given matches the number of possible answers for the question. If they do, they get 100%, otherwise they score 0. Unfortunately, Oracle's not been able pass the question_id into the chc query. In effect, it's executing the following query once for each question marked as "only one correct": select mco1.question_id, count (*) choice_count from qdb_mc_options mco1 group by mco1.question_id Notice there's no where clause. Therefore Oracle is counting how many options there are for every quiz, not just the one we're interested in. Let me repeat that: This processes all the rows in an index on qdb_mc_options for every question that is marked as only one correct. This isn't going to scale well. Relatively few questions have this property. There only needs to be a handful in the results for this to impact query performance however. To fix this, let's check what we're trying to achieve. We want to count how many options there are for a specific question - i.e. join the query above on question_id. Fortunately there's a simple way to do this. We can remove the query in chc from the with clause and place it directly in the select. Then we can join it to the current quiz, like so: with per_quiz as ( select qz.comp_event_id , qz.quiz_id , qr.user_id , case when qt.only_one_choice_correct = 'N' then sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) else case when sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) < (select count (*) choice_count from qdb_mc_options mco1 where qs.question_id = mco1.question_id ) then 0 else sum (case when mco.is_correct = rc.user_choice then 1 else 0 end ) end end correct_answers from qdb_quizzes qz , qdb_quiz_results qr , qdb_mc_options mco , qdb_result_choices rc , qdb_questions qs , qdb_question_types qt where qz.quiz_id = qr.quiz_id and qz.question_id = mco.question_id and qr.result_id = rc.result_id and mco.mc_option_id = rc.mc_option_id and qz.question_id = qs.question_id and qs.question_type_id = qt.question_type_id and nvl (qr.saved_by_user, 'Y') = 'Y' group by qz.comp_event_id, qz.quiz_id, qr.user_id, qt.only_one_choice_correct, qs.question_id ) select eva.comp_event_id , eva.user_id , sum ( per_quiz.correct_answers ) correct_answers from qdb_compev_answers eva, per_quiz where eva.comp_event_id = per_quiz.comp_event_id and eva.user_id = per_quiz.user_id group by eva.comp_event_id, eva.user_id ; We're now explicitly asking to count the number of choices for a given question. Oracle is able to apply the question_id filter to the qdb_mc_options table. As a result it should only process the handful of rows for each question, instead of the whole table! In theory this looks like a huge saving. How does it bear out in practice? Wow, a massive decrease in work! Those 16,000 buffer gets have dropped down to under 500. The 15,000 plus gets doing the index scan are almost entirely eliminated. Notice also that an extra SORT (AGGREGATE) has appeared (there's four instead of the original three). Expanding this step out we see:   The object listed is the index on qdb_mc_options (question_id). Oracle's manipulated the SQL so it's only executing this subquery once, rather than for each row. Let's recap. Subquery factoring (the "with" clause) is a neat way of saving queries you can reference later in the same statement. This can help make them easier to understand. The risk is if these subqueries include group by statements or other complex operations then Oracle may have to process all the underlying data - not just the rows that you're interested in. In some circumstances Oracle can access just the rows you need via complex view merging. Unfortunately it didn't in this case. To figure out if this case is affecting a query, look for VIEW steps in the execution plan. If these are processing unexpectedly large numbers of rows you may need to rewrite your query to improve its performance.

So far in this series we've made substantial improvements to the performance of the Library page query on the PL/SQL Challenge. This was taking several minutes to execute; our efforts have reduced...

Analytical SQL

Fetching a Row Plus N Rows Either Side in a Single SQL Statement

A colleague recently posed me the following question: I want to find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key. How can I do this using SQL? It's an interesting problem. Let's look at how you do in this in a single, efficient select statement. When constructing SQL queries it's always beneficial to have example data along with the expected output. In this case, given the following rows: ID STRING -- ------ 1 abc 2 def 3 ghi 4 this 5 jkl 6 mno 7 pqr When a user searches for the string "this" and N = 2, the query should return the rows with IDs 2-6. How can you do this? A first it's tempting to look at analytic functions. Lag() and lead() return values from the rows N positions before or after the current row. Using the classic employees table and N = 1, you could try a query like: select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr, lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees where last_name = 'Johnson'; Nextr and prevr will return the employee IDs either side of Johnson, right? Wrong. Analytic functions are processed after the where clause. The condition "last_name = 'Johnson'" returns one row. Oracle then applies lag and lead on just this one row - not all the rows in the underlying table. Therefore instead of getting the employee_IDs next to Johnson's, these functions return zero: EMPLOYEE_ID LAST_NAME NEXTR PREVR ----------- --------- ----- ----- 179 Johnson 0 0 Hmmm. To get around this, you need to execute these functions before the where clause. You can do this by creating an in-line view including them. The where clause then goes outside the subquery, like so: select employee_id, last_name, prevr, nextr from ( select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr , lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees ) where last_name = 'Johnson'; EMPLOYEE_ID LAST_NAME NEXTR PREVR ----------- --------- ----- ----- 179 Johnson 178 180 Now you have the previous and next employee_IDs. With these values, you can join back to the original table to find all the IDs between them: with boundaries as ( select prevr, nextr from ( select employee_id, last_name, lead(employee_id,1,0) over (order by employee_id) as nextr, lag(employee_id,1,0) over (order by employee_id) as prevr from hr.employees ) where last_name = 'Johnson' ) select emp.employee_id, emp.last_name from hr.employees emp join boundaries b on emp.employee_id between b.prevr and b.nextr order by emp.employee_id; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 178 Grant 179 Johnson 180 Taylor Success! To change the number of rows returned you just need to adjust the second parameter to lag() and lead(). You can convert these to bind variables. This enables users to select how many rows they want at runtime. Now you have a working query it's worth checking its performance. Here's where this query falls down. An index on last_name isn't going to help. Oracle has to scan all the rows in the employees table first, before applying the last name check (otherwise lag() and lead() could give us incorrect results). This isn't going to scale well for tables with millions or billions of rows. What else could you try? There's the naive approach. Just add and subtract the number of rows wanted from the target employee_ID. Then find all the rows between these values: with target as ( select employee_id from hr.employees where last_name = 'Johnson' ) select emp.employee_id, emp.last_name from hr.employees emp where (select employee_id from target) between emp.employee_id-1 and emp.employee_id+1; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 178 Grant 179 Johnson 180 Taylor Looks like it works! It gives the expected output. And it can use an index on employees.last_name. Time to roll this into production, right? Unfortunately, no. This example works because there are no gaps between employee_IDs. In general this isn't guaranteed (what if you can delete rows?). This method may return too few rows either side of the target. It also only works on gap-free integer columns. It doesn't work on a general version the problem: finding the rows ordered by something other than ID (e.g. insert date). Let's look at the problem again, restating it slightly. You want to find the upper and lower IDs, N rows away from a given row. Then find all the rows between them these IDs. Starting with the ID range, Oracle will be able to fetch the rows using a nice efficient index range scan on the primary key (assuming small N). So the real question is: How can you find the upper and lower bounds efficiently? Enter the min/max index range scan. Oracle is able to find the maximum or minimum value stored in an indexed column efficiently. This is because indexes are ordered structures, so it can just look at the first or last value in the index. This principle applies even if you're looking for a value in the middle of an index - provided you know the value immediately before (or after) it! A query in the form: select max(employee_id) from employees where employee_id < :value; can still do an efficient min/max index range scan. Oracle is able to search the index for the value provided. From there it goes to the previous value - the maximum less than the search value. This fetches the employee ID before the variable - even if there are gaps. To find the next ID, change the max() to a min() and flip the less than to a greater than. You can exploit this feature to efficiently find the upper and lower IDs either side of our target. Plugging the previous query in to a basic select from employees, along with one to find the next row, gives the following: select (select max(employee_id) from hr.employees e where e.employee_id < emp.employee_id) prevr, (select min(employee_id) from hr.employees e where e.employee_id > emp.employee_id) nextr from hr.employees emp where last_name = 'Johnson'; PREVR NEXTR ---------- ---------- 178 180 This is all very well and good. You've only returned the first value either side of the target however. This only works if N = 1. How can you get the rest when N is greater? Thinking programmatically, you could re-run this query N times. At each execution, use the maximum/minimum employee_IDs found in the previous pass - i.e. recursively call the query with the new boundaries. At this point you may be tempted to reach for your favourite programming language to build a recursive function including this query. There's no need however - you can do it all in SQL! This is possible using recursive query subfactoring. To use this you need to place the query in a with clause, define your base case, the recursive case and the exit condition(s). The previous query gives your base case. You need to add another column to this to count how many times you've recursed, starting with one. The recursive case starts with the same query. This time however, you increment the recursion count and add a check that this value is less than or equal to N (the terminating condition). Union-alling these two together inside a with clause gives: with boundaries (prevr, nextr, lvl) as ( select (select max(e.employee_id) from hr.employees e where e.employee_id < emp.employee_id) prevr, (select min(e.employee_id) from hr.employees e where e.employee_id > emp.employee_id) nextr, 1 lvl from hr.employees emp where last_name = 'Johnson' union all select (select max(employee_id) from hr.employees e where e.employee_id < prevr) prevr, (select min(employee_id) from hr.employees e where e.employee_id > nextr) nextr, lvl+1 lvl from boundaries where lvl+1 <= :lvl ) select * from boundaries; -- for lvl/N = 2 PREVR NEXTR LVL ---------- ---------- ---------- 178 180 1 177 181 2 Now you have the upper and lower bounds. To get the rows asked for, just join employees to the above where the employee_ID is between prevr and nextr. Before you do so, there is some tidying up steps needed. To prevent duplicating rows in the output, include a check that the computed recursion count (lvl) equals N. You also need to be wary of searching for rows near the start or end of the employee IDs. If recursion will take you beyond the max or min IDs, then the subquery will return null. This will cause the whole output to be empty. To prevent this, nvl() the results of these queries with employee_ID in the base case and prevr and nextr in the recursive case. Finally, use a bind variable for the last_name check! Putting this all together gives the following query: with boundaries (prevr, nextr, lvl) as ( select nvl((select max(e.employee_id) from hr.employees e where e.employee_id < emp.employee_id), emp.employee_id) prevr, nvl((select min(e.employee_id) from hr.employees e where e.employee_id > emp.employee_id), emp.employee_id) nextr, 1 lvl from hr.employees emp where last_name = :lname union all select nvl((select max(employee_id) from hr.employees e where e.employee_id < prevr), prevr ) prevr, nvl((select min(employee_id) from hr.employees e where e.employee_id > nextr), nextr ) nextr, lvl+1 lvl from boundaries where lvl+1 <= :lvl ) select e.employee_id, e.last_name from hr.employees e join boundaries b on e.employee_id between b.prevr and b.nextr and b.lvl = :lvl order by e.employee_id; EMPLOYEE_ID LAST_NAME ----------- ------------------------- 177 Livingston 178 Grant 179 Johnson 180 Taylor 181 Fleaur A nice, efficient way to find the N rows either side of a target! Note: I have to thank Sayan Malakshinov for providing the inspiration for the final solution for this, which came from his query to efficiently find the top-N results within multiple categories.

A colleague recently posed me the following question: I want to find a row matching a string and the N (non-matching) rows either side of it, ordered by primary key. How can I do this using SQL? It's an...

PL/SQL

Optimizing the PL/SQL Challenge V: The Danger of Too Many Indexes

In part four of this series, we saw how the Library page query from the PL/SQL Challenge was using a crazy execution plan. Not only did it full scan three indexes from the same table, it used a fourth to access the table itself. We solved this by rewriting the query and creating a new index. It's worth taking a step back to look at all the indexes to see what impact they have on the query. Here's the original, slow, execution plan: The three highlighted index full scans each return a large number of rows. A lot of work, especially considering that these are all on qdb_comp_events and Oracle later accesses this via its primary key index. What happens if we remove these three indexes (jes2, i_qdb_comp_events_priv and i_qdb_comp_events_startdt)? Let's take a look at the execution plan: Wow, that's a big difference. Not only have we lost the full scans of the indexes we've removed, the four other full index scans are also gone! The execution plan looks similar to the one we ended up with at the end of the previous post. This plan isn't as efficient (~15,000 gets for this vs. ~6,000 optimizations in part 4), so it's still better sticking with the previous approach. Neither plan uses these indexes however, so it's worth asking: Why don't we just get rid of them? Looking at just this query it's tempting to say yes. Before going ahead with such a change we need to look at the system as a whole. There may be other statements that rely one or more of these indexes. It's no good getting dramatic performance improvements in one query if we shift the problem to another one. Let's look at three approaches you can use to help figure out whether Oracle needs an index you're planning to drop. Inspect the code A good starting point is to check the code base for references to the table whose indexes we plan to remove. Once we've found all the statements we can check each to see whether they require the index or not. There are a few obvious downsides to this method. The first is the effort in analyzing the queries. If it's a core table it may be referenced in tens or hundreds of SQL statements. The time it'll take to inspect each individual query could make this exercise infeasible. Secondly, if users are able to write ad-hoc queries against the database then there may still be business critical statements that require the index. Your code analysis isn't going to identify these. Finally - and most decisively - the source code isn't always available! If it's a third party application it's likely you won't be able to inspect the code yourself. Inspect current query execution plans If code analysis proves impossible or infeasible, we can check the actual plans on the production system. The following query identifies current execution plans that use a given index: select * from v$sql_plan where object_name = 'index_to_drop'; The downside to this approach is the execution plan isn't guaranteed to be present. Plans are cached on a least recently used basis. If the query is infrequently executed, e.g. an overnight batch job, then it's unlikely to be present if we check at lunchtime. Therefore we need to sample this at several points over a period of time. Fortunately the AWR (automatic workload repository) does this sampling for us, so we can look there for historic execution plans. There are still a couple of caveats to this approach however. Mostly importantly, we need the diagnostics pack license to use it. Even if we have it, the next issue is that not all statements are saved in AWR, just the heaviest consumers. So it's still possible we could miss important queries that don't put sufficient load on the system (relative to other queries). Turn on index monitoring The final method I'll discuss is index monitoring. This allows Oracle to record whether or not any statements have used an index. To enable this, issue the following statement: alter index index_to_drop monitoring usage; If a statement uses the index after we do so, then this is recorded in v$object_usage. We can check this with this query: select * from v$object_usage where index_name = 'INDEX_TO_DROP'; This approach is also susceptible to the infrequently executed query issue. Depending on the nature of the target system it's likely we'll need to let it run with monitoring enabled for several days or weeks to get confidence that an index isn't used (often!). The major drawback to all these methods is it's still possible to miss infrequently executed, critical queries for which an index is vital. For example, our index may only be used in month or yearend reports. Removing it may make the difference between these reports completing in minutes and taking several days finish. If our analsis doesn't include these times it's possible to miss these key processes. Ideally we'd combine the approaches above, checking the code while also monitoring production plans for a period of several days or weeks. If a query is performing so badly it's negatively impacting the business then this may be longer than we can wait. In these situations we may have to take action quickly. This pressure may force us into creating a new index without removing the unnecessary ones. This introduces technical debt. The increased index count makes it possible this or other queries will select inappropriate access paths. Also there's extra overhead maintaining the indexes when executing DML and increased storage requirements. Be aware that tuning one statement in isolation can lead to issues in other areas.

In part four of this series, we saw how the Library page query from the PL/SQL Challenge was using a crazy execution plan. Not only did it full scan three indexes from the same table, it used a fourth...

PL/SQL

Optimizing the PL/SQL Challenge IV: More OR Condition Woes

In the previous post in the PL/SQL Challenge optimization series, we'd made huge performance gains simply by removing optional bind parameter clauses. The main body of the query is still doing a significant amount of work however and taking several seconds to execute. Fortunately there are simple changes we can make to improve its performance. Let's take a look. Reviewing our trusty autotrace report we can see the following: The red boxes highlight key issues in the execution plan. The three index scans are all accessing over 86,000 rows. The index lookups against qdb_comp_events are done over 21,000 times. There's something more insidious that's not apparent from looking at this plan though: All three indexes are also on the qdb_comp_events table! Essentially the optimizer has combined three indexes from one table. It's then filtered these results along with data from other tables. Several steps later it goes back and accessing the original table via its primary key. Ouch! Clearly something's going wrong here. Let's take a look at the join conditions for the comp events table: and ce.competition_id = c.competition_id and ce.comp_event_id = qz.comp_event_id and ce.hide_quizzes = 'N' and ( ce.approval_needed = 'N' or ( ce.approval_needed = 'Y' and ce.approved_by_user_id is not null)) and ( ce.end_date <= sysdate or ( qt.text = 'Free-Form Text' and ce.start_date <= sysdate) or ( trunc (ce.end_date) = trunc (sysdate) and c.frequency = 'D') or ( ce.start_date <= sysdate and trunc (ce.end_date) = trunc (next_day (sysdate, 'SAT') - 1) and c.frequency = 'W') or ( ce.start_date <= sysdate and trunc (ce.end_date) = trunc (add_months (sysdate, 1), 'MON') - 1 and c.frequency = 'M')) Hmmm, another thorny set of OR conditions. No wonder the optimizer is struggling! Is there anything we can do to simplify or remove these? Let's tackle clause the approval_needed clause first. Currently there are no indexes including these columns. We could create an index on (approval_needed, approved_by_user_id). It's then possible Oracle could do an OR expansion using this index. The optimizer isn't guaranteed to select such an expansion however. Even if it does, we'll be accessing the table or its index multiple times. This is likely to be less efficient than a single AND condition (with supporting index). Can we rewrite this OR clause as a single condition? Let's review what it's actually doing. In English, we're saying: If the event requires approval (approval_needed = 'Y'), then ensure that someone has done so (approved_by_user_id is not null). Otherwise we don't care whether approved_by_user_id has a value or not. This gives us a clue to how we can rewrite this. Firstly, we can check if approval_needed equals 'N'. If it does, then map to any non-null value. Otherwise use the value for approved_by_user_id. How can we do this? Using decode() (or case), like so: and decode(ce.approval_needed, 'N', 'X', 'Y', ce.approved_by_user_id) is not null We can replace the previous OR clause in our query with the above and create a (function based) index including this expression. The optimizer can then use this index without having to use OR expansions! Applying just this condition to qdb_comp_events returns all except a handful of rows however. Therefore an index that only includes this expression is unlikely to be useful. Let's keep looking at the query to see if there are any other improvements we can make. The second set of OR expressions are complex, referencing the start_date and end_date columns alternately along with other tables. This means there's no easy way to rewrite it to enable index use. We can still make some changes here to improve the indexability of the query however. Look carefully at the subexpressions: and ( ce.end_date <= sysdate or ( qt.text = 'Free-Form Text' and ce.start_date <= sysdate) or ( trunc (ce.end_date) = trunc (sysdate) and c.frequency = 'D') or ( ce.start_date <= sysdate and trunc (ce.end_date) = trunc (next_day (sysdate, 'SAT') - 1) and c.frequency = 'W') or ( ce.start_date <= sysdate and trunc (ce.end_date) = trunc (add_months (sysdate, 1), 'MON') - 1 and c.frequency = 'M')) Most of these check whether start_date <= sysdate. The two that don't validate that end_date <= sysdate and trunc(end_date) = trunc(sysdate). Checking the data we can see that, as we'd expect, start_date <= end_date for all rows. Therefore we can infer that everything returned by the query will have a start_date <= sysdate. This enables us to add this predicate outside the OR branches. Doing so permits Oracle to use an index including start_date without fancy OR expansions or other transformations. Finally, hiding quizzes is deprecated functionality. All the rows have the value 'N' for the hide_quizzes column. Therefore we can remove the hide_quizzes = 'N' clause without affecting the results. Putting this all together we can rewrite where clause above as: and ce.competition_id = c.competition_id and ce.comp_event_id = qz.comp_event_id and decode(ce.approval_needed, 'N', 'X', 'Y', ce.approved_by_user_id) is not null and ce.start_date <= sysdate and ( <big or branch trimmed for brevity> ); Having simplified the query, it's time to re-evaluate the indexes to see if we can create one that better suits our query. We now have checks against the following columns or expressions on qdb_comp_events:   competition_id comp_event_id start_date decode(ce.approval_needed, 'N', 'X', 'Y', ce.approved_by_user_id) Comp_event_id is the primary key for the table. Therefore including this in the index is unlikely to give much benefit. Of the three remaining three conditions, competition_id is the only one with an equality (=) check against it, so this should be the leading column in the index. This leaves us with two candidate indexes:   competition_id, decode(approval_needed, 'N', 'X', 'Y', approved_by_user_id), start_date competition_id, start_date, decode(approval_needed, 'N', 'X', 'Y', approved_by_user_id) Let's try both out and see what impact they have on performance. First up, the index with start_date listed last: Hmm, we're still accessing qdb_comp_events in the same way. There are some other (unshown) differences in the execution plan. These have only led to a slight drop in the work the query does though. Let's try the second ordering: Aha! Now we're getting somewhere! The explain plan looks completely different. Gone are those crazy index fast full scans. Oracle's using the index we've just created, qdb_coev_comp_start_apprv_i, to return just 1,744 rows from the comp events table instead of over twenty thousand. Additionally the total gets performed by the query is down from ninety thousand to just under six thousand, an order of magnitude less work! What have we learned here? When you have or conditions in your query it's worth taking a moment to ask yourself the following questions:   Can you move any of the checks outside of the OR into the main body of the query? Can the OR be rewritten using a function such as case or decode? Moving checks outside the OR is particularly beneficial when the branches check different columns on a table. If you're can find a condition to apply against a single column outside of the OR expressions it increases the chances that Oracle will use an index (against that column). To do this you need good knowledge of the data and logical deduction skills or you may find your updated query is wrong! Changing an OR condition into a function requires similar logical reasoning skills. Additionally you'll have to create an index including an exact copy of the new expression for the optimizer to use an index access path. In some cases you may need to make a trade off. Changing your query and creating a targeted, more efficient function-based index may be the best solution for your query. Sticking with the original OR can be better however if a plain index including the columns involved will be usable by many other queries in your system. We've made great improvements to the query so far. There's still more to come however, so stay tuned for more posts in this series!  

In the previous post in the PL/SQL Challenge optimization series, we'd made huge performance gains simply by removing optional bind parameter clauses. The main body of the query is still doing a...

Partitioning

Obfuscated SQL Contest: Stelios Vlasopoulos' Winning Entry

To wrap up this series, let's take a look at Stelios' winning entry in the Obfuscated SQL Contest! Vastly simplifying, this solution generates the latitude and longitude co-ordinates of each location. These are then passed as parameters in a HTTP request to http://ws.geonames.org/, in the process using more hex conversions in one statement than I've seen in my entire career! This call returns an XML document including the time zone name for each place. Stelios then uses multiple XML-to-string-and-back manipulation steps to extract these names, get the current time at each point and return the requested output. Let's break it down. First up, the query uses subquery factoring (with clause) to generate five rows listing the place abbreviations required in the question and their co-ordinates. Unsurprisingly, this process isn't straightforward. It starts with the following query, which exclusive ORs two raws together to generate one long string: select utl_raw.cast_to_varchar2( utl_raw.bit_xor( hextoraw('5B4F5D504E38313239393233373930 3737363831323937373637393738353135313033353730303031313 6373733373836393837343037383530393130373339363832383536 3737323733343138383138333230383736323331373736383639373 83339373432303433313034393931353331'), hextoraw(trim(to_char(470188516729,'XXXXXXXXXX'))) ) ) a from dual It then generates five rows using the "group by cube()" method. This returns eight rows. Instead of just adding verifying the rownum is less than or equal to five, there's an excellently unnecessary square root calculation thrown in: select rownum rn from (select 1 num from dual group by cube (1,2,3) ) where rownum<=sqrt(25) Each row is assigned a value by chopping up the big string created earlier into 23 character long sections using a regular expression. The statement then inserts commas into these strings using the G (group separator) format mask: select to_char( regexp_substr (x.a,'\d{23}',1,rn), 'FM999999G99999999G999999999', 'NLS_NUMERIC_CHARACTERS=.,' ) z from (select '666578129923790776812977 6797851510357000116773786987407850910739682 8567727341881832087623177686978397420431049 91531' a, rownum rn from dual connect by level <= 5) x This outputs the following: Z -------------------------- 666578,12992379,077681297 767978,51510357,000116773 786987,40785091,073968285 677273,41881832,087623177 686978,39742043,104991531 The three numbers within each list eventually become the place abbreviation, latitude and longitude. First the values are extracted into three columns using regular expressions to find the nth occurrence of "[^,]+" within the strings. This matches characters that aren't carets - in this case the three sets of numbers. Instead of just passing in the occurrence to these regular expressions, Stelios introduces a completely unnecessary "connect by level" operation. This enables him to use the level pseudocolumn along with redundant calculations to generate the numbers 1, 2 and 3. The second two columns are then "divided" by one million by multiplying them by the sixth negative power of 10 (0.000001)! select regexp_substr(z, '[^,]+', 1, 3 * level - 2) a, regexp_substr(z, '[^,]+', 1, 3 * level - 1) * power(1e1,-6) b, regexp_substr(z, '[^,]+' , 1, 3 *level) * ( case when mod(3 * level, 3)=0 and rownum > 1 then -1 else 1 end ) * power(1e1,-6) c from ( select to_char( regexp_substr (x.a,'\d{23}',1,rn), 'FM999999G99999999G999999999', 'NLS_NUMERIC_CHARACTERS=.,' ) z from ( select '666578129923790776812977679785151035700011 677378698740785091073968285677273418818320876231776869783 9742043104991531' a, rownum rn from dual connect by level <= 5 ) x ) connect by level <= ( length(regexp_replace(z,'[^,]+')) + 3 ) / 3 This completes the first subquery! At this point we have the following output: A B C ------ ---------- ----------- 666578 12.992379 77.681297 767978 51.510357 -0.116773 786987 40.785091 -73.968285 677273 41.881832 -87.623177 686978 39.742043 -104.991531 The data are starting to take shape. A is the ascii values of the place abbreviations and, as you can verify with a quick Google search, B and C are the latitude and longitude of them. These are passed into another subquery block (d2) which converts column A into the place names and X and Y back into hex strings, employing the little used model clause! This once again calls up on regular expressions to generate the column CT from A by passing values to the chr() function. This returns a character based on its numeric ascii value. select ct, x, y from d1 model partition by (rownum rn) dimension by (0 dim) measures (cast(a as varchar2(65)) as ct, a, cast(b as varchar2(30)) as x, b, cast(c as varchar2(30)) as y, c ) rules upsert iterate (3) ( ct[0]= (case when iteration_number = 0 then null else ct[0] end ) || chr(regexp_substr(a[0],'\d{2}',1,iteration_number+1)), x[0]=rawtohex(to_char(b[0])), y[0]=rawtohex(to_char(c[0])) ) order by rn Understanding the model clause is a blog post unto itself, so if you'd like to get to grips with it read this document. Let's put this all together, refactoring the with clause as the output of the above and start looking at the body of the statement: with d2 as ( select 'BAN' ct, '31322E393932333739' x, '37372E363831323937' y from dual union all select 'LON' ct, '35312E353130333537' x, '2D2E313136373733' y from dual union all select 'NEW' ct, '34302E373835303931' x, '2D37332E393638323835' y from dual union all select 'CHI' ct, '34312E383831383332' x, '2D38372E363233313737' y from dual union all select 'DEN' ct, '33392E373432303433' x, '2D3130342E393931353331' y from dual ) select d2.ct, httpuritype( utl_raw.cast_to_varchar2( hextoraw('687474703A2F2F687474703A2F2F777 32E67656F6E616D65732E6F72672F74696D657A6F6E653F6C61743D' || d2.x||'266C6E673D'||d2.y||'26757365726E616D653D73766C61323939') ) ).getxml() req from d2 The call to UTL_Raw.Cast_To_Varchar2() buried in the HTTPURIType arguments returns (for one row): http://http://ws.geonames.org/timezone?lat=12.992379&lng=77.681297&username=svla299 An ingenious way to avoid needing to execute "set define off" so the statement doesn't identify the ampersands in the URL as substitution variables! These requests return XML documents including the time zone for each point. The query fetches these using HTTPURIType - an Oracle supplied data type. This takes a HTTP URL, accesses it and returns the content in the format specified in the get* function. In this case GeoNames returns an XML document, so Stelios returns uses getXML(). This returns XML such as: <?xml version="1.0" encoding="UTF-8" standalone="no"?> <geonames> <timezone tzversion="tzdata2015a"> <countryCode>IN</countryCode> <countryName>India</countryName> <lat>12.992379</lat> <lng>77.681297</lng> <timezoneId>Asia/Kolkata</timezoneId> <dstOffset>5.5</dstOffset> <gmtOffset>5.5</gmtOffset> <rawOffset>5.5</rawOffset> <time>2015-05-18 15:53</time> <sunrise>2015-05-18 05:53</sunrise> <sunset>2015-05-18 18:38</sunset> </timezone> </geonames> The query gets the time zone from these by extracting the value in /geonames/timezone/timezoneId from these documents. Of course, this path is encrypted as a hex string :) At this point the SQL is effectively (with just one XML document for brevity): select p.column_value.extract('//text()').getstringval() as tz from (select xmltype('<?xml version="1.0" encoding="UTF-8"?> <geonames> <timezone tzversion="tzdata2015a"> <countryCode>IN</countryCode> <countryName>India</countryName> <lat>12.992379</lat> <lng>77.681297</lng> <timezoneId>Asia/Kolkata</timezoneId> <dstOffset>5.5</dstOffset> <gmtOffset>5.5</gmtOffset> <rawOffset>5.5</rawOffset> <time>2015-05-18 15:53</time> <sunrise>2015-05-18 05:53</sunrise> <sunset>2015-05-18 18:38</sunset> </timezone> </geonames>' ) req from dual ) z, table( xmlSequence( z.req.extract( utl_raw.cast_to_varchar2( hextoraw('2F67656F6E616D65732F74696D657A6F6E652F74696D657A6F6E654964') ) ) ) ) p This returns the <timezoneId> elements as XMLType columns using the now deprecated XMLSequence function. This returns a collection of XMLType, which is unnested using the table() clause. It then gets the name of the time zone using the extract() method on XMLType. With the time zones identified, getting the time for each of these is simply a matter of calling current_time with these values and formatting the output: to_char(current_timestamp at time zone tz,'DD-MON-YYYY HH24:MI:SS') Giving: NEW_TIME -------------------------- BAN - 18-MAY-2015 19:02:08 LON - 18-MAY-2015 14:32:08 NEW - 18-MAY-2015 09:32:08 CHI - 18-MAY-2015 08:32:08 DEN - 18-MAY-2015 07:32:08 The final step takes aggregates these five rows into one using more XML manipulation! From this point the query is "simply": with d2 as ( select 'BAN - 18-MAY-2015 19:09:10' new_time from dual union all select 'LON - 18-MAY-2015 14:39:10' new_time from dual union all select 'NEW - 18-MAY-2015 09:39:10' new_time from dual union all select 'CHI - 18-MAY-2015 08:39:10' new_time from dual union all select 'DEN - 18-MAY-2015 07:39:10' new_time from dual ) select rtrim( xmltransform( sys_xmlagg(sys_xmlgen(new_time)), xmltype ( '<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="/ROWSET/NEW_TIME"> <xsl:value-of select="text()"/>; </xsl:for-each> </xsl:template> </xsl:stylesheet>' ) ).getstringval(), '; ' ) timezones from d2 This first converts the times and places back into XML (again!) using the deprecated Sys_XMLGen. The rows are then merged into a single document with Sys_XMLAgg(). In a final flourish, the actual values are pulled out by applying the stylesheet below to the document: <?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="/ROWSET/NEW_TIME"> <xsl:value-of select="text()"/>; </xsl:for-each> </xsl:template> </xsl:stylesheet> Phew - a true masterpiece of convoluted, gnarly, unreadable SQL! With a boatload of unnecessary conversions, HTTP calls, use of nearly XML function going and the model clause I'm sure you'll agree this is a worthy winner :) Thanks again to Stelios and everyone else who took part in this competition. I look forward to seeing what ingenious methods of cryptic code players can come up with in the next contest!

To wrap up this series, let's take a look at Stelios' winning entry in the Obfuscated SQL Contest! Vastly simplifying, this solution generates the latitude and longitude co-ordinates of each location....

Obfuscated SQL Contest: Sean Stuber's Entry

Continuing the analysis of the Obfuscated SQL Contest entries, this time we're taking a look at Sean Stuber's query. This is another cunning statement that hides the actual query. This one uses a new feature in 12c, a PL/SQL function defined on-the-fly in a with clause, to convert nonsense into real SQL. The actual query has some impressive tricks including an elaborate comma separated string to row converted and getting the current time via a HTTP request! Note: while writing this post I noticed that Sean's entry uses UTC for London time which we'd overlooked in the judging process. UTC doesn't take account of daylight savings changes, so we've had to downgrade Sean from runner up to honourable mention. First we'll look at how Sean converted gibberish into SQL. Simplifying, the driving query is: with function s return varchar2is s varchar2(1000);begin execute immediate translate ( encrypted_string, q'{"%'()*+,-./123456789:;<=>ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghiklmnopqrstuvwxy|}', q'{|yxwvutsrqponmlkihgfedcba_^][ZYXWVUTSRQPONMLKJIHGFEDCBA>=<;:987654321/.-,+*)('%"}' ) into s; return s;end;select s from dual; This converts an encrypted string into SQL using translate(). This function takes a string and converts each character listed in the second argument to the corresponding character in the third. The q' mechanism is necessary to enable the single quote character to appear in the list without needing additional quotes. Therefore everything listed in the first q'{}' section above is converted to the equivalent character listed in the second, i.e. " becomes |, % becomes y, ' becomes x and so on. Viewing the result of the translate, we see the following (formatted for attempted clarity ;): with /* Requires 12c to implement function in WITH clause. Must be run as SYS or SYSTEM in order to avoid needing a network ACL defined. This makes no calls to SYSDATE, SYSTIMESTAMP, CURRENT_DATE or CURRENT_TIMESTAMP to retrieve the time Instead the current time is pulled from the US Naval Observatory's atomic clock */function o2i(p in varchar2) return interval day to second as r interval day to second; s integer := case when p like '-%' then -1 else 1 end;begin r := numtodsinterval(s * to_number(substr(p, 2, 2)), 'hour'); r := r + numtodsinterval(s * to_number(substr(p, 5, 2)), 'minute'); return r;end;a as ( select 'Bangalore,London,New York,Chicago,Denver' a from dual),b as ( select to_timestamp_tz(substr(regexp_substr(dbms_lob.substr(x), '<BR>.* UTC'), 5), 'Mon. dd, hh24:mi:ss TZR') now from ( select httpuritype('http://tycho.usno.navy.mil/cgi-bin/timer.pl') .getclob() x from dual ))select replace(a, ',', '; ')from ( select rtrim( xmlquery( '//x/text()' passing xmlagg(xmlelement("x", b || ' ' || c || ',') order by a desc) returning content ), ',') a from ( select c a, b, to_char(c, '"-" DD-MON-YYYY hh24:mi:ss') c from ( select b, cast((select now from b) + c as timestamp) c from ( select a, b, o2i(tz_offset(c)) c from ( select case length(c) when 9 then 'Asia' || '/' || 'Kolkata' when 8 then 'America' || '/' || replace(c, ' ', '_') when 7 then 'US' || '/' || 'Central' when 6 then nvl(decode(substr(b, 2, 1), 'O', 'UTC'), 'US' || '/' || 'Mountain') end c, a, b from ( select a, substr(b, 1, 3) b, b c from ( select upper(b) b, c a, initcap(b) c from ( select c, a, b from ( select a, b c, regexp_substr(a, '[^,]+', 1, b) b from a full outer join (select to_number(column_value) b from xmltable((select '1' || ' to ' || ((select length( replace( regexp_replace( a, '[^,]', 'x'), 'x', null ) ) from a ) + 1) from dual ) ) ) c on b < length(a) ) ) ) ) ) ) ) )) Aha! Now we have something that resembles a "normal" select statement! Execute immediate in the PL/SQL function is the method that actually runs this statement. The output of this dynamic SQL is the string of locations and times asked for in the question. So how does the real query work? Let's take a look. The core of this are two queries defined in the with clause. The first simply selects out the locations we're interested in: select 'Bangalore,London,New York,Chicago,Denver' afrom dual The second gets the current time by making a HTTP request to the US Naval Observatory clock! select to_timestamp_tz( substr( regexp_substr( dbms_lob.substr(x), '<BR>.* UTC' ), 5 ), 'Mon. dd, hh24:mi:ss TZR' ) nowfrom ( select httpuritype('http://tycho.usno.navy.mil/cgi-bin/timer.pl').getclob() x from dual) The HTTPURIType request returns a HTML document. This contains (amongst other things) the current time in UTC. With nifty regular expressions and substr operations this is extracted and converted into a timestamp with time zone. At this point it gets tricky ;) The string of place names (Bangalore,London,New York,Chicago,Denver) is converted to the number four. Not content with something obvious such as counting the number of commas, Sean uses regular expressions to first convert this to a string of "x"s (xxxxxxxxx,xxxxxx,xxxxxxxx,xxxxxxx,xxxxxx). These are then stripped out using replace to leave four commas. These are then "counted" by finding the length of this string! select length( replace( regexp_replace( 'Bangalore,London,New York,Chicago,Denver' , '[^,]', 'x'), 'x', null ) )from dual The value four is then passed into an XMLTable expression and one added to it. This exploits the XMLTable('1 to <N>') function which creates rows with the values in the range specified. Simplifying, the query is now: select to_number(column_value) bfrom xmltable((select '1' || ' to ' || ((select 4 from dual ) + 1 ) from dual ) ) This gives us rows numbered one to five. These are then full outer joined back to the string of locations! Each row is assigned a location by using a regular expression to find each text string. To this point the query is: select regexp_substr(a, '[^,]+', 1, b) bfrom ( select 'Bangalore,London,New York,Chicago,Denver' a from dual)full outer join ( select to_number(column_value) b from xmltable('1 to 5')) con b < length(a) Which returns: B ---------Bangalore London New York Chicago Denver The most hideously convoluted way to split elements of a comma separated list into rows I've ever seen! The output of this passed through a couple more layers of nested selects. These convert the initcap versions of the place names into the three characters necessary for the output (Bangalore => BAN, London => LON, etc.) with some renaming of the columns thrown in for good measure. After this we have the following: A B C ---------- --- ------------ 1 BAN BANGALORE 2 LON LONDON 3 NEW NEW YORK 4 CHI CHICAGO 5 DEN DENVER The query then supplies time zones for each location by checking the length of the location in a case statement. Denver and London are the same length, so some it performs extra jiggery pokery to choose between these based on whether or not the second character is 'O': case length(c) when 9 then 'Asia' || '/' || 'Kolkata' when 8 then 'America' || '/' || replace(c, ' ', '_') when 7 then 'US' || '/' || 'Central' when 6 then nvl( decode(substr(b, 2, 1), 'O', 'UTC'), 'US' || '/' || 'Mountain' )end These time zone names are then converted into a time offset from UTC using the TZ_Offset() function. At this point the PL/SQL function defined in the with clause - o2i - comes into play. It takes these time zone strings and converts them to day to second intervals, taking care to ensure that the locations west of the Greenwich Meridian have negative intervals. function o2i(p in varchar2) return interval day to second as r interval day to second; s integer := case when p like '-%' then -1 else 1 end;begin r := numtodsinterval(s * to_number(substr(p, 2, 2)), 'hour'); r := r + numtodsinterval(s * to_number(substr(p, 5, 2)), 'minute'); return r;end; Generating the time for each location is simply a matter of adding these intervals to the current time (as returned by the naval clock). At this point we have the following data: 15/05/2015 16.22:08 BAN - 15-MAY-2015 16:22:0815/05/2015 10.52:08 LON - 15-MAY-2015 10:52:0815/05/2015 06.52:08 NEW - 15-MAY-2015 06:52:0815/05/2015 05.52:08 CHI - 15-MAY-2015 05:52:0815/05/2015 04.52:08 DEN - 15-MAY-2015 04:52:08 The final step is to convert these rows into a single row and column. Sean does this with XML manipulation. Simplifying, we now have: with rws as ( select '15/05/2015 16.22:08' a, 'BAN' b, '- 15-MAY-2015 16:22:08' c from dual union all select '15/05/2015 10.52:08' a, 'LON' b, '- 15-MAY-2015 10:52:08' c from dual union all select '15/05/2015 06.52:08' a, 'NEW' b, '- 15-MAY-2015 06:52:08' c from dual union all select '15/05/2015 05.52:08' a, 'CHI' b, '- 15-MAY-2015 05:52:08' c from dual union all select '15/05/2015 04.52:08' a, 'DEN' b, '- 15-MAY-2015 04:52:08' c from dual )select rtrim( xmlquery( '//x/text()' passing xmlagg( xmlelement("x", b || ' ' || c || ',') order by a desc ) returning content ), ',' ) afrom rws; The XMLElement call creates a fragment wrapped in "x" tags (e.g. "<x>BAN - 15-MAY-2015 16:22:08,</x>"). The XMLAgg() function then combines these elements into one document, collapsing our five rows down into one. The <x> tags are removed by using XMLQuery, which extracts the value from these elements. Finally the query removes the trailing comma and converts the rest into semicolons to ensure the output matches the requested format! Nice work Sean! This is the only solution that didn't rely on in-built functions to get the time. Combined with hiding the real SQL, this featured many of the aspects we were looking for - little used and misused features. In the next post we'll analyze the winning solution from Stelios Vlasopoulos.

Continuing the analysis of the Obfuscated SQL Contest entries, this time we're taking a look at Sean Stuber's query. This is another cunning statement that hides the actual query. This one uses a new...

Obfuscated SQL Contest: James Su's Entry

As promised in the announcement of the winners for the Obfuscated SQL Contest we're dissecting the queries of the winners to make sense from the madness! In this post we'll look at James Su's entry. The judges all loved the elegantly simple complexity of this statement. At its core is a query that just selects a blank line made up of tabs and spaces along with the tab character: SELECT' '||' '||' ' S,chr(9) b FROM DUAL How could this possibly enable us to return the times? Via cunning string manipulation! Firstly the query generates 396 rows using "connect by level". In a cool flourish, the number of row is determined using RegExp_Count to find the number of tabs hidden in the string and dividing this by three. Each row is given a numeric value, assigned using gnarly instr manipulation of the line of spaces and tabs. There's three instr sections: (INSTR(S,b,1,3*LEVEL-2)- DECODE(LEVEL,1,0, INSTR(S,b,1,3*LEVEL-3))-1)*25 + --i25 (INSTR(S,b,1,3*LEVEL-1)-INSTR(S,b,1,3*LEVEL-2)-1)*5 + --i5 INSTR(S,b,1,3*LEVEL)-INSTR(S,b,1,3*LEVEL-1)-1 + 32 --i32 The i25 section returns multiple of 25 - 0, 25, 50 or 75, i5 returns multiple of five up to 20 - 0, 5, 10, 15, 20 - and i32 returns numbers in the range 32-36. It then adds these values together. This is then passed into the chr() function, which returns the character corresponding to the ascii value provided. The result of this are rows containing letters. These are then pivoted to a single row using the ListAgg() operator, which gives the following (formatted for clarity): SELECT 'BAN - ' || TO_CHAR(A AT TIME ZONE 'ASIA/CALCUTTA', B) || '; ' || 'LON - ' || TO_CHAR(A AT TIME ZONE 'EUROPE/LONDON', B) || '; ' || 'NEW - ' || TO_CHAR(A AT TIME ZONE 'AMERICA/NEW_YORK', B) || '; ' || 'CHI - ' || TO_CHAR(A AT TIME ZONE 'AMERICA/CHICAGO', B) || '; ' || 'DEN - ' || TO_CHAR(A AT TIME ZONE 'AMERICA/DENVER', B) S FROM (SELECT SYSTIMESTAMP A,'DD-MON-YYYY HH24:MI:SS' B FROM DUAL) Now we can see what's going on! James has created a query that selects the current time and a date format, then passes these in to to_char to get the appropriate output. This SQL statement is executed using the DBMS_XMLGen package. The GetXMLType function accepts a SQL query and returns an XMLType document. The document contains the result of the query. In this case we have (formatted for clarity again): <ROWSET> <ROW> <S>BAN - 12-MAY-2015 14:30:50; LON - 12-MAY-2015 10:00:50; NEW - 12-MAY-2015 05:00:50; CHI - 12-MAY-2015 04:00:50; DEN - 12-MAY-2015 03:00:50</S> </ROW></ROWSET> From here getting the desired output is simply a matter of using substr to return the times from within the <S> element. Overall a fantastic solution that scored highly on the "how on earth...?!" factor. The judges felt that the select statement passed into DBMS_XMLGen is a little too simple to take the top spot however. Great work James! In the next post we'll look at Sean Stuber's entry.

As promised in the announcement of the winners for the Obfuscated SQL Contest we're dissecting the queries of the winners to make sense from the madness! In this post we'll look at James Su's entry. The...

Analytical SQL

Obfuscated SQL Contest Winners!

  Thanks to everyone who entered the Obfuscated SQL Contest! This ran on the PL/SQL Challenge from March - April. It asked players to produce ingenious, creative, quirky and downright weird SQL to return the current time in the timezones Bangalore, London, New York, Chicago and Denver. There were 53 entries, many of which provoked serious head scratching amongst the judges to determine how these queries actually work! After much deliberating in an exceptionally close run contest the judges have chosen the following entries as winners:   Overall winner: Stelios Vlasopoulos Runner up: James Su Honourable mention: Sean Stuber, Anton Scheffer, Laurent Schneider, TonyC These entries all fully got into the spirit of the contest producing ingenious and convoluted SQL. Let’s look at these in more detail to see why: Stelios A beast of a query that uses a huge number of features, abuses HTTP calls and has convoluted expressions “just because”! The features used by this query include: Model Clause Numerous unnecessary, string, number and hex conversions With clause Multiple regular expressions A HTTP call to get the timezone XML parsing Row generation using "connect by level" and "group by cube" Unnecessary calculations Cross join Sean An ingenious solution with an encrypted SQL string that's decrypted using inline PL/SQL in SQL. Key feature (mis)use: HTTP call to get the time Inline PL/SQL Execute immediate to convert "junk" to real SQL XML parsing Full outer join James An elegantly complex solution that hides the actual SQL statement in a string of spaces! A string of spaces and tabs is converted into an actual select statement via string manipulation The actual query is executed and converted to XML using DBMS_XMLGEN.getxmltype All three are brilliant examples of obscure SQL. Congratulations to Stelios, Sean and James! You can view these entries at the bottom of this post. If you'd like to view all the entries head here. Stay tuned for blog posts dedicated to each of the winners entries, dissecting them in detail to explain how they work. In addition to these three, there are several other high quality entries worthy of mention. Anton Scheffer A truly convoluted select statement that includes a bucket list of features. In addition to many already mentioned it also includes: Analytic functions MULTISET operators LNNVL DBMS_DEBUG_VC2COLL to convert a list to an array Calls to DBMS_RANDOM! With all these features this would be a strong contender for the title. Unfortunately this doesn't always produce the correct output, so lost out for not fully meeting the requirements. Laurent Schneider Another devious query using a stack of features. Those not already listed above include: JSON manipulation Rows generated using SYS.ODCIVARCHAR2LIST Different methods of timezone naming A recursive inline PL/SQL function This also failed to meet the requirements. The timezone for Chicago and Denver aren't right, the output is wrong for these places. With some small tweaking this would also be in the running for the top spots. TonyC Uses common table expressions (WITH clause) combined with mind bending string manipulation to extract the dates. Key features not highlighted in other solutions: Reverses a string with the timezones listed Calls WM_CONCAT to convert rows into a string Thanks again to everyone who took part and helped make this an enjoyable contest. We'll run more of these in future, so stay tuned! NOTE 15 May 2015: Sean's entry was originally selected as a runner up. Upon further analysis the judges noticed that it uses UTC for London time, meaning it doesn't take account of daylight savings. Consequently this entry was downgraded to honourable mention. Here are the winning statements: Stelios' SQL with d1 as ( select regexp_substr(z,'[^,]+',1,3*level-2) a, regexp_substr(z,'[^,]+',1,3*level-1)*power(1e1,-6) b, regexp_substr(z,'[^,]+',1,3*level)*(case when mod(3*level,3)=0 and rownum > 1 then -1 else 1 end)* power(1e1,-6) c from ( select to_char(regexp_substr (x.a,'\d{23}',1,rn),'FM999999G99999999G999999999', 'NLS_NUMERIC_CHARACTERS=.,') z from ( select utl_raw.cast_to_varchar2(utl_raw.bit_xor(hextoraw('5B4F5D50 4E3831323939323337393037373638313239373736373937383531353130333 537303030313136373733373836393837343037383530393130373339363832 383536373732373334313838313833323038373632333137373638363937383 339373432303433313034393931353331'),hextoraw(trim(to_char(470 188516729,'XXXXXXXXXX'))))) a from dual ) x cross join (select rownum rn from (select 1 num from dual group by cube (1,2,3)) where rownum<=sqrt(25))) y connect by level <= (length(regexp_replace(z,'[^,]+')) + 3)/3 ), d2 as (select ct, x, y from d1 model partition by (rownum rn) dimension by (0 dim) measures (cast( a as varchar2(65)) as ct, a, cast(b as varchar2(30)) as x, b, cast(c as varchar2(30)) as y, c) rules upsert iterate (3) ( ct[0]=(case when iteration_number = 0 then null else ct[0] end )|| chr(regexp_substr(a[0],'\d{2}',1,iteration_number+1)), x[0]=rawtohex(to_char(b[0])),y[0]=rawtohex(to_char(c[0])) ) order by rn ) select rtrim(xmltransform (sys_xmlagg(sys_xmlgen(new_time)), xmltype ( '<?xml version="1.0"?><xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <xsl:for-each select="/ROWSET/NEW_TIME"> <xsl:value-of select="text()"/>; </xsl:for-each> </xsl:template> </xsl:stylesheet>' ) ).getstringval(),'; ') timezones from ( select ct ||' - '||to_char(current_timestamp at time zone tz,'DD-MON-YYYY HH24:MI:SS') new_time from ( select z.ct, p.column_value.extract('//text()').getstringval() as tz from (select d2.ct, httpuritype(utl_raw.cast_to_varchar2(hextoraw('68747 4703A2F2F687474703A2F2F77732E67656F6E616D65732E6F72672F74696D657 A6F6E653F6C61743D' ||d2.x||'266C6E673D'||d2.y||'26757365726E616 D653D73766C61323939'))).getxml() req from d2 ) z, table(xmlSequence(z.req.extract(utl_raw.cast_to_varchar2 (hextoraw('2F67656F6E616D65732F74696D657A6F6E652F74696D65 7A6F6E654964'))))) p ) ) / Sean's SQL --Requires 12c to implement function in WITH clause. --Must be run as SYS or SYSTEM and database server must have access to internet --The internet call is safe, but please do reverse engineer the text to evaluate the inner workings. WITH FUNCTION s RETURN VARCHAR2 IS s VARCHAR2(1000); BEGIN EXECUTE IMMEDIATE TRANSLATE( q'{HVKW pu M:.*6-:, on< +1 63/4:3:2+ 9*2<+612 62 HVKW <4>*,:q R*,+ =: -*2 >, LFL 1- LFLKZR 62 1-;:- +1 >)16; 2::;628 > 2:+(1-5 _]S ;:962:;q K76, 3>5:, 21 <>44, +1 LFL[_KZs LFLKVRZLK_ROs ]JMMZQKA[_KZ 1- ]JMMZQKAKVRZLK_RO +1 -:+-6:): +7: +63: V2,+:>; +7: <*--:2+ +63: 6, /*44:; 9-13 +7: JL Q>)>4 P=,:-)>+1-%x, >+136< <41<5 up YJQ]KVPQ 1n6w/ VQ I_M]W_Mnv MZKJMQ VQKZMI_S [_F KP LZ]PQ[ _L - VQKZMI_S [_F KP LZ]PQ[d , VQKZXZM eb ]_LZ HWZQ / SVTZ xryx KWZQ ro ZSLZ o ZQ[d ^ZXVQ - eb QJRKP[LVQKZMI_Sw, u KPAQJR^ZMwLJ^LKMw/s ns nvvs x71*-xvd - eb - t QJRKP[LVQKZMI_Sw, u KPAQJR^ZMwLJ^LKMw/s ks nvvs x362*+:xvd MZKJMQ -d ZQ[d > _L wLZSZ]K x^>28>41-:sS12;12sQ:( F1-5s]76<>81s[:2):-x > YMPR [J_Svs = _L wLZSZ]K KPAKVRZLK_ROAKEwLJ^LKMwMZXZGOALJ^LKMw[^RLASP^qLJ^LKMw'vs xc^Maqu JK]xvs kvs xR12q ;;s 77nle36e,, KEMxv 21( YMPR wLZSZ]K 7++/*-6+%/:wx7++/epp+%<71q*,21q2>)%q364p<86r=62p+63:-q/4xvq8:+<41=wv ' YMPR [J_Svv LZSZ]K MZOS_]Zw>s xsxs xd xv YMPR wLZSZ]K MKMVRw GRSNJZMFwxpp'p+:'+wvx O_LLVQX GRS_XXwGRSZSZRZQKw|'|s = "" x x "" < "" xsxv PM[ZM ^F > [ZL]v MZKJMQVQX ]PQKZQKvs xsxv > YMPR wLZSZ]K < >s =s KPA]W_Mw<s x|r| [[rRPQrFFFF 77nle36e,,xv < YMPR wLZSZ]K =s ]_LKwwLZSZ]K 21( YMPR =v t < _L KVRZLK_ROv < YMPR wLZSZ]K >s =s 1n6wKEAPYYLZKw<vv < YMPR wLZSZ]K ]_LZ SZQXKWw<v HWZQ f KWZQ x_,6>x "" xpx "" xT145>+>x HWZQ g KWZQ x_3:-6<>x "" xpx "" MZOS_]Zw<s x xs xAxv HWZQ h KWZQ xJLx "" xpx "" x]:2+->4x HWZQ i KWZQ QISw[Z]P[ZwLJ^LKMw=s ns ovs xPxs xJK]xvs xJLx "" xpx "" xR1*2+>62xv ZQ[ <s >s = YMPR wLZSZ]K >s LJ^LKMw=s os mv =s = < YMPR wLZSZ]K JOOZMw=v =s < >s VQVK]_Ow=v < YMPR wLZSZ]K <s >s = YMPR wLZSZ]K >s = <s MZXZGOALJ^LKMw>s xDBsCtxs os =v = YMPR > YJSS PJKZM UPVQ wLZSZ]K KPAQJR^ZMw]PSJRQAI_SJZv = YMPR GRSK_^SZw wLZSZ]K xox "" x +1 x "" w wLZSZ]K SZQXKWw MZOS_]Zw MZXZGOAMZOS_]Zw >s xDBsCxs x'xvs x'xs QJSSvv YMPR >v t ov YMPR [J_Svvv < PQ = c SZQXKWw>vvvvvvvvvv}', q'{"%'()*+,-./123456789:;<=>ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_abcdefghiklmnopqrstuvwxy|}', q'{|yxwvutsrqponmlkihgfedcba_^][ZYXWVUTSRQPONMLKJIHGFEDCBA>=<;:987654321/.-,+*)('%"}' ) INTO s; RETURN s; END; SELECT s FROM dual James' SQL (note: ensure spaces and tabs are preserved or this solution won't work) SELECT SUBSTR(s,INSTR(s,'<S>')+3,INSTR(s,'</S>')-INSTR(s,'<S>')-3) FROM ( SELECT DBMS_XMLGEN.getxmltype(LISTAGG(CHR((INSTR(S,b,1,3*LEVEL-2)-DECODE(LEVEL,1,0,INSTR(S,b,1,3*LEVEL-3))-1)*25 +(INSTR(S,b,1,3*LEVEL-1)-INSTR(S,b,1,3*LEVEL-2)-1)*5 +INSTR(S,b,1,3*LEVEL)-INSTR(S,b,1,3*LEVEL-1)-1 +32)) WITHIN GROUP(ORDER BY LEVEL))S FROM ( SELECT ' ' ||' ' ||' ' S,chr(9) b FROM DUAL ) CONNECT BY LEVEL<=REGEXP_COUNT(S,b)/3 );

  Thanks to everyone who entered the Obfuscated SQL Contest! This ran on the PL/SQL Challenge from March - April. It asked players to produce ingenious, creative, quirky and downright weird SQL...

Optional Search Parameters: Getting Rid of Those Pesky OR Conditions

In the PL/SQL Challenge optimization series, we saw how using OR conditions of the form: (:bind_var is null or :bind_var = column) caused the optimizer headaches. Simply removing these from the query led to huge performance gains. This was because the optimizer these conditions caused the optimizer to have terrible row estimates. Consequently access methods and join orders were "wrong". There are many circumstances where you’re not able to remove optional search parameters. I neglected to say what you should do in these situations. Let's take a look at some alternatives. The first method is to simply rewrite the OR condition as: column = nvl(:bind_var, column) NVL is a function that returns the second parameter if the first is null. Therefore this is saying: Return true if the column is equal to the bind variable value. If the bind variable is null, check the column is equal to itself (which is always true provided the column has no nulls). What happens when we plug this into a query? Let's start with just one optional bind variable: select * from qdb_questions where question_id = nvl(:q_id, question_id); Autotrace gives us the following output for this: A concatenation with two filters?! What's going on here? Oracle has effectively rewritten this query as two statements: select * from qdb_questions where question_id = :q_id; and: select * from qdb_questions where question_id = question_id and :q_id is null; It's then combined the output of these two approaches (via a concatenation). This enables it to use use a nice efficient index when a bind value is supplied and a full table scan when one isn't. Neat! What happens when we extend the NVL approach to more optional parameters? select * from qdb_questions where author_id = nvl(:a_id, author_id) and minimum_version_id = nvl(:m_id, minimum_version_id) and question_id = nvl(:q_id, question_id); This gives us eight possible combinations for whether the bind variables are null or not. Ideally Oracle will create all these possible combinations for us, giving us eight filter steps in the plan. What does autotrace show us? Hmmm, there's only two filters. Oracle doesn't create all the combinations for us. For two of the binds we're back to the original problem with row estimates. We could create all the options manually, writing out each combination for whether a bind is null or not then gluing these together with union all. Doing so gives a query that looks like this (going with two variables for brevity): select * from qdb_questions where author_id = :a_id and question_id = :q_id union all select * from qdb_questions where :a_id is null and question_id = :q_id union all select * from qdb_questions where author_id = :a_id and :q_id is null union all select * from qdb_questions where :a_id is null and :q_id is null; The idea here is Oracle will look at whether or not the bind variables are null and then only process data in the one matching block. Does this bear out in practice? Let's execute the query passing in an author id: Not exactly. Notice that the two highlighted index range scans both processed 75 rows. Oracle only did the check to see whether or not :Q_ID is null after accessing the data. The plan has done more work than strictly necessary. Each additional variable we add potentially doubles the number unnecessary steps where we still process data. In this case the effect was small. We're only accessing one table though. In a real system it's like that we'll join several tables together (the query on the PL/SQL Challenge uses nine!), making the impact of partly executing an irrelevant subquery greater. Our attempt to outsmart the optimizer hasn't worked. In the process we've created a query that's harder to write and difficult to understand at a glance. If UNION ALL is out, what can we do? Dynamic SQL! This is where you build up your SQL statement on the fly, based on whether or not input parameters are null. The pseudo code for this looks like: sql := 'select * from table where 1=1 '; if var_1 is not null then sql := sql || 'and column1 = :var_1'; end if; if var_2 is not null then sql := sql || 'and column2 = :var_2'; end if; The end result of this is you’ll have a different SQL statement for each combination of parameters instead of one generic query. There's several benefits to this approach. Not only are we going to avoid the issues described above, it makes slow performing combinations easy to spot using the average execution time for the queries. It also makes it easier to build workarounds for these slow options when others perform well. Therefore if you're building a search screen with optional parameters I recommend you use dynamic SQL. You can find more details on how to do this here and here. It's worth noting that using dynamic SQL can lead to memory issues. Each individual statement is stored in the library cache when it's executed. Therefore having a large number of queries could cause you to run out of space. If you start hitting library cache issues you may need to make some tradeoffs between dynamic SQL and the other methods described above. Personally I would start by writing dynamic SQL unless there's already memory issues in production. UPDATE 10 Dec 2018: Added caveat for nulls when checking if a column equals itself.

In the PL/SQL Challenge optimization series, we saw how using OR conditions of the form: (:bind_var is null or :bind_var = column) caused the optimizer headaches. Simply removing these from the query...

Tricks, Explosions and SQL! The Magic of SQL Channel Launches on YouTube

It's said that any sufficiently advanced technology is indistinguishable from magic. Despite its age, the Oracle database contains a range advanced features and options. Many of these have magical abilities enabling you to build high performing, scalable applications with less code. As with all magic these features can appear mystifying - until you know how it's done! I'm here to help you work with SQL so you can manipulate data easily and create fantastic applications your users love. With this in mind I've launched a new channel on YouTube: "The Magic of SQL"! Here you'll find cool tricks you can do when using SQL on Oracle. Unlike real magicians however I'll share the secrets with you so you can perform your own SQL magic. :) The first video, "The Column Update Mystery", is now available here. This is a very different style to traditional Oracle (or other technology) videos. I'd love to hear your thoughts on the concept. Did it excite you? Do you understand the purpose of the technology? Most importantly, as I create more of these - will you watch them? Please share your opinions in the comments below or - if you prefer - you can contact me directly at chris.saxon@oracle.com. If you enjoyed it please share it with your fellow developers, DBAs and architects. There's many more of these SQL magic trick videos in the pipeline, so subscribe to the channel so you never miss a trick! ;) Thanks for reading. If you've not watched the Column Update Mystery yet, head here to bring a little SQL magic into your day!

It's said that any sufficiently advanced technology is indistinguishable from magic. Despite its age, the Oracle database contains a range advanced features and options. Many of these have magical...

PL/SQL

Optimizing the PL/SQL Challenge III: Too Many AND Clauses Spoil the Plan

In the previous article in this series we were looking at the performance of the Library page query on the PL/SQL Challenge. We identified that the main culprits were two table accesses. Together these contributed to nearly 90% of the work (buffer gets) performed by the query. Having identified this the next question is how do we fix it? There's a surprisingly simple change that results in huge performance gains. If you're a fan of adding more clauses to a query to improve performance the answer may surprise you. Let's look at the autotrace output again: The numbers in the columns from left to right are: (Estimated) Cardinality - how many rows Oracle expects to return from each step Last CR Buffer Gets - how many consistent gets were needed to execute each step (note for joins this includes the gets performed by all steps feeding into the join) Last Output Rows - how many rows were actually returned from each part of the plan Last Starts - how many times the section of the plan was executed There's a couple of important items to note here: The optimizer's estimates are way out - it's expecting one row for most steps of the plan. All steps in this part of the plan return at least 100 rows and many return several thousand. All the joins are nested loops. In a nested loop every row in the first rowset is compared to every row in the second rowset. Nested loops are efficient when joining small numbers of rows. As they compare every row in the first set to every row in the second it doesn't scale as the number of rows joined grows. This is the reason the two table accesses we identified earlier are so inefficient. As we can see from the last starts column, both of these steps are executed over 100,000 times. We're averaging just over two gets per row (divide the total buffer gets by the number of executions), so each individual execution is efficient - we're just doing a huge number of them! Oracle has chosen nested loops because it's expecting to fetch a small number of rows at each point in the plan. So the real question we need to ask is: How can we get the optimizer's estimates to be closer to the actual number of rows returned? To improve the accuracy of these figures, we should go to the first point in the execution plan where the estimated and actual rows diverge. We don't need to look far - this is happening at the very first step! The first operation is a full table scan. It's expecting this to return just one row, when in fact this operation returns 3,358 rows. To see how Oracle could get this so wrong, let's take a look at this in more detail: In the query these conditions are: and ( v.minimum_version_id = :p650_minimum_version_id or :p650_minimum_version_id is null) and ( :p650_question_id is null or v.question_id = :p650_question_id) and ( author_id = :p650_author_id or :p650_author_id is null) We have three bind variables that may or may not have a value set. When they're null we want to ignore them. When they're not null we want to filter the table based on their values. So each OR clause either returns all the rows in the table (when the bind variable is null) or the fraction of the table that matches the restriction (1/number of distinct values in the column). Unfortunately the optimizer doesn't see it this way. It looks at the :bind_var is null predicate and guesses! Oracle thinks each of the is null checks will return 5% of the rows in the table. It then combines this 5% guess with the percentage that would be returned when the bind variable equals the target column using standard (non-mutually exclusive) probability logic. Therefore for each bind variable the optimizer is either significantly underestimating how many rows the OR branch will return (when it's null) or overestimating (when it's not). Having estimated a percentage of the rows for each OR, then Oracle multiplies the three fractions together. As multiplying percentages less than 100 results in a smaller number, doing so reduces the estimated rows. The end result in this case is it expects to return just 0.02% of the rows in the table - less than one row! What can we do about this? At this stage it's useful to ask what the purpose of the bind variables are. These all had null values in the instances of I/O errors players reported. So we're in the situation where Oracle is significantly underestimating the number of rows. QUESTION_ID is the primary key of the QDB_QUESTIONS table. Entering this returns the one quiz the player asked for. In this situation the player knows exactly which quiz they want. Rather than showing them the results page and making them click again to access the quiz, a better solution would be to take them directly to the quiz page when supplying a question id. Therefore we can remove this from the query and give players a better experience (fewer clicks to get to their chosen question). MINIMUM_VERISON_ID is a filter allowing players to specify the minimum release of Oracle quizzes will run on (e.g. 11g or higher). While potentially useful we could try removing this option from the report. If there's significant benefit to removing it then it may be better to disable this functionality or come up with an alternative if players have a strong need for it. AUTHOR_ID selects all the quizzes written by a given person. This value isn't actually selectable from the Library page itself. A URL is shown at the top of the Library page enabling players to see all the quizzes for a given person. A better solution here would be to remove this parameter from the Library page. We could then create a separate page listing individual players quizzes, excluding the other input parameters. Therefore we can try removing all three of these conditions from the query. What happens when we do this? Wow! The filter operation has dropped from ~460,000 buffer gets to ~55,000! Let's take stock. All we've done so far is remove three OR conditions from the query! This lead to an ~88% reduction in consistent gets (work) done by the query and a 95%+ reduction in runtime on my local VM. The reason this has happened is Oracle's estimates for the rows returned by various steps in the plan is closer to the actual number of rows. As a result it's chosen a more appropriate plan. To recap: When Oracle sees (:bind_var is null or :bind_var = column) predicates, it's essentially guessing how many rows the filter will return. In many cases this guess is badly wrong. For each AND clause on a table, the optimizer multiplies together all the (expected) fractions of the rowset returned for each clause. This will often result in it estimating a significantly smaller number of rows returned than is actually the case. Key takeaway: avoid (:bind_var is null or :bind_var = column) conditions! We've seen huge gains already from trivial optimizations. The filter operation is still consuming ~55,000 buffer gets to return just 104 rows, suggesting there's still significant improvements we can make. That first sort operation that does ~16,500 gets for just 25 rows isn't looking too good either. We'll address these in a future blog post.

In the previous article in this series we were looking at the performance of the Library page query on the PL/SQL Challenge. We identified that the main culprits were two table accesses. Together...

PL/SQL

Optimizing the PL/SQL Challenge II: How to Figure Out the Root of the Problem

In the previous article in this series, we looked at optimizing some function calls on the Library page for the PL/SQL Challenge. This improved the performance of function calls so we could re-enable functionality to display whether there are new comments on a quiz in the report. It was just scratching the surface of the issue however. Many users were still reporting IO errors when using the Library report. To get meaningful improvements we needed to dig deeper. The report is generated from one (large) SQL statement. We needed to attack the main query itself. It's at this point that I like to turn to autotrace to see what's going on. Why autotrace over an explain plan? A couple of reasons. Firstly, an explain plan is exactly that - a plan. Oracle may choose a different path when executing a query than is shown in the explain. Autotrace shows the actual execution plan it used. Secondly, we can see how many rows each step returned, how this compares to the estimated rows and how many buffer gets (work) each step required. You can get similar functionality by using SQL trace. This requires you to access files on the server, whereas you can run autotrace from your client - a key benefit for many of us who don't have access to the physical server. SQL trace does have advantages over autotrace (displaying wait events is a key one). For a first pass at tuning the query autotrace is good enough. So I ran the query in SQL*Plus with autotrace enables (set autotrace trace) and got the following:   Ouch! A 93 step execution plan! With some much going on, where do we start looking? This is where I find SQL Developer's autotrace functionality exceptionally useful. You can expand or collapse sections of the tree to quickly see which are retuning the most rows and consuming the most buffer gets. Collapsing down the top level operations, we see the following:   At this point it's clear where we need to direct our efforts: the FILTER operation consumes over 460,000 buffer gets to return just 107 rows. That's over 4,000 reads per row! Let's expand this section to see what's going on: Aha! We have two table accesses that require over 200,000 buffer gets each! This makes up the bulk of our ~460,000 gets for the filter. These steps are clearly doing the most work in the query, so we should direct our tuning efforts to figuring out how to reduce or eliminate these operations. Now we've identified where our to direct our efforts, it's time to start looking at the query in more detail to see how we can improve it. Stay tuned for a future blog post explaining how!  

In the previous article in this series, we looked at optimizing some function calls on the Library page for the PL/SQL Challenge. This improved the performance of function calls so we could re-enable...

Big Data SQL

Reflections on Collaborate 15

Last week I was in Las Vegas for Collaborate 15. Here's a summary of my experiences from a week in Sin City. I gave two talks at the event - a quick tip comparing indexes to M&Ms and an hour long session about using edition-based redefinition with PL/SQL. These seemed to go over well, with the audience having many questions at the end which I take as a good sign (if you'd like the slides for these you can find them here). Bryn Llewellyn - product manager for EBR - was in the audience for the EBR talk which came in handy for the Q&A as he was able to answer some things I didn't know (e.g. the maximum number of editions in a database is 2,000 for 11g. It's been increased for 12c).   I was also part of the panel for the codetalk session Steven arranged before he had to pull out. This proved to be better than expected. We ripped apart some gnarly code, discussing ways it could be improved (head here to see the code in question). I was joined by fellow advocates Dan and Blaine along with Jared Still - a consultant at Pythian. While there was a small audience - due, I suspect, to a combination of it starting at 8am, Steven no longer attending and it only appearing on the schedule late on - those attending did get involved, despite the oversized room. We were in one of the large ballrooms; had it been full I think people at the back would have struggled to take part. It's a format that shows promise and I look forward to doing more of these in the future. The biggest issue for me was deciding which sessions to attend. Collaborate is in effect three conferences rolled into one (IOUG, OAUG and Quest), so in any given slot there was up to 50 concurrent talks! Even narrowing this to the IOUG presentations still normally left over 10 sessions to choose from. With so much choice, I found the session planner hard to navigate. Others commented on this too, so this looks like one area the conference could improve next year. Depsite the size, many of the sessions I attended were only about half full. I'm not sure whether this is a result of having too many concurrent sessions or just me choosing less popular talks - apparently some such as the one on OpenLDAP were standing room only! I managed to battle my way through the planner though and find some great talks. My favourite was Dan McClary's where he explained how Oracle's Big Data SQL product works. This is an area I've not played with yet, so I found it very useful for getting up to speed. It's certainly a feature I plan on spending more time with. I have to say thanks to the folks at Solar Winds for inviting me to their shindig on Monday night. This proved to be a great chance to meet new people over a few drinks. :) The trip was also a great chance to meet the newest member of the Developer Advocate team - Blaine Carter. He's focusing on open source development and helping show that we here at Oracle are happy to work with open source products. He's got some fantastic ideas, so keep a lookout for content from him over the coming year. Overall I had a great time, learning about some 12c features I'm not fully up to speed on and Big Data SQL as well as meeting some great people and catching up my with my colleagues. Viva, Las Vegas!

Last week I was in Las Vegas for Collaborate 15. Here's a summary of my experiences from a week in Sin City. I gave two talks at the event - a quick tip comparing indexes to M&Ms and an hour long...

Converting Common Table Expressions from SQL Server to Oracle

A colleague recently posed the following question: How do I convert the following SQL Server query to Oracle? with cte as ( select dateadd(mm, -1, convert(date, getdate())) as dt union all select dateadd(d,1,cte.dt) from cte where dateadd(d,1,cte.dt) <= getdate()) select * from cte; This is a great question. It's a trivial query that demonstrates the difficulty of obtaining "database independence". Most aspects of it need updating for it to work in Oracle. Let's break it down. The first (and most important!) question is: What is this query doing? My colleague stated this is used to dynamically add dates to a query - i.e. it's a row generator, in this case for dates. Digging further, we can see that it uses common table expressions (aka CTEs) to generate the values. getDate() returns the current date. dateAdd adds the interval of time specified in the first parameter to a date. "mm" is month, so we're subtracting one month from the current date. This gives us our base condition - today's one month ago. So on 1st Feb this returns 1st Jan. The recursive part of the query is adding one day ("d") for each row in the set, up to the current date. We now have the goal: write a query that generates all the dates from one month ago until today. Now that we've established purpose, we can start digging into the detail. The first issue is the initial select doesn't have a from clause. While is valid in SQL Server, it's not in Oracle. You always have to specify a from clause. We only want the query to return one row, so we can put dual to good use: with cte as ( select dateadd(mm, -1, convert(date, getdate())) as dt from dual union all select dateadd(d,1,cte.dt) from cte where dateadd(d,1,cte.dt) <= getdate()) select * from cte; The next step is to fetch the current date. We need to replace those getDate() calls with sysdate. The SQL Server version also has a convert(date, getDate()) function, which returns the date with no time component. The equivalent in Oracle is trunc(). Plugging these in gives: with cte as ( select dateadd(mm, -1, trunc(sysdate)) as dt from dual union all select dateadd(d,1,cte.dt) from cte where dateadd(d,1,cte.dt) <= sysdate) select * from cte; We're getting closer. Next step is to do the date arthimetic. Oracle doesn't have a dateAdd function. Instead it goes with the (much easier IMO) approach of allowing you to add days to dates directly with standard arithmetic, e.g. date + days. The dateAdd function differs when it comes to months though. It doesn't add or subtract 30 days, it changes the date to the equivalent day in the target month. For example, if subtracting one month 1 March the new date will be 1 February. At the end of the month it goes to the last day on the previous month however - e.g. from 31 March it will go to 28 February or 29 on leap years. To get this functionality in Oracle we use add_months(date, number). Now we have: with cte as ( select add_months(trunc(sysdate), -1) as dt from dual union all select cte.dt+1 from cte where cte.dt+1 < sysdate) select * from cte; We're getting close, but executing this gives the following error: ORA-32039: recursive WITH clause must have column alias list In Oracle, you need to specify the columns that form the recursive table in the table name. Putting it all together gives us the following query: with cte (dt) as ( select add_months(trunc(sysdate), -1) as dt from dual union all select cte.dt+1 from cte where cte.dt+1 < sysdate) select * from cte; To port our simple 7 line query from SQL Server to Oracle we've had to make 8 changes to the query! So much for database independence!

A colleague recently posed the following question: How do I convert the following SQL Server query to Oracle? with cte as ( select dateadd(mm, -1, convert(date, getdate())) as dt union all select...

Head to Vegas to see the Oracle Developer Advocates speak

That's right, next week Chris and Dan will be presenting at Collaborate 15 in Las Vegas. The slots for our session are: Mon. Apr. 13 12:00 - 12:30 pm Finding All the Red M&Ms: A Story of Indexes and Full Table Scans - Chris Saxon "Why isn't my query using an index?" is one of the most common questions beginners have when tuning SQL. This talk explores the factors that influence the optimizer's decision behind this question. It does so by comparing fetching rows from a database table to finding all the red M&Ms across a number of bags of sweets. It then compares this to index lookups and full table scans in explain plans giving an understanding on how to answer this question. It finishes with some thought experiments, showing how changing how M&Ms are placed in bags and the relative size of bags to sweets affect the effectiveness of indexes. And yes, there will be chocolate ;) Wed. Apr. 15 4:30 - 5:30 pm Maintain Application Uptime When Deploying PL/SQL Using Oracle's Edition Based Redefinition - Chris Saxon Releasing code changes to PL/SQL in heavily executed packages on databases with high uptime requirements brings many challenges. It's not possible to gain the DDL lock to compile a package while it's being executed, so how do you do so without downtime? In 11gR2, Oracle introduced edition-based redefinition (EBR). This allows you to deploy multiple versions of PL/SQL objects in the same database. Using EBR, you can release PL/SQL easily without downtime. This talk discusses earlier methods to tackling this problem and shows how EBR is superior. It steps through the process of creating and using editions. It also investigates some of the downsides to using EBR and techniques you can use to mitigate these. Thu. Apr. 16 12:15 - 1:15 pm An Introduction to JavaScript Based Web Applications Using the Oracle Database - Dan McGhan JavaScript MVC frameworks, such as AngularJS, are an increasingly popular choice for creating sophisticated web applications. Middle tiers for such applications are most commonly written using Node.js which provides developers with many tools for operations such as creating RESTful URLs, handling user sessions, and connecting to databases. Oracle offers several technologies, some quite new, that greatly assist in the development of these types of applications including the Oracle Node.js Driver, Oracle REST Data Services, new JSON capabilities in the SQL engine and more. In this session attendees will learn about these new technologies and how they can be leveraged to help create modern web applications with more power and flexibility ever before. We'd love it if you're able to join us for these sessions and look forward to meeting you! If you're not able to make it Vegas, you can still see Chris' M&Ms talk online through the Plug-In to Vegas Forum Virtual Conference in the database performance track

That's right, next week Chris and Dan will be presenting at Collaborate 15 in Las Vegas. The slots for our session are: Mon. Apr. 13 12:00 - 12:30 pm Finding All the Red M&Ms: A Story of Indexes and...

Mind the gaps

In our last few posts, we looked at solving FizzBuzz in SQL using a row generator, and the performance of several row generators. In this post, we'll take a look at another application for row generators: filling in "gaps". Here's the CONNECT BY row generator we used to solve FizzBuzz: select level from dual connect by level <= 100 ; this generates 100 rows of numbers 1..100. But row generators aren't limited to numbers. By using date arithmetic, you can generate dates: select date '2015-01-01' + level -1 as gendate from dual connect by date '2015-01-01' + level -1 < date '2015-02-01' ; GENDATE --------- 01-JAN-15 02-JAN-15 03-JAN-15 04-JAN-15 05-JAN-15 ... 29-JAN-15 30-JAN-15 31-JAN-15 31 rows selected. Or by using ASCII codes, you can generate characters: select chr(65 + level - 1) as letter from dual connect by level <= 26; LETT ---- A B C D ... X Y Z 26 rows selected. Row generators are particularly useful for filling in "gaps" in the data. What do I mean by "gaps"? Well, here's an example from the OE.ORDERS table. Name Null? Type -------------------- -------- ---------------------------- ORDER_ID NOT NULL NUMBER(12) ORDER_DATE NOT NULL TIMESTAMP(6) WITH LOCAL TIME ZONE ORDER_MODE VARCHAR2(8) CUSTOMER_ID NOT NULL NUMBER(6) ORDER_STATUS NUMBER(2) ORDER_TOTAL NUMBER(8,2) SALES_REP_ID NUMBER(6) PROMOTION_ID NUMBER(6) Suppose that we want to know how many orders are recorded for each day of June, 2007. We can use trunc(order_date) to get the count of all orders on each day for which there were ANY orders: --How many orders on each day in June 2007? select trunc(order_date) as orderdate, count(*) as ordercount from oe.orders where order_date between date '2007-06-01' and date '2007-06-30' group by trunc(order_date) order by 1; ORDERDATE ORDERCOUNT --------- ---------- 04-JUN-07 1 07-JUN-07 2 13-JUN-07 1 18-JUN-07 1 27-JUN-07 1 29-JUN-07 4 6 rows selected. This gives the number of orders for each day that had any orders. But it doesn't give a zero for the days on which there were no orders. There's no data in the OE.ORDERS table for these days, so nothing to sum or count. This is the kind of "gap" that row generators excel at filling. In the following query, we use CONNECT BY to generate one row for each day in June 2007, then outer join that to the previous OE query. select gendate, nvl(ordercount,0) from ( select trunc(order_date) as orderdate, count(*) as ordercount from oe.orders where order_date between date '2007-06-01' and date '2007-06-30' group by trunc(order_date) ) jun_orders, ( select date '2007-06-01' + level -1 as gendate from dual connect by date '2007-06-01' + level -1 < date '2007-06-30' ) jun_days where gendate = orderdate(+) order by gendate; GENDATE NVL(ORDERCOUNT,0) --------- ----------------- 01-JUN-07 0 02-JUN-07 0 03-JUN-07 0 04-JUN-07 1 05-JUN-07 0 06-JUN-07 0 07-JUN-07 2 ... 28-JUN-07 0 29-JUN-07 4 30-JUN-07 0 30 rows selected. This resultset has the required total of 0 for each day in June that didn't have any orders. And of course, we can do the same thing in ANSI compliant SQL with recursive subquery factoring (aka CTEs): with jun_days (gendate) as ( select date '2007-06-01' as gendate from dual union all select gendate + 1 from jun_days where gendate < date '2007-06-30' ), jun_orders as ( select trunc(order_date) as orderdate, count(*) as ordercount from oe.orders where order_date between date '2007-06-01' and date '2007-06-30' group by trunc(order_date) ) select gendate, nvl(ordercount,0) from jun_days left outer join jun_orders on ( gendate = orderdate ) order by gendate; Today's image is by Arz, licensed under CC-BY-SA-3.0, via Wikimedia Commons

In our last few posts, we looked at solving FizzBuzz in SQL using a row generator, and the performance of several row generators. In this post, we'll take a look at another application for row...

Follow-up to "Row generators, part 2"

My last post generated many excellent comments, here and on Twitter. As several people pointed out in the comments, using CONNECT BY to generate a large number of rows is memory intensive, and may fail with errors like "ORA-30009: Not enough memory for CONNECT BY operation" or "ORA-04030: out of process memory". One improvement that gets around this is to use Cartesian join on two or more 1000-row CONNECT BY sources, as Tanel Poder suggests in this 2008 blog post: set timing on set autot traceonly with r1 as (select level from dual connect by level <= 100), r2 as (select level from dual connect by level <= 100), r3 as (select level from dual connect by level <= 1000) select rownum from r1, r2, r3 ; 10000000 rows selected. Elapsed: 00:00:06.03 Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 0 consistent gets 0 physical reads 0 redo size 69281273 bytes sent via SQL*Net to client 22513 bytes received via SQL*Net from client 2001 SQL*Net roundtrips to/from client 5 sorts (memory) 0 sorts (disk) 10000000 rows processed Jonathan Lewis suggested increasing arraysize in sql*plus to decrease the time spent in SQL*Net traffic. The default is 15, range from 1 to 5000: set arrays 1 / 10000000 rows selected. Elapsed: 00:09:56.88 set arraysize 15 / 10000000 rows selected. Elapsed: 00:01:27.09 set arraysize 5000 / 10000000 rows selected. Elapsed: 00:00:06.23 Sayan Malakshinov suggested this variation, which uses hash join instead of the outermost merge join cartesian: set autot traceonly set arrays 5000 with r1 as (select 1 as x from dual connect by level <= 100), r2 as (select 1 as x from dual connect by level <= 100), r3 as (select 1 as x from dual connect by level <= 1000) select rownum from r1, r2, r3 where r1.x = r2.x and r1.x=r3.x and r1.x=r2.x ; 10000000 rows selected. Elapsed: 00:00:06.07 Finally, for a comprehensive treatment of row generators, including pipelined functions and MODEL clause, several people pointed to this article by Adrian Billington.

My last post generated many excellent comments, here and on Twitter. As several people pointed out in the comments, using CONNECT BY to generate a large number of rows is memory intensive, and may fail...

Row generators, part 2

In our last post, we solved FizzBuzz in SQL using a row generator, and looked at several different ways to generate rows in SQL. In this post, we'll do a performance comparison between row generators. Then in our next post, we'll look at some more things to do with row generators. Setup Before generating 1000000 rows several different ways, turn timing on set autotrace to traceonly so that we will get timing and we won't have 1000000 rows printed to screen. set autotrace traceonly set timing on Now the comparison... CONNECT BY select level from dual connect by level <= 1000000 1000000 rows selected. Elapsed: 00:00:09.08 Recursive Subquery WITH natural (n) as ( select 1 as n from dual union all select n+1 from natural where n<1000000 ) select * from natural; 1000000 rows selected. Elapsed: 00:00:16.54 XMLTABLE select * from xmltable ('1 to 1000000'); 1 hour later... ^C 748505 rows selected. Elapsed: 01:04:14.49 Let's go smaller... 100,000 rows. select * from xmltable ('1 to 100000'); 100000 rows selected. Elapsed: 00:20:00.95 CUBE Here we need to figure out how many arguments to give CUBE. 2^20 = 1,048,576 so we need 20: select rownum from ( select 1 from dual group by cube( 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ) ) where rownum < 1000000; 1 hour later... ERROR at line 2: ORA-00028: your session has been killed ORA-00028: your session has been killed ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT Elapsed: 01:00:14.12 ... Enough said. The standings: Row generator 1,000,000 100,000 ANSI compliant? Recommended? CONNECT BY 00:00:09.08 00:00:00.87 No Yes Recursive WITH 00:00:16.54 00:00:01.51 Yes Yes XMLTABLE More than 1 hour 00:20:00.95 Yes No CUBE More than 1 hour More than 1 hour Yes No   Conclusion CONNECT BY is slightly faster than recursive subquery factorying, but the latter is ANSI-SQL compliant. Using XMLTABLE and CUBE as row generators is not recommended! P.S. Post Edited to add: thanks to all the great comments, I've created a followup post here.  

In our last post, we solved FizzBuzz in SQL using a row generator, and looked at several different ways to generate rows in SQL. In this post, we'll do a performance comparison between row...

FizzBuzz

FizzBuzz is a classic programming problem / interview question. In this post we'll solve FizzBuzz in SQL using a row generator. The FizzBuzz problem: Write a program that prints the integers from 1 to 100. For multiples of 3 print "Fizz" instead of the number, For multiples of 5 print "Buzz“ instead of the number, For numbers which are multiples of both 3 and 5, print "FizzBuzz".   If I have a table my_numbers with the numbers from 1 to 100 in it, then this is trivial to do in SQL: select case when mod(the_number,15)=0 then 'FizzBuzz' when mod(the_number,3)=0 then 'Fizz' when mod(the_number,5)=0 then 'Buzz' else to_char(the_number) end as "FizzBuzz!" from my_numbers order by the_number; FizzBuzz! ---------------------------------------- 1 2 Fizz 4 Buzz Fizz 7 8 Fizz Buzz 11 Fizz 13 14 FizzBuzz ... 97 98 Fizz Buzz 100 rows selected. But, how to get that table of the natural numbers from 1-100? Well, I could create a table and then populate it using PL/SQL; or I could do this: --10 rows select 1 as num from dual union all select 2 as num from dual union all select 3 as num from dual union all select 4 as num from dual union all select 5 as num from dual union all select 6 as num from dual union all select 7 as num from dual union all select 8 as num from dual union all select 9 as num from dual union all ... select 100 as num from dual; but that's not only tedious but completely unnecessary. There are several much better ways to create row sources in SQL. The simplest, and most frequently used, row generator uses CONNECT BY LEVEL: select level from dual connect by level <= 100 ; What's going on here? CONNECT BY is part of the hierarchical query clause LEVEL is a pseudocolumn available when using CONNECT BY. And we can use this directly to solve FizzBuzz: select case when mod(level,15)=0 then 'FizzBuzz' when mod(level,3)=0 then 'Fizz' when mod(level,5)=0 then 'Buzz' else to_char(level) end as "FizzBuzz!" from dual connect by level <= 100; In 11.2 and above, Oracle supports Common Table Expressions, a.k.a. recursive subquery factoring. So we can rewrite the row generator using recursive subquery factoring, and then rewrite the fizzbuzz solution as: WITH natural (n) as ( select 1 as n from dual union all select n+1 from natural where n<100 ) select case when mod(n,15)=0 then 'FizzBuzz' when mod(n,3)=0 then 'Fizz' when mod(n,5)=0 then 'Buzz' else to_char(n) end as "FizzBuzz!" from natural; Unlike CONNECT BY, recursive subquery factoring is ANSI compliant. So the above gives us an ANSI-compliant solution to FizzBuzz. Here's another, even simpler, way to get our 100 rows of numbers 1..100: select * from xmltable ('1 to 100'); and also... select rownum from ( select 1 from dual group by cube( 1, 1, 1, 1, 1, 1, 1 ) ) where rownum < 101; Confused? Wonder what's going on here? Group by cube gives you every aggregation. In the case of "select 1 from dual", every aggregation is 1. Try this: --2 rows select 1 from dual group by cube (1); --4 rows select 1 from dual group by cube (1,1); --8 rows select 1 from dual group by cube (1,1,1); so the "group by cube" is creating 2^n rows for you, where "n" is the number of cols listed in the CUBE. Now, we've looked at several ways to generate a table of N natural numbers from 1...N. Which is better / faster? In the next post we'll compare execution times, statistics and explain plans for the above row generators. Then, we'll look at a few more things you can do with row generators.

FizzBuzz is a classic programming problem / interview question. In this post we'll solve FizzBuzz in SQL using a row generator. The FizzBuzz problem: Write a program that prints the integers from 1 to...

PL/SQL

Optimizing the PL/SQL Challenge: Using Covering Indexes to Reduce IO errors on the Library Page

Over on the PL/SQL Challenge we've been experiencing some performance issues. The main culprit is on the Library page, where you can view past quizzes. The "Commentary" column displays whether players have added any new comments (objections or discussions) to a quiz since a person last visited it and, if so, when. The problem? Including this information resulted in an "ORA-02395: exceeded call limit on IO usage". Essentially our queries were doing too much work, causing too many IO calls. More confusingly, this happened somewhat randomly - i.e. not all people were reporting this error. Digging through the code revealed that the following SQL which found when a player had last visited a quiz: CURSOR visits_cur IS SELECT qv.created_on FROM qdb_quiz_visits qv WHERE qv.user_id = user_id_in AND qv.quiz_id = quiz_id_in ORDER BY qv.created_on DESC; A quick look at the explain plan showed the following: Looking at this, we can see that the query accesses the table using an index on QDB_QUIZ_VISITS (QUIZ_ID). Oracle expects this to return 112 rows (the cardinality column). It then filters the results based on the USER_ID and then sorts the results. 112 table accesses doesn't seem too bad. Perhaps the optimizer is underestimating how many rows the query will return, let's check: select avg(c) from ( select qv.user_id, count(*) c from qdb_quiz_visits qv where qv.user_id is not null group by qv.user_id ); AVG(C) ---------- 74.2388726 It's actually overestimating! Perhaps something else is going on. How many visits have the top 5 players ranked by total number of visits made? select user_id, c from ( select qv.user_id, count(*) c from qdb_quiz_visits qv where qv.user_id is not null group by qv.user_id order by count(*) desc ) where rownum <= 5; USER_ID C ---------- ---------- 11966 7772 10808 6026 4294 5678 487 4737 3763 4606 Some people have visited the PL/SQL Challenge a lot! This means in the worst case our original explain plan will access the table 7,772 times before applying the filter on the quizzes themselves. Considering we only want the most recent row, that's a lot of work. The offending query is buried in a function called in the select, so this process will happen for every row returned by the resultset. No wonder we were hitting IO errors! This distribution also helps explain why this issue was only reported by a few users. For the average quiz visits per person to be ~74, most players must have a small number of quiz visits. For low use players, there simply isn't that many rows to for this query to process when displaying the Library page. Can we do anything to help this situation? Yes we can! We can create an index on QDB_QUIZ_VISITS (QUIZ_ID, USER_ID, CREATED_ON). This enables Oracle to execute the whole query without accessing the table. Further, because indexes are ordered data structures, we can avoid the SORT operation. Building this index gives the following plan:   This removes two whole steps from the execution plan. This also means we only access the rows that match on the QUIZ_ID and the USER_ID. Let's see how this changes our worst case scenario: select * from ( select qv.user_id, quiz_id, count(*) from qdb_quiz_visits qv where user_id is not null group by qv.user_id, quiz_id order by count(*) desc ) where rownum <= 5; USER_ID QUIZ_ID COUNT(*) ---------- ---------- ---------- 12376 2123 86 1552 1841 66 2270 596142 64 702 2030 61 1307 141525 53 A much more reasonable 82 rows. This optimization is just a section of a much larger query. Some of the reports still generate IO errors so further tuning is needed, details of which will appear in future blog posts. Summary: In this case we improved performance by creating a "covering index". This is one where all the columns listed in a query are present in an index. Not only has this saved us a table access and a sort operation, we've brought the worst case for the total number of rows to process down from 7,772 to just 86 - two orders of magnitude less!

Over on the PL/SQL Challenge we've been experiencing some performance issues. The main culprit is on the Library page, where you can view past quizzes. The "Commentary" column displays whether players...

select 'Hello world!' from dual;

Welcome to the blog of Oracle's Developer Advocates for SQL! We are Natalka Roshak and Chris Saxon and we're here to share the joy of SQL. Follow our blog here or follow our daily SQL tips on Twitter at @sqldaily. About Chris Chris has built up expertise in SQL over the past ten years working as an Oracle developer, DBA and architect. During this time he's built up a passion for interacting with data using SQL and helping others do the same. Combining this with a love of games and quizzes, Chris started the design quiz on the PL/SQL Challenge in 2013. He's looking forward to helping spread the word about SQL so that people get the most out of their data. Chris blogs at www.sqlfail.com, at site where he shows how people fail to make appropriate use of Oracle using SQL and PL/SQL statements, along with SQL brainteasers and other topics that interest him. You can get in touch with Chris via email at chris.saxon@oracle.com or follow him on Twitter @chrisrsaxon. About Natalka Natalka Roshak is an Oracle Developer Advocate for SQL. She's been using SQL for half her life, still has a lot to learn, and loves sharing what she learns. Natalka started out as a DBA/Developer working in SQL, Perl and PL/SQL. She spent several years at Harvard, where she wrote a data reconciliation program that saved the university thousands of man-hours and over $500,000 in corrected payroll errors, before joining Oracle as a member of the RAC support team. In addition to writing for OraFAQ and DevX, she's written over a hundred My Oracle Support articles. Natalka tweets at @NatalkaRoshak and blogs at rdbms-insight.com.

Welcome to the blog of Oracle's Developer Advocates for SQL! We are Natalka Roshak and Chris Saxon and we're here to share the joy of SQL. Follow our blog here or follow our daily SQL tips on Twitter...