Subscribe

Share

Database, SQL and PL/SQL

Better Tools for Better Data

New functions in Oracle Database 12c Release 2 solve data validation challenges.

By Connor McDonald

March/April 2017

In the next few columns, I’ll spend some time looking at new features in Oracle Database 12c Release 2. These features come from the “12 Things About Oracle Database 12c” presentation series that Chris Saxon and I, the Ask Tom team, gave at Oracle OpenWorld 2016 in San Francisco. (See asktom.oracle.com, under the Resources tab). In this article, I’ll take a look at a suite of Oracle Database 12c Release 2 improvements for validating data that make loading or querying data from unreliable datasources easier.

Background

At the UK Oracle User Group conference (December 2016), during my presentation on Oracle Database 12c Release 2 features, I projected a slide containing the acronym QTFWC. I could see many attendees in the audience nodding, acknowledging that this was the next acronym to be digested in an information technology industry already filled to the brim with acronyms.

The acronym was fictional; there’s no such product or feature as QTFWC. It was merely a humorous reflection on a real issue that faces all developers dealing with data whose quality and correctness are unknown: producing Queries That Finish Without Crashing.

In times gone by, data that wasn’t in your database was in someone else’s database. But datasources now vary widely, and equally variable are the quality and correctness of that data. Developers tasked with querying, cleansing, and loading this data face a chicken-and-egg problem when it comes to datatypes: the only way to know if a string can be converted to, say, a valid number is to attempt the conversion. But to attempt the conversion is to also run the risk of the conversion’s being invalid and having the entire query crash.

Consider a load process that must load data from a table called STAGING_SALES, representing sales data staged from several external sources. The task seems simple enough: create an INSERT statement to transfer the raw data into the target table ANNUAL_SALES.

SQL>  insert into ANNUAL_SALES
  2   select *
  3   from   STAGING_SALES;

A developer may patiently watch this statement execute for minutes, or possibly hours, eagerly waiting for its completion, only to see the following:

Elapsed: 06:12:34.00
ERROR at line 1:
ORA-01847: day of month must be between 1 and last day of month

The error suggests a problem with a DATE conversion, but there are no clues as to what data caused it. Adding to the developer’s frustration is that the data conversion error probably occurred some three hours into the total elapsed time of six hours, with the last three hours representing the effort of undoing all the changes applied so far. (See the “Instead of Waiting” sidebar for information on how to determine whether a data manipulation language [DML] statement has commenced a rollback before it completes.)

An examination of some of the data in the STAGING_TABLE reveals the conversion issue. A column named CREATED_DATE contains string data, some of which cannot be correctly converted to a date in the equivalent ANNUAL_SALES table.

SQL> select CREATED_DATE
  2  from   STAGING_SALES;

CREATED_DATE
————————————————————
01-FEB-2016
12-MAR-2012
54-AUG-2013
09-SEP-2014
23-OCT-2012
...

I am trivializing the true troubleshooting effort here. Analyzing source data for correctness, especially if the source data is millions of rows over dozens of columns, is an arduous task.

Historical Solutions

In the past, intercepting an error in data conversion typically required a PL/SQL function to act as a wrapper around standard facilities. In the insert-into-ANNUAL_SALES example, to check for a valid date in the CREATED_DATE column and prevent the statement from failing, I first create a date_checker PL/SQL wrapper function:

SQL> create or replace
  2  function date_checker(p_str varchar2) return date is
  3    l_dte date;
  4  begin
  5    l_dte := to_date(p_str,'dd-mon-yyyy');
  6    return l_dte;
  7  exception
  8    when others then return null;
  9  end;
 10  /
Function created.

The date_checker function returns the source data as a DATE datatype if the conversion can be performed and returns NULL otherwise. In Oracle Database 12c Release 1, the PL/SQL code can be folded directly into the SQL statement itself to avoid cluttering the data dictionary. For more information on using PL/SQL functions within a WITH clause, refer to the documentation.

SQL> insert /*+ with_plsql */ into ANNUAL_SALES
  2  with
  3    function date_checker(p_str varchar2) return date is
  4      dte date;
  5    begin
  6      dte := to_date(p_str,'dd-mon-yyyy');
  7      return dte;
  8    exception
  9      when others then return null;
 10    end;
 11  select date_checker(created_date) valid_date, ...
 12  from   staging_sales;

Although this solves the data conversion problem, there is increased complexity in the SQL code as well as the performance overhead of calling a PL/SQL function potentially millions of times.

More Validation Control with Oracle Database 12c Release 2

Oracle Database 12c Release 2 adds attempt-to-convert-and-catch-errors functionality natively to the database via the new VALIDATE_CONVERSION function and the existing CAST and TO_datatype suite of conversion functions.

Returning to the data loading example, here’s how the new VALIDATE_CONVERSION can be used in the INSERT statement:

SQL> insert into ANNUAL_SALES
  2  select to_date(created_date, 'dd-mon-yyyy'), ...
  3  from   STAGING_SALES
  4  where  validate_conversion(
  5           created_date as date, 'dd-mon-yyyy'
  6         ) = 1;

Instead of returning an error because of string data that could not be converted to a DATE datatype, the new VALIDATE_CONVERSION predicate in the WHERE clause picks up data only where a conversion of the CREATED_DATE column to a DATE datatype with the supplied format ‘dd-mon-yyyy’ mask is successful. Success is indicated by a return value of 1. (If the conversion would not have succeded, the function returns 0.) Because only the rows that could be converted are returned, I can now apply the TO_DATE function in the SELECT portion of the INSERT with the assurance that it will not cause the statement to fail.

Using VALIDATE_CONVERSION as a predicate ensures that the statement will not crash, but it also keeps complete rows from the source data from being loaded into the target table. What about other requirements for row handling and values? What if you must replace erroneous data with a default value but retain the row, so that before and after row counts are consistent in the load process?

To address that requirement, the TO_datatype conversion functions have been extended in Oracle Database 12c Release 2 to optionally return a default value if the data conversion fails.

The SALES_AMT column in the STAGING_SALES table also contains string data, but that data should be loaded as number values into the ANNUAL_SALES table. A sample of the SALES_AMT data shows that one of the rows has an erroneous comma, which would typically cause an error for the standard TO_NUMBER function call.

SQL> select SALES_AMT,
  2  from   STAGING_SALES;

SALES_AMT
——————————
120000    
172125    
128000    
125,000
99500 
...

By using the new extended syntax for TO_NUMBER, however, you can nominate a default value for use whenever the TO_NUMBER function fails:

SQL> select SALES_AMT,
  2         TO_NUMBER(SALES_AMT
  3                 DEFAULT -1 ON CONVERSION ERROR) conv_sale
  4  from   STAGING_SALES;

SALES_AMT  CONV_SALE
—————————— ——————————
120000         120000
172125         172125
128000         128000
125,000            -1
99500           99500
...

The other TO_datatype functions support the same extended functionality. Similarly, the CAST function supports the same extension for casting from one datatype to another.

Summary

With Oracle Database 12c Release 2, extensions to data conversion functions and the new VALIDATE_CONVERSION function make data validation via SQL a breeze for database developers. SQL statements can self-validate data without the need for additional PL/SQL wrappers or nondatabase code to guard against conversion errors.

 
Instead of Waiting
 

If you are running a large DML statement (INSERT, UPDATE, DELETE) and it fails during execution, the statement must roll back (or undo) all the changes completed from the commencement of the statement up to the point of failure. Hence a statement that gives the appearance of still executing may in fact already have failed, with control not yet returned to the requesting program or user because the undo phase is still in progress. However, you can detect whether a statement has already failed by examining the V$TRANSACTION view. The USED_UREC column in that view provides the number of undo records consumed by this transaction. Note that this is not necessarily going to be the same as the number of rows updated, because a single row change might change several indexes, large object (LOB) contents, and so on, each of which will have its own undo information. However, you can observe the direction of change in this column to determine the state of a transaction. For example, in one session, I create a table with approximately 100,000 records and then delete them.

Session 1
SQL> create table t as select * from dba_objects;
Table created.

SQL> delete from t;
104074 rows deleted.

SQL> select sys_context('USERENV','SID') from dual;
SYS_CONTEXT('USERENV','SID')
———————————————————————————————————————————————————————
590

In this example, the delete has already completed, but even if it were still executing, I could monitor the progress of the active transaction from another session by joining V$SESSION to V$TRANSACTION.

Session 2
SQL> select t.used_urec
  2  from   v$transaction t,
  3         v$session s
  4  where  s.sid  = 590   -- from above
  5  and    t.addr = s.taddr;

 USED_UREC
——————————
    104074

If that DELETE statement were still executing and then failed, the changes would commence rolling back. I can simulate that by issuing an explicit rollback command and rerunning my transaction monitor query while the rollback takes place.

Session 1
SQL> rollback;
[executing...]
Session 2
Running the query against V$TRANSACTION a few times in quick succession shows the progress of the rollback operation in Session 1.
SQL> select t.used_urec
  2  from   v$transaction t,
  3         v$session s
  4  where  s.sid = 590
  5  and    t.addr = s.taddr;

 USED_UREC
——————————
     94872

SQL> /

 USED_UREC
——————————
     32897

SQL> /
no rows selected

The last result—“no rows selected”—indicates that the rollback is complete, and that there is no longer an active transaction in Session 1.

Next Steps

LEARN more about Oracle Database 12c Release 2.

DOWNLOAD Oracle Database 12c Release 2.

 

Photography by Scott Webb, Unsplash