Subscribe

Share

Database, SQL and PL/SQL

Excellent Extensions

New features in Oracle Database 12c Release 2 make external tables even more flexible.

By Connor McDonald

July/August 2017

In the next few columns, I’ll spend some time looking at new features in Oracle Database 12c Release 2. These features come from the “12 Things About Oracle Database 12c” presentation series that Chris Saxon and I (the AskTom team) gave at Oracle OpenWorld 2016 in San Francisco, California. (You can find the slides for these presentations on asktom.oracle.com, under the Resources tab.) In this article, I’ll take a look at some enhancements to external tables in Oracle Database 12c Release 2.

A Polyglot World

The dictionary definition for polyglot is “a person who knows and is able to use several languages.” This naturally crept over into the IT software world, where the term polyglot programming refers to the use of multiple programming languages. More recently the term polyglot persistence has come to refer to the use of multiple database technologies to best meet the needs of various business requirements.

For example, to meet the stringent data integrity needs of financial transaction processing applications, Oracle Database is a natural technology fit, but for simpler key/value pair requirements, a business may opt for Oracle NoSQL Database. The fundamental unit of storage in many NoSQL databases is a document, and rather than using multiple relational structures, a NoSQL database stores data in much larger chunks, typically in documents of a proprietary format or an open standard such as JavaScript Object Notation (JSON). It is already possible to query data from, for example, a Hadoop Distributed File System (HDFS) datastore by creating and using an external table definition in Oracle Database:

SQL> create table CUSTOMER_RAWDATA
  2   (customer_number number,
  3    customer_name   varchar2(50),
  4    postal_code     char(5)
  5   )
  6  organization external
  7    (type oracle_hdfs
  8     default directory TEMP
  9     access parameters
 10     (
 11      com.oracle.bigdata.cluster   = hadoop_clust
 12      com.oracle.bigdata.rowformat = delimited fields terminated by ','
 13     )
 14     location('hdfs/p1a.dat',
 15              'hdfs/p1b.dat',
 16              'hdfs/p2.dat',
 17              'hdfs/p3.dat'
 18             )
 19    );

But using this external table definition may lose the benefits of the parallelism that large NoSQL clusters afford. If the information for customer 150, for example, is contained only in the document p2.dat, a query to the external table will not have that insight and all documents will have to be scanned. Similarly, if the data was natively partitioned in the source system, for example, via hive partitions, the knowledge of the partitions will not be conveyed within the corresponding external table definition within the Oracle Database instance.

Partitioned External Tables with Oracle Database 12c Release 2

Oracle Database 12c Release 2 removes this previous “missing insight” limitation by introducing partitioned external tables to allow a partitioning definition to be included within the external table definition:

SQL> create table CUSTOMER_RAWDATA
  2   (customer_number number,
  3    customer_name   varchar2(50),
  4    postal_code     char(5)
  5   )
  6  organization external
  7    (type oracle_hdfs
  8     default directory TEMP
  9     access parameters
 10     (
 11      com.oracle.bigdata.cluster   = hadoop_clust
 12      com.oracle.bigdata.rowformat = delimited fields terminated by ','
 13     )
 14    )
 15  partition by range(customer_number)
 16  (
 17    partition p1 values less than (100) location('hdfs/p1a.dat', 'hdfs/p1b.dat'),
 18    partition p2 values less than (200) location('hdfs/p2.dat'),
 19    partition p3 values less than (300) location('hdfs/p3.dat')
 20  );

Partitioned external tables are not solely for administrators who need to access nonrelational datasources. They can also be used by relational database developers for accessing local relational datasources to achieve performance benefits, because all the usual optimizations for partitions, such as partition elimination, are available for partitioned external tables. In the following example, I have two datasets of employee data, one file for each region—Australia (AU) and New Zealand (NZ).

File: emp_au.dat
"AU",7369,"SMITH","CLERK",7902,17/DEC/80,800,,20
"AU",7499,"ALLEN","SALESMAN",7698,20/FEB/81,1600,300,30
"AU",7521,"WARD","SALESMAN",7698,22/FEB/81,1250,500,30
"AU",7566,"JONES","MANAGER",7839,02/APR/81,2975,,20
...
File: emp_nz.dat
"NZ",8369,"SMITH","CLERK",7902,17/DEC/80,800,,20
"NZ",8499,"ALLEN","SALESMAN",7698,20/FEB/81,1600,300,30
"NZ",8521,"WARD","SALESMAN",7698,22/FEB/81,1250,500,30
"NZ",8566,"JONES","MANAGER",7839,02/APR/81,2975,,20
...

To obtain the set of all employees in both regions, I can define a partitioned external table by using the conventional list partition syntax:

SQL> create table ext_emp
  2  ( region   varchar(2),
  3    empno    number(4),
  4    ename    varchar2(10),
  5    job      varchar2(9),
  6    mgr      number(4),
  7    hiredate date,
  8    sal      number(7,2),
  9    comm     number(7,2),
 10    deptno   number(2)
 11  )
 12  organization external
 13  ( type oracle_loader
 14    default directory temp
 15    access parameters
 16    ( records delimited by newline
 17      badfile 'emp.bad'
 18      logfile 'emp.log'
 19      fields terminated by "," optionally enclosed by '"' ldrtrim
 20      reject rows with all null fields
 21      ( region,
 22        empno,
 23        ename,
 24        job,
 25        mgr,
 26        hiredate,
 27        sal,
 28        comm,
 29        deptno
 30      )
 31    )
 32  )
 33  partition by list(region)
 34  (
 35    partition p_au values ('AU') location('emp_au.dat'),
 36    partition p_nz values ('NZ') location('emp_nz.dat')
 37  ) reject limit unlimited;

If I defined the external table as nonpartitioned with two files defined in a single location definition, all queries would need to scan both files. But with a partitioned external table definition, partition elimination enables access to each file in isolation if the predicates allow it. For example, a query to just the employees in region AU yields the following execution plan, showing that only the first partition, P_AU, was required to satisfy the query.

SQL> set autotrace traceonly explain
SQL> select * from ext_emp where region = 'AU';
Execution Plan
——————————————————————————————————————————————————————————
Plan hash value: 2018056040
—————————————————————————————————————————————————————————————————————————————————————
|Id|Operation                   |Name   |Rows|Bytes|Cost(%CPU)|Time    |Pstart|Pstop|
—————————————————————————————————————————————————————————————————————————————————————
| 0|SELECT STATEMENT            |       |  82| 7380|  29   (0)|00:00:01|      |     |
| 1| PARTITION LIST SINGLE      |       |  82| 7380|  29   (0)|00:00:01|    1 |    1|
| 2|  EXTERNAL TABLE ACCESS FULL|EXT_EMP|  82| 7380|  29   (0)|00:00:01|    1 |    1|
—————————————————————————————————————————————————————————————————————————————————————

Note that the database does not validate the external data to ensure that it aligns with the partition definition. For example, I added an invalid row (REGION = ‘XX’) into the emp_au.dat file. Then when I query for REGION = ‘XX’, I get no rows returned, because the partition definition, and hence subsequent optimizer partition elimination, defines that data to be impossible. However, scanning all the data by using a nonpartition key predicate reveals that the data is present in the file. It is your responsibility to ensure that the file data maps correctly to the partition definition.

SQL> select * from ext_emp where region = 'XX';
no rows selected

SQL> select * from ext_emp where empno = 1934;

RE      EMPNO ENAME    JOB          MGR HIREDATE       SAL    COMM    DEPTNO
—— —————————— ———————— ——————— ———————— ————————— ———————— ——————— —————————
XX       1934 MILLER   CLERK       7782 23-JAN-82     1300                10
Process Control

As long as software applications have existed, process controls on the management, deployment, and maintenance of those applications have also existed. Entire practice management frameworks such as the Information Technology Infrastructure Library (ITIL), which have established a suite of standards and procedures for any part of the application development lifecycle, are now commonplace in most organizations. As part of current standards and procedures, structural changes to objects via data definition language (DDL) are much more tightly controlled. But unlike the DDL for a standard Oracle Database table, the success of a DDL statement defining an external table is not a guarantee that the definition is valid. As such, there are pros and cons to such tight controls over DDL deployment for external tables.

In this example, I’ve created an external table to read from a daily extract of employee data:

SQL> create table ext_emp (
  2   empno     number(4),
  3   ename     varchar2(10),
  4   job       varchar2(9),
  5   mgr       number(4),
  6   hiredate  date,
  7   sal       number(7,2),
  8   comm      number(7,2),
  9   deptno    number(2)
 10  )
 11  organization external
 12  ( type oracle_loader
 13    default directory TMP
 14    access parameters
 15    ( records delimited by newline
 16      fields terminated by ','
 17      missing field values are null
 18      ( empno,ename,job,mgr,hiredate,sal,comm,deptno )
 19    )
 20    location ('emp-2016-10-01.dat'));
Table created.

Although this definition may appear valid, given that no error was returned, the true test comes when I attempt to query the external table, which is when the column mappings and parsing of the external file take place. In my case, something is awry:

SQL> select * from ext_emp;
select * from ext_emp
*
ERROR at line 1:
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-30653: reject limit reached

I have forgotten to add the REJECT LIMIT clause to my table definition, and hence even a single invalid row in the source file data immediately terminates my SQL query. I probably should have taken advantage of the facility in the SQL*Loader feature of Oracle Database where a control file can be used to generate the DDL for an external table, using the EXTERNAL_TABLE=GENERATE_ONLY clause, which would have added the REJECT LIMIT clause for me automatically. (See “On Better Loading and Fact-Checking” for an example of this facility.)

So my problem is easily resolved; I just need to add a clause to my table DDL. But in a world of strict process control, DDL changes are typically seen as high-risk changes, requiring the full gamut of validations and checks to be exercised before approval. This is a not a criticism of such process controls—they are critical for avoiding the aforementioned scenario of small errors leading to catastrophic consequences—but there should be recognition that all DDL does not present the same risk.

Another external table DDL risk example is the filename attribute. In my example above, the format of the filename emp-2016-01-01.dat suggests that this filename may need to be adjusted in the external table definition every day as a new employee file is generated with a fresh date stamp in the name. But demanding a complete suite of checks and balances for such a small DDL change would be overkill.

More Flexibility with Oracle Database 12c Release 2

Oracle Database 12c Release 2 recognizes that there are some modifications to external table definitions that can be considered a lower-risk exercise than would be the case with more-typical DDL changes to tables.

In Oracle Database 12c Release 2, without any DDL changes to the table, some external table parameters can be modified directly within the execution of a query against that external table. To solve the problem of my missing REJECT LIMIT clause, for example, I can add an EXTERNAL MODIFY clause directly to my SELECT statement:

SQL> select * from ext_emp
  2    external modify ( reject limit unlimited );

     EMPNO ENAME      JOB              MGR HIREDATE         SAL
—————————— —————————— ————————— —————————— ————————— ——————————
      7499 ALLEN      SALESMAN        7698 20-FEB-81       1600
      7521 WARD       SALESMAN        7698 22-FEB-81       1250
      7566 JONES      MANAGER         7839 02-APR-81       2975

Similarly, if the LOCATION filename changes from day to day, I can override the filename each time I query the external table:

SQL> select * from ext_emp
  2    external modify ( location ('emp-2016-10-02.dat') );

     EMPNO ENAME      JOB              MGR HIREDATE         SAL
—————————— —————————— ————————— —————————— ————————— ——————————
      7902 FORD       ANALYST         7566 03-DEC-81       3000
      7934 MILLER     CLERK           7782 23-JAN-82       1300
      7566 JONES      MANAGER         7839 02-APR-81       2975

The EXTERNAL MODIFY clause permits modification of the external table definition without requiring DDL changes—and the likely resultant process control implications. The EXTERNAL MODIFY clause allows the modification or override of the following clauses only:

  • DEFAULT DIRECTORY
  • LOCATION
  • ACCESS PARAMETERS (BADFILE, LOGFILE, DISCARDFILE)
  • REJECT LIMIT
Summary

With Oracle Database 12c Release 2, extensions to external table functionality provide more-flexible data access across the datasources that are becoming commonplace in a polyglot persistence landscape. Partitioned external tables deliver a more direct mapping to the partitioning mechanism in those nonrelational datasources as well as realizing conventional partitioning performance benefits via partition elimination. Also, changing some of the defining attributes of external tables dynamically at query time removes the need for convoluted process controls that can impede developer productivity.

Next Steps

LEARN more about Oracle Database 12c Release 2

DOWNLOAD Oracle Database 12c Release 2.

TRY Oracle Database Cloud services

 

Photography by Dmitri Popov, Unsplash