X

An Oracle blog about Oracle Text Index

  • March 23, 2016

Loading documents and other file data into the Oracle Database

Roger Ford
Product Manager

Introduction

With recent versions of Oracle, there is more and more emphasis on handling unstructured text, as well as the traditional fully structured data such as numbers, dates and short text fields.

Here I want to consider various methods of getting "documents" into the Oracle database.

None of these methods are completely trivial, and the choice of which to use will depend on various factors.

What are the characteristics of a "document"? In general, I'm referring to text which may be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to 32767 bytes. The document most likely originates from a file on a file system, and it may be in a character set which differs from the database character set. In many cases, documents will have metadata (such as author and creation date) that we want to load at the same time.

We should distinguish between four types of data:

  • Metadata. This is short textual, numeric or date data which describes the main document. For example: document id, author, date created.
  • Semi-structured Data. This is textual strings in a specific format, such as JSON or XML. We may be able to make assertions about such data, such as "will always start with '<'", or "does not contain any newline characters".
  • Full Text.  This is text of arbitrary length. It can and usually will contain all sorts of characters including newlines and quotes. It may consist of multibyte characters in UTF-8 or UCS16 (or other) character sets.
  • Binary Data. Arbitrary length data which is not character encoded. May contain any bytes or combinations of bytes including nulls and end-of-file markers.

 Loading a mixture of these types of data can be problematic. If we want to load metadata and full text from a single file, then we need some method of determining where the full text starts and finishes. The easiest way to do that is to enclose the full text in some sort of quotes. But then what if the full text contains those quote characters?  So it's usually easier to arrange for each full text document to exist in a separate file, referenced from the main loading method.

It would be impossible to load multiple binary data records from a single file, unless the binary data was first encoded using a scheme such as BASE64, since by its nature you can never specify a delimiter string that could not appear in the binary data.

For these examples, we're going to consider three use-cases:

  1.  A set of JSON documents. These documents reside in a single file, one per line.
  2.  A set of files, which contain one document per file. Each file/document has metadata associated with it in a "summary" file.
  3.  As above, but one of the files contains multibyte (UTF-8) data.

Client / Server considerations - where is your source data?

If your client and server are the same machine, then any of these methods should work. However, if you are connecting to your database from a remote client over SQL*Net, you might first decide whether you want to load files from the client (the machine from which you connect) or the server (the machine where the database resides).
 

Load Method Data on Client Data on Server
SQL*Plus
Yes
No
SQL*Loader
Yes
No
External Tables
No
Yes
BFILES
No
Yes

 

Method 1: Using SQL*Plus Scripts

Let's consider we have a table with a several short columns and a CLOB column:

CREATE TABLE my_table 
  (id        NUMBER,
   author    VARCHAR2(30),
   created   DATE,
   text      CLOB
);

If I just want to do some "quick and dirty" loading of this table, I can just create a SQL script file and run that in SQL*Plus. For example:

INSERT INTO my_table VALUES ( 1, 'Roger', SYSDATE, 'the quick brown fox jumps over the lazy dog' );
INSERT INTO my_table VALUES ( 2, 'John', SYSDATE, 'the angry aligator ate the antelope');
COMMIT;

Simple and quick, and fairly easy to generate from source files using a language such as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations:

  • Any single quotes in the text need to be doubled up:
    INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'my best friend''s brother said ''hi''');
  • Any ampersand ("&") characters in the script will cause problems unless you've put SET DEFINE OFF somewhere in the script
     
  • An input line to SQL*Plus cannot be longer than 2499 characters. If you want to include more than this you'll need to concatenate strings:
    INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'this is a long line that continues over more than one line'
       || ' in the source code file');
  • Even then, that concatenation method will only allow you to assemble up to 4000 characters. To continue longer than that would require you to create temporary lobs, and use DBMS_LOB.APPEND for each chunk.
     
  • It's slower than most other methods which will be an issue if you're loading large quantities of data.

Let's take our usecase 1 - the JSON documents. We could cut-and-paste from the file into a hand-coded script, but it's probably better to do it using a text-processing language such as Perl.  The attached file jsonToSql.pl replaces any single quotes with two single quotes, and breaks each line into chunks no longer than 2400 characters (allowing some space for syntactic overheads.

Assuming we have perl installed and available, we can call it from the command line (Windows or Unix) using

perl jsonToSql.pl jsonsource.txt > json.sql

And we can then load the resulting file json.sql (after creating the necessary table - see above) using

sqlplus user/pass @json.sql 

SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two and three out for this method.

Multibyte Considerations

SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful correct (and viewable!) output you will probably want to use SQL Developer to run the scripts.

Method 2: Using SQL*Loader

SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use, but the vast number of options can make it intimidating for the uninitiated.

SQL*Loader loads files from the client.  It has no access to files on the server.

It is normal to provide SQL*Loader with two inputs: A data file containing the data to be loaded to the database, and a control file which specifies how the data file is laid out, and which columns the data should be written to (it's possible to include the data inside the control file, but we'll stick with separate files for this example).

Use case 1:  JSON documents in a single file

We'll use the same table "my_table" as defned earlier.

This is fairly straightforward.  We create a control file called jsonloader.ctl which consists of

LOAD DATA
INFILE 'jsonsource.txt'
  INTO TABLE jsondocs
  FIELDS
   ( jsontext CHAR(32000) )

 

Then we load it using the command:

sqlldr roger/roger control=jsonloader.ctl log=loader.log bad=loader.bad

The "log" and "bad" arguments aren't required, but can be most useful for debugging.

Note that we've limited the length of the jsontext field to 32000 characters, but you can increase that size if you need to.

Use case 2: Loading from separate files

A quick reminder about our second scenario. We have a table which we'll define as:

CREATE TABLE my_table 
  (id        NUMBER,
   author    VARCHAR2(30),
   created   DATE,
   text      CLOB
);

and into this table we want to load the files file1.txt and file2.txt

file1.txt:

The quick brown fox jumps over the lazy dog

file2.txt:

The angry alligator ate the antelope.

These files only have one line in each, but they could just as easily have multiple lines per file. One file is loaded into each database record.

If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the filenames for those files must be included in the data file. We use the "FILLER" syntax to specify that the filename isn't to be loaded into any column, and then the LOBFILE syntax to load the text into a column.

So here's our control file. We can see that it's similar to the previous control file but the "FIELDS" section is more complicated, to allow for multiple metadata fields, and the reference to the external file to be loaded.

loader.ctl:

LOAD DATA
INFILE 'loader.txt'
  INTO TABLE my_table
  FIELDS TERMINATED BY ','
  ( id         CHAR(10),
    author     CHAR(30),
    created    DATE "YYYY-MM-DD" ":created",
    fname      FILLER CHAR(80),
    text       LOBFILE(fname) TERMINATED BY EOF
  )

"fname" is an arbitrary label, we could have used "fred" and it would have worked exactly the same. It just needs to be the same on the two lines where it is used.

Note that the control file includes the name of our data file, loader.txt. loader.txt contains the metadata associated with each file, and the filename.

loader.txt:
1,John Smith,2015-04-29,file1.txt
2,Pete Jones,2013-01-31,file2.txt

So we're loading two files, along with the metadata for each.

Now we can run SQL*Loader from the command line as follows:

sqlldr roger/roger control=loader.ctl log=loader.log bad=loader.bad

(again, the "log" and "bad" arguments aren't required, but can be most useful for debugging).

Then we can use SQL*Plus or SQL Developer to examine our loaded table:
 

SQL> select * from my_table;
    ID AUTHOR        CREATED   TEXT
---------- ---------------- --------- ------------------------------------------------
     1 John Smith        29-APR-15 The quick brown fox jumps over the lazy dog
     2 Pete Jones        31-JAN-13 The angry alligator ate the antelope.

 

Common mistakes:

Do not be tempted to add spacing to format the loader.txt file nicely. If you use

1, John Smith, 2015-04-29, file1.txt

Then not only will the author be loaded with a space in front of "John", but you'll see an obscure error:

SQL*Loader-509: System error: The device does not recognize the command.
SQL*Loader-502: unable to open data file ' file1.txt' for field TEXT table 

Muti-byte considerations

So what if your file contains multi-byte characters? Perhaps some Chinese text mixed up with English?
You will simply need to specify the character set of the file in the LOBFILE directive. For example, I have a file encoded in UTF-8, so I modify the control file to read:

LOAD DATA
INFILE 'loader.txt'
  INTO TABLE my_table
  FIELDS TERMINATED BY ','
  ( id         CHAR(10),
    author     CHAR(30),
    created    DATE "YYYY-MM-DD" ":created",
    fname      FILLER CHAR(80),
    text       LOBFILE(fname CHARACTERSET UTF8) TERMINATED BY EOF
  )

Loading Binary Files with SQL*Loader

The above method (without the CHARACTERSET addition) will work exactly the same if our column is a binary BLOB column rather than a character CLOB. There is no difference in the loading mechanism - all that changes is that the database does not attempt to do any kind of character set conversation on the data as it is loaded into the database.

 

Method 3: Using BFILES and PL/SQL

A BFILE is a non-standard datatype which provides read-only access to data held on an external file system.

NOTE: This method will only work with files on the server. BFILES have no access to files on the client machine.

It's possible to work directly with BFILEs, but it's perhaps more common to use them as a kind of temporary staging area to allow you to read data into database columns.

BFILEs require a DIRECTORY - which is an internal database pointer to a folder/directory on the file system. To create a directory you must have CREATE ANY DIRECTORY privilege - or you can have a DBA create the directory and grant you access to it. This provides a layer of protection to prevent users from reading or writing files on the database server.

There is a provided package DBMS_LOB which provides a number of utility functions including LOADFROMFILE.

On the face of it, this might seem all we need, but in fact there's some preparation work we need to do.

LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator, open it, call LOADFROMFILE, then close it afterwards.

The following example shows a procedure which writes to our previously-created my_table.

Firstly it creates a BFILE from the supplied directory and filename, then it inserts the metadata into my_table, fetching back the lob locator for the newly created row. It then uses LOADFROMFILE to write into that lob:

CREATE SEQUENCE loadtest_seq;
CREATE TABLE my_table(id NUMBER, author VARCHAR2(100), load_date DATE, text CLOB);
CREATE OR REPLACE PROCEDURE LOAD_A_FILE
   ( dir_name   VARCHAR2,
     file_name  VARCHAR2,
     author     VARCHAR2
   ) IS
  l_bfile   BFILE;
  l_clob    CLOB;
BEGIN
   l_bfile := BFILENAME(dir_name, file_name);
   IF (dbms_lob.fileexists(l_bfile) = 1) THEN
      dbms_output.put_line('File Exists');
      INSERT INTO my_table T
        VALUES (loadtest_seq.NEXTVAL, 
                author, 
                sysdate, 
               EMPTY_CLOB()
      ) RETURN text INTO l_clob;
      L_BFILE := bfilename(dir_name, file_name);
      dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY );
      dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength(l_bfile) );
      dbms_lob.fileclose( l_bfile );
      COMMIT;
   ELSE 
     dbms_output.put_line('File does not exist');
   END IF;   
END;
/

To use this procedure, we might do:

CREATE OR REPLACE DIRECTORY my_directory AS '/scratch/raford/Projects/loadingFiles';
-- remember that needs CREATE ANY DIRECTORY priv.

EXECUTE load_a_file( 'MY_DIRECTORY', 'file1.txt', 'Roger');


Note that in this date we're using SYSDATE for the creation date, which varies slightly from previous examples. We could of course include create_date as another argument to our procedure.

 

 

Common Mistakes:

 

Remember that the file must be found on the database server. You cannot use this method to load a file which is located on a local machine if you are using SQL*Plus or SQL Developer remotely.

The directory name is an identifier, similar to a username, table name or column name. As such, if you specify it without double quotes, it will be automatically upper-cased. But LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you must specify it in upper-case if it was originally created without quotes.

No character set conversion is applied by LOADCLOBFROMFILE. This generally means the above procedure will not work if your data is not simple single-byte characters. Often you'll find the text is inexplicably truncated, becauuse of the difference between the number of characters and the number of bytes. See the next section for a solution.

 

Multibyte Considerations

 

If we want to load multibyte character data, we must forget DBMS_LOB.LOADFROMFILE and copy the contents of the BFILE into the CLOB using a more "manual" technique. This involves reading chunks into a RAW buffer, and then using UTL_RAW.CAST_TO_VARCHAR2 to allow us to write the RAW data into the CLOB value:

CREATE OR REPLACE PROCEDURE LOAD_MULTIBYTE_FILE
   ( dir_name   VARCHAR2,
     file_name  VARCHAR2,
     author     VARCHAR2
   ) IS
  l_bfile   BFILE;
  l_clob    CLOB;
  buf       RAW(32767);
  vc        VARCHAR2(32767);
  maxsize   INTEGER := 8192;    -- a char can take up to 4 bytes, 
                                -- so this is the maximum safe length in chars
  amt       INTEGER;
  amtvc     INTEGER;
  v_offset  INTEGER := 1;
BEGIN
   l_bfile := bfilename(dir_name, file_name);
   IF (dbms_lob.fileexists(l_bfile) = 1) THEN
      dbms_output.put_line('File Exists');
      INSERT INTO my_table T
        VALUES (loadtest_seq.NEXTVAL, 
                author, 
                sysdate, 
                EMPTY_CLOB()
      ) RETURN text INTO l_clob;
      l_bfile := BFILENAME(dir_name, file_name);
      amt := dbms_lob.getlength( l_bfile );
      dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY );
      WHILE amt > 0 LOOP
        IF amt > maxsize THEN
           amt := maxsize;
        END IF;
        dbms_lob.read( l_bfile,amt, v_offset, buf );
        vc := utl_raw.cast_to_varchar2(buf);
        amtvc := LENGTH(vc);
        dbms_lob.writeappend( l_clob, amtvc, vc );
        v_offset := v_offset + amt;
        amt := dbms_lob.getlength( l_bfile ) - v_offset + 1;
      END LOOP;
      dbms_lob.fileclose( l_bfile );
      COMMIT;
   ELSE 
     dbms_output.put_line('File does not exist');
   END IF;   
END;
/

 

Method 4: Using External Tables

External tables appear to the database much like normal tables, but their data resides on the file system. The layout of the external files is described using a variation on SQL*Loader syntax.

However, SQL*Loader loads from the client file system, whereas external tables expect their data files to exist on the server file system.

External tables are particularly useful for loading large numbers of records from a single file, so each record appears in its own row in the table.

CREATE TABLE my_ext_table ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB )
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY MY_DIRECTORY
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY 0x'0A'
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
      ( id         CHAR(10),
        author     CHAR(30),
        created    DATE "YYYY-MM-DD",
        text       CHAR(131071)
      )
    )
    LOCATION (
      MY_DIRECTORY:'external.dmp'
    )
);

And here's our data file external.dmp:

1,Roger,2015-08-08,The quick brown fox jumps over the lazy dog
2,John,2012-01-01,"The angry aligator, acting alone, ate the antelope" 

Note the second line requires quotes around the full text string, as otherwise the field would be terminated at the comma.  Of course we meet the usual problem that the full text string may contain quotes. We can counter that by doubling them:

2,John,2012-01-01,"The angry aligator, acting ""alone"", ate the antelope"

 

But we still have the issue that the full text might contain the record delimiter character (0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each document. External tables do not support the FILLER directive, so instead we have to use a COLUMN TRANSFORMS line to specify that our fname field contains the name of our external file:

DROP TABLE my_ext_table2;
CREATE TABLE my_ext_table2 ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB )
ORGANIZATION EXTERNAL (
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY MY_DIRECTORY
  ACCESS PARAMETERS (
    RECORDS DELIMITED BY 0x'0A'
    FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
      ( id         CHAR(10),
        author     CHAR(30),
        created    DATE "YYYY-MM-DD",
        fname      char(100)
      )
    COLUMN TRANSFORMS (text FROM LOBFILE(fname) FROM (MY_DIRECTORY) )
    )
    LOCATION (
      'loader.txt'
    )
);
select * from my_ext_table2; 

Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already specified that as the default directory.
However if the loader.txt file and the individual clob files were in different locations, it would be needed.

Once your data is in an external table, you can either leave it there and perform normal table operations (DML and most DDL) on the external table,  or you can use the external table as a "staging" table to get the data into a normal table. You can simply do

CREATE TABLE normaltable SELECT * FROM externaltable;

 

to create a new normal table. And of course you can similarly do:

INSERT INTO normaltable SELECT * FROM externaltable;

Conclusion

There are several ways to load documents into LOB columns in Oracle database tables. The choice of method will depend on the nature and location of your data, the performance required for loading, and perhaps the loading method you're already most familiar with.

The author welcomes corrections and comments:  roger.ford@oracle.com

Join the discussion

Comments ( 6 )
  • Yuan Monday, July 11, 2016

    In Method 2, Use Case 2, is there a way to grab the file date from the OS rather than provide it in the data file?


  • Roger Ford Monday, July 11, 2016

    That's a good question. I think the answer is "not directly". If you want to do that you'll probably need to generate the "loader.txt" file programmatically, using a language which can get the file creation date from the OS and insert it into the file. I'd use Perl for that, but no doubt there are many other (probably better) options.


  • guest Tuesday, November 1, 2016

    I have a set of files which represent 1 images (there are millions of 'pages') how can I load them so that the QBE tool will retrieve them as a single document and page through them? Currently for a single document stored as 8 images we get 8 'documents' returned and we can not page through them we have to select each page/document separately.


  • mik Friday, June 9, 2017
    INSERT INTO normaltable AS SELECT * FROM externaltable;

    should be

    INSERT INTO normaltable SELECT * FROM externaltable;

    AS keyword is forbidden here.
  • Pranita Wednesday, October 25, 2017
    INSERT INTO normaltable AS SELECT * FROM externaltable;

    should be

    INSERT INTO normaltable SELECT * FROM externaltable;

    AS keyword is forbidden here
  • Richard Wednesday, March 14, 2018
    Thanks - the article was very helpful, but I had to change the following to get my file to load in the correct character set as utl_raw assumes the source file is in the database character set. Including this in case someone else faces the same issue. You can find out the encoding of the source file by opening it in Firefox, then you need to find the equivalent oracle character set.

    --vc := utl_raw.cast_to_varchar2(buf);
    vc := UTL_I18N.RAW_TO_CHAR(buf,'AL16UTF16LE');
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.