By Roger Ford-Oracle on Mar 23, 2016
With recent versions of Oracle, there is more and more emphasis on handling unstructured text, as well as the traditional fully structured data such as numbers, dates and short text fields.
Here I want to consider various methods of getting "documents" into the Oracle database.
None of these methods are completely trivial, and the choice of which to use will depend on various factors.
What are the characteristics of a "document"? In general, I'm referring to text which may be too long to fit into a PL/SQL VARCHAR2 variable. These variables are limited to 32767 bytes. The document most likely originates from a file on a file system, and it may be in a character set which differs from the database character set. In many cases, documents will have metadata (such as author and creation date) that we want to load at the same time.
We should distinguish between four types of data:
- Metadata. This is short textual, numeric or date data which describes the main document. For example: document id, author, date created.
- Semi-structured Data. This is textual strings in a specific format, such as JSON or XML. We may be able to make assertions about such data, such as "will always start with '<'", or "does not contain any newline characters".
- Full Text. This is text of arbitrary length. It can and usually will contain all sorts of characters including newlines and quotes. It may consist of multibyte characters in UTF-8 or UCS16 (or other) character sets.
- Binary Data. Arbitrary length data which is not character encoded. May contain any bytes or combinations of bytes including nulls and end-of-file markers.
Loading a mixture of these types of data can be problematic. If we want to load metadata and full text from a single file, then we need some method of determining where the full text starts and finishes. The easiest way to do that is to enclose the full text in some sort of quotes. But then what if the full text contains those quote characters? So it's usually easier to arrange for each full text document to exist in a separate file, referenced from the main loading method.
It would be impossible to load multiple binary data records from a single file, unless the binary data was first encoded using a scheme such as BASE64, since by its nature you can never specify a delimiter string that could not appear in the binary data.
For these examples, we're going to consider three use-cases:
- A set of JSON documents. These documents reside in a single file, one per line.
- A set of files, which contain one document per file. Each file/document has metadata associated with it in a "summary" file.
- As above, but one of the files contains multibyte (UTF-8) data.
Client / Server considerations - where is your source data?
If your client and server are the same machine, then any of these methods should work. However, if you are connecting to your database from a remote client over SQL*Net, you might first decide whether you want to load files from the client (the machine from which you connect) or the server (the machine where the database resides).
|Load Method||Data on Client||Data on Server|
Method 1: Using SQL*Plus Scripts
Let's consider we have a table with a several short columns and a CLOB column:
CREATE TABLE my_table (id NUMBER, author VARCHAR2(30), created DATE, text CLOB );
If I just want to do some "quick and dirty" loading of this table, I can just create a SQL script file and run that in SQL*Plus. For example:
INSERT INTO my_table VALUES ( 1, 'Roger', SYSDATE, 'the quick brown fox jumps over the lazy dog' ); INSERT INTO my_table VALUES ( 2, 'John', SYSDATE, 'the angry aligator ate the antelope'); COMMIT;
Simple and quick, and fairly easy to generate from source files using a language such as Perl or Python. But do this with any quantity of data and you'll rapidly hit limitations:
- Any single quotes in the text need to be doubled up:
INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'my best friend''s brother said ''hi''');
- Any ampersand ("&") characters in the script will cause problems unless you've put SET DEFINE OFF somewhere in the script
- An input line to SQL*Plus cannot be longer than 2499 characters. If you want to include more than this you'll need to concatenate strings:
INSERT INTO my_table VALUES ( 3, 'Roger', SYSDATE, 'this is a long line that continues over more than one line' || ' in the source code file');
- Even then, that concatenation method will only allow you to assemble up to 4000 characters. To continue longer than that would require you to create temporary lobs, and use DBMS_LOB.APPEND for each chunk.
- It's slower than most other methods which will be an issue if you're loading large quantities of data.
Let's take our usecase 1 - the JSON documents. We could cut-and-paste from the file into a hand-coded script, but it's probably better to do it using a text-processing language such as Perl. The attached file jsonToSql.pl replaces any single quotes with two single quotes, and breaks each line into chunks no longer than 2400 characters (allowing some space for syntactic overheads.
Assuming we have perl installed and available, we can call it from the command line (Windows or Unix) using
perl jsonToSql.pl jsonsource.txt > json.sql
And we can then load the resulting file json.sql (after creating the necessary table - see above) using
sqlplus user/pass @json.sql
SQL*Plus isn't very good for loading from separate files, so we'll leave scenarios two and three out for this method.
SQL*Plus isn't very good at dealing with multibyte / UTF-8 characters. To get useful correct (and viewable!) output you will probably want to use SQL Developer to run the scripts.
Method 2: Using SQL*Loader
SQL*Loader is a utility from the very early days of Oracle. It's reasonably simple to use, but the vast number of options can make it intimidating for the uninitiated.
SQL*Loader loads files from the client. It has no access to files on the server.
It is normal to provide SQL*Loader with two inputs: A data file containing the data to be loaded to the database, and a control file which specifies how the data file is laid out, and which columns the data should be written to (it's possible to include the data inside the control file, but we'll stick with separate files for this example).
Use case 1: JSON documents in a single file
LOAD DATA INFILE 'jsonsource.txt' INTO TABLE jsondocs FIELDS ( jsontext CHAR(32000) )
Then we load it using the command:
sqlldr roger/roger control=jsonloader.ctl log=loader.log bad=loader.bad
The "log" and "bad" arguments aren't required, but can be most useful for debugging.
Note that we've limited the length of the jsontext field to 32000 characters, but you can increase that size if you need to.
Use case 2: Loading from separate files
A quick reminder about our second scenario. We have a table which we'll define as:
CREATE TABLE my_table (id NUMBER, author VARCHAR2(30), created DATE, text CLOB );
The quick brown fox jumps over the lazy dog
The angry alligator ate the antelope.
These files only have one line in each, but they could just as easily have multiple lines per file. One file is loaded into each database record.
If we want to use SQL*Loader to load CLOB or BLOB data from separate files, then the filenames for those files must be included in the data file. We use the "FILLER" syntax to specify that the filename isn't to be loaded into any column, and then the LOBFILE syntax to load the text into a column.
So here's our control file. We can see that it's similar to the previous control file but the "FIELDS" section is more complicated, to allow for multiple metadata fields, and the reference to the external file to be loaded.
LOAD DATA INFILE 'loader.txt' INTO TABLE my_table FIELDS TERMINATED BY ',' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD" ":created", fname FILLER CHAR(80), text LOBFILE(fname) TERMINATED BY EOF )
"fname" is an arbitrary label, we could have used "fred" and it would have worked exactly the same. It just needs to be the same on the two lines where it is used.
Note that the control file includes the name of our data file, loader.txt. loader.txt contains the metadata associated with each file, and the filename.
So we're loading two files, along with the metadata for each.
Now we can run SQL*Loader from the command line as follows:
sqlldr roger/roger control=loader.ctl log=loader.log bad=loader.bad
(again, the "log" and "bad" arguments aren't required, but can be most useful for debugging).
Then we can use SQL*Plus or SQL Developer to examine our loaded table:
SQL> select * from my_table; ID AUTHOR CREATED TEXT ---------- ---------------- --------- ------------------------------------------------ 1 John Smith 29-APR-15 The quick brown fox jumps over the lazy dog 2 Pete Jones 31-JAN-13 The angry alligator ate the antelope.
Do not be tempted to add spacing to format the loader.txt file nicely. If you use
1, John Smith, 2015-04-29, file1.txt
Then not only will the author be loaded with a space in front of "John", but you'll see an obscure error:
SQL*Loader-509: System error: The device does not recognize the command. SQL*Loader-502: unable to open data file ' file1.txt' for field TEXT table
So what if your file contains multi-byte characters? Perhaps some Chinese text mixed up with English?
You will simply need to specify the character set of the file in the LOBFILE directive. For example, I have a file encoded in UTF-8, so I modify the control file to read:
INFILE 'loader.txt' INTO TABLE my_table FIELDS TERMINATED BY ',' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD" ":created", fname FILLER CHAR(80), text LOBFILE(fname CHARACTERSET UTF8) TERMINATED BY EOF )
Loading Binary Files with SQL*Loader
The above method (without the CHARACTERSET addition) will work exactly the same if our column is a binary BLOB column rather than a character CLOB. There is no difference in the loading mechanism - all that changes is that the database does not attempt to do any kind of character set conversation on the data as it is loaded into the database.
Method 3: Using BFILES and PL/SQL
A BFILE is a non-standard datatype which provides read-only access to data held on an external file system.
NOTE: This method will only work with files on the server. BFILES have no access to files on the client machine.
It's possible to work directly with BFILEs, but it's perhaps more common to use them as a kind of temporary staging area to allow you to read data into database columns.
BFILEs require a DIRECTORY - which is an internal database pointer to a folder/directory on the file system. To create a directory you must have CREATE ANY DIRECTORY privilege - or you can have a DBA create the directory and grant you access to it. This provides a layer of protection to prevent users from reading or writing files on the database server.
There is a provided package DBMS_LOB which provides a number of utility functions including LOADFROMFILE.
On the face of it, this might seem all we need, but in fact there's some preparation work we need to do.
LOADFROMFILE will only load into an open clob. So we need to prepare a lob locator, open it, call LOADFROMFILE, then close it afterwards.
The following example shows a procedure which writes to our previously-created my_table.
Firstly it creates a BFILE from the supplied directory and filename, then it inserts the metadata into my_table, fetching back the lob locator for the newly created row. It then uses LOADFROMFILE to write into that lob:
CREATE SEQUENCE loadtest_seq; CREATE TABLE my_table(id NUMBER, author VARCHAR2(100), load_date DATE, text CLOB); CREATE OR REPLACE PROCEDURE LOAD_A_FILE ( dir_name VARCHAR2, file_name VARCHAR2, author VARCHAR2 ) IS l_bfile BFILE; l_clob CLOB; BEGIN l_bfile := BFILENAME(dir_name, file_name); IF (dbms_lob.fileexists(l_bfile) = 1) THEN dbms_output.put_line('File Exists'); INSERT INTO my_table T VALUES (loadtest_seq.NEXTVAL, author, sysdate, EMPTY_CLOB() ) RETURN text INTO l_clob; L_BFILE := bfilename(dir_name, file_name); dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY ); dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength(l_bfile) ); dbms_lob.fileclose( l_bfile ); COMMIT; ELSE dbms_output.put_line('File does not exist'); END IF; END; /
To use this procedure, we might do:
CREATE OR REPLACE DIRECTORY my_directory AS '/scratch/raford/Projects/loadingFiles'; -- remember that needs CREATE ANY DIRECTORY priv.Note that in this date we're using SYSDATE for the creation date, which varies slightly from previous examples. We could of course include create_date as another argument to our procedure.
EXECUTE load_a_file( 'MY_DIRECTORY', 'file1.txt', 'Roger');
Remember that the file must be found on the database server. You cannot use this method to load a file which is located on a local machine if you are using SQL*Plus or SQL Developer remotely.
The directory name is an identifier, similar to a username, table name or column name. As such, if you specify it without double quotes, it will be automatically upper-cased. But LOADCLOBFROMFILE expects the directory name to be passed in as a string, so you must specify it in upper-case if it was originally created without quotes.
No character set conversion is applied by LOADCLOBFROMFILE. This generally means the above procedure will not work if your data is not simple single-byte characters. Often you'll find the text is inexplicably truncated, becauuse of the difference between the number of characters and the number of bytes. See the next section for a solution.
If we want to load multibyte character data, we must forget DBMS_LOB.LOADFROMFILE and copy the contents of the BFILE into the CLOB using a more "manual" technique. This involves reading chunks into a RAW buffer, and then using UTL_RAW.CAST_TO_VARCHAR2 to allow us to write the RAW data into the CLOB value:
CREATE OR REPLACE PROCEDURE LOAD_MULTIBYTE_FILE ( dir_name VARCHAR2, file_name VARCHAR2, author VARCHAR2 ) IS l_bfile BFILE; l_clob CLOB; buf RAW(32767); vc VARCHAR2(32767); maxsize INTEGER := 8192; -- a char can take up to 4 bytes, -- so this is the maximum safe length in chars amt INTEGER; amtvc INTEGER; v_offset INTEGER := 1; BEGIN l_bfile := bfilename(dir_name, file_name); IF (dbms_lob.fileexists(l_bfile) = 1) THEN dbms_output.put_line('File Exists'); INSERT INTO my_table T VALUES (loadtest_seq.NEXTVAL, author, sysdate, EMPTY_CLOB() ) RETURN text INTO l_clob; l_bfile := BFILENAME(dir_name, file_name); amt := dbms_lob.getlength( l_bfile ); dbms_lob.fileopen( l_bfile, dbms_lob.FILE_READONLY ); WHILE amt > 0 LOOP IF amt > maxsize THEN amt := maxsize; END IF; dbms_lob.read( l_bfile,amt, v_offset, buf ); vc := utl_raw.cast_to_varchar2(buf); amtvc := LENGTH(vc); dbms_lob.writeappend( l_clob, amtvc, vc ); v_offset := v_offset + amt; amt := dbms_lob.getlength( l_bfile ) - v_offset + 1; END LOOP; dbms_lob.fileclose( l_bfile ); COMMIT; ELSE dbms_output.put_line('File does not exist'); END IF; END; /
Method 4: Using External Tables
External tables appear to the database much like normal tables, but their data resides on the file system. The layout of the external files is described using a variation on SQL*Loader syntax.
However, SQL*Loader loads from the client file system, whereas external tables expect their data files to exist on the server file system.
External tables are particularly useful for loading large numbers of records from a single file, so each record appears in its own row in the table.
CREATE TABLE my_ext_table ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY MY_DIRECTORY ACCESS PARAMETERS ( RECORDS DELIMITED BY 0x'0A' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD", text CHAR(131071) ) ) LOCATION ( MY_DIRECTORY:'external.dmp' ) );
And here's our data file external.dmp:
1,Roger,2015-08-08,The quick brown fox jumps over the lazy dog 2,John,2012-01-01,"The angry aligator, acting alone, ate the antelope"
Note the second line requires quotes around the full text string, as otherwise the field would be terminated at the comma. Of course we meet the usual problem that the full text string may contain quotes. We can counter that by doubling them:
2,John,2012-01-01,"The angry aligator, acting ""alone"", ate the antelope"
But we still have the issue that the full text might contain the record delimiter character (0x'0A', or newline). So as with SQL*Loader, we can use specify a separate file for each document. External tables do not support the FILLER directive, so instead we have to use a COLUMN TRANSFORMS line to specify that our fname field contains the name of our external file:
DROP TABLE my_ext_table2; CREATE TABLE my_ext_table2 ( id NUMBER, author VARCHAR2(30), created DATE, text CLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY MY_DIRECTORY ACCESS PARAMETERS ( RECORDS DELIMITED BY 0x'0A' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ( id CHAR(10), author CHAR(30), created DATE "YYYY-MM-DD", fname char(100) ) COLUMN TRANSFORMS (text FROM LOBFILE(fname) FROM (MY_DIRECTORY) ) ) LOCATION ( 'loader.txt' ) ); select * from my_ext_table2;
Note: the "FROM (MY_DIRECTORY)" clause is not necessary here since we've already specified that as the default directory.
However if the loader.txt file and the individual clob files were in different locations, it would be needed.
Once your data is in an external table, you can either leave it there and perform normal table operations (DML and most DDL) on the external table, or you can use the external table as a "staging" table to get the data into a normal table. You can simply do
CREATE TABLE normaltable AS SELECT * FROM externaltable;
to create a new normal table. And of course you can similarly do:
INSERT INTO normaltable AS SELECT * FROM externaltable;
There are several ways to load documents into LOB columns in Oracle database tables. The choice of method will depend on the nature and location of your data, the performance required for loading, and perhaps the loading method you're already most familiar with.
The author welcomes corrections and comments: firstname.lastname@example.org