X

The leading edge of scripting languages and Oracle Database brought to you by the Data Access Development team

Efficient and Scalable Batch Statement Execution in Python cx_Oracle

Christopher Jones
Senior Principal Product Manager
cx_Oracle logo

 

 

Today's guest post is by Oracle's Anthony Tuininga, creator and lead maintainer of cx_Oracle, the extremely popular Oracle Database interface for Python.

 

 

 

Introduction

This article shows how batch statement execution in the Python cx_Oracle interface for Oracle Database can significantly improve performance and make working with large data sets easy.

In many cx_Oracle applications, executing SQL and PL/SQL statements using the method cursor.execute() is perfect. But if you intend to execute the same statement repeatedly for a large set of data, your application can incur significant overhead, particularly if the database is on a remote network. The method cursor.executemany() gives you the ability to reduce network transfer costs and database load, and can significantly outperform repeated calls to cursor.execute().

SQL

To help demonstrate batch execution, the following tables and data will be used:

create table ParentTable (
    ParentId              number(9) not null,
    Description           varchar2(60) not null,
    constraint ParentTable_pk primary key (ParentId)
);

create table ChildTable (
    ChildId               number(9) not null,
    ParentId              number(9) not null,
    Description           varchar2(60) not null,
    constraint ChildTable_pk primary key (ChildId),
    constraint ChildTable_fk foreign key (ParentId)
            references ParentTable
);

insert into ParentTable values (10, 'Parent 10');
insert into ParentTable values (20, 'Parent 20');
insert into ParentTable values (30, 'Parent 30');
insert into ParentTable values (40, 'Parent 40');
insert into ParentTable values (50, 'Parent 00');

insert into ChildTable values (1001, 10, 'Child A of Parent 10');
insert into ChildTable values (1002, 20, 'Child A of Parent 20');
insert into ChildTable values (1003, 20, 'Child B of Parent 20');
insert into ChildTable values (1004, 20, 'Child C of Parent 20');
insert into ChildTable values (1005, 30, 'Child A of Parent 30');
insert into ChildTable values (1006, 30, 'Child B of Parent 30');
insert into ChildTable values (1007, 40, 'Child A of Parent 40');
insert into ChildTable values (1008, 40, 'Child B of Parent 40');
insert into ChildTable values (1009, 40, 'Child C of Parent 40');
insert into ChildTable values (1010, 40, 'Child D of Parent 40');
insert into ChildTable values (1011, 40, 'Child E of Parent 40');
insert into ChildTable values (1012, 50, 'Child A of Parent 50');
insert into ChildTable values (1013, 50, 'Child B of Parent 50');
insert into ChildTable values (1014, 50, 'Child C of Parent 50');
insert into ChildTable values (1015, 50, 'Child D of Parent 50');

commit;

Simple Execution

To insert a number of rows into the parent table, the following naive Python script could be used:

data = [
    (60, "Parent 60"),
    (70, "Parent 70"),
    (80, "Parent 80"),
    (90, "Parent 90"),
    (100, "Parent 100")
]

for row in data:
    cursor.execute("""
            insert into ParentTable (ParentId, Description)
            values (:1, :2)""", row)

This works as expected and five rows are inserted into the table. Each execution, however, requires a "round-trip" to the database. A round-trip is defined as the client (i.e. the Python script) making a request to the database and the database sending back its response to the client. In this case, five round-trips are required. As the number of executions increases, the cost increases in a linear fashion based on the average round-trip cost. This cost is dependent on the configuration of the network between the client (Python script) and the database server, as well as on the capability of the database server.

Batch Execution

Performing the same inserts using executemany() would be done as follows:

data = [
    (60, "Parent 60"),
    (70, "Parent 70"),
    (80, "Parent 80"),
    (90, "Parent 90"),
    (100, "Parent 100")
]

cursor.executemany("""
        insert into ParentTable (ParentId, Description)
        values (:1, :2)""", data)

In this case there is only one round-trip to the database, not five. In fact, no matter how many rows are processed at the same time there will always be just one round-trip. As the number of rows processed increases, the performance advantage of cursor.executemany() skyrockets. For example, on my machine inserting 1,000 rows into the same table in a database on the local network using cursor.execute() takes 410 ms, whereas using cursor.executemany() requires only 20 ms. Increasing the number to 10,000 rows requires 4,000 ms for cursor.execute() but only 60 ms for cursor.executemany()!

For really huge data sets there may be external buffer or network limits to how many rows can be processed at one time. These limits are based on both the number of rows being processed as well as the "size" of each row that is being processed. The sweet spot can be found by tuning your application. Repeated calls to executemany() are still better than repeated calls to execute().

As mentioned earlier, execution of PL/SQL statements is also possible. Here is a brief example demonstrating how to do so:

data = [[2], [6], [4]]

var = cursor.var(str, arraysize = len(data))
data[0].append(var)                               # OUT bind variable ':2'
cursor.executemany("""
        declare
            t_Num number := :1;
            t_OutValue varchar2(100);
        begin
            for i in 1..t_Num loop
                t_OutValue := t_OutValue || 'X';
            end loop;
            :2 := t_OutValue;
        end;""", data)

print("Result:", var.values)

This results in the following output:

Result: ['XX', 'XXXXXX', 'XXXX']

Using executemany()

With the significant performance advantages that can be seen by performing batch execution of a single statement it would seem obvious to use cursor.executemany() whenever possible. Let's look at some of the other features of executemany() useful for common data handling scenarios.

Scenario 1: Getting Affected Row Counts

One scenario that may arise is the need to determine how many rows are affected by each row of data that is passed to cursor.executemany(). Consider this example:

for parentId in (10, 20, 30):
    cursor.execute("delete from ChildTable where ParentId = :1",
            [parentId])
    print("Rows deleted for parent id", parentId, "are",
            cursor.rowcount)

This results in the following output:

Rows deleted for parent id 10 are 1
Rows deleted for parent id 20 are 3
Rows deleted for parent id 30 are 2

Since each delete is performed independently, determining how many rows are affected by each delete is easy to do. But what happens if we use cursor.executemany() in order to improve performance as in the following rewrite?

data = [[10], [20], [30]]

cursor.executemany("delete from ChildTable where ParentId = :1", data)
print("Rows deleted:", cursor.rowcount)

This results in the following output:

Rows deleted: 6

You'll note this is the sum of all of the rows that were deleted in the prior example, but the information on how many rows were deleted for each parent id is missing. Fortunately, that can be determined by enabling the Array DML Row Counts feature, available in Oracle Database 12.1 and higher:

data = [[10], [20], [30]]

cursor.executemany("delete from ChildTable where ParentId = :1", data,
        arraydmlrowcounts = True)
for ix, rowsDeleted in enumerate(cursor.getarraydmlrowcounts()):
    print("Rows deleted for parent id", data[ix][0], "are",
            rowsDeleted)

This results in the same output as was shown for the simple cursor.execute():

Rows deleted for parent id 10 are 1
Rows deleted for parent id 20 are 3
Rows deleted for parent id 30 are 2

Scenario 2: Handling Bad Data

Another scenario is handling bad data. When processing large amounts of data some of that data may not fit the constraints imposed by the database. Using cursor.execute() such processing may look like this:

data = [
    (1016, 10, 'Child B of Parent 10'),
    (1017, 10, 'Child C of Parent 10'),
    (1018, 20, 'Child D of Parent 20'),
    (1018, 20, 'Child D of Parent 20'),       # duplicate key
    (1019, 30, 'Child C of Parent 30'),
    (1020, 30, 'Child D of Parent 40'),
    (1021, 600, 'Child A of Parent 600'),     # parent does not exist
    (1022, 40, 'Child F of Parent 40'),
]

for ix, row in enumerate(data):
    try:
        cursor.execute("""
                insert into ChildTable
                (ChildId, ParentId, Description)
                values (:1, :2, :3)""", row)
    except cx_Oracle.DatabaseError as e:
        print("Row", ix, "has error", e)

This results in the following output:

Row 3 has error ORA-00001: unique constraint (EMDEMO.CHILDTABLE_PK) violated
Row 6 has error ORA-02291: integrity constraint (EMDEMO.CHILDTABLE_FK) violated - parent key not found

If you make use of cursor.executemany(), however, execution stops at the first error that is encountered:

data = [
    (1016, 10, 'Child B of Parent 10'),
    (1017, 10, 'Child C of Parent 10'),
    (1018, 20, 'Child D of Parent 20'),
    (1018, 20, 'Child D of Parent 20'),       # duplicate key
    (1019, 30, 'Child C of Parent 30'),
    (1020, 30, 'Child D of Parent 40'),
    (1021, 600, 'Child A of Parent 600'),     # parent does not exist
    (1022, 40, 'Child F of Parent 40'),
]

try:
    cursor.executemany("""
            insert into ChildTable (ChildId, ParentId, Description)
            values (:1, :2, :3)""", data)
except cx_Oracle.DatabaseError as e:
    errorObj, = e.args
    print("Row", cursor.rowcount, "has error", errorObj.message)

This results in the following output:

Row 3 has error ORA-00001: unique constraint (EMDEMO.CHILDTABLE_PK) violated

Fortunately there is an option to help here as well, using the Batch Errors feature available in Oracle Database 12.1 and higher. This can be seen using the following code:

data = [
    (1016, 10, 'Child B of Parent 10'),
    (1017, 10, 'Child C of Parent 10'),
    (1018, 20, 'Child D of Parent 20'),
    (1018, 20, 'Child D of Parent 20'),       # duplicate key
    (1019, 30, 'Child C of Parent 30'),
    (1020, 30, 'Child D of Parent 40'),
    (1021, 600, 'Child A of Parent 600'),     # parent does not exist
    (1022, 40, 'Child F of Parent 40'),
]

cursor.executemany("""
        insert into ChildTable (ChildId, ParentId, Description)
        values (:1, :2, :3)""", data, batcherrors = True)
for errorObj in cursor.getbatcherrors():
    print("Row", errorObj.offset, "has error", errorObj.message)

This results in the following output, which is identical to the example that used cursor.execute():

Row 3 has error ORA-00001: unique constraint (EMDEMO.CHILDTABLE_PK) violated
Row 6 has error ORA-02291: integrity constraint (EMDEMO.CHILDTABLE_FK) violated - parent key not found

In both the execute() and executemany() cases, rows that were inserted successfully open a transaction which will need to be either committed or rolled back with connection.commit() or connection.rollback(), depending on the needs of your application. Note that if you use autocommit mode, the transaction is committed only when no errors are returned; otherwise, a transaction is left open and will need to be explicitly committed or rolled back.

Scenario 3: DML RETURNING Statements

The third scenario that I will consider is that of DML RETURNING statements. These statements allow you to bundle a DML statement (such as INSERT, UPDATE, DELETE and MERGE statements) along with a query to return some data at the same time. With cursor.execute() this is done easily enough using something like the following code:

childIdVar = cursor.var(int)
cursor.setinputsizes(None, childIdVar)
for parentId in (10, 20, 30):
    cursor.execute("""
            delete from ChildTable
            where ParentId = :1
            returning ChildId into :2""", [parentId])
    print("Child ids deleted for parent id", parentId, "are",
            childIdVar.values)

This produces the following output:

Child ids deleted for parent id 10 are [1001]
Child ids deleted for parent id 20 are [1002, 1003, 1004]
Child ids deleted for parent id 30 are [1005, 1006]

Support for DML RETURNING in cursor.executemany() was introduced in cx_Oracle 6.3. Because it was supported only in execute() prior to cx_Oracle 6.3, the cx_Oracle.__future__ object must have the attribute "dml_ret_array_val" set to True to allow multiple values to be returned by executemany(). Failing to set this to True when calling executemany() will result in an error. Finally, the variable created to accept the returned values must have an array size large enough to accept the rows that are returned (one array of output data is returned for each of the input records that are provided).

The following code shows the new executemany() support in cx_Oracle 6.3:

cx_Oracle.__future__.dml_ret_array_val = True

data = [[10], [20], [30]]
childIdVar = cursor.var(int, arraysize = len(data))
cursor.setinputsizes(None, childIdVar)
cursor.executemany("""
        delete from ChildTable
        where ParentId = :1
        returning ChildId into :2""", data)
for ix, inputRow in enumerate(data):
    print("Child ids deleted for parent id", inputRow[0], "are",
            childIdVar.getvalue(ix))

This results in the same output as was seen with cursor.execute():

Child ids deleted for parent id 10 are [1001]
Child ids deleted for parent id 20 are [1002, 1003, 1004]
Child ids deleted for parent id 30 are [1005, 1006]

Note: that using "dml_ret_array_val" set to True with execute() causes arrays to be returned for each bind record. In any future cx_Oracle 7 this will become the only behavior available.

Scenario 4: Variable Data lengths

When multiple rows of data are being processed there is the possibility that the data is not uniform in type and size. cx_Oracle makes some effort to accommodate such differences. For example, type determination is deferred until a value that is not None is found in the data. If all values in a particular column are None, then cx_Oracle assumes the type is a string and has a length of 1. cx_Oracle will also adjust the size of the buffers used to store strings and bytes when a longer value is encountered in the data. These sorts of operations, however, will incur overhead as cx_Oracle has to reallocate memory and copy all of the data that has been processed thus far. To eliminate this overhead, the method cursor.setinputsizes() should be used to tell cx_Oracle about the type and size of the data that is going to be used. For example:

data = [
    (110, "Parent 110"),
    (2000, "Parent 2000"),
    (30000, "Parent 30000"),
    (400000, "Parent 400000"),
    (5000000, "Parent 5000000")
]

cursor.setinputsizes(None, 20)
cursor.executemany("""
        insert into ParentTable (ParentId, Description)
        values (:1, :2)""", data)

In this example, without the call to cursor.setinputsizes(), cx_Oracle would perform five allocations of increasing size as it discovered each new, larger string. The value 20, however, tells cx_Oracle that the maximum size of the strings that will be processed is 20 characters. Since cx_Oracle allocates memory for each row based on this value it is best not to oversize it. Note that if the type and size are uniform (like they are for the first column in the data being inserted), the type does not need to be specified and None can be provided, indicating that the default type (in this case cx_Oracle.NUMBER) should be used.

Conclusion

As can be seen by the preceding examples, cursor.executemany() lets you manage data easily and enjoy high performance at the same time!

Join the discussion

Comments ( 5 )
  • Jeff Dawson Monday, April 30, 2018
    For batch inserts, have you tested the performance of executemany() vs. passing nested record types to a PL/SQL procedure that performs a FORALL insert? My assumption is that the FORALL would outperform executemany(), but I have not setup a test case to prove it.
  • Christopher Jones Monday, April 30, 2018
    Any FORALL solution has to do that extra interface call from PL/SQL to SQL which executeMany() doesn't have to do.
  • Ronny Meissner Tuesday, July 3, 2018
    The DB2.0 API in Python allows the parameters of executemany to be a generator, which would have a huge advantage for large datasets in inserts. cx_Oracle seems not to support this and states that it requires params to be a "list" (or tuple). Are there any plans to change this or any recommendation how to insert large datasets without having to produce the whole list upfront?
  • Christopher Jones Tuesday, July 3, 2018
    @Ronny please open an issue at https://github.com/oracle/python-cx_Oracle/issues which is a better place for discussion and for enhancements to be tracked.

    Since the complete data set needs to be sent to the DB in one operation, a generator may only be a programming convenience, rather than providing any memory or time savings.
  • Ronny Meissner Tuesday, July 3, 2018
    @Christopher,
    thanks for your prompt reply, I will try and log an issue in Github.
    Not sure I would agree on the programming convineance. In fact I played around with SQLite3 (included in Python) where the API supports a generator: where I'm getting an "DPI-1001: out of memory" in cx for breaching a GB I only got some 370MB for the generator-based executemany in sqlite3. In general it is recommended to use generators when the datasets get huge, which is the case with my use case here as well (generating the complete list upfront is expansive).

    But still thanks for the recomendation.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services