It's the bane of data loads everywhere. Having 99% of a process complete, only for it to fail on the last few rows:
insert into target_table select * from massive_table where last_row_is_a_duplicate = 'Y'; ORA-00001: unique constraint (...) violated
This typically aborts the whole process. Rolling everything back.
Meaning you need to clean your data. Then run the whole load again.
Aaaaaargghhh!
This is particularly problematic if the load takes hours to complete.
Fortunately Oracle Database has several methods you can use to skip duplicate rows stop this happening. In this post you'll see how you can do this by:
ignore_row_on_dupkey_index
hint to the insertdeferrable
But first, a quick recap on what primary keys and unique constraints are and why you have them.
Being able to locate one specific row in a table is a fundamental feature of SQL. In fact it's a requirement of first normal form.
Take the username for accounts in your application. To ensure someone has entered the correct password and pull up their details on login you need to find the row for that username. And be sure each username appears at most once in the accounts table.
Declaring the username unique
enforces this second property. And raises an error if anyone tries to insert a duplicate name:
create table accounts ( username varchar2(10) not null unique, given_name varchar2(512) not null ); insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Chris' ); insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Christopher' ); ORA-00001: unique constraint (CHRIS.SYS_C0018134) violated
This guarantees that each username appears at most once.
A primary key is a special case of a unique constraint. It has these extra properties:
not null
)So how do you choose between using a primary key or unique constraint?
Some tables can have many unique identifiers. A common reason for this is surrogate keys – sequence assigned or GUID values that have no meaning outside of the application. For example you may add an account_id
column to the accounts
table:
create table accounts ( account_id integer generated as identity not null primary key, username varchar2(10) not null unique, given_name varchar2(512) not null );
The table can only have one primary key. So at least one of account_id
and username
must be a unique constraint. But which should be the primary key?
The accounts
table is likely to have many child tables that refer to a particular row in it. Such as orders, invoices, and payments. To ensure these child rows point to a valid account, create a foreign key from the child table to the parent.
The columns foreign keys point to should be immutable. Unchanging. This avoids problems with having to cascade updates of key values from parent to children. For example, if someone wants to change their username.
Both primary keys and unique constraints can be the target of foreign keys. But because you can only have one primary key, by convention foreign keys point to this constraint.
So by making account_id
the primary key, you're telling other developers:
"This is the immutable value. Use this as the target for foreign keys."
Creating unique constraints improves data quality. Usually when someone tries to insert an existing value – such as creating a new account with an existing username – you want to stop this happening.
But sometimes you may want to skip this error. Say when loading keys that already exist.
If the source data are likely to contain key values that are already in the table, it's best to avoid loading these completely.
Do this with a not exists
subquery:
insert into accounts ( username, given_name ) select username, given_name from accounts_stage acst where not exists ( select * from accounts acct where acct.username = acst.username );
This will only load usernames from accounts_stage
missing from accounts
.
This only works with insert ... select
. There's no query with insert ... values
. So this approach won't work with these.
And what if the source contains duplicates? How do you skip the extra rows there?
The simplest method is to add a hint to the query. Added in 11.2, the ignore_row_on_dupkey_index
hint silently ignores duplicate values:
insert /*+ ignore_row_on_dupkey_index ( acct ( username ) ) */ into accounts acct ( username, given_name ) select username, given_name from accounts_stage;
So if accounts_stage
has duplicate usernames, this will add one of the rows and bypass the other.
This also works when using the values version of insert:
insert /*+ ignore_row_on_dupkey_index ( acct ( username ) ) */ into accounts acct ( username, given_name ) values ( 'chrissaxon', 'Chris' );
To use the hint, put the table name or alias followed by either:
Note you can only ignore one constraint in the table. Not an issue if you want to skip extra usernames and have auto-generated primary keys. But a limitation if you have many unique keys on a table.
While adding a hint is easy, it has a couple of other major drawbacks:
So if you want to know if there were any duplicates, you need to post-process the data to check.
If the source contains identical rows - those where every column has the same value - this may not bother you. Of course, if this case, if you're better off adding distinct
in the subquery:
insert into accounts ( username, given_name ) select distinct username, given_name from accounts_stage;
But if it's only usernames that clash, it's likely you want to know. And find out which rows you skipped. To do this you need to keep which rows failed.
DML error logging enables insert
, update
, and delete
statements store exceptions in another table. Allowing the statement to complete without error. Connor discusses it in this video:
To use it, first you must create the error logging table. By default this creates a table named err$_<tablename>
.
Then use the log errors clause in your SQL:
exec dbms_errlog.create_error_log ( 'accounts' ); insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Chris' ) log errors into err$_accounts reject limit unlimited; insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Christopher' ) log errors into err$_accounts reject limit unlimited; insert into accounts ( username, given_name ) select username, given_name from accounts_stage log errors into err$_accounts reject limit unlimited;
The reject limit
defaults to zero. Meaning you'll still get an exception if there are any duplicates. Setting this to unlimited
allows you to ignore all ORA-00001 errors, but the insert still succeed!
Of course, if there are lots of errors in the load, this may be a sign of bigger problems. And you want to stop processing. So if you want to abort the load if there's more than e.g. 100 failing rows, set reject limit 100
.
After the load is complete, you'll need to query the error logging table to see if there were any issues.
This could be a problem there are lots of loads running at the same time. How do you know which process the errors belong to?
To combat this problem, add a tag to the load. Then use this to fish out the offending data after:
insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Christopher' ) log errors into err$_accounts ('load name') reject limit unlimited; select * from err$_accounts where ora_err_tag$ = 'load name';
All the techniques so far assume that you can safely ignore duplications. At least while running the load transaction.
But sometimes you may need to load all copies. Then post-process the data to keep the "correct" one.
The final method allows you to do this - provided you can do the validation in the same transaction!
You can delay the point the database validates a constraint until commit by declaring it as deferrable:
create table accounts ( username varchar2(10) not null unique deferrable initially deferred, given_name varchar2(512) not null ); insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Chris' ); insert into accounts ( username, given_name ) values ( 'chrissaxon', 'Christopher' );
The key difference between this and the previous methods is the target table may contain duplicates during the transaction. The database only raises ORA-00001 when you commit
. At which point it rolls back all the rows!
commit; ORA-02091: transaction rolled back ORA-00001: unique constraint (CHRIS.SYS_C0018128) violated select count (*) from accounts; COUNT(*) 0
So before you end the transaction, you need to find and delete any duplicate rows.
This can be useful if you have a multi-pass load where you run many insert statements. And you can only determine which rows to keep at the end.
Note deferrable keys create a non-unique index to police the constraint. This makes primary key lookups less efficient. It also breaks the ignore_row_on_dupkey_index
hint, as this requires a unique index:
insert /*+ ignore_row_on_dupkey_index ( acct ( username ) ) */ into accounts acct ( username, given_name ) values ( 'chrissaxon', 'Chris' ); ORA-38913: Index specified in the index hint is invalid
HT to Richard Foote suggesting deferrable constraints for pointing out the hint limiation.
As always, there is an element of subjectivity to "best". And it depends on your goals. To help you decide, answer these questions:
If the source data are clean, but you may be re-inserting existing values then changing your SQL is the way to go. The most straightforward is to add a not exists
clause. Or if you want to overwrite the non-key columns with the latest data, use merge
.
But when you're loading messy data, it's possible you'll have more problems than just ORA-00001 errors. Such as trying to put null
into not null
columns. In which case, DML error logging is the way to go, as this will capture all the issues.
For the most part, the ignore_row_on_dupkey_index
hint should be a last resort. Handy if you need a quick fix. But definitely a kludgy workaround!
Similarly, deferrable primary key or unique constraints are probably a sign that you need to redesign your process. Also you can't make an existing constraint deferrable. You need to drop and recreate the constraint – and almost certainly its index too – to do this. Making it impractical to add this trick to most existing tables. But in cases where you have to load copies, then clean them after it may be the only viable method.
Have you tried any of the other solutions suggested in this point (subquery to avoid dups or using log errors)? How do they compare?