There are some things in life it's great to have spare copy of. Your keys, your kidneys and your backups to name a few.
The dictionary definition of this is:
So you're looking for two (or more) rows that are copies of each other. For example, the two rows below hold identical information:
TITLE UK_RELEASE_DATE LENGTH_IN_MINUTES BBFC_RATING ------- ----------------- ----------------- ----------- Frozen 06-DEC-2013 00:00 102 PG Frozen 06-DEC-2013 00:00 102 PG
But often you need to find copies on a subset of the columns. For example, it's common for tables to have an ID column. This means the rows are no longer exact copies. But all the "real world" information is repeated:
FILM_ID TITLE UK_RELEASE_DATE LENGTH_IN_MINUTES RATING ---------- ---------- ----------------- ----------------- ------ 1 Frozen 06-DEC-2013 00:00 102 PG 2 Frozen 06-DEC-2013 00:00 102 PG
Or maybe there are only one or two column values that are copies. It is common for websites to use your email address as your username. So if you have more than one email address in your user accounts table, how do you know who is who?
USER_ID EMAIL_ADDRESS FIRST_NAME LAST_NAME ------- ------------- ---------- --------- 1 firstname.lastname@example.org Dave Badger 2 email@example.com Sarah Fox
So your first step to finding the duplicates is defining which columns form a repeating group.
In the first films example above the rows are exact duplicates. So you may be tempted to say "all columns". But what if you insert another row like this:
TITLE UK_RELEASE_DATE LENGTH_IN_MINUTES BBFC_RATING ---------- ----------------- ----------------- ---------- Frozen 06-DEC-2013 00:00 90 U
So when looking for copies here, you want to use:
There are many ways you can find copies. Using group by is one of the easiest. To do this, list all the columns you identified in the previous step in the select and group by clauses. You can then count how many times each combination appears with count(*):
select title, uk_release_date, count(*) from films group by title, uk_release_date
This returns you a single row for each combination. This includes the rows without duplicates. To return just the copied values you need to filter the results. This is those where the count is greater than one. You can do this with a having clause, like so:
select title, uk_release_date, count(*) from films group by title, uk_release_date having count(*) > 1;
If you want to display the all the rows you need another step. Query the table again. Filter it by checking where the rows are in the results of the above query:
select * from films where (title, uk_release_date) in ( select title, uk_release_date from films group by title, uk_release_date having count(*) > 1 )
So far so good. You've had to list the table twice though. So it's likely you'll have two full scans of the table, as this execution plan shows:
For large tables, this could take a long time. It would be better if you could reference the table once.
select f.*, count(*) over (partition by title, uk_release_date) ct from films f;
So what difference does this make?
Unlike group by, analytics preserve your result set. So you can still see all the rows in the table. This means you only need to list the table once.
You still need to filter the results though. Currently it includes the unduplicated values. You can't use analytic functions in the where or having clauses:
select f.* from films f where count(*) over (partition by title, uk_release_date) > 1; ORA-00934: group function is not allowed here
So you need to do this in an inline view:
select * from ( select f.*, count(*) over (partition by title, uk_release_date) ct from films f ) where ct > 1
with film_counts as ( select f.*, count(*) over (partition by title, uk_release_date) ct from films f ) select * from film_counts where ct > 1
With a list of the duplicates in hand, it's time for the next step!
Now you've identified the copies, you often want to delete the extra rows. For simplicity, I'm going to assume that either the rows are exact copies or you don't care which you remove.
If there is only a handful, you could do this by hand. But this is unworkable if there are a large number of duplicates. It's better to build a single statement which removes all the unwanted copies in one go.
To do this, you must first decide which rows you want to keep. For example, you might want to preserve the oldest row. To do this you'll need another column in the table (e.g. insert date, id, etc.) that is not one of the copied values.
Assuming you have this, there are two ways you can build your delete. If your driving column is unique for each group, but may have duplicates elsewhere in the table, you'll need a correlated delete. Insert dates fall into this category.
Correlated means you're joining the table you're deleting from in a subquery. You need to do this on your duplicate column group. Take the minimum value for your insert date:
delete films f where insert_date not in ( select min(insert_date) from films s where f.title = s.title and f.uk_release_date = s.uk_release_date )
An uncorrelated delete is similar to the previous example. The difference you don't need the joins. The subquery here is like the original group by example you made to find the copies. Then add the id or other unique column as appropriate. For example:
select min(film_id) from films group by title, uk_release_date
This gives you the value of the first entry for each combination. This includes any unduplicated groups. So you want to get rid of all the rows not in this set. Do this by deleting those not in this subquery:
delete films where film_id not in ( select min(film_id) from films group by title, uk_release_date )
All rows in Oracle have a rowid. This is a physical locator. That is, it states where on disk Oracle stores the row. This unique to each row. So you can use this value to identify and remove copies. To do this, replace min() with min(rowid) in the uncorrelated delete:
delete films where rowid not in ( select min(rowid) from films group by title, uk_release_date )
You've gone to all the effort of removing the duplicates. You're feeling good. But a week later a new ticket comes in:
"Please remove duplicate customer accounts".
All that hard work for nothing! OK, you've got your delete ready, so you can run it again.
After a while you'll likely become bored of this. So you create a job to remove the duplicates. But this doesn't solve the problem. There will still be times between job runs where the table could have duplicates.
The answer's easy: create a unique constraint. Include in this all the columns that form the duplicate group you identified at the start:
alter table films add constraint film_u unique (title, uk_release_date)
With this is in place, any attempts to add copies will throw an error:
insert into films values ( 3, 'Frozen', date'2013-12-06', 100, 'U', sysdate ); SQL Error: ORA-00001: unique constraint (CHRIS.FILM_U) violated
Alternatively, you can ignore these. As of 11g Release 2 you can change your inserts to use the IGNORE_ROW_ON_DUPKEY_INDEX hint. Or there's a way to get around this with triggers. Be warned though, both of these solutions cause inserts to fail silently. So the data your users entered "disappears". People tend to be unhappy when this happens!
It would be nice if there was a way to ignore the existing rows, while also ensuring that there are no new duplicates.
Fortunately there is!
This creates the constraint instantly. It doesn't work as you would expect with unique constraints though:
alter table films add constraint film_u unique (title, uk_release_date) novalidate SQL Error: ORA-02299: cannot validate (CHRIS.FILM_U) - duplicate keys found
To overcome this, you need to use a non-unique index. You can create this beforehand as normal. Or you can define it right in the constraint definition:
alter table films add constraint film_u unique (title, uk_release_date) using index ( create index film_i on films (title, uk_release_date) ) novalidate
To ensure there are no duplicates, you need to validate the constraint.
This opens up another way to find the copies. When you validate it, you can log any violating rows to an error table. You do this with the exceptions into clause.
To use you this you need an exceptions table. You can do this with the supplied script $ORACLE_HOME/rdbms/admin/utlexcpt.sql. Or the following create table:
create table exceptions ( row_id rowid, owner varchar2(30), table_name varchar2(30), constraint varchar2(30) )
With this in place you can run:
alter table films modify constraint film_u validate exceptions into exceptions
Note that this doesn't remove the failing rows. You still need to do that yourself. But it does find them for you. You can use this information to filter the rows you're deleting:
delete films where rowid in ( select row_id from exceptions where table_name = 'FILMS' and constraint = 'FILM_U' and owner = sys_context('userenv', 'current_user') ) and rowid not in ( select min(rowid) from films where rowid in ( select row_id from exceptions where table_name = 'FILMS' and constraint = 'FILM_U' and owner = sys_context('userenv', 'current_user') ) group by title, uk_release_date having count(*) > 1 )
If the number of duplicates relative to the size of the table is small, this approach may make your delete faster. This is because it can swap a full scan of your table for rowid lookups:
If your table has hundreds of millions of rows or more, this could massively reduce how long the delete takes.
Whichever approach you take, it's a good idea to create an unvalidated unique constraint first. If you don't, new people may enter new duplicates while your delete is running. Which means you may need to run the delete again. And again. And again...
Adding the constraint prevents you from chasing your tail. You can then clean the data at your leisure.
Do note that having non-unique index supporting a unique constraint could lead to performance issues. So once you've fixed your data you may want change the index to be unique. But you can't change the uniqueness of an existing index. You need to create a new one.
Before Oracle Database 12c you could do this by adding a constant to the index of the unvalidated constraint, like so:
alter table films add constraint film_u_temp unique (title, uk_release_date) using index ( create index film_i on films (title, uk_release_date, 'a') ) novalidate;
Then add the unique index once the data are clean:
create unique index film_ui on films (title, uk_release_date);
Fortunately, as of 12c you can create two (or more) indexes on the same columns - provided only one is visible! So there's no need for the constant nonsense.
Want to try this out for yourself?
Head over to LiveSQL to run the scripts in this post.
Which method do you like best? Are there any other ways you could find and remove duplicates?
Let us know in the comments!
UPDATED: 21 July 2017, fixed explanation about switching normal index for unique (comment from guest on 31 Aug 16) and formatting fixes.