X

Celebrating the joy and power of Oracle SQL with the Oracle Developer Advocate team

  • October 28, 2015

Finding the Longest Common Starting Substring Using SQL

Chris Saxon
Developer Advocate
A poster asked an interesting question on Ask Tom today. This amounted to:

How can I return a distinct list of the longest common sequences of characters at the start of a set of strings?

For example, given the following strings:

emcpower1
emcpower2
emcpower3
sda
sdb
sdc

The longest sequences where there is at least one other row that starts the same are "emcpower" and "sd". So the output should be:

emcpower
sd

How can you solve this using SQL?

To figure out how to do this, it's useful to break it down. Here's a basic algorithm:
  • For a given row, check whether any other rows start with the same character
  • If there are, check there's any that match on the first two characters
  • As long as there's at least one other matching row, repeat this process. Each time add one more character to the comparison.
  • Stop when there's no matches or you've reached the end of the string
A simple implementation is to loop through all the strings. For each of these, loop up to the number of characters within it, running the process above. The pseudo code to do this is:
for strs in 1 .. number_of_strings loop
for inx in 1 .. length(string(strs)) loop

check_string := substr(full_string, 1, inx);

if exists another row where check_string := substr(other_string, 1, inx);
continue;
else
exit;
end if;

end loop;
end loop;
SQL is a set based language however. There is no loop operation.

So how do you do it?

First put the rows above in a table:

create table strings as
select 'emcpower1' str from dual
union all
select 'emcpower2' from dual
union all
select 'emcpower3' from dual
union all
select 'sda' from dual
union all
select 'sdb' from dual
union all
select 'sdc' from dual;
Next create a set of all the strings you want to compare. These are all the leading substrings. For example, emcpower becomes:

e, em, emc, emcp, emcpo, emcpow, emcpowe, emcpower

To do this you can use a row generator. This should return as many rows as the length of the longest string. You can subtract one from this if you know the strings are unique.
  select rownum r from dual
connect by level <= (select max(length(str)) from strings);

Cross join this with the table of strings:

with rws as (
select rownum r from dual
connect by level <= (select max(length(str)) from strings)
)
select *
from strings s1
cross join rws;

This lists each string N times, where N is the length of the longest string. You only need to repeat each string up to its length however. So you should add a where clause, limiting the number to the length of each individual string. Using this list, create the set of all the starting substrings. Do this with substr(string, 1, n):

with rws as (
select rownum r from dual
connect by level <= (select max(length(str)) from strings)
)
select str,
substr(str, 1, r) subs
from strings s1
cross join rws
where r < length(str);

With this, you can check whether there's another row with the same starting characters.

Exists comes in handy here. First you must ensure you're not comparing a row to itself! Next see if any other rows have the same beginning:

with rws as (
select rownum r from dual
connect by level <= (select max(length(str)) from strings)
)
select str,
substr(str, 1, r) subs
from strings s1
cross join rws
where r < length(str)
and exists (
select * from strings s2
where s1.str <> s2.str
and substr(s1.str, 1, r) = substr(s2.str, 1, r)
);

This returns a list of all the common strings. We just want the longest of these for each row. To do this, group by string and return the max substring:

with rws as (
select rownum r from dual
connect by level <= (select max(length(str)) from strings)
)
select str, max(subs) s
from (
select str,
substr(str, 1, r) subs
from strings s1
cross join rws
where r < length(str)
and exists (
select * from strings s2
where s1.str <> s2.str
and substr(s1.str, 1, r) = substr(s2.str, 1, r)
)
)
group by str;

STR S
--------- ---------
sda sd
sdc sd
sdb sd
emcpower1 emcpower
emcpower2 emcpower
emcpower3 emcpower

All that now remains is to return the distinct list of the substrings:

with rws as (
select rownum r from dual
connect by level <= (select max(length(str)) from strings)
)
select distinct s
from (
select str, max(subs) s
from (
select str,
substr(str, 1, r) subs
from strings s1
cross join rws
where r < length(str)
and exists (
select * from strings s2
where s1.str <> s2.str
and substr(s1.str, 1, r) = substr(s2.str, 1, r)
)
)
group by str
);

S
---------
emcpower
sd
There's a possible problem with this solution however. It returns all the common starting strings. If we add emcdisk and samba to the original set, the query now returns:
S       
---------
emc
emcpower
s
sd
This is what the original poster wanted. This may not be what you need if you doing this though. You need to work with the data consumers to know what they want. Should it be the output above, the two shortest where there's a match, i.e.:

emc
s

Or the original output? If you're looking to solve a similar problem then make sure you've handled the edge cases!

Can you think of other ways to solve this problem in SQL? Are there better ways to write the query?

What about the general version of this problem: find the longest common substrings anywhere within the strings. Can you provide a solution for that?

If you have other approaches let us know in the comments

No access to a database? No problem. You can view and run these scripts in your browser using LiveSQL.

Join the discussion

Comments ( 8 )
  • Timo Raitalaakso Friday, October 30, 2015

    with strings as(
    select 'emcpower1' str from dual
    union all
    select 'emcpower2' from dual
    union all
    select 'emcpower3' from dual
    union all
    select 'samba' from dual
    union all
    select 'emcdisk' from dual
    union all
    select 'sda' from dual
    union all
    select 'sdb' from dual
    union all
    select 'sdc' from dual)
    , cte (str, ret, lvl) as
    (select str,substr(str,1,1),1
    from strings
    union all
    select c.str,substr(c.str,1,c.lvl+1),c.lvl+1
    from cte c, strings s
    where substr(c.str,1,c.lvl+1)=substr(s.str,1,c.lvl+1)
    and c.str!=s.str
    ), anl as (
    select ret,lvl,max(lvl)over(partition by str) mx
    from cte)
    select distinct ret
    from anl
    where lvl=mx
    ;

  • Chris Saxon Friday, October 30, 2015

    Good stuff, nice use of a recursive CTE Timo.

  • Stew Ashton Sunday, November 1, 2015

    The solution posted above is what Jonathan Lewis calls a "brontosaurus": it starts with few rows, generates lots of intermediate rows, then throws most of them away to get a small result set.

    The number of rows generated is a bit less that the number of strings multiplied by the length of the longest string, although many of those rows may get filtered immediately.

    To lessen the number of rows generated, you could use the length of each string instead of the length of the longest string:

    with substrs as(
    select str, substr(str,1,column_value) subs
    from strings, table(cast(multiset(
    select level from dual
    connect by level <= length(str)
    ) as sys.odcinumberlist))
    )
    select distinct max(subs) subs
    from substrs a join substrs b using(subs)
    where a.str < b.str
    group by a.str;

    As you know, the "table(cast(multiset(" syntax can be replace by the LATERAL clause starting in version 12c.

    Best regards, Stew Ashton

  • Stew Ashton Sunday, November 1, 2015

    To "find the longest common substrings anywhere within the strings", I thought it might be best to use PL/SQL to do as little work as possible. First I get all possible substrings from the first row Oracle gives me, then I sort them with the longest substring first. After that I check each substring until I find one that fits into every row, and that is my answer.

    declare
    l_num number;
    l_sub_string strings.str%type;
    cursor c_cur is
    with firstrow as (
    select str, length(str) strlen
    from strings where rownum = 1
    )
    , starts as (
    select str, strlen, level start_pos
    from firstrow
    connect by level <= strlen
    )
    , boundaries as (
    select a.str, a.start_pos, b.start_pos end_pos
    from starts a, starts b
    where a.start_pos + b.start_pos <= a.strlen + 1
    )
    select substr(str, start_pos, end_pos) sub_string
    from boundaries
    group by substr(str, start_pos, end_pos)
    order by length(substr(str, start_pos, end_pos)) desc, min(start_pos);
    begin
    for rec in c_cur loop
    l_sub_string := rec.sub_string;
    select 1 into l_num
    from dual
    where exists (
    select null from strings where instr(str, l_sub_string) = 0
    );
    end loop;
    exception when no_data_found then
    dbms_output.put_line(l_sub_string);
    end;
    /

  • guest Monday, November 2, 2015

    Hi

    By checking

    ls -l /dev/disk/by-id/*|sed s,.*/,,

    I'd rather simply use

    select distinct regexp_replace(c,'\d.*|.$') from t

    Cheers
    Laurent

  • Chris Saxon Monday, November 2, 2015

    Good point on the CTE solution Stew and great work on the "longest substring anywhere". Can you come with any other solutions to this?

    Nice solution Laurent. Regexs can work well if you know your data conforms to a certain format.

  • Tegiri Nenashi Monday, November 2, 2015

    The question begs for an existence of string function that returns longest common prefix of two strings. Without such built in function one can leverage regular expressions:

    select s1.str s1str, s2.str s2str,
    REGEXP_SUBSTR( s1.str||'!'||s2.str, '([^!]*)([^!]*)!\1' )
    from strings s1, strings s2
    where s1.str less than s2.STR;

    S1STR S2STR REGEXP_SUBSTR(S1.ST
    --------- --------- -------------------
    emcpower1 emcpower2 emcpower1!emcpower
    emcpower1 emcpower3 emcpower1!emcpower
    emcpower1 sda emcpower1!
    emcpower1 sdb emcpower1!
    emcpower1 sdc emcpower1!
    emcpower2 emcpower3 emcpower2!emcpower
    emcpower2 sda emcpower2!
    emcpower2 sdb emcpower2!
    emcpower2 sdc emcpower2!
    emcpower3 sda emcpower3!
    emcpower3 sdb emcpower3!
    emcpower3 sdc emcpower3!
    sda sdb sda!sd
    sda sdc sda!sd
    sdb sdc sdb!sd

    Then, finding longest substrings after question mark symbol is easy.

  • Chris Saxon Tuesday, November 3, 2015

    Nicely done Tegiri, the power of regexps strikes again!

    Can you find a way to use these to find the longest common substring starting at any location?

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.