tokenizing newline delimited data with regexp_substr() in sqlplus

In my previous post, I showed comma delimited values(csv) example.

SQL: Delimited list of values as in-line view. csv on memory (INOUE Katsumi @ Tokyo)

select regexp_substr('KING,TURNER,BLAH,!ERROR!,0,JAMES','[^,]+',1,level) as token from dual connect by level <= 256

But if you copy and paste from Excel like below,

It will be 'newline' delimited data.  Below is sqlplus session showing regexp_substr() can handle it.
I tried regex '\r' and '\n' but neither worked.

SQL> l
  1  select regexp_substr('
  2  King
  3  Kochhar
  4  De Haan
  5  Hunold
  6  Ernst
  7  Austin
  8  Pataballa
  9  Lorentz
 10  Greenberg
 11  Faviet
 12  Chen
 13  ','[^
 14* ]+',1,level) from dual connect by level <= 12
SQL> /

REGEXP_SUBSTR('KINGKOCHHARDEHAAN
--------------------------------
King
Kochhar
De Haan
Hunold
Ernst
Austin
Pataballa
Lorentz
Greenberg
Faviet
Chen

12 rows selected.
Comments:

There is a parameter 'm' to regexp functions that tell regexp to take line breaks into consideration. That way the regexp anchors ^ (start) and $ (end) work on each line rather than the entire text.

Like this using ^.+$ to get non-empty lines:

SQL> select regexp_substr('
2 King
3 Kochhar
4 De Haan
5 Hunold
6 Ernst
7 Austin
8 Pataballa
9 Lorentz
10 Greenberg
11 Faviet
12 Chen
13 ','^.+$',1,level,'m') token
14 from dual connect by level <= 11
15 /

TOKEN
--------------------
King
Kochhar
De Haan
Hunold
Ernst
Austin
Pataballa
Lorentz
Greenberg
Faviet
Chen

11 rows selected.

Or like this using ^.*$ to include the two empty lines:

SQL> select regexp_substr('
2 King
3 Kochhar
4 De Haan
5 Hunold
6 Ernst
7 Austin
8 Pataballa
9 Lorentz
10 Greenberg
11 Faviet
12 Chen
13 ','^.*$',1,level,'m') token
14 from dual connect by level <= 13
15 /

TOKEN
--------------------

King
Kochhar
De Haan
Hunold
Ernst
Austin
Pataballa
Lorentz
Greenberg
Faviet
Chen

13 rows selected.

An advantage of using 'm' is it makes it easy to for example take those on the newline separated list who begin with F, G, K or L:

SQL> select regexp_substr('
2 King
3 Kochhar
4 De Haan
5 Hunold
6 Ernst
7 Austin
8 Pataballa
9 Lorentz
10 Greenberg
11 Faviet
12 Chen
13 ','^[FGKL].*$',1,level,'m') token
14 from dual connect by level <= 5
15 /

TOKEN
--------------------
King
Kochhar
Lorentz
Greenberg
Faviet

But the above examples have hardcoded how many levels for the connect by. That can be fixed as well, for example like:

SQL> select regexp_substr(txt,regexp,1,level,'m') token
2 from (
3 select
4 '
5 King
6 Kochhar
7 De Haan
8 Hunold
9 Ernst
10 Austin
11 Pataballa
12 Lorentz
13 Greenberg
14 Faviet
15 Chen
16 '
17 txt,
18 '^[FGKL].*$' regexp
19 from dual
20 )
21 connect by level <= regexp_count(txt,regexp,1,'m')
22 /

TOKEN
--------------------
King
Kochhar
Lorentz
Greenberg
Faviet

Now just change regexp expression to get all that contain a lowercase 'a':

SQL> select regexp_substr(txt,regexp,1,level,'m') token
2 from (
3 select
4 '
5 King
6 Kochhar
7 De Haan
8 Hunold
9 Ernst
10 Austin
11 Pataballa
12 Lorentz
13 Greenberg
14 Faviet
15 Chen
16 '
17 txt,
18 '^.*a.*$' regexp
19 from dual
20 )
21 connect by level <= regexp_count(txt,regexp,1,'m')
22 /

TOKEN
--------------------
Kochhar
De Haan
Pataballa
Faviet

Hope this can be useful :-)

Posted by Kim Berg Hansen on April 16, 2013 at 05:35 PM JST #

Thanks for the detailed explanation!
I'm trying to grasp the idea.
And sorry for my ignorance. I posted new entry
without using 'm'.

Posted by owner on April 18, 2013 at 09:45 AM JST #

Don't be sorry - I just stumbled upon your blog entry on OraNA and thought "hey, here is an opportunity for me to teach something - maybe other people will read that blog and learn" ;-)

Anyway, you could probably also do it without 'm' parameter. You had the right idea with \r and \n, except you need to use the CHR function and the ascii values of carriage return and newline. Build up your regexp string something like:

'[^' || chr(13) || chr(10) || ']+'

Which would match "one or more characters that are not ascii values 13 or 10".

Glad to be of assistance ;-)

Posted by Kim Berg Hansen on April 18, 2013 at 06:06 PM JST #

2 Kim Berg Hansen

interesting moment using '^.+$' mask for 10 and 11 versions

Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0
Connected as scott

SQL>
SQL> select regexp_substr('
2 2 King
3 3 Kochhar
4 4 De Haan
5 5 Hunold
6 6 Ernst
7 7 Austin
8 8 Pataballa
9 9 Lorentz
10 10 Greenberg
11 11 Faviet
12 12 Chen
13 ','^.+$',1,level,'m') token
14 from dual connect by level <= 11
15 /

TOKEN
--------------------------------------------------------------------------------
3 Kochhar
4 De Haan
5 Hunold
6 Ernst
7 Austin
8 Pataballa
9 Lorentz
10 Greenberg
11 Faviet
12 Chen

11 rows selected

SQL> -- where row with "2 King" value ?
SQL> -- one row is null

and for 11

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> select regexp_substr('
2 2 King
3 3 Kochhar
4 4 De Haan
5 5 Hunold
6 6 Ernst
7 7 Austin
8 8 Pataballa
9 9 Lorentz
10 10 Greenberg
11 11 Faviet
12 12 Chen
13 ','^.+$',1,level,'m') token
14 from dual connect by level <= 11
15 /

TOKEN
-----------------------------------------------------------------------------

2 King
3 Kochhar
4 De Haan
5 Hunold
6 Ernst
7 Austin
8 Pataballa
9 Lorentz
10 Greenberg
11 Faviet
12 Chen

11 rows selected.

SQL>

Posted by AlexAnd on May 13, 2013 at 07:54 PM JST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today