07 May 2015

Splitting a comma delimited string the RegExp way, Part Three

The article read most often on this blog is called "Splitting a Comma Delimited String, the RegExp way".
On this blog there are two articles about this technique, one is about splitting up a single string into multiple rows and the other is about multiple strings into multiple rows.
Links to both articles are included at the bottom of this article.
It seems like there is a need for functionality like that frequently. And just to add to those two articles on the subject, here is a third one combining the first two articles.

Recently I was asked for help in a comment on how to go about and split up a string like the following

      'ABC/FDF,RET/YRT,UYT/ERT'
   
The expected outcome would be
      ABC
      FDF
      RET
      YRT
      UYT
      ERT
   

As you can see the input string consists of two different delimiters, namely a comma and a forward slash (/).
To split this string up, you will need both techniques from the other articles.

Let's start with a variable containing the input string.

      SQL> var input varchar2(150)
      SQL> 
      SQL> exec :input := 'ABC/FDF,RET/YRT,UYT/ERT'

      PL/SQL procedure successfully completed.
   

The first step is to split the string up using the first method, split up the string using the comma as a delimiter.

SQL> select regexp_substr (:input, '[^,]+',1, rownum) str
  2    from dual
  3   connect by level <= regexp_count (:input, '[^,]+')
  4  ;

STR
-------
ABC/FDF
RET/YRT
UYT/ERT
   
This will leave us with three records each consisting of a string that needs further splitting up, but this time with the forward slash as the delimiter.

Using these rows as the input in the next phase, use the technique described in the second article.
By introducing Subquery Factoring (lines 1-5), create a named query "commas"

   SQL> with commas
     2  as
     3  (select regexp_substr (:input, '[^,]+',1, rownum) str
     4    from dual
     5   connect by level <= regexp_count (:input, '[^,]+'))
     6   select regexp_substr (str, '[^\/]+', 1, rn) split
     7   from commas
     8   cross
     9   join (select rownum rn
    10       from (select max (regexp_count(rtrim (str, '/')||'/', '\/')) mx
    11        from commas
    12     )
    13    connect by level <= mx
    14    )
    15   where regexp_substr (str, '[^\/]+', 1, rn) is not null
    16   ;

   SPLIT
   ----------------------------------
   ABC
   FDF
   RET
   YRT
   UYT
   ERT
The forward slash has special meaning with regular expressions it needs to be escaped using a backslash.
You can see this on lines 6, 10, and 15.
What is interesting, or at least I find interesting, is the use of the RTRIM on line 10.
Each value per line is not completely delimited by the forward slashes, the trailing one is missing. Just to concatenate one to each line would be to easy, what if there is a trailing slash?
The RTRIM removes the trailing slash and concatenates one at the end, making sure that the string is split up at the right place.

Links

4 comments:

  1. Nice combination of your methods, Alex :-)

    An alternative if you can have multiple valid delimiters and they are all "equal" could be to translate all delimiters into the same (for example comma). Supposing we could have these valid delimiters: , / | ;

    exec :input := 'ABC/FDF,RET|YRT,UYT;ERT'

    select
    regexp_substr(translate(:input, '/|;', ',,,'), '[^,]+', 1, rownum) str
    from dual
    connect by level <= regexp_count(translate(:input, '/|;', ',,,'), '[^,]+')
    ;

    But it does require that there is no special meaning like 'comma is "row" delimiter and slash is "column" delimiter' ;-)

    ReplyDelete
  2. Hi Alex,

    Thanks for this article. If you want to keep the special meaning of the delimiters,
    there is an alternative using "collection unnesting".

    Pre-12c version:

    with first_split as (
    select rownum rn,
    regexp_substr (:input, '[^,]+',1, rownum) first_str
    from dual
    connect by level <= regexp_count (:input, '[^,]+')
    )
    select rn, column_value from first_split,
    table(cast(multiset(
    select regexp_substr (first_str, '[^\/]+',1, rownum)
    from dual
    connect by level <= regexp_count (first_str, '[^\/]+')
    ) as sys.odcivarchar2list));

    12 version using the LATERAL clause:

    with first_split as (
    select rownum rn,
    regexp_substr (:input, '[^,]+',1, rownum) first_str
    from dual
    connect by level <= regexp_count (:input, '[^,]+')
    )
    select rn, str from first_split,
    lateral(
    select regexp_substr (first_str, '[^|/]+',1, rownum) str
    from dual
    connect by level <= regexp_count (first_str, '[^|/]+')
    );

    ReplyDelete
    Replies
    1. Love it! I've been looking for a use case for the LATERAL clause and this is a great one. Thanks!

      Delete