Saturday 21 August 2010

Just another implementation of string splitting function in PL/SQL

Here are some of my implementation of string splitting function in PL/SQL.
It is something developers need often, and still can't believe there isn't a function that does something like this in oracle 10g...???

Some developers parse delimited text using substr(text,X,Y) for every token they need to use in their code…

My advice is: Don’t ever do this!!! Use some available tokenizer function!

I found several implementations on the internet, but those functions either does not support delimiter with length more than 1, either it does not return tokens that are NULL, or does not return table of varchar2 index by pls_integer ( what I needed), etc.

So I wrote it myself…

Anyway, here is my first implementation, written entirelly in PL/SQL:

CREATE OR REPLACE FUNCTION splitString(str VARCHAR2,delimit VARCHAR2) RETURN dbms_sql.Varchar2_Table IS
  
  inputString VARCHAR2(32000);
  delimiter VARCHAR2(32);                         

  delimiters dbms_sql.Number_Table;
  tokens dbms_sql.Varchar2_Table;

  array_index PLS_INTEGER := 1;
  pos PLS_INTEGER := 1;
  delimiter_pos PLS_INTEGER;

  delimiter_length PLS_INTEGER;
  inputString_length PLS_INTEGER;

BEGIN

  IF str IS NULL THEN
  RAISE_APPLICATION_ERROR(-20000,'str cannot be null!');
  END IF;
  
  IF delimit IS NULL OR delimit = '' THEN
  RAISE_APPLICATION_ERROR(-20000,'delimiter cannot be null or empty string!');
  END IF;

  inputString := str;
  delimiter := delimit;

  delimiter_length := length(delimiter);
  inputString_length := length(inputString);

  -- Prepare inputText for parsing
  IF substr(inputString,inputString_length - delimiter_length + 1,delimiter_length) != delimiter THEN
      inputString := inputString || delimiter;
  END IF;

  IF substr(inputString,1,delimiter_length) = delimiter THEN
      inputString := substr(inputString,delimiter_length + 1,inputString_length);
  END IF;
  ---------------------


  -- find delimiter occurrences in inputText
   LOOP
      delimiter_pos := instr(inputString,delimiter,pos);
      EXIT WHEN delimiter_pos = 0;
      delimiters(array_index) := delimiter_pos;
      array_index := array_index + 1;
      pos := delimiter_pos + delimiter_length;
   END LOOP;
  --------------------
  
  -- reset temp variables
  array_index := 1;
  pos := 1;
  ---------------------

  -- iterate through delimiters and create tokens
  FOR j IN delimiters.first..delimiters.last LOOP
      tokens(array_index) := substr(inputString,pos,delimiters(j)-pos);
      array_index := array_index + 1;
      pos := delimiters(j)+ delimiter_length;
  END LOOP;
  --------------------

  RETURN tokens;

END;
/
Function’s parameters are string that needs to be tokenized, and a delimiter ( that can have length greater than 1).

Yoy could enchance it by setting


TOKENS     DBMS_SQL.VARCHAR2_TABLE;

to

TYPE t IS TABLE OF varchar2(32000) INDEX BY PLS_INTEGER;
TOKENS t;

if  you are dealing with tokens of length > 2000.
Also, these are case-sensitive functions. You can make it case-insensitive very easy...

Dejan  gave me an idea to write it using SQL.
So I did it, because his implementation using SQL to parse a string is cool:



CREATE OR REPLACE FUNCTION SPLITSTRING_USING_SQL(STR VARCHAR2, DELIMIT VARCHAR2)
  RETURN DBMS_SQL.VARCHAR2_TABLE IS

  INPUTSTRING    VARCHAR2(32000);
  INPUTDELIMITER VARCHAR2(32);

  TOKENS DBMS_SQL.VARCHAR2_TABLE;

  DELIMITER_LENGTH   NUMBER;
  INPUTSTRING_LENGTH NUMBER;

BEGIN

  IF STR IS NULL THEN
    RAISE_APPLICATION_ERROR(-20000, 'str cannot be null!');
  END IF;

  IF DELIMIT IS NULL OR DELIMIT = '' THEN
    RAISE_APPLICATION_ERROR(-20000,
                            'delimiter cannot be null or empty string!');
  END IF;

  INPUTSTRING    := STR;
  INPUTDELIMITER := DELIMIT;

  DELIMITER_LENGTH   := LENGTH(INPUTDELIMITER);
  INPUTSTRING_LENGTH := LENGTH(INPUTSTRING);

  -- Prepare inputText for parsing
  IF SUBSTR(INPUTSTRING,
            INPUTSTRING_LENGTH - DELIMITER_LENGTH + 1,
            DELIMITER_LENGTH) != INPUTDELIMITER THEN
    INPUTSTRING := INPUTSTRING || INPUTDELIMITER;
  END IF;

  IF SUBSTR(INPUTSTRING, 1, DELIMITER_LENGTH) != INPUTDELIMITER THEN
    INPUTSTRING := INPUTDELIMITER || INPUTSTRING;
  END IF;
  
  SELECT SUBSTR(INPUTTEXT,
                INDEKS + DELIMITER_LENGTH,
                NAREDNI - DELIMITER_LENGTH - INDEKS) TOKEN BULK COLLECT
    INTO TOKENS
    FROM (SELECT INDEKS,
                 LEAD(INDEKS) OVER(ORDER BY INDEKS) NAREDNI,
                 INPUTTEXT
            FROM (SELECT LEVEL INDEKS,
                         INPUTSTRING INPUTTEXT,
                         INPUTDELIMITER DELIMITER,
                         (CASE
                           WHEN SUBSTR(INPUTSTRING, LEVEL, DELIMITER_LENGTH) =
                                INPUTDELIMITER THEN
                            1
                           ELSE
                            0
                         END) IS_DELIMITER
                    FROM DUAL
                  CONNECT BY LEVEL <= LENGTH(INPUTSTRING))
           WHERE IS_DELIMITER = 1)
   WHERE NAREDNI IS NOT NULL;

  RETURN TOKENS;

END;
/
It’s basically very similar to his function.
The only difference is that unlike Dejan’s function, this one returns NULL tokens too…

You can test theese two functions easy:


declare 
  
  input VARCHAR2(32000) := 'one***two***three***four***five***six***seven***eight***nine***ten';
  delimiter varchar2(10) := '***';
  result dbms_sql.Varchar2_Table;
  
begin

  result := splitString(input,delimiter);
  --result := splitString_using_sql(input,delimiter);
  
  for i in result.first..result.last loop
      dbms_output.put_line(result(i));
  end loop;
                        
  
end;





Conclusion:
There are CERTAINLLY better ways to write this function, and these are just an example how this could be done.

This is one possible approach, to return an array of tokens. For me , probably the better way would be to create an Object that does something like StringTokenizer Java class. Not to return an array, but to provide a way to iterate through tokens using it’s hasMoreTokens() and nextToken() methods.
I think this is better approach if you are dealing with very large input strings and tokens. If you are dealing with small ones, then it’s OK to use one of my funtions I provided here, or any other similar to this one, that you can find on the internet…

If you intend to use one of these, I suggest  on using first one, because It is much faster than second one.

I didn’t do any optimizations ( which is obvious  ) because I was pleased with speed. Of course, you are welcome to do it yourself

as a matter of fact, I did some speed testing with input strings whose length is ~ 30000, and the first one was (much) faster than many  functions I found on the internet…