10 October 2019

OGB Appreciation Day: APEX_DATA_PARSER: Flexible, Powerful, Awesome (#ThanksOGB #OrclAPEX)

For this years OGB Appreciation Day I wanted to highlight the very awesome APEX_DATA_PARSER.

At my current project one of the requirements is that CSV-files are to be uploaded by the user and the application has to figure out what type of file it is and process accordingly.
So the challenge with this is:

  • Which type of file is it?
  • What delimiter was used?

Uploading the file is straight forward enough with Application Express (APEX) and out of scope of this blog. The files are stored in a APEX Collection before they are processed.

The original procedure to determine what type of file it was and how to process it was quite cumbersome.
The file is parsed and the first line was extracted. The string which effectively was the header line of the CSV file was compared with a predefined string and based on this comparison it was determined what type of file it was.
The problem was that sometimes the header line would look like this:

      DATE_,PAYMENT_METHOD,CURRENCY,AMOUNT,STATUS,ID,DESCRIPTION
         
and sometimes it would like this:
      "DATE_","PAYMENT_METHOD","CURRENCY","AMOUNT","STATUS","ID","DESCRIPTION"
         
or maybe even sometime like this:
      "DATE_";"PAYMENT_METHOD";"CURRENCY";"AMOUNT";"STATUS";"ID";"DESCRIPTION"
         
The header line could be enclosed in quotes, or not.. The delimiter was a comma, or not...
I've simplified the example, the real header lines where a lot longer.
You can imagine that parsing the rest of the file was quite tricky, as the file content could have commas as delimiter or semi-colons or ...
Because of all the variations the procedure was quite lengthy and error prone.

Simplifying with APEX_DATA_PARSER

With APEX_DATA_PARSER things got a lot cleaner, not to mention a lot faster. The example below will only give you the gist of the procedure.

First the different header lines where declared as a record type of datatype APEX_T_VARCHAR2.

         l_file1_header apex_t_varchar2 := apex_t_varchar2 ('DATE_'
                                                           ,'PAYMENT_METHOD'
                                                           ,'CURRENCY'
                                                           ,'AMOUNT'
                                                           ,'STATUS'
                                                           ,'DESCRIPTION'
                                                           );
         l_file2_headers apex_t_varchar2 := apex_t_varchar2 ('ID'
                                                            ,'REF'
                                                            ,'ORDER_'
                                                            ,'STATUS'
                                                            ,'LIB'
                                                            );
         
Second we need to get the header line of the file that was uploaded through the APEX application.
         select column_name
           bulk collect
           into l_file_header
           from table (apex_data_parser.get_columns
                         (apex_data_parser.discover (p_content   => r_files.file_content
                                                    ,p_file_name => r_files.filename
                                                    )));
         
The APEX_DATA_PARSER offers a function to discover information about the file. The function return a file-profile and contains information about the file-encoding, file-type, how many rows it parsed, what type of delimited is used in case of CSV, etc. etc.. This function is aptly named DISCOVER.
With the APEX_DATA_PARSER.GET_COLUMNS function uses the file profile created by the DISCOVER-function to obtain information about the content: Which columns are in the file, what might the datatype be, what is the format mask that was used, in which order are the columns.
With the simple SELECT statement above we know the header columns in the uploaded file.

We're almost there..

Now that the header columns are known, we can determine how to proceed from here with a straightforward CASE statement.
Why a CASE statement you might ask. It gives a easy opportunity to raise an exception in case an unknown file is uploaded... i.e. it will raise a CASE_NOT_FOUND exception.
         case l_file_header
         when l_file1_header
         then
            process_file1;
         when l_file2_header
         then
            process_file2;
         end case;
         
The uploaded files are moved from the APEX collection to a regular heap table called ALL_UPLOADED_FILES (omitted from this blogpost).

To process the content of the file we can now do an INSERT-SELECT and drop the content into the destination table:
         insert into destination_tbl
            (date_
            ,payment_method
            ,currency
            ,amount
            ,status
            ,description
            )
         select to_date (col001, 'yyyy-mm-dd hh24:mi:ss')
               ,col002
               ,col003
               ,col005
               ,col006
               ,col007
           from all_uploaded_files fle
               ,table (apex_data_parser.parse (p_content   => fle.csv
                                              ,p_file_name => fle.csv_filename
                                              ,p_skip_rows => 1 -- skip the header line
                                              ))      
         

On a sidenote: we're using DropZone, so the file(s) are in an APEX collection which is read with the following query:
         select c001    as filename
               ,c002    as mime_type
               ,d001    as date_created
               ,n001    as file_id
               ,blob001 as file_content
           from apex_collections
          where collection_name = 'DROPZONE_UPLOAD'
         

A big thank you to the APEX team for implementing this awesome functionality! Thanks ODC!

12 March 2019

The elegant NUMTODSINTERVAL to transform numbers

There is a database table with three columns into which data from an external source is loaded.
The data is loaded as a CSV file and contains three columns to represent the date, hour and minutes. The CSV file is loaded into a staging table with three columns:

  • one for the date, datatype: DATE
  • one for the hours, datatype NUMBER
  • one for the minutes, datatype NUMBER

Loading the information in the destination table, there is only one column to hold the data. These three columns should be transformed into a "real" date, meaning a DATE datatype as we know it.

One way is to concatenate all the columns together and apply the TO_DATE function with the appropriate format mask. With this method you need to take care of leading zeros for the hours and minutes.
This method would result in something like the following:

to_date( to_char( e.csv_date, 'DD-MM-YYYY') || ' ' || to_char( lpad( e.hour, 2, '0')) || ':' || to_char( lpad( e.minute, 2, '0')), 'DD-MM-YYYY HH24:MI') as clock_time

As you can imagine, it was only because of exceptions before realizing that the hours and minutes should be left padded with zeros.
There is another, easier, way to transform the three columns into a single DATE column:

e.csv_date + numtodsinterval (e.hour, 'HOUR') + numtodsinterval (e.minute, 'MINUTE') as clock_time
Using the NUMTODSINTERVAL function for the hours and the minutes makes the transformation trivial. No need to worry about leading zeros (or lack thereof).

NUMTODSINTERVAL is a great little function that makes code so much more elegant.