01 February 2023

Parse CSV-file in different Character Set with APEX_DATA_PARSER

When there are special characters in a CSV-file, and the file is in a different characterset than which you expect, parsing it can give unexpected results.
From different suppliers we get CSV-file, which need to be parsed and stored in the database. This is done by using APEX_DATA_PARSER.
APEX_DATA_PARSER is very flexible and allows several file-formats: JSON, XML, CSV and even native Excel XLSX.
The files are uploaded to the database (in the background) and processed. And this works like a charm,... until... one of the suppliers provided a CSV-file in a different character set, with special characters.

Below is an example of this file:

This file is uploaded to a database table with the following structure:

  Name         Null? Type          
------------ ----- ------------- 
MIME_TYPE          VARCHAR2(255) 
FILENAME           VARCHAR2(400) 
BLOB_CONTENT       BLOB  
  
The following query is used to parse the file and extract the data:
SQL> select p.col001
  2        ,p.col002
  3        ,p.col003
  4    from a
  5   cross
  6    apply apex_data_parser.parse (p_content => a.blob_content
  7                                 ,p_file_name => a.filename
  8                                 ) p
  9* /

COL001                     COL002     COL003     
__________________________ __________ __________ 
R�sselsheim am Main        3.000      624.760    
Gnutz                      1.000      139.490    
�ach-Palenberg             1.000      139.490    
Syke                       1.000      242.600    
H�kirchen-Siegertsbrunn    1.000      136.980    
Eberhardzell               1.000      233.810    
heinsberg                  1.000      67.840     
Hohndorf                   3.000      304.800    
Nidderau                   3.000      385.890    
N�n-Hardenberg             1.000      239.680    
Kl��1.000                  185.000               
K                          2.000      237.880    
W�rselen                   1.000      196.500    
Gl�ckstadt                 1.000      136.980    
Dessau-Ro�au               1.000      5.900      

15 rows selected. 
As you can see in the results above, the special characters have been removed. When a word in the first column ends in a special character, the parsing of the rest of the line is also messed up, there is a NULL in Column 3 while there should be a value.
Of course this has everything to do with the characterset of the uploaded file.

With the following command in a Terminal-window on Mac, it is possible to get information about the file, including the characterset:

  file -I name of the file
  
In my case that would yield the following results:
file -I charset.csv
charset.csv: text/plain; charset=iso-8859-1
  
This means that the file is in the characterset: ISO-8859-1

To find out which characterset this maps to in the database, the following query can help:

  SQL> select value
  2    from v$nls_valid_values
  3   where parameter  ='CHARACTERSET'
  4     and value like '%8859%'
  5* /

VALUE             
_________________ 
WE8ISO8859P1      
EE8ISO8859P2      
SE8ISO8859P3      
NEE8ISO8859P4     
CL8ISO8859P5      
AR8ISO8859P6      
EL8ISO8859P7      
IW8ISO8859P8      
WE8ISO8859P9      
NE8ISO8859P10     
WE8ISO8859P15     
BLT8ISO8859P13    
CEL8ISO8859P14    
AZ8ISO8859P9E     

14 rows selected.
  
From this list WE8ISO8859P1 is selected and passed in into the APEX_DATA_PARSER function:

SQL> select p.col001
  2        ,p.col002
  3        ,p.col003
  4    from all_csvs a
  5   cross
  6    apply apex_data_parser.parse (p_content => a.blob_content
  7                                 ,p_file_name => a.filename
  8                                 ,p_file_charset => 'WE8ISO8859P1'
  9                                 ) p
 10* /

COL001                        COL002    COL003     
_____________________________ _________ __________ 
Rüsselsheim am Main           3.000     624.760    
Gnutz                         1.000     139.490    
Übach-Palenberg               1.000     139.490    
Syke                          1.000     242.600    
Höhenkirchen-Siegertsbrunn    1.000     136.980    
Eberhardzell                  1.000     233.810    
heinsberg                     1.000     67.840     
Hohndorf                      3.000     304.800    
Nidderau                      3.000     385.890    
Nörten-Hardenberg             1.000     239.680    
Klüß                          1.000     185.000    
Köln                          2.000     237.880    
Würselen                      1.000     196.500    
Glückstadt                    1.000     136.980    
Dessau-Roßlau                 1.000     5.900      

15 rows selected.

And now all is well.

No comments:

Post a Comment