When there are special characters in a CSV-file, and the file is in a different characterset than which you expect, parsing it can give unexpected results.
From different suppliers we get CSV-file, which need to be parsed and stored in the database. This is done by using APEX_DATA_PARSER.
APEX_DATA_PARSER is very flexible and allows several file-formats: JSON, XML, CSV and even native Excel XLSX.
The files are uploaded to the database (in the background) and processed. And this works like a charm,... until... one of the suppliers provided a CSV-file in a different character set, with special characters.
Below is an example of this file:
This file is uploaded to a database table with the following structure:
Name Null? Type
------------ ----- -------------
MIME_TYPE VARCHAR2(255)
FILENAME VARCHAR2(400)
BLOB_CONTENT BLOB
The following query is used to parse the file and extract the data:
SQL> select p.col001
2 ,p.col002
3 ,p.col003
4 from a
5 cross
6 apply apex_data_parser.parse (p_content => a.blob_content
7 ,p_file_name => a.filename
8 ) p
9* /
COL001 COL002 COL003
__________________________ __________ __________
R�sselsheim am Main 3.000 624.760
Gnutz 1.000 139.490
�ach-Palenberg 1.000 139.490
Syke 1.000 242.600
H�kirchen-Siegertsbrunn 1.000 136.980
Eberhardzell 1.000 233.810
heinsberg 1.000 67.840
Hohndorf 3.000 304.800
Nidderau 3.000 385.890
N�n-Hardenberg 1.000 239.680
Kl��1.000 185.000
K 2.000 237.880
W�rselen 1.000 196.500
Gl�ckstadt 1.000 136.980
Dessau-Ro�au 1.000 5.900
15 rows selected.
As you can see in the results above, the special characters have been removed. When a word in the first column ends in a special character, the parsing of the rest of the line is also messed up, there is a NULL in Column 3 while there should be a value.
Of course this has everything to do with the characterset of the uploaded file.
With the following command in a Terminal-window on Mac, it is possible to get information about the file, including the characterset:
file -I name of the file
In my case that would yield the following results:
file -I charset.csv
charset.csv: text/plain; charset=iso-8859-1
This means that the file is in the characterset: ISO-8859-1
To find out which characterset this maps to in the database, the following query can help:
SQL> select value
2 from v$nls_valid_values
3 where parameter ='CHARACTERSET'
4 and value like '%8859%'
5* /
VALUE
_________________
WE8ISO8859P1
EE8ISO8859P2
SE8ISO8859P3
NEE8ISO8859P4
CL8ISO8859P5
AR8ISO8859P6
EL8ISO8859P7
IW8ISO8859P8
WE8ISO8859P9
NE8ISO8859P10
WE8ISO8859P15
BLT8ISO8859P13
CEL8ISO8859P14
AZ8ISO8859P9E
14 rows selected.
From this list WE8ISO8859P1 is selected and passed in into the APEX_DATA_PARSER function:
SQL> select p.col001
2 ,p.col002
3 ,p.col003
4 from all_csvs a
5 cross
6 apply apex_data_parser.parse (p_content => a.blob_content
7 ,p_file_name => a.filename
8 ,p_file_charset => 'WE8ISO8859P1'
9 ) p
10* /
COL001 COL002 COL003
_____________________________ _________ __________
Rüsselsheim am Main 3.000 624.760
Gnutz 1.000 139.490
Übach-Palenberg 1.000 139.490
Syke 1.000 242.600
Höhenkirchen-Siegertsbrunn 1.000 136.980
Eberhardzell 1.000 233.810
heinsberg 1.000 67.840
Hohndorf 3.000 304.800
Nidderau 3.000 385.890
Nörten-Hardenberg 1.000 239.680
Klüß 1.000 185.000
Köln 2.000 237.880
Würselen 1.000 196.500
Glückstadt 1.000 136.980
Dessau-Roßlau 1.000 5.900
15 rows selected.
And now all is well.