When there are special characters in a CSV-file, and the file is in a different characterset than which you expect, parsing it can give unexpected results.
From different suppliers we get CSV-file, which need to be parsed and stored in the database. This is done by using APEX_DATA_PARSER.
APEX_DATA_PARSER is very flexible and allows several file-formats: JSON, XML, CSV and even native Excel XLSX.
The files are uploaded to the database (in the background) and processed. And this works like a charm,... until... one of the suppliers provided a CSV-file in a different character set, with special characters.
Below is an example of this file:
This file is uploaded to a database table with the following structure:
Name Null? Type ------------ ----- ------------- MIME_TYPE VARCHAR2(255) FILENAME VARCHAR2(400) BLOB_CONTENT BLOBThe following query is used to parse the file and extract the data:
SQL> select p.col001 2 ,p.col002 3 ,p.col003 4 from a 5 cross 6 apply apex_data_parser.parse (p_content => a.blob_content 7 ,p_file_name => a.filename 8 ) p 9* / COL001 COL002 COL003 __________________________ __________ __________ R�sselsheim am Main 3.000 624.760 Gnutz 1.000 139.490 �ach-Palenberg 1.000 139.490 Syke 1.000 242.600 H�kirchen-Siegertsbrunn 1.000 136.980 Eberhardzell 1.000 233.810 heinsberg 1.000 67.840 Hohndorf 3.000 304.800 Nidderau 3.000 385.890 N�n-Hardenberg 1.000 239.680 Kl��1.000 185.000 K 2.000 237.880 W�rselen 1.000 196.500 Gl�ckstadt 1.000 136.980 Dessau-Ro�au 1.000 5.900 15 rows selected.As you can see in the results above, the special characters have been removed. When a word in the first column ends in a special character, the parsing of the rest of the line is also messed up, there is a NULL in Column 3 while there should be a value.
Of course this has everything to do with the characterset of the uploaded file.
With the following command in a Terminal-window on Mac, it is possible to get information about the file, including the characterset:
file -I name of the fileIn my case that would yield the following results:
file -I charset.csv charset.csv: text/plain; charset=iso-8859-1This means that the file is in the characterset: ISO-8859-1
To find out which characterset this maps to in the database, the following query can help:
SQL> select value 2 from v$nls_valid_values 3 where parameter ='CHARACTERSET' 4 and value like '%8859%' 5* / VALUE _________________ WE8ISO8859P1 EE8ISO8859P2 SE8ISO8859P3 NEE8ISO8859P4 CL8ISO8859P5 AR8ISO8859P6 EL8ISO8859P7 IW8ISO8859P8 WE8ISO8859P9 NE8ISO8859P10 WE8ISO8859P15 BLT8ISO8859P13 CEL8ISO8859P14 AZ8ISO8859P9E 14 rows selected.From this list WE8ISO8859P1 is selected and passed in into the APEX_DATA_PARSER function:
SQL> select p.col001 2 ,p.col002 3 ,p.col003 4 from all_csvs a 5 cross 6 apply apex_data_parser.parse (p_content => a.blob_content 7 ,p_file_name => a.filename 8 ,p_file_charset => 'WE8ISO8859P1' 9 ) p 10* / COL001 COL002 COL003 _____________________________ _________ __________ Rüsselsheim am Main 3.000 624.760 Gnutz 1.000 139.490 Übach-Palenberg 1.000 139.490 Syke 1.000 242.600 Höhenkirchen-Siegertsbrunn 1.000 136.980 Eberhardzell 1.000 233.810 heinsberg 1.000 67.840 Hohndorf 3.000 304.800 Nidderau 3.000 385.890 Nörten-Hardenberg 1.000 239.680 Klüß 1.000 185.000 Köln 2.000 237.880 Würselen 1.000 196.500 Glückstadt 1.000 136.980 Dessau-Roßlau 1.000 5.900 15 rows selected.
And now all is well.
No comments:
Post a Comment