X

The Oracle APEX blog is your source for APEX news, technical tips and strategic direction

Super-Easy CSV, XLSX, JSON or XML parsing: About the APEX_DATA_PARSER package

Carsten Czarski
Consulting Member of technical Staff

by Andreea Dobre, Oracle Romania in Bucharest,
and Carsten Czarski, Oracle Application Express Development team.

One of the new features of Application Express 19.1 is the new Data Loading wizard, which already has been introduced in the Quick and Easy Data Loading with APEX 19.1 blog posting. But there is more than the new wizard in SQL Workshop - the new APEX_DATA_PARSER PL/SQL package provides a nice and easy developer interface to parse files. 

The APEX_DATA_PARSER package supports all file types which are supported by SQL Workshop data loading, that includes delimited files (CSV, Tab-Delimited), JSON, XML or XLSX). The parser is implemented as a table function - so the developer accesses parser results like a table. And the parser can thus be used like a table: With an INSERT ... SELECT statement, rows can be directly inserted into a table, existing tables can be joined in order to look up values or SQL and PL/SQL functions can be applied in order to process and refine the raw parsed data.

This blog posting will walk you though the APEX_DATA_PARSER PL/SQL package as follows:

  • Create an application within a workspace on APEX 19.1 (for now, use the Early Adopter instance: apexea.oracle.com)
  • In that application, we'll create a page where to upload the XML, XLSX, JSON or CSV file, parse it and display the data as a Classic Report.

The most important function in the APEX_DATA_PARSER package is the PARSE table function. It allows you not only to simply parse the file contents, but also to detect the headers and then display it as columns within a report. The PARSE function accepts the file to be parsed as a BLOB - other parameters allow to control the parser. The function supports parsing up to 300 columns. 

function parse(
    p_content                      in blob,
    p_file_name                    in varchar2     default null,
    p_file_profile                 in clob         default null,
    --
    p_detect_data_types            in varchar2     default 'Y',
    p_decimal_char                 in varchar2     default null,
    --
    p_xlsx_sheet_name              in varchar2     default null,
    --
    p_row_selector                 in varchar2     default null,
    --
    p_csv_row_delimiter            in varchar2     default LF,
    p_csv_col_delimiter            in varchar2     default null,
    p_csv_enclosed                 in varchar2     default '"',
    --
    p_skip_rows                    in pls_integer  default 0,
    --
    p_nullif                       in varchar2     default null,
    --
    p_file_charset                 in varchar2     default 'AL32UTF8',
    p_max_rows                     in number       default null,
    --
    p_store_profile_to_collection  in varchar2     default null ) return apex_t_parser_table pipelined;

Below is a short description of the parameters.

  • P_CONTENT -  the file content to be parsed as a BLOB

  • P_FILE_NAME - the name of the file; this used to derive the file type. 

    • .xml extensions will be treated as XML files

    • .json, .geojson or .js extensions will be treated as JSON files

    • .csv or .txt extensions will be treated as delimited files

    • .xlsx extensions will be treated as Excel Workbooks (only the XLSX format is supported)

  • P_CSV_COL_DELIMITER  - use a specific CSV column delimiter. If omitted, the function will detect either comma (,), semi-colon (;), hash (#), pipe (|) or the tab character, based on contents of the first row.

  • P_CSV_ENCLOSED  - override the default enclosure character for CSV parsing. Note that the parser treats this value as "optionally enclosed", so it will also handle non-enclosed values correctly.

  • P_CSV_ROW_DELIMITER - override the default row delimiter for CSV parsing. By default this is the newline character. You won't need to adjust this parameter in most of the cases.

  • P_ROW_SELECTOR - for JSON and XML files. Pointer to the array / list of rows within the JSON or XML file. If omitted, the function will:

    • for XML files: use "/*/*" (first tag under the root tag) as the row selector

    • for JSON files: the parser will look up JSON arrays within the file and use the first array found.

  • P_XLSX_SHEET_NAME - for XLSX worksheets. The name of the worksheet to parse. If omitted, the function will use the first worksheet found.

  • P_FILE_CHARSET - use a specific file encoding. If omitted, files will be treated as Unicode (AL32UTF8).

  • P_DETECT_DATA_TYPES - whether to discover column data types. Defaults to Yes.

  • P_DECIMAL_CHAR - use this decimal character when trying to detect NUMBER data types. If not specified, the procedure will auto-detect the decimal character.

  • P_SKIP_ROWS - skip the first N rows before parsing the rest of the file.

  • P_NULLIF - value to treat as NULL. This is case-insensitive and can be used to implicitly change values like "N/A", "-" or "TBD" to NULL.

  • P_MAX_ROWS  - stop parsing rows after P_MAX_ROWS rows have been processed.

  • P_STORE_PROFILE_TO_COLLECTION - when a name is provided, all column and data type information, which was discovered during parse, will be stored into a collection of that name.

Let's now build a simple test application. A page in that application will allow to upload a file and then the file contents will be displayed as a report.

  • Create an empty page in your application
  • Create a region of type "Static HTML" on the page
  • Add a File Browse item to the region. In settings, keep the Storage type as APEX_APPLICATION_TEMP_FILES.
  • Finally add a button to submit the file. When that button is clicked, the file will be sent to the server and the APEX_DATA_PARSER package will be able to parse it.

Your page should look as follows:

If you upload a file right now, nothing will happen. So, add the report to display the parser results to the page. Choose a Classic Report and use the following SQL Query.

select line_number, col001, col002, col003, col004, col005, 
                    col006, col007, col008, col009, col010
                    -- more columns (col011 to col300) can be selected here.
  from apex_application_temp_files f, 
       table( apex_data_parser.parse(
                  p_content                     => f.blob_content,
                  p_add_headers_row             => 'Y',
                  p_max_rows                    => 500,
                  p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                  p_file_name                   => f.filename ) ) p
 where f.name = :PX_FILE

Then add another classic report to display the column and data type information which will be discovered while the file is being parsed. The p_store_profile_to_collection parameter above instructs APEX to store that information into the FILE_PARSER_COLLECTION collection. So the second report simply displays the contents of that collection.

select column_position, column_name, data_type, format_mask
  from apex_collections c, 
       table( apex_data_parser.get_columns( p_profile => c.clob001 ) )
 where c.collection_name = 'FILE_PARSER_COLLECTION' 
   and c.seq_id = 1

Add a Server-Side Condition to both report regions to show only when the PX_FILE item is not null - i.e. only when a file has been uploaded. In Page Designer, your page should then look as follows:

Run your page. First, you should only see the File Browse item. However, after uploading a CSV file ...

EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DNAME,LOC
7839,KING,PRESIDENT,,1981-11-17T00:00:00,5000,,ACCOUNTING,NEW YORK
7698,BLAKE,MANAGER,7839,1981-05-01T00:00:00,2850,,SALES,CHICAGO
7782,CLARK,MANAGER,7839,1981-06-09T00:00:00,2450,,ACCOUNTING,NEW YORK
7566,JONES,MANAGER,7839,1981-04-02T00:00:00,2975,,RESEARCH,DALLAS
:

... you should see the first 10 columns and the first 15 rows of that file - nicely parsed as it was a table. 

 

Below, you should see the discovered column and data type information - note that this all has been done by the PARSE table function. The classic reports just display results.

 

Try this with other CSV files as well. You might also play with the P_CSV_COL_DELIMITER, P_SKIP_ROWS and P_MAX_ROWS parameters.

select line_number, col001, col002, col003, col004, col005, 
                    col006, col007, col008, col009, col010
                    -- more columns (col011 to col300) can be selected here.
  from apex_application_temp_files f, 
       table( apex_data_parser.parse(
                  p_content                     => f.blob_content,
                  p_add_headers_row             => 'Y',
                  --
                  p_max_rows                    => 5,
                  p_skip_rows                   => 2,
                  p_csv_col_delimiter           => ';',
                  --
                  p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                  p_file_name                   => f.filename ) ) p
 where f.name = :PX_FILE

Working with XML files

The APEX_DATA_PARSER package also supports XML files. An XML file needs to have a repeating structure in order to be parsed correctly, like the following example:

<?xml version="1.0"?>
<ROWSET>
 <ROW>
  <EMPNO>7839</EMPNO>
  <ENAME>KING</ENAME>
  <JOB>PRESIDENT</JOB>
  <HIREDATE>1981-11-17T00:00:00.000</HIREDATE>
  <SAL>5000</SAL>
  <DEPTNO>10</DEPTNO>
 </ROW>
 <ROW>
  <EMPNO>7698</EMPNO>
  <ENAME>BLAKE</ENAME>
  <JOB>MANAGER</JOB>
  <MGR>7839</MGR>
  <HIREDATE>1981-05-01T00:00:00.000</HIREDATE>
  <SAL>2850</SAL>
  <DEPTNO>30</DEPTNO>
 </ROW>
 :
</ROWSET>

The tag names do not matter. By default, APEX_DATA_PARSER will treat the first element below the root element as the repeating structure. The tag names below that tag (here: ROW) will be treated as column names. Using the P_ROW_SELECTOR parameter, you can override that behavior, e.g. as follows:

select line_number, col001, col002, col003, col004, col005, 
                    col006, col007, col008, col009, col010
                    -- more columns (col011 to col300) can be selected here.
  from apex_application_temp_files f, 
       table( apex_data_parser.parse(
                  p_content                     => f.blob_content,
                  p_add_headers_row             => 'Y',
                  --
                  p_row_selector                => '/DOCUMENT/ROWSET/ROW',
                  --
                  p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                  p_file_name                   => f.filename ) ) p
 where f.name = :PX_FILE

Looking at the rest of the APEX_DATA_PARSER package, there is no difference between how the package works for XML and CSV files.

Parsing JSON files

JSON files are handled similar to XML files. If no P_ROW_SELECTOR argument is given, APEX_DATA_PARSER.PARSE will look for the a JSON array within the file. That array will be treated as the "table" and the elements of that array will be the rows. Like for XML files, the P_ROW_SELECTOR parameter can be used to override that behavior. For JSON files, JSON path syntax ("dot notation") has to be used, of course. However, APEX_DATA_PARSER does not allow to access nested arrays: so a row selector of results.array1[].array2 will not be supported.

select line_number, col001, col002, col003, col004, col005, 
                    col006, col007, col008, col009, col010
                    -- more columns (col011 to col300) can be selected here.
  from apex_application_temp_files f, 
       table( apex_data_parser.parse(
                  p_content                     => f.blob_content,
                  p_add_headers_row             => 'Y',
                  --
                  p_row_selector                => 'result.items',
                  --
                  p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                  p_file_name                   => f.filename ) ) p
 where f.name = :PX_FILE

Parsing XLSX files

Until now, all the 3 file types (CSV, JSON and XML) had a similar approach: When uploading the file, as there was only a File Browse item and a button to submit the page. An XLSX file, on the other side, can contain one or multiple worksheets. So we need to tell APEX_DATA_PARSER, which worksheet to process. If that is omitted, APEX_DATA_PARSER will parse the first worksheet it finds. So we will need to amend our application page by a select list to show the existing worksheets and to allow the end user to pick a specific worksheet.

Add a new item of type Select List or Radio Group and use the following SQL Query as List of Values query.

select sheet_display_name, sheet_file_name
  from apex_application_temp_files f,
       table( apex_data_parser.get_xlsx_worksheets( p_content => blob_content ) ) p
 where f.name = :PX_FILE

 

Add a Server-Side Condition to only render this item, when we're dealing with an XLSX file, so when the file extension is .xlsx.

Then add the P_XLSX_SHEET_NAME parameter to pass the chosen worksheet to APEX_DATA_PARSER.PARSE ...

select line_number, col001, col002, col003, col004, col005, 
                    col006, col007, col008, col009, col010
                    -- more columns (col011 to col300) can be selected here.
  from apex_application_temp_files f, 
       table( apex_data_parser.parse(
                  p_content                     => f.blob_content,
                  p_add_headers_row             => 'Y',
                  --
                  p_xlsx_sheet_name             => :PX_XLSX_WORKSHEET,
                  --
                  p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                  p_file_name                   => f.filename ) ) p
 where f.name = :PX_FILE

... and add the PX_XLSX_WORKSHEET item as Page Items to Submit to the Parser Results report region.

Finally we need to add a Dynamic Action in order to refresh both the reports when a different XLSX worksheet is chosen from the select list. So, add a new Dynamic Action as follows:

  • Name is Worksheet change
  • Event is Change
  • Selection Type is Item
  • Item is PX_XLSX_WORKSHEET
  • Add a Refresh on the Parser Results region as the TRUE action

The Column Information region needs to be refreshed as well. However, that may only happen after the Parser Results region has been refreshed, since the new column information is discovered during parse. So add another Dynamic Action ..

  • Name is Refresh Column Information
  • Event is After Refresh
  • Selection Type is Region
  • Region is Parser Results
  • Add a Refresh on the Column Information region as the TRUE action

The result should look as follows:

Then run your page and upload an XLSX file. You will now see the XLSX Worksheet select list (or radio group), as shown in the following screen shot.

By choosing another worksheet, the Parser Results region will refresh first, then the Column Information region.

Summary

With the above examples, we gained some understanding on how APEX_DATA_PARSER package does work. It provides a very simple means to access data from CSV, XLSX, JSON or XML files by simply selecting as it was a table. Beyond this, the function also discovers information on columns and data types, which can be retrieved from a collection or directly using APEX_DATA_PARSER.DISCOVER and APEX_DATA_PARSER.GET_COLUMNS functions. 

APEX_DATA_PARSER expects the file contents simply as a BLOB - it does not matter where this BLOB comes from. While we used a file upload in this example, the BLOB can also come from an existing table or from the web using the APEX_WEB_SERVICE or UTL_HTTP packages.

From parsing to data loading into tables it is only a small step: A SQL INSERT statement added on top of the parse function and the magic is done. Moreover, we can use database functionality like the DML Error Logging clause to collect bad data into separate table, so we won’t miss any rows from processing. Here are two pseudo-code examples for further data processing.

  • Direct insert into a target table 
    insert into target_table(
        select line_number, col001, col002, col003, col004, col005, 
                            col006, col007, col008, col009, col010
                            -- more columns (col011 to col300) can be selected here.
          from apex_application_temp_files f, 
               table( apex_data_parser.parse(
                          p_content                     => f.blob_content,
                          p_file_name                   => f.filename ) ) p
         where f.name = :PX_FILE
    )
    
  • Process in a PL/SQL loop
    declare
        l_row target_table%rowtype;
    begin
        for p in (
            select line_number, col001, col002, col003, col004, col005, 
                                col006, col007, col008, col009, col010
                                -- more columns (col011 to col300) can be selected here.
              from apex_application_temp_files f, 
                   table( apex_data_parser.parse(
                              p_content                     => f.blob_content,
                              p_add_headers_row             => 'Y',
                              p_max_rows                    => 50,
                              p_store_profile_to_collection => 'FILE_PARSER_COLLECTION',
                              p_file_name                   => f.filename ) ) p
             where f.name = :PX_FILE
        ) loop
            l_row.ename := upper( p.col001 );
            l_row.sal   := my_function.get_sal_from_value( p.col002 );
            l_row.comm  := to_number( p.col003);
    
            -- :
            -- further processing        
    
            insert into target_table ( l_row );
    
        end loop;
    end;
    

Joins ..? Subqueries ..? Running parsing and loading as a job in the background ...? This is all pretty straightforward now, since all database functionality is available at your fingertips. Comprehensive API documentation on the APEX_DATA_PARSER package will be part of the APEX documentation to be published when APEX 19.1 is being released.

So try it out today: If not done already, create your workspace on the Early APEX 19.1 Early Adopter right now!

 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services