X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Query ORC files and complex data types in Object Storage with Autonomous Database

Martin Gubar
Director Product Management

Apache ORC is a columnar file type that is common to the Hadoop ecosystem.  It is similar in concept to Apache Parquet; Hortonworks Hadoop clusters often leveraged ORC – while Cloudera clusters utilized Parquet.  Like parquet, ORC is a database file designed for efficient reads. Files embed a schema, data is stored in a compressed columnar format, predicate pushdown enables data pruning – sound familiar?  See the Parquet blog post if you want a refresher :)

In addition, ORC provides the ability to capture complex data types – including arrays, maps, unions and objects (or structs).  This capability too is similar to both Parquet and Avro.

Autonomous Database now supports querying object store data that is captured in ORC format – in addition to text, Avro and Parquet.  And, across the structured file types – you can now also query complex data types.  Let’s take a look at an example.  We’ll extend the movie file that was used in our previous Avro post (we downloaded this data from Wikipedia) - this time using the ORC file type with an extra column. The movie file has the following schema:

id int
original_title string
overview string
poster_path string
release_date string
vote_count int
runtime int
popularity double
genres array<struct<id:int,name:string>

Notice that each movie is categorized by multiple genres (an array of genres).  This array is an array of objects - or structs:  each genre has an id (integer) and a name (string).  Let's take a look at how to access this data in Autonomous Database.  After creating a credential object that enables access to Oracle Object Storage, create a table using dbms_cloud.create_external table:

begin
  DBMS_CLOUD.create_credential (
    credential_name => 'OBJ_STORE_CRED',
    username => abc@oracle.com',
    password => '12345'
  ) ;

    dbms_cloud.create_external_table (
    table_name =>'movie_info',
    credential_name =>'OBJ_STORE_CRED',
    file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc',
    format =>  '{"type":"orc",  "schema": "first"}'
    );
end;
/

Notice you don’t have to specify the shape of the data.  The columns and data types are automatically derived by reading the metadata contained in the ORC file.  This created a table with the following DDL: 

CREATE TABLE "ADMIN"."MOVIE_INFO" 
   ( "ID" NUMBER(10,0), 
     "ORIGINAL_TITLE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
     "OVERVIEW" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
     "POSTER_PATH" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
     "RELEASE_DATE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", 
     "VOTE_COUNT" NUMBER(10,0), 
     "RUNTIME" NUMBER(10,0), 
     "POPULARITY" BINARY_DOUBLE, 
     "GENRES" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP"
   )  DEFAULT COLLATION "USING_NLS_COMP" 
   ORGANIZATION EXTERNAL 
    ( TYPE ORACLE_BIGDATA
      DEFAULT DIRECTORY "DATA_PUMP_DIR"
      ACCESS PARAMETERS
      ( com.oracle.bigdata.credential.name=OBJ_STORE_CRED
        com.oracle.bigdata.fileformat=ORC
  )
      LOCATION
       ( 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc'
       )
    )
   REJECT LIMIT UNLIMITED 
  PARALLEL ;

So, how will you handle the complex data?  Simple… data will be returned as JSON.  Thus, despite the fact that different file types store the same data in different ways, your query can be totally agnostic of this fact.  You see the same JSON output for all common complex types.  Use Oracle's rich JSON query processing capabilities against the table.

Running a simple query against this table yields the following:

select original_title, release_date, genres
from movie_info
where release_date > '2000'
order by original_title;

original_title release_date genres
(500) Days of Summer 2009 [{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC 2008 [{"id":6,"name":"Comedy"}]
11:14 2003 [{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours 2010 [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30 2004 [{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
1408 2007 [{"id":45,"name":"Sci-Fi"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":6,"name":"Comedy"},{"id":18,"name":"War"}]

Notice that the genres returned as a json array.  Let's make that JSON data more useful.  The JSON can be transformed using Oracle's JSON functions - including the simple "." notation as well as the more powerful transform functions like JSON_TABLE.  The example below queries the table - turning each value of the array into a row in the result set:

select original_title,
       release_date,
       m.genre_name,
       genres
from movie_info mi,
     json_table(mi.genres, '$.name[*]'
       COLUMNS (genre_name VARCHAR2(25) PATH '$')                 
                ) AS m
where rownum < 10;

original_title release_date genre_name genres
(500) Days of Summer
2009
Drama
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer
2009
Comedy
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer
2009
Horror
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer
2009
Western
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer
2009
War
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
(500) Days of Summer
2009
Romance
[{"id":3,"name":"Drama"},{"id":6,"name":"Comedy"},{"id":17,"name":"Horror"},{"id":19,"name":"Western"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
10,000 BC
2008
Comedy
[{"id":6,"name":"Comedy"}]
11:14
2003
Family
[{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
11:14
2003
Thriller
[{"id":9,"name":"Thriller"},{"id":14,"name":"Family"}]
127 Hours
2010
Comedy
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
127 Hours
2010
Drama
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"}]
13 Going on 30
2004
Romance
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30
2004
Comedy
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30
2004
War
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]
13 Going on 30
2004
Drama
[{"id":6,"name":"Comedy"},{"id":3,"name":"Drama"},{"id":18,"name":"War"},{"id":15,"name":"Romance"}]

 

JSON_TABLE was used in this case to 1) create a row for each value of the array (think outer join) and 2) the struct was parsed to extract the name of the genre.

Autonomous Database continues to expand its ability to effectively query external source.  Look for more to come in this area soon!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.