Query ORC files and complex data types in Object Storage with Autonomous Database

Apache ORC is a columnar file type that is common to the Hadoop ecosystem. It is similar in concept to Apache Parquet; Hortonworks Hadoop clusters often leveraged ORC – while Cloudera clusters utilized Parquet. Like parquet, ORC is a database file designed for efficient reads. Files embed a schema, data is stored in a compressed columnar format, predicate pushdown enables data pruning – sound familiar? See the Parquet blog post if you want a refresher 🙂

In addition, ORC provides the ability to capture complex data types – including arrays, maps, unions and objects (or structs). This capability too is similar to both Parquet and Avro.

Autonomous Database now supports querying object store data that is captured in ORC format – in addition to text, Avro and Parquet. And, across the structured file types – you can now also query complex data types. Let’s take a look at an example. We’ll extend the movie file that was used in our previous Avro post (we downloaded this data from Wikipedia) – this time using the ORC file type with an extra column. The movie file has the following schema:

id int original_title string overview string poster_path string release_date string vote_count int runtime int popularity double genres array<struct<id:int,name:string>

Notice that each movie is categorized by multiple genres (an array of genres). This array is an array of objects – or structs: each genre has an id (integer) and a name (string). Let’s take a look at how to access this data in Autonomous Database. After creating a credential object that enables access to Oracle Object Storage, create a table using dbms_cloud.create_external table:

begin DBMS_CLOUD.create_credential ( credential_name => 'OBJ_STORE_CRED', username => abc@oracle.com', password => '12345' ) ; dbms_cloud.create_external_table ( table_name =>'movie_info', credential_name =>'OBJ_STORE_CRED', file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc', format => '{"type":"orc", "schema": "first"}' ); end; /

Notice you don’t have to specify the shape of the data. The columns and data types are automatically derived by reading the metadata contained in the ORC file. This created a table with the following DDL:

CREATE TABLE "ADMIN"."MOVIE_INFO" ( "ID" NUMBER(10,0), "ORIGINAL_TITLE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", "OVERVIEW" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", "POSTER_PATH" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", "RELEASE_DATE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP", "VOTE_COUNT" NUMBER(10,0), "RUNTIME" NUMBER(10,0), "POPULARITY" BINARY_DOUBLE, "GENRES" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP" ) DEFAULT COLLATION "USING_NLS_COMP" ORGANIZATION EXTERNAL ( TYPE ORACLE_BIGDATA DEFAULT DIRECTORY "DATA_PUMP_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.credential.name=OBJ_STORE_CRED com.oracle.bigdata.fileformat=ORC ) LOCATION ( 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc' ) ) REJECT LIMIT UNLIMITED PARALLEL ;

So, how will you handle the complex data? Simple… data will be returned as JSON. Thus, despite the fact that different file types store the same data in different ways, your query can be totally agnostic of this fact. You see the same JSON output for all common complex types. Use Oracle’s rich JSON query processing capabilities against the table.

Running a simple query against this table yields the following:

select original_title, release_date, genres from movie_info where release_date > '2000' order by original_title;

original_title	release_date	genres
(500) Days of Summer	2009	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
10,000 BC	2008	[{“id”:6,”name”:”Comedy”}]
11:14	2003	[{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}]
127 Hours	2010	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}]
13 Going on 30	2004	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
1408	2007	[{“id”:45,”name”:”Sci-Fi”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:6,”name”:”Comedy”},{“id”:18,”name”:”War”}]

Notice that the genres returned as a json array. Let’s make that JSON data more useful. The JSON can be transformed using Oracle’s JSON functions – including the simple “.” notation as well as the more powerful transform functions like JSON_TABLE. The example below queries the table – turning each value of the array into a row in the result set:

select original_title, release_date, m.genre_name, genres from movie_info mi, json_table(mi.genres, '$.name[*]' COLUMNS (genre_name VARCHAR2(25) PATH '$') ) AS m where rownum < 10;

original_title	release_date	genre_name	genres
(500) Days of Summer	2009	Drama	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
(500) Days of Summer	2009	Comedy	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
(500) Days of Summer	2009	Horror	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
(500) Days of Summer	2009	Western	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
(500) Days of Summer	2009	War	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
(500) Days of Summer	2009	Romance	[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
10,000 BC	2008	Comedy	[{“id”:6,”name”:”Comedy”}]
11:14	2003	Family	[{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}]
11:14	2003	Thriller	[{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}]
127 Hours	2010	Comedy	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}]
127 Hours	2010	Drama	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}]
13 Going on 30	2004	Romance	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
13 Going on 30	2004	Comedy	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
13 Going on 30	2004	War	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
13 Going on 30	2004	Drama	[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]

JSON_TABLE was used in this case to 1) create a row for each value of the array (think outer join) and 2) the struct was parsed to extract the name of the genre.

Autonomous Database continues to expand its ability to effectively query external source. Look for more to come in this area soon!

Query ORC files and complex data types in Object Storage with Autonomous Database

Marty Gubar

Director Product Management

Oracle Database Vault on Autonomous Database: Quickly and Easily

Announcing Autonomous Data Guard!

Query ORC files and complex data types in Object Storage with Autonomous Database

Authors

Marty Gubar

Director Product Management

Oracle Database Vault on Autonomous Database: Quickly and Easily

Announcing Autonomous Data Guard!