Apache ORC is a columnar file type that is common to the Hadoop ecosystem. It is similar in concept to Apache Parquet; Hortonworks Hadoop clusters often leveraged ORC – while Cloudera clusters utilized Parquet. Like parquet, ORC is a database file designed for efficient reads. Files embed a schema, data is stored in a compressed columnar format, predicate pushdown enables data pruning – sound familiar? See the Parquet blog post if you want a refresher 🙂
In addition, ORC provides the ability to capture complex data types – including arrays, maps, unions and objects (or structs). This capability too is similar to both Parquet and Avro.
Autonomous Database now supports querying object store data that is captured in ORC format – in addition to text, Avro and Parquet. And, across the structured file types – you can now also query complex data types. Let’s take a look at an example. We’ll extend the movie file that was used in our previous Avro post (we downloaded this data from Wikipedia) – this time using the ORC file type with an extra column. The movie file has the following schema:
id int
original_title string
overview string
poster_path string
release_date string
vote_count int
runtime int
popularity double
genres array<struct<id:int,name:string>
Notice that each movie is categorized by multiple genres (an array of genres). This array is an array of objects – or structs: each genre has an id (integer) and a name (string). Let’s take a look at how to access this data in Autonomous Database. After creating a credential object that enables access to Oracle Object Storage, create a table using dbms_cloud.create_external table:
begin
DBMS_CLOUD.create_credential (
credential_name => 'OBJ_STORE_CRED',
username => abc@oracle.com',
password => '12345'
) ;
dbms_cloud.create_external_table (
table_name =>'movie_info',
credential_name =>'OBJ_STORE_CRED',
file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc',
format => '{"type":"orc", "schema": "first"}'
);
end;
/
Notice you don’t have to specify the shape of the data. The columns and data types are automatically derived by reading the metadata contained in the ORC file. This created a table with the following DDL:
CREATE TABLE "ADMIN"."MOVIE_INFO"
( "ID" NUMBER(10,0),
"ORIGINAL_TITLE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"OVERVIEW" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"POSTER_PATH" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"RELEASE_DATE" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP",
"VOTE_COUNT" NUMBER(10,0),
"RUNTIME" NUMBER(10,0),
"POPULARITY" BINARY_DOUBLE,
"GENRES" VARCHAR2(4000 BYTE) COLLATE "USING_NLS_COMP"
) DEFAULT COLLATION "USING_NLS_COMP"
ORGANIZATION EXTERNAL
( TYPE ORACLE_BIGDATA
DEFAULT DIRECTORY "DATA_PUMP_DIR"
ACCESS PARAMETERS
( com.oracle.bigdata.credential.name=OBJ_STORE_CRED
com.oracle.bigdata.fileformat=ORC
)
LOCATION
( 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/abcd/b/movies/o/movie-info.orc'
)
)
REJECT LIMIT UNLIMITED
PARALLEL ;
So, how will you handle the complex data? Simple… data will be returned as JSON. Thus, despite the fact that different file types store the same data in different ways, your query can be totally agnostic of this fact. You see the same JSON output for all common complex types. Use Oracle’s rich JSON query processing capabilities against the table.
Running a simple query against this table yields the following:
select original_title, release_date, genres
from movie_info
where release_date > '2000'
order by original_title;
| original_title | release_date | genres |
|---|---|---|
| (500) Days of Summer | 2009 | [{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}] |
| 10,000 BC | 2008 | [{“id”:6,”name”:”Comedy”}] |
| 11:14 | 2003 | [{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}] |
| 127 Hours | 2010 | [{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}] |
| 13 Going on 30 | 2004 | [{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}] |
| 1408 | 2007 | [{“id”:45,”name”:”Sci-Fi”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:6,”name”:”Comedy”},{“id”:18,”name”:”War”}] |
Notice that the genres returned as a json array. Let’s make that JSON data more useful. The JSON can be transformed using Oracle’s JSON functions – including the simple “.” notation as well as the more powerful transform functions like JSON_TABLE. The example below queries the table – turning each value of the array into a row in the result set:
select original_title,
release_date,
m.genre_name,
genres
from movie_info mi,
json_table(mi.genres, '$.name[*]'
COLUMNS (genre_name VARCHAR2(25) PATH '$')
) AS m
where rownum < 10;
| original_title | release_date | genre_name | genres |
|
(500) Days of Summer
|
2009
|
Drama
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
(500) Days of Summer
|
2009
|
Comedy
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
(500) Days of Summer
|
2009
|
Horror
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
(500) Days of Summer
|
2009
|
Western
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
(500) Days of Summer
|
2009
|
War
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
(500) Days of Summer
|
2009
|
Romance
|
[{“id”:3,”name”:”Drama”},{“id”:6,”name”:”Comedy”},{“id”:17,”name”:”Horror”},{“id”:19,”name”:”Western”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
10,000 BC
|
2008
|
Comedy
|
[{“id”:6,”name”:”Comedy”}]
|
|
11:14
|
2003
|
Family
|
[{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}]
|
|
11:14
|
2003
|
Thriller
|
[{“id”:9,”name”:”Thriller”},{“id”:14,”name”:”Family”}]
|
|
127 Hours
|
2010
|
Comedy
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}]
|
|
127 Hours
|
2010
|
Drama
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”}]
|
|
13 Going on 30
|
2004
|
Romance
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
13 Going on 30
|
2004
|
Comedy
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
13 Going on 30
|
2004
|
War
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
|
13 Going on 30
|
2004
|
Drama
|
[{“id”:6,”name”:”Comedy”},{“id”:3,”name”:”Drama”},{“id”:18,”name”:”War”},{“id”:15,”name”:”Romance”}]
|
JSON_TABLE was used in this case to 1) create a row for each value of the array (think outer join) and 2) the struct was parsed to extract the name of the genre.
Autonomous Database continues to expand its ability to effectively query external source. Look for more to come in this area soon!
