Cloud has become a reality for many enterprise customers. What was just a marketing buzz a few years ago is now fundamental architecture for many customers. The elasticity of Infrastructure (IaaS), along with automation of platform (PaaS) and convenience of service approach (SaaS), brings many people into Cloud. But the elephant in the room is vendor lock-in. What if a customer moves all my IT resources into one Cloud and a vendor one day decides significantly raise the price or introduce any new charge? Also, a common scenario is when hidden costs appear during service usage, which could dramatically impact TCO. The solution is to employ a multi-cloud strategy when similar functionality can be replicated in another Cloud.
Also, each cloud vendor may have strengths and weaknesses, so customers may want to use the best of two worlds. In this blog post, I will discuss the integration of Amazon Web Services (AWS) and Autonomous Data Warehouse running on Oracle Cloud Infrastructure (OCI).
For a customer using the multi-cloud, it’s entirely possible to have data in one object store (because some applications wrote it there) and need to process these data assets with an engine running in another cloud. Oracle ADB can fulfill these requirements and process data stored in AWS S3 buckets on the fly. For example, in addition to the most commonly used authentication method using Access/Secret keys, Oracle Autonomous Data Warehouse can be authenticated as Amazon Named Resource. With any authentication, the user can create an external table on top of the AWS S3 bucket using Database Actions UI or PL/SQL API. But both cases cover direct table creation on top of S3, not using the AWS Glue catalog, which can capture and centrally store metadata about files in S3.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. It can connect to various data sources, including data stored in Amazon S3 and databases on Amazon RDS, and automatically generate the code to extract, transform, and load the data. AWS Glue also includes a data catalog that allows customers to store and retrieve information about their data, such as schema and statistics.
AWS Glue consists of multiple parts. For the scope of this blog, I focus my attention on the catalog component used to store metadata.
The catalog approach has a few benefits over creating an external table directly on S3.
Now, I’d like to demonstrate how to transfer AWS metadata into the Autonomous Data Warehouse for later examination. The general steps of this guide are:
First, I’d like to display what’s available in AWS Glue. I located AWS Glue and selected it:
I’m assuming that some items are already present in S3, and Glue has scanned them before. So, some entities are already in place:
Now it’s the moment to link this AWS Glue catalog with the Autonomous Database. First off, I have to make some login details in my ADB. The Access and Secret key will help to reach AWS Glue and the data in S3:
begin
dbms_cloud.create_credential('MY_CRED_AWS','<access key>', '<secret key>');
end;
Credentials will be associated with Data Catalog and Object Store:
begin
dbms_dcat.set_data_catalog_credential('MY_CRED_AWS');
dbms_dcat.set_object_store_credential('MY_CRED_AWS');
end;
Also, a user has to set up an AWS region where Glue is running:
begin
dbms_dcat.set_data_catalog_conn(region => 'us-west-2',
catalog_type=>'AWS_GLUE');
end;
Now after the association is done, time to check, using ADB API, what is inside a Glue Catalog. To find Glue databases, check this one:
select * from all_glue_databases order by name;
The following query shows a list of the tables available for sync:
select * from all_glue_tables order by database_name, name;
Having a list of the user of the table can pick a few that need to be synced into ADB (Note: this is just metadata sync, data itself will reside in the S3 bucket):
begin
dbms_dcat.run_sync(
synced_objects => '
{
"database_list": [
{
"database": "parq",
"table_list": ["store","item"]
}
]
}',
error_semantics => 'STOP_ON_ERROR');
end;
/
To run a full sync (all tables and all databases), run the following command:
begin
dbms_dcat.run_sync(synced_objects=>'{"database_list":["*"]}')
end;
After the sync, it’s a good time to check newly created objects in a database. A new schema was created (name has been generated automatically):
Now user can run SQL queries over datasets stored in AWS S3:
SELECT * FROM glue$parq_tpcds_oracle_parq.store;
What is essential — data remain stored in AWS Cloud and not copied over to Oracle Database.
In this blog, I show how easily users can expand their data warehouse beyond a single cloud and use metadata and data from other cloud providers such as AWS. Data remains in the remote cloud and is only processed at query runtime. By utilizing AWS Glue, customers can sync multiple objects at one run and easily maintain such a connection.
My role involves overseeing the product management of Data Lake features for the Oracle Autonomous Database.