Automate multi-cloud analysis by integration of Autonomous Data Warehouse with AWS Glue

July 31, 2023 | 6 minute read
Alexey Filanovskiy
Product Manager
Text Size 100%:

Why is multi-cloud important?

Cloud has become a reality for many enterprise customers. What was just a marketing buzz a few years ago is now fundamental architecture for many customers. The elasticity of Infrastructure (IaaS), along with automation of platform (PaaS) and convenience of service approach (SaaS), brings many people into Cloud. But the elephant in the room is vendor lock-in. What if a customer moves all my IT resources into one Cloud and a vendor one day decides significantly raise the price or introduce any new charge? Also, a common scenario is when hidden costs appear during service usage, which could dramatically impact TCO. The solution is to employ a multi-cloud strategy when similar functionality can be replicated in another Cloud.
Also, each cloud vendor may have strengths and weaknesses, so customers may want to use the best of two worlds. In this blog post, I will discuss the integration of Amazon Web Services (AWS) and Autonomous Data Warehouse running on Oracle Cloud Infrastructure (OCI).

What could have been done to integrate ADB with Amazon Cloud before?

ADB is multi-cloud ready

For a customer using the multi-cloud, it’s entirely possible to have data in one object store (because some applications wrote it there) and need to process these data assets with an engine running in another cloud. Oracle ADB can fulfill these requirements and process data stored in AWS S3 buckets on the fly. For example, in addition to the most commonly used authentication method using Access/Secret keys, Oracle Autonomous Data Warehouse can be authenticated as Amazon Named Resource. With any authentication, the user can create an external table on top of the AWS S3 bucket using Database Actions UI or PL/SQL API. But both cases cover direct table creation on top of S3, not using the AWS Glue catalog, which can capture and centrally store metadata about files in S3.

What is AWS Glue?

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. It can connect to various data sources, including data stored in Amazon S3 and databases on Amazon RDS, and automatically generate the code to extract, transform, and load the data. AWS Glue also includes a data catalog that allows customers to store and retrieve information about their data, such as schema and statistics.

AWS Glue consists of multiple parts. For the scope of this blog, I focus my attention on the catalog component used to store metadata.

Benefits of using a central glue catalog over creating an external table

The catalog approach has a few benefits over creating an external table directly on S3.

  • Crawler can automate the gathering of metadata. The intelligent component goes over data sitting in S3 and automatically detects datatype, list of the columns, and type of the columns. A crawler can gather metadata over multiple objects at one run.
  • Data is always up to date. For example, in the case of a partitioned table will always reflect the latest operations with partitions.
  • Users can create many external tables at once
  • Table formats like Iceberg may rely on catalog solutions, which can be the only way to create external tables on top of these data formats.

Enrich Autonomous Data Warehouse with data stored in S3, using AWS Glue.

Now, I’d like to demonstrate how to transfer AWS metadata into the Autonomous Data Warehouse for later examination. The general steps of this guide are:

  1. Check the metadata in AWS Glue
  2. Link AWS Glue to the Autonomous Data Warehouse
  3. Synchronize metadata (create XTs in Autonomous Data Warehouse from AWS Glue)
  4. Look at the new items in the Autonomous Data Warehouse and perform a query on S3
Sync Flow

Inspect metadata in AWS Glue.

First, I’d like to display what’s available in AWS Glue. I located AWS Glue and selected it:

AWS Glue interface

I’m assuming that some items are already present in S3, and Glue has scanned them before. So, some entities are already in place:

List of tables in Glue

Associate AWS Glue with Autonomous Data Warehouse.

Now it’s the moment to link this AWS Glue catalog with the Autonomous Database. First off, I have to make some login details in my ADB. The Access and Secret key will help to reach AWS Glue and the data in S3:

begin
    dbms_cloud.create_credential('MY_CRED_AWS','<access key>', '<secret key>');
end;

Credentials will be associated with Data Catalog and Object Store:

begin
    dbms_dcat.set_data_catalog_credential('MY_CRED_AWS');
    dbms_dcat.set_object_store_credential('MY_CRED_AWS');
end;

Also, a user has to set up an AWS region where Glue is running:

begin
    dbms_dcat.set_data_catalog_conn(region => 'us-west-2', 
                                    catalog_type=>'AWS_GLUE');
end;

Synchronize metadata from AWS Glue

Now after the association is done, time to check, using ADB API, what is inside a Glue Catalog. To find Glue databases, check this one:

select * from all_glue_databases order by name;

The following query shows a list of the tables available for sync:

select * from all_glue_tables order by database_name, name;

Having a list of the user of the table can pick a few that need to be synced into ADB (Note: this is just metadata sync, data itself will reside in the S3 bucket):

begin
dbms_dcat.run_sync(
  synced_objects => '
    {
        "database_list": [
            {
            "database": "parq",
            "table_list": ["store","item"]
            }
        ]
    }',
error_semantics => 'STOP_ON_ERROR');
end;
/

To run a full sync (all tables and all databases), run the following command:

begin
    dbms_dcat.run_sync(synced_objects=>'{"database_list":["*"]}')
end;

Inspect new objects in Autonomous Data Warehouse and run a query on top of S3.

After the sync, it’s a good time to check newly created objects in a database. A new schema was created (name has been generated automatically):

Now user can run SQL queries over datasets stored in AWS S3:

SELECT * FROM glue$parq_tpcds_oracle_parq.store;

What is essential — data remain stored in AWS Cloud and not copied over to Oracle Database.

Conclusion

In this blog, I show how easily users can expand their data warehouse beyond a single cloud and use metadata and data from other cloud providers such as AWS. Data remains in the remote cloud and is only processed at query runtime. By utilizing AWS Glue, customers can sync multiple objects at one run and easily maintain such a connection.

Alexey Filanovskiy

Product Manager

My role involves overseeing the product management of Data Lake features for the Oracle Autonomous Database.


Previous Post

How to Use Flashback Time Travel in Autonomous Database

Can Tuzla | 6 min read

Next Post


August Newsletter for Autonomous Database

Keith Laker | 6 min read
Everything you need to know about data warehousing with the world's leading cloud solution provider