Iceberg Tables: A New Data Source for Oracle Autonomous Database

March 21, 2024 | 5 minute read
Alexey Filanovskiy
Product Manager
Text Size 100%:

 

Table formats in data lakes, such as Iceberg, emerged in response to the need for a more efficient and scalable way to store and manage large and complex datasets. In this blog, I’ll illustrate the advantage of using Iceberg table formats over other well-known file formats and demonstrate how Autonomous Database support for Iceberg extends its data lake portfolio, helping users navigate the challenges of today’s data-intensive environments more effectively.

Benefits of table formats

Traditionally, data lakes were built using file formats such as CSV, JSON, Parquet, or Avro. They could store large amounts of data but did not provide a way to organize the data into tables or databases. This limitation made it difficult to query, manage, and analyze data, particularly as the volume and complexity of data in data lakes grew.

For example, a user may have 3 files (assume they are Parquet files that have an embedded schema) in a Data Lake. It is difficult to figure out if the files are for three different tables or for a single table when plain file formats are used.

Table formats, on the other hand, provide a way to organize data into tables and partitions, which makes it easier to query and analyze the data. Here are some benefits table formats may have over file format:

  • In table format, there is a mapping between files and tables. This gives a consistent view of data across multiple execution engines.
  • Data management and performance: Table formats provide improved data management capabilities, including the ability to manage data partitions and evolve schemas. Moreover, the table format helps overcome scalability limitations for large datasets by eliminating the need to list the entire directory structure each time a query is executed.
  • Table formats have DML support. Operations like insert, update, and delete can be executed. It doesn’t mean that organizations can build transactional systems, but some relatively rare DML operations can be performed.

What is Apache Iceberg?

Apache Iceberg is an open-source table format for data lakes that is designed to provide scalable and efficient data management. It was developed to address the limitations of existing file formats for data lakes, such as Parquet and ORC, which do not provide a way to manage data at the table level. Iceberg is designed to work with a range of data storage systems, including cloud object stores and Hadoop Distributed File System (HDFS). It has been widely adopted by the Data Warehouse and Data Lake communities and continues to grow in popularity.

Iceberg Tables. Now with Autonomous Database

To embrace all the advantages of table formats and the growing popularity of Iceberg, Oracle is enabling Autonomous Database customers to fully utilize Iceberg tables and access Iceberg data as clients. This broadens Autonomous Database users’ choice of data types beyond the already-supported file types (Parquet, ORC, JSON, CSV, and Avro). You can now use the Iceberg table format as an external table.

There are two ways users can create external tables:

  • Using a catalog (like AWS Glue).
  • Specify the path to the Iceberg JSON metadata file.

The rule of thumb is to use a catalog when possible because it always gives the latest version. Here is a code example that creates an external table on top of data stored in AWS S3 using the AWS Glue catalog:

BEGIN
 DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
   table_name => ‘SALES_Q1_ICEBERG,
   credential_name => ‘AWS_CRED’,
   file_uri_list => NULL,
   format =>’{
             “access_protocol”:{
             “protocol_type”:”iceberg”,
             “protocol_config”:{
             “iceberg_catalog_type”:”aws_glue”,
             “iceberg_glue_region”:”us-west-2",
             “iceberg_table_path”:”default.sales_q1"
             }
           }
 }’);
END;

If the catalog is not available, there is always an option to point to the metadata file:

BEGIN
 DBMS_CLOUD.CREATE_EXTERNAL_TABLE(
     table_name => ‘SALES_Q1_ICEBERG’,
     credential_name => ‘AWS_CRED’,
     file_uri_list => 'https://sales_data.s3.us-west-2.amazonaws.com/sales_q1/metadata/95ef842d.metadata.json',
     format =>'{"access_protocol":{"protocol_type":"iceberg"}}');
 END;

The main limitation of pointing to the metadata file is that a new file is created every time the table is updated. As a result, the user must update the table definition manually. For instance, if the original state is as follows:

After inserting rows in this table, a new metadata file will point to a fresh snapshot and manifest file:

As a result of this change, the user will not see new data until the external table is updated to point to the new metadata file.

Benefits of using Iceberg tables with Autonomous Database

The following examples describe the value of integrating Iceberg tables with Oracle Autonomous Database (ADB):

  1. Unified View: Iceberg tables allow for a consistent view of the same dataset across multiple engines, eliminating the need to maintain metadata in several places.
  2. DML Operations: While ADB doesn’t perform DML operations directly, Iceberg tables enable the use of various tools that can perform DML operations.
  3. Cross-Platform Data Accessibility: If your datasets are already in Iceberg format on a different cloud, they can be easily read and utilized by ADB, reducing data duplication and enhancing the flexibility of your operations.
  4. Efficient Large-Scale Handling: If an organization needs cross-platform data accessibility, Iceberg tables may help you deal with large datasets.

Conclusion

Oracle Autonomous Database’s new support for Apache Iceberg greatly enhances its data lake capabilities. This is a significant addition for those dealing with vast quantities of data, offering superior organization, updating, and time-travel features.

This new integration offers more flexibility and efficiency, allowing ADB users to handle large, complex datasets more effectively. It’s another tool in your toolbox that can make your data management tasks smoother and more efficient.

Alexey Filanovskiy

Product Manager

My role involves overseeing the product management of Data Lake features for the Oracle Autonomous Database.


Previous Post

How to Use Built-in Encryption and Decryption Support When Exporting and Loading Data in Autonomous Database

Can Tuzla | 5 min read

Next Post


Now automatically refresh Refreshable Clones on Autonomous Database

Nilay Panchal | 3 min read
Everything you need to know about data warehousing with the world's leading cloud solution provider