Guest post from Carl Olofson, Principal Analyst, DBMSGuru LLC

Overview: Why a Lakehouse?

AI has become a critical key to success in business going forward. But enterprise AI is dependent on the availability of strategic data from across the enterprise, and for that data to be current and of good quality. This is largely dependent on the data management technology in use.

Conventionally, most enterprises have three sets of data management technology. One is operational, with a wide range of databases driving applications that perform business functions. Another is a small set of one or more large databases designed to collect key data from the operational systems and use that data for enterprise-wide analytics. These are commonly called data warehouses.

Over the past couple of decades, many have added other data collections used either for more narrowly focused, near-term analysis or for business analysis performed via data science. These may contain ordered files or simple, often single table databases in an open table format such as Apache Iceberg and have a heterogeneous collection of data for analysis. This set of data management implementations is called data lakes, and the data is analyzed using Spark, Python scripts, or one of several SQL analytics implementations.

A data platform designed to leverage the flexibility of the data lake with the rigor of the data warehouse is called a data lakehouse, which pulls and combines data from both types of databases with the addition of more operational data included in an ad hoc basis. As enterprises seek to leverage AI technology for comprehensive data analysis, they need the lakehouse to have depth of data, to support a variety of data formats and models, to scale dynamically, and to be robust, highly performant, and secure.

Operational Challenges of the Lakehouse

Enterprises face a number of daunting challenges in building a data lakehouse. These include the following:

  • Complex management. Some lakehouses are built using separately acquired components. This can lead to performance and management challenges. A range of connectivity and management tools may be required. A lakehouse is always more straightforward to manage when managed with a single set of technologies.
  • Data transfer and consolidation. Some lakehouse offerings provide precious little in terms of enabling timely and flexible data transfer from sources, such as operational databases, to the data lake for analysis. Having a set of such tools from a single vendor, all designed to work as part of a system, is generally best.
  • Definitional metadata for combining data from different sources is preferred to doing the work by hand, manually keeping notes that can be obscure and incomplete regarding what the data means.
  • Connectivity. Inter-system interoperation and adjustment need to be simple and clean. Use one package here, and another package there, and pretty soon you have a mess on your hands.
  • Support for open standards. Building a stable team of developers and ensuring flexible access to all kinds of data and tools requires support for leading open standards and, in many cases, open-source solutions.
  • Performance. Many enterprises, in order to support the widest range of data formats and models, employ database systems from a variety of vendors featuring a range of performance characteristics. Such a situation can lead to inconsistent system-wide performance, leading to frustration and lost staff time.

Oracle Autonomous AI Lakehouse: An Open Multicloud Data Platform

Oracle has provided an integrated product solution to the issues described above. This facility enables the user to build and manage a lakehouse with consistent functionality that has been engineered together and includes built-in AI functionality. It addresses the above listed challenges as follows:

  • Complex management. Instead of dealing with a potpourri of tools for collecting and managing data lakes, the Oracle Autonomous AI Lakehouse provides a simple, consistent set of tooling and user interfaces for all the jobs to be done in finding, defining, collecting and querying the data.
  • Data transfer and consolidation. Oracle provides GoldenGate and Oracle Data Integrator for data movement and synchronization into the data lakehouse from almost any source. For those with a substantial investment in other existing data lake and data warehouse technologies, Oracle Autonomous AI Lakehouse can interoperate with Microsoft Fabric, Amazon EMR, and Databricks.
  • Definitional metadata is supported through the Autonomous Data Catalog, which captures definitional metadata for data in databases, data lakes, data platforms, and various catalog-driven systems including Apache Iceberg, AWS Glue, Databricks Unity, and Snowflake Horizon. Instead of users navigating through individual catalogs, databases, and data stores, Oracle organizes the metadata into a single “catalog of catalogs” for ease in finding and understanding the data.
  • Connectivity. Oracle supports direct, fast access to data in Iceberg tables and object storage through the Data Lake Accelerator, which optimizes such access for smooth, even performance.
  • Support for open standards. Although Oracle supports a variety of open standard access methods for Oracle AI Database, the Autonomous AI Lakehouse also provides broad data access capability for Apache Iceberg data, including support for a range of tools in the open-source ecosystem, data access through Apache Spark, Python, Scala, and SQL, and storage of Iceberg tables on object storage. In addition to Iceberg (and its storage format, Parquet), the platform can also access data in CSV, JSON, Avro, and Delta UniForm formats, demonstrating its openness and flexibility for organizations.
  • Access to real-time, operational data. Some data warehouse platforms only ingest and store data that represents past events. This means that users or AI agents of such systems are in fact accessing and making decisions based on stale data. In contrast, Autonomous AI Lakehouse has access to real-time, operational data blended with historical data, empowering decisions and taking actions based on up-to-date information.
  • Performance. Of course, Oracle Database features comprehensive features and tools for optimal support, but access to Iceberg tables is also highly optimized using the Data Lake Accelerator, which speeds up Iceberg data access for large-scale queries with servers provided on a pay-as-you-go basis. In addition, query performance of frequently accessed Iceberg data is improved by caching data within Oracle Exadata flash storage. So, although your Iceberg data may reside on object storage, accessing it won’t slow down your complex data queries.

Cloud Support Without Limits

While other vendors adapt their platforms to the constraints of different hyperscaler clouds, Oracle Autonomous AI Data Lakehouse operates with no changes across Oracle Cloud Infrastructure (OCI) and  AWS, Google Cloud, and Microsoft Azure. This cloud support without limits enables customers to have the same experience with Autonomous AI Lakehouse wherever their data resides.

Conclusions

Whatever narrow criteria have been applied for application databases and analytic databases in the past need to be revised in light of AI’s need to blend any data, anytime, and from anywhere, to provide comprehensive answers to questions both tactical and strategic. This AI hunger for data calls for vendors to build a different data lakehouse, one that is open, interoperable, and operates in real time—much like Autonomous AI Lakehouse—with robust connectivity to operational and application data sources as well as a strong data warehouse platform and a catalog of catalogs, allowing users or AI agents to easily search and find the data they need.

The data lakehouse providing comprehensive information needed to guide the enterprise in the future must be scalable, performant, and robust. Oracle Autonomous AI Lakehouse addresses these needs, and with its integrated AI capabilities can simplify the task of incorporating AI at every logical level in business management. It would seem clear that anyone looking to build a robust base of data access for their AI and other analytical work should consider the Oracle AI Data Lakehouse very seriously as its enabling data platform.