When I took wood shop back in eighth grade, my shop teacher taught us to create a design for our project before we started building it. The way we captured the design was in what was called a working drawing. In those days it was neatly hand sketched showing shapes and dimensions from different perspectives and it provided enough information to cut and assemble the wood project.
The big data solutions we work with today are much more complex and built with layers of technology and collections of services, but we still need something like working drawings to see how the pieces fit together.
Solution patterns (sometimes called architecture patterns) are a form of working drawing that help us see the components of a system and where they integrate but without some of the detail that can keep us from seeing the forest for the trees. That detail is still important, but it can be captured in other architecture diagrams.
In this blog I want to introduce some solution patterns for data lakes. (If you want to learn more about what data lakes are, read "What Is a Data Lake?") Data lakes have many uses and play a key role in providing solutions to many different business problems.
The solution patterns described here show some of the different ways data lakes are used in combination with other technologies to address some of the most common big data use cases. I’m going to focus on cloud-based solutions using Oracle’s platform (PaaS) cloud services.
These are the patterns:
Let’s start with the Data Science Lab use case. We call it a lab because it’s a place for discovery and experimentation using the tools of data science. Data Science Labs are important for working with new data, for working with existing data in new ways, and for combining data from different sources that are in different formats. The lab is the place to try out machine learning and determine the value in data.
Before describing the pattern, let me provide a few tips on how to interpret the diagrams. Each blue box represents an Oracle cloud service. A smaller box attached under a larger box represents a required supporting service that is usually transparent to the user. Arrows show the direction of data flow but don’t necessarily indicate how the data flow is initiated.
The data science lab contains a data lake and a data visualization platform. The data lake is a combination of object storage plus the Apache Spark™ execution engine and related tools contained in Oracle Big Data Cloud. Oracle Analytics Cloud provides data visualization and other valuable capabilities like data flows for data preparation and blending relational data with data in the data lake. It also uses an instance of the Oracle Database Cloud Service to manage metadata.
The data lake object store can be populated by the data scientist using an Open Stack Swift client or the Oracle Software Appliance. If automated bulk upload of data is required, Oracle has data integration capabilities for any need that is described in other solution patterns. The object storage used by the lab could be dedicated to the lab or it can be shared with other services, depending on your data governance practices.
Data warehouses are an important tool for enterprises to manage their most important business data as a source for business intelligence. Data warehouses, being built on relational databases, are highly structured. Data therefore must often be transformed into the desired structure before it is loaded into the data warehouse.
This transformation processing in some cases can become a significant load on the data warehouse driving up the cost of operation. Depending on the level of transformation needed, offloading that transformation processing to other platforms can both reduce the operational costs and free up data warehouse resources to focus on its primary role of serving data.
Oracle’s Data Integration Platform Cloud (DIPC) is the primary tool for extracting, loading, and transforming data for the data warehouse. Oracle Database Cloud Service provides required metadata management for DIPC. Using Extract-Load-Transform (E-LT) processing, data transformations are performed where the data resides.
For cases where additional transformation processing is required before loading (Extract-Transform-Load, or ETL), or new data products are going to generated, data can be temporarily staged in object storage and processed in the data lake using Apache Spark™. Additionally, this also provides an opportunity to extend the data warehouse using technology to query the data lake directly, a capability of Oracle Autonomous Data Warehouse Cloud.
Advanced analytics is one of the most common use cases for a data lake to operationalize the analysis of data using machine learning, geospatial, and/or graph analytics techniques. Big data advanced analytics extends the Data Science Lab pattern with enterprise grade data integration.
Also, whereas a lab may use a smaller number of processors and storage, the advanced analytics pattern supports a system scaled-up to the demands of the workload.
Oracle Data Integration Platform Cloud provides a remote agent to capture data at the source and deliver it to the data lake either directly to Spark in Oracle Big Data Cloud or to object storage. The processing of data here tends to be more automated through jobs that run periodically.
Results are made available to Oracle Analytics Cloud for visualization and consumption by business users and analysts. Results like machine learning predictions can also be delivered to other business applications to drive innovative services and applications.
The Stream Analytics pattern is a variation of the Big Data Advanced Analytics pattern that is focused on streaming data. Streaming data brings with it additional demands because the data arrives as it is produced and often the objective is to process it just as quickly.
Stream Analytics is used to detect patterns in transactions, like detecting fraud, or to make predictions about customer behavior like propensity to buy or churn. It can be used for geo-fencing to detect when someone or something crosses a geographical boundary.
Business transactions are captured at the source using the Oracle Data Integration Platform Cloud remote agent and published to an Apache Kafka® topic in Oracle Event Hub Cloud Service. The Stream Analytics Continuous Query Language (CQL) engine running on Spark subscribes to the Kafka topic and performs the desired processing like looking for specific events, responding to patterns over time, or other work that requires immediate action.
Other data sources that can be fed directly to Kafka, like public data feeds or mobile application data, can be processed by business-specific Spark jobs. Results like detected events and machine learning predictions are published to other Kafka topics for consumption by downstream applications and business processes.
The four different solution patterns shown here support many different data lake use cases, but what happens if you want a solution that includes capabilities from more than one pattern? You can have it. Patterns can be combined, but the cloud also makes it easy to have multiple Oracle Big Data Cloud instances for different purposes with all accessing data from a common object store.
Now you’ve seen some examples of how Oracle Platform Cloud Services can be combined in different ways to address different classes of business problem. Use these patterns as a starting point for your own solutions. And even though it’s been a few years since eighth grade, I still enjoy woodworking and I always start my projects with a working drawing.
If you're ready to test these data lake solution patterns, try Oracle Cloud for free with a guided trial, and build your own data lake.