Monday Sep 15, 2014

Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL

Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL.  The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.  

Let's take a look at what it takes to get started and then preview this new capability.

Setting up Connections to Hive

The first thing you need to do is set up a JDBC connection to Hive.  Follow these steps to set up the connection:

Download and Unzip JDBC Drivers

Cloudera provides high performance JDBC drivers that are required for connectivity:

  • Download the Hive Drivers from the Cloudera Downloads page to a local directory
  • Unzip the archive
    • unzip Cloudera_HiveJDBC_2.5.4.1006.zip
  • Two zip files are contained within the archive.  Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): 
    • unzip Cloudera_HiveJDBC4_2.5.4.1006.zip -d /home/oracle/jdbc/

Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.

Update SQL Developer to use the Cloudera Hive JDBC Drivers

Update the preferences in SQL Developer to leverage the new drivers:

  • Start SQL Developer
  • Go to Tools -> Preferences
  • Navigate to Database -> Third Party JDBC Drivers
  • Add all of the jar files contained in the zip to the Third-party JDBC Driver Path.  It should look like the picture below:
    sql developer preferences

  • Restart SQL Developer

Create a Connection

Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2.  Click the New Connection button in the SQL Developer toolbar.  You'll need to have an ID, password and the port where Hiveserver2 is running:

connect to hiveserver2

The example above is creating a connection called hive which connects to Hiveserver2 on localhost running on port 10000.  The Database field is optional; here we are specifying the default database.

Using the Hive Connection

The Hive connection is now treated like any other connection in SQL Developer.  The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:

sqldeveloper - view data in hive

And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive:

worksheet

Here, we've altered the definition of a hive table and then queried that table in the worksheet.

Create Big Data SQL-enabled Tables Using Oracle Data Modeler

Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.  Let's create a few tables using the metadata from the Hive Metastore.  Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item.  You will see the same connections found in the SQL Developer connection navigator:

pick a connection

After selecting the hive connection and a database, select the tables to import:

pick tables to import

There could be any number of tables here - in our case we will select three tables to import.  After completing the import, the logical table definitions appear in our palette:

imported tables

You can update the logical table definitions - and in our case we will want to do so.  For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000).  We have domain knowledge and understand that this field is really much smaller - so we'll update it to the appropriate size:

update prop

Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c.  Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

preview ddl

Edit the Table Definitions

The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing.  For example, edit table movieapp_log_json:

edit table props

You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don't match), and much more.

Query All Your Data

You now have full Oracle SQL access to data across the platform.  In our example, we can combine data from Hadoop with data in our Oracle Database.  The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it!  Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:

compare to blockbuster movies

Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)

Friday Sep 12, 2014

Announcement: Big Data SQL is Generally Available

Oracle Big Data SQL and Big Data Appliance 4.0 are generally available

Big Data Appliance 4.0 receives the following upgrades:

 

  • Big Data SQL: Join data in Hadoop with Oracle Database using Oracle SQL 
    • New external table types for handling data stored in Hadoop 
    • Smart Scan for Hadoop to provide fast query performance 
    • Requires additional license for Big Data SQL 
    • Requires Exadata Database Machine running DB 12.1.0.2 
  • Automated recovery from server failure 
    • This includes migration of master roles to a different server and re-provisioning of a slave node on a server that has been replaced 
  • NoSQL DB multiple-zone configurations on BDA 
    • When adding nodes to a NoSQL DB BDA cluster, they can be added to an existing zone or to a new zone. 
  •  Update to using the 12c ODI Agent on BDA clusters 

For more information on Big Data SQL, check out:

 

 

[Read More]

Your indispensable guide to DW and Big Data at OpenWorld in iBook and PDF formats

There's so much to see and learn at Oracle OpenWorld because it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. 

What to expect at OOW 2014 - We will be announcing a wide range of continuous data warehouse innovations in both hardware and software. Join Oracle experts as we dive deep into the latest generation of data warehouse innovations for analyzing enterprise data and diverse big data streams to derive real business value. You will also learn data warehouse best practices and hear from customers consolidating business analysis onto a common scalable platform. Hands-on labs are available for both beginners and experts giving you the chance to try some of these innovative data warehouse technologies first-hand.

To help you get the most from this year’s event I have put together a comprehensive downloadable guide of all the data warehousing and big data activities at @OracleOpenWorld 2014. If you are smartphone and/or tablet user then checkout our amazing web apps (see previous post OpenWorld on your iPad and iPhone - Now Fully Operational!). If you don't have a tablet or a suitable smartphone of just want a downloadable booklet then this guide contains everything to help you get the most from this year’s conference, including the following:

  • Overview of OpenWorld - why you have got to be there!
  • Video Guide to Data Warehousing with Oracle Database 12c
  • Comprehensive day-by-day session calendar
  • List of must-see sessions
  • List of hands-on labs
  • Map of all the most important session and lab venues
  • List of demo pods and guide to demo grounds 
  • Comprehensive presenter biographies
  • Profiles for key Data Warehouse customers
  • Live twitter feeds from your data warehouse product managers
  • Links for more information

iBook Cover PDF Cover
Click here to download Guide in Apple iBook format

Please note that this Apple iBook can be used on any Apple Mac computer or iPad running the iBook application. iPod touch and iPhone users should use the PDF version of this guide.
Click here to download Guide in PDF format

Enjoy @OracleOpenWorld 2014  and if you have time please stop by the Parallel Execution and Analytical SQL demo booth in the demo grounds and say hello.

Thursday Sep 11, 2014

OpenWorld on your iPad and iPhone - Now Fully Operational!

In one of my recent blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the hands-on lab schedule (it is now live on the OpenWorld site) I have updated my smartphone and tablet apps to include the list of hands-on labs on a day-by-day basis (Monday Tuesday, Wednesday, Thursday). The list of hands-on labs can still be viewed in subject area order (data warehousing and big data) within the app via the “Switch to subject view” link in the top left part of the screen. 

iPad-Labs.jpg iPhone-Labs.jpg

I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the tablet and smartphone apps is shown below:

Oow locations

If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices. 

Please note that I have also published online calendars (via my Google account) which can viewed via the following blog posts:


If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

Wednesday Sep 10, 2014

Online Calendar for Data Warehousing Hands-on Labs at OpenWorld now available

There so many exciting hands-on labs at this years OpenWorld conference and the schedule builder is now live so you can start booking your seat at these labs. To help you get organized and pick the most useful labs to attend I have published a new shared online calendar that contains all the most important data warehouse and big data hands-on labs at this year’s OpenWorld. The following links will allow you to add this shared calendar to your own calendar application:

HOL-Calendar.jpg

Hope this helps you get organised for this year’s incredible conference. Any comments then let me know. The online calendar for all the most important data warehousing and big sessions is available via my previous blog post “Online Calendar for Data Warehousing Sessions at OpenWorld now available”. 

Enjoy.

Thursday Sep 04, 2014

Online Calendar for Data Warehousing Sessions at OpenWorld now available

I have published a shared online calendar that contains all our data warehouse and big data sessions at this year’s OpenWorld. The following links will allow you to add this shared calendar to your own calendar application:

oow-calendar

As soon as the dates and times for the keynote sessions are published I will add these to the calendar so keep checking for updates.

Hope this helps you get organised for this year’s incredible conference. Any comments then let me know and if I missed your data warehouse/big data session then let me know and I will add it to the calendar.

Enjoy.

Wednesday Sep 03, 2014

OpenWorld on your iPad and iPhone - UPDATED!

In my last blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the session schedule (it is now live on the OpenWorld site) I have updated my iPhone and iPad apps to include the list of sessions on a day-by-day basis (Sunday, Monday Tuesday, Wednesday, Thursday). The list of sessions can still be viewed in subject area order (data warehousing, performance and scalability, analytics, unstructured data, industry models and big data) within the app via the “Switch to subject view” link in the top left part of the screen.

Updated-iPad-OOW2014 Updated-iPhone-OOW2014

I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the iPad and iPhone apps is shown below:

Oow locations



If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices.

If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

Monday Aug 25, 2014

OpenWorld on your iPad and iPhone

Most of you probably know that each year I publish a data warehouse guide for OpenWorld which contains links to the latest data warehouse videos, a calendar for the most important sessions and labs and a section that provides profiles and relevant links for all the most important data warehouse presenters. For this year’s conference I have made all this information available via an HTML app that runs on most smartphones and tablets. The pictures below show the HTML app running on iPad and iPhone.

IPad OOW IPhone OOW

This exciting new web app contains information about why you should attend OpenWorld - just in case you have not yet booked your ticket! - as well as the following information:

  • Getting to know 12c - a series of video interviews with George Lumpkin, Vice President of Data Warehouse Product Management
  • Your presenters - full biographies and links to social media sites for all the key data warehouse presenters
  • Must sees sessions - list of all the most important data warehouse presentations at this year’s conference
  • Our customers - profiles our most important data warehouse customers
  • Must attend labs - list of all the most important data warehouse hands-on labs at this year’s conference
  • Links - a list of links to the most important data warehouse sites

If you want to run these web apps on your smartphone and/or tablet then follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays iframes because scrolling does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser.

If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld.

Tuesday Jul 22, 2014

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

The benefits of Big Data at Stubhub:

  • Stubhub enabled data scientists to work directly with customer-related data—such as ticket-purchasing history—inside the database, and to use database options to explore the data graphically, build and evaluate multiple data-mining models, and deploy predictions and insights throughout the enterprise—drastically improving StubHub’s agility and responsiveness
  • Developed highly targeted ticket promotional campaigns and offers by having the ability to calculate 180 million customers’ lifetime value (or propensity) instead of just 20,000 values at a time
  • Used Oracle R Enterprise component of Oracle Advanced Analytics—an Oracle Database option—to reduce a fraud issue by up to 90%

Read more or watch the video:

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data

Introduction

Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Summary

So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Monday Jun 30, 2014

My Data Warehousing and Big Data Must-See Guide for OpenWorld

Data Warehousing Must-See Session Guide for Open World 2014
IBook2014

There’s so much to learn at this year’s Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To get the most from this year’s event I have prepared an initial guide which lists all the must important data warehousing and big data sessions. The first part of the guide provides links to videos and content from last year’s event so you can either re-live the highlights from last year or see what you missed by not being there! If you have an iPad or use OSX 10.9 then you will want to download the iBook version because this contains video highlights from last year’s database keynote session.

The session guide is divided into the following chapters:

  • Data Warehousing
  • Performance and Scalability
  • Database Analytics
  • Industry Data Models and Data Integration
  • Unstructured Data
  • Big Data

The last part of the booklet lists all the key data warehouse and big data product managers who will be presenting at this year’s conference. Included alongside each speaker’s biography are links to their social media sites and blogs.

Please note that, as usual, there will be hands-on labs at this year’s OpenWorld but these have not been included at in the session catalog. I am expecting them to be added shortly. Details of all our labs will appear in the more detailed guide that I will publish before the conference begins.

The Must-See guide is available in two formats: 

For iPad and OSX 10.9 (Mavericks) users please download the iBook, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.ibooks

For all other platforms please download the PDF, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.pdf

 Hope to see you in September at OpenWorld. 

Thursday Jun 26, 2014

Big Data Breakthrough; Watch the Webcast on July 15th

Wednesday Jun 25, 2014

Big Data Appliance now part of Exastack Partner Program

With data growing at unprecedented levels, application providers need a robust, scalable and high-performance infrastructure to manage and analyze data at the speed of business. Helping Independent Software Vendors (ISVs) meet these challenges, Oracle PartnerNetwork (OPN) is today extending its Oracle Exastack program to give partners the opportunity to achieve Oracle Exastack Ready or Optimized status for Oracle Big Data Appliance and Oracle Database Appliance.

The announcement was made today during the sixth annual Oracle PartnerNetwork (OPN) global kickoff event, “Engineered for Success: Oracle and Partners Winning Together.” For a replay of the event, visit: http://bit.ly/1pgDjXL

Here is a word from one of our partners: “TCS was looking for an enterprise grade and reliable Hadoop solution for our HDMS document management application,” said Jayant Dani, principal consultant, Component Engineering Group, TCS.  “We were drawn to Oracle's Big Data Appliance for its ready to use platform, superior performance, scalability, and simple scale-out capabilities.”

Read the entire news release here.

Monday Jun 23, 2014

Part 3: DBAs guide to building and deploying sandboxes

Recently I started a series of blog posts covering the need for a private, secure and safe area for data discovery within the data warehouse ecosystem. The most common name for this type area is a “sandbox”. The demand for sandboxing is growing rapidly as many companies start exploring the new types of data streams linked to “big data”. In the first of my series of blog posts I started with an overview of the basic elements of sandboxes in the cloud. In the second post I looked at the differences between sandboxes vs. data marts.  As part of the first posts I put forward the idea of a sandbox lifecycle model, as shown below, and in this post I am going to explore the first stage of that lifecycle model: the BUILD stage:

Many companies are actively working on lots of different big data led projects: customer sentiment analysis, clickstream analysis, product/service recommendations, real time event monitoring etc. Consequently there are lots of different project teams wanting to evaluate a wide variety of new data sources inside their own private space. They are looking for new customers or segments to target, new product opportunities, new ways to retain customers and ways to predict system faults before they happen. Consequently, DBAs are being flooded with requests to create new sandbox environments. The process of building, or deploying, a sandbox can be based on one of two approaches:

  • Deliver an empty environment: Project team then finds the required data sources and loads them into the sandbox before starting their analysis
  • Delivered a pre-populated environment: Project team loads any additional new data sources into the sandbox before starting their analysis

At the moment we can easily support both of these scenarios. Below is a review of the features available in earlier, pre-12c, versions of the database followed by new features that are part of Database 12c. Hopefully, if you are already using sandboxes and looking for a reason to move to Database 12c then this post will help you make the case for moving your data warehouse to 12c.

Building Sandboxes pre-Database 12c

There are a number of ways to create sandboxes using pre-12c versions of the Oracle Database. The two most popular choices that I have seen used are Database Configuration Assistant and Workspace Manager. 

Database Configuration Assistant

Creating an empty sandbox is a relatively easy for DBAs and business users. The Database Configuration Assistant (DBCA) provides a wizard for creating and configuring an empty, but ready to go, database. This tool is very flexible and allows DBAs to control all aspects of the database creation process. For most business users the wizard provides far too many parameters to configure so Oracle provides three default configuration templates which provide default values for all the main configuration parameters. The pre-built templates include: general purpose, data warehouse and custom. The DBA can add additional templates for specific types of sandboxes by simply creating new template files and this is covered in the database documentation (see here: http://docs.oracle.com/cd/E16655_01/server.121/e17643/install.htm#BABEGFCG). The information in each template includes details of database options, initialisation parameters, and storage attributes (for data files, tablespaces, control files, and online redo logs). A common approach might be to create templates to configure large, medium and small sandboxes with appropriate settings for the specific hardware platform being used.

Once the new sandbox is ready the business users/data scientists can then start loading data. I am not going to over the data loading process, however, please note that this can be done using a data integration tool such as Oracle Data Integrator or flat files. In fact ODI now has a scripting language called Groovy which allows ETL developers to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy.

Overall, creating empty sandboxes using DBCA is a simple process. However, this approach does mean that data is replicated from the original database and each instance requires its own dedicated hardware resources (I/O, CPU and memory) which cannot be easily transferred to other sandboxes. Every time a new sandbox is required the DBA and systems administrators have to ensure that sufficient memory for managing the data along with all the typical database background processes, storage and CPU are available. As more and more sandboxes are requested and deployed these replicated overheads quickly consume the available server resources and this limits the number of sandboxes that can be run on a specific platform. Having dedicated, non-sharable, resources is extremely limiting and very costly.

Many teams find that trying to get approval to create a new database is a long and  painful process. Sometimes it can months to work through the bureaucracy and ensure all the pieces are in place ready to create the new database: storage, processing nodes, network connectivity etc.

Key Benefits: relatively simple process for creating new, empty, databases;

Key issues: requires dedicated system resources; cannot copy an existing database with its schemas and data; no simple mechanism for moving discoveries (objects, data etc) to production environment; may take a long time to provision due to internal bureaucracy

Workspace Manager

If projects require pre-populated sandbox environments then creating a sandbox with access to existing data is best done using Oracle Workspace Manager. This provides the ability to manage current, proposed and historical versions of data within the same database instance so there is no need to move data. Workspace Manager creates a self-contained virtual area within a database where users can update existing data without affecting the original data. This virtual area isolates all the changes until they are explicitly merged with production data or discarded. This means there is no need to duplicate existing data and any new data can be loaded in a controlled way without impacting the production data.

Users in a workspace always see a transactionally consistent view of the entire database; that is, they see changes made in their current workspace plus the original data in the database as it existed either when the workspace was created or when the workspace was last refreshed with changes from the parent workspace. Workspace Manager even supports a hierarchy of workspaces for very complex sandbox requirements involving multiple teams and/or project goals.

There is a great example of how to use Workspace Manager for what-if analysis on the Oracle Learning Library. The set-by-step tutorial shows you how to use workspaces to do location-based analysis on current and proposed data. The fictitious company MyCompany is planning to build a new warehouse to provide better service to its customers. Two potential sites are under consideration. The company wants to use SQL to analyse the prospective sites. To do this, data for both prospective sites must be entered into the production warehouse schema. However, this new data must be kept separate from the existing production data so that it does not impact the work of employees who are not part of the site selection team. Using Workspace Manager it is possible to isolate data for each of prospective sites and allow the two site selection teams to work concurrently. This use case creates the following hierarchy of workspaces, as shown below:

 version-enabled warehouse table and associated team workspaces: SITE1, SITE2, and LIVE. 

You can take this sandbox tutorial by following the steps on the Oracle Learning Library page, see here: https://apex.oracle.com/pls/apex/f?p=44785:24:15780601154836::NO:24:P24_CONTENT_ID%2CP24_PREV_PAGE:4553%2C29

One of the challenges of using workspaces relates to resource management and the ability to increase or decrease resources which is important for overall resource utilisation.

Obviously at some point users will want to add new data into the workspace and as with the DBCA overview, ETL developers can use ODI’s scripting language called Groovy to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy

Key Benefits: relatively simple process; no need to copy/clone data; easy to move discoveries to production

Key issues: requires careful workload management; monitoring and chargeback is difficult to achieve

For more information about Oracle Workspace Manager visit the product home page on OTN: http://www.oracle.com/technetwork/database/enterprise-edition/index-087067.html.

Summary 

In general, it is likely that most customers will use both approaches for delivering sandboxes to their project teams. Where there are hundreds of projects taking place at the same time then both approaches will probably create additional work for the DBA and systems administrators in terms of having to monitor and track lots of workspaces and/or plan-monitor-track resource usage across lots of database instances. Ideally what we need is an approach that can easily support the deployment of thousands of sandboxes, offers simple and fast provisioning, leaves data in place to prevent unnecessary movement of large volumes of data, provides a single centrally managed platform and, most importantly, allows for expansion/reduction of resources as needed across all sandboxes. Is this possible? 

Building Sandboxing with Database 12c

The good news is that the new features we introduced as part of Database 12c combine the all best parts of the sandboxing features from the pre-12c approach with none of the drawbacks. Specifically, Database 12c introduced the concept of a multitenant database. This was originally positioned as a way to consolidate databases and applications under the control of a single environment but the approach works amazingly well for sandboxes. Therefore, with 12c you can now deploy sandboxes into the cloud!

This new feature (Oracle Multitenant) delivers a new architecture that provides a multitenant container database to hold a series of pluggable databases, or in this case sandboxes. There is a great quick overview of the new multitenant feature available on YouTube, follow the link below to watch the video:

Video: Oracle Multitenant Architecture

With the new multitenant architecture featuring pluggable databases (sandboxes) it is now possible to create a single container database that allows multiple sandboxes to be simply plugged in. Below is a simple overview of the main multitenant terminology. When we start to create a new multitenant sandbox environment we plug-in our multiple pluggable sandboxes to the root database. The combination of root database and pluggable sandboxes (PDBs) is referred to as a “Container Database” or CDB as shown below:

Architecture

The great thing about the multitenant feature us that it offers a very fast and efficient way of creating new sandboxes. It is a bit like the template feature in Database Configuration Assistant with one important difference: deploying a new pluggable sandbox takes seconds as shown here:

Clone

the above is taken from the Mutitenant info graphic which is here: http://www.oracle.com/us/products/database/oracle-multitenant-infographic-1961308.pdf. The actual speed at which we can deploy pluggable sandboxes is outlined in the graph below:

P3

The first bar on the graph shows the time taken to create a “standard” database, which involves coping template data files and then configuring the instance which can take a considerable amount of time. The second bar shows the time taken to create a new pluggable sandbox, which is significantly faster. The third bar shows that we can deliver pluggable sandboxes in real time by using the built-in “clone” feature - this allows us to deploy a new pluggable sandbox which is able to “see” the source data in the PDB which acted as the source for the clone. The cloned-pluggable sandbox also gets access to new/refreshed data as it is changed in the source system. This means that our cloned pluggable sandbox is always up to date with the very latest data. 

What is clone?

Provisioning a sandbox that is a copy of an existing database so the project teams have access to existing data sets used to be a lengthy multistep process.  Multitenant provides a capability called cloning. It is now possible to create local clones (from a pluggable database in the same container database) or remote clones (across a database link to a remote container database). What are the benefits of using this feature? Firstly, it is very simple - a single SQL statement. Secondly, it keeps the overall security requirements nice and  simple  because all that is required is database access and not OS access. Lastly, business users and data scientists like it because the whole process is so fast. The process of creating full clones is fast. Even faster—ludicrously fast—is the process for creating snapshot clones. Snapshot clones are built on a capability of the underlying file system—where available—called copy-on-write. This makes it possible to request and deploy a pre-populated pluggable sandbox in a matter of seconds!

Now we understand the overall architecture and speed at which we can deploy new sandboxes, how do you actually request a new sandbox/PDB?

Self-Service provisioning of sandboxes

In prior releases of the Oracle Database a lot of the tools and features that were used for sandboxing were typically aimed at DBAs. Today, many business users and data scientists want to control and own the process of requesting and building a sandbox. As part of Database 12.1.0.2 we have provided a new self-service application that DBAs can make available to their business teams for self-service provisioning.  This new application provides is a simple 4-step process for creating a new pluggable sandbox :

Step 1: Start the APEX self-service provisioning app and click on the “New Database” link:

Build 1

Step 2: Select to create a new sandbox/pluggable database or plug-in an existing database that has been moved from another container. In the majority of cases business users will typically provision a new sandbox:

Build 2

Step 3: Determine if you are going to create a new empty sandbox or clone an existing database (mart or warehouse or existing sandbox). In the example below we have selected and existing sandbox MIKETEST and the option to clone this database is available at the bottom of the screen. 

Build 3

Step 4: Check the details and confirm the deployment. 

Build 4

As mentioned earlier the speed at which pluggable sandboxes can be created means that business users can request a new pluggable sandbox and have the environment ready for use in a matter of seconds (of your course your mileage might vary according to your hardware!).

Workspace Manager and Multitenant

In many cases the will still be a need to have an environment where the same data sets can be shared across multiple project teams with each team work independently on their own view of the data. This is exactly the sort of use case for Workspace Manager, as described earlier, and the good news is that Workspace Manager functions transparently in a multitenant architecture. Workspaces benefit from the efficient administration of one multitenant container database, and the separation and resource prioritisation allowed by multiple pluggable databases. 

SQL Developer and Multitenant

For data scientists and more sophisticated business users the latest version of SQL Developer is now multitenant aware. Therefore, rather than using the self-service application, all the required operations for creating and managing a sandbox are now available from within SQL Developer as shown here:

SQLDeveloper 1

 From within SQLDeveloper you can: deploy a sandbox (CREATE PLUGGABLE or CLONE PLUGGABLE), move a sandbox (UNPLUG and PLUG) and delete a sandbox (DROP PLUGGABLE)

SQLDeveloper 2

SQL Developer offers fine grained control and most business users and data scientists will probably need some guidance from their DBA on the various parameter settings for deploying a new sandbox. For more information about using SQL Developer to manage ( creating, modifying, plugging/unplugging) pluggable sandboxes see the online tutorial on the Oracle Learning Library: https://apex.oracle.com/pls/apex/f?p=44785:24:113456037406764:::24:P24_CONTENT_ID%2CP24_PROD_SECTION_GRP_ID%2CP24_PREV_PAGE:7649%2C%2C24

Summary

Database 12c offers significant advantages for delivering sandboxes - deploying pluggable sandboxes into the cloud is now simple and really fast. The fast and efficient deployment of pluggable sandboxes can be delivered as a self-service approach using a variety of tools and applications. These new pluggable sandboxes also support the existing data isolation features such as Workspaces. Overall, Database 12c makes sandboxing faster, simpler and more efficient.

Conclusion

In this latest post on sandboxing I have examined the “Build” phase of our BOX’D sandbox lifecycle. For customers who have not yet moved to Database 12c there are a number of tools and features that can be used to deploy sandboxes. Customers who have moved to Database 12c can use the new multitenant feature alongside some of the earlier features to deliver real-time deployment of pre-populated and empty sandboxes.

Thursday Jun 05, 2014

Globacom and mCentric Deploy BDA and NoSQL Database to analyze network traffic 40x faster

In a fast evolving market, speed is of the essence. mCentric and Globacom leveraged Big Data Appliance, Oracle NoSQL Database to save over 35,000 Call-Processing minutes daily and analyze network traffic 40x faster. 

Here are some highlights from the profile:

Why Oracle

“Oracle Big Data Appliance works well for very large amounts of structured and unstructured data. It is the most agile events-storage system for our collect-it-now and analyze-it-later set of business requirements. Moreover, choosing a prebuilt solution drastically reduced implementation time. We got the big data benefits without needing to assemble and tune a custom-built system, and without the hidden costs required to maintain a large number of servers in our data center. A single support license covers both the hardware and the integrated software, and we have one central point of contact for support,” said Sanjib Roy, CTO, Globacom.

Implementation Process

It took only five days for Oracle partner mCentric to deploy Oracle Big Data Appliance, perform the software install and configuration, certification, and resiliency testing. The entire process—from site planning to phase-I, go-live—was executed in just over ten weeks, well ahead of the four months allocated to complete the project.

Oracle partner mCentric leveraged Oracle Advanced Customer Support Services’ implementation methodology to ensure configurations are tailored for peak performance, all patches are applied, and software and communications are consistently tested using proven methodologies and best practices.

Read the entire profile here.

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
9
10
11
12
13
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today