Thursday Sep 11, 2014

OpenWorld on your iPad and iPhone - Now Fully Operational!

In one of my recent blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the hands-on lab schedule (it is now live on the OpenWorld site) I have updated my smartphone and tablet apps to include the list of hands-on labs on a day-by-day basis (Monday Tuesday, Wednesday, Thursday). The list of hands-on labs can still be viewed in subject area order (data warehousing and big data) within the app via the “Switch to subject view” link in the top left part of the screen. 

iPad-Labs.jpg iPhone-Labs.jpg

I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the tablet and smartphone apps is shown below:

Oow locations

If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices. 

Please note that I have also published online calendars (via my Google account) which can viewed via the following blog posts:


If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

Wednesday Sep 10, 2014

Online Calendar for Data Warehousing Hands-on Labs at OpenWorld now available

There so many exciting hands-on labs at this years OpenWorld conference and the schedule builder is now live so you can start booking your seat at these labs. To help you get organized and pick the most useful labs to attend I have published a new shared online calendar that contains all the most important data warehouse and big data hands-on labs at this year’s OpenWorld. The following links will allow you to add this shared calendar to your own calendar application:

HOL-Calendar.jpg

Hope this helps you get organised for this year’s incredible conference. Any comments then let me know. The online calendar for all the most important data warehousing and big sessions is available via my previous blog post “Online Calendar for Data Warehousing Sessions at OpenWorld now available”. 

Enjoy.

Thursday Sep 04, 2014

Online Calendar for Data Warehousing Sessions at OpenWorld now available

I have published a shared online calendar that contains all our data warehouse and big data sessions at this year’s OpenWorld. The following links will allow you to add this shared calendar to your own calendar application:

oow-calendar

As soon as the dates and times for the keynote sessions are published I will add these to the calendar so keep checking for updates.

Hope this helps you get organised for this year’s incredible conference. Any comments then let me know and if I missed your data warehouse/big data session then let me know and I will add it to the calendar.

Enjoy.

Wednesday Sep 03, 2014

OpenWorld on your iPad and iPhone - UPDATED!

In my last blog post I provided links to our OpenWorld data warehouse web app for smartphones and tablets. Now that the OOW team has released the session schedule (it is now live on the OpenWorld site) I have updated my iPhone and iPad apps to include the list of sessions on a day-by-day basis (Sunday, Monday Tuesday, Wednesday, Thursday). The list of sessions can still be viewed in subject area order (data warehousing, performance and scalability, analytics, unstructured data, industry models and big data) within the app via the “Switch to subject view” link in the top left part of the screen.

Updated-iPad-OOW2014 Updated-iPhone-OOW2014

I have also added a location map which can be viewed by clicking on the linked-text, “View location map", which is in the top right part of the screen on each application. The location map that is available within both the iPad and iPhone apps is shown below:

Oow locations



If you want to run these updated web apps on your smartphone and/or tablet then you can reuse the existing links that I published on my last blog post. If you missed that post then  follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays frames since scrolling within frames does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser on Samsung devices.

If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld and, if you have time, it would be great to see you if you want to stop by at the Parallel Execution and Analytical SQL demo booth.

Monday Aug 25, 2014

OpenWorld on your iPad and iPhone

Most of you probably know that each year I publish a data warehouse guide for OpenWorld which contains links to the latest data warehouse videos, a calendar for the most important sessions and labs and a section that provides profiles and relevant links for all the most important data warehouse presenters. For this year’s conference I have made all this information available via an HTML app that runs on most smartphones and tablets. The pictures below show the HTML app running on iPad and iPhone.

IPad OOW IPhone OOW

This exciting new web app contains information about why you should attend OpenWorld - just in case you have not yet booked your ticket! - as well as the following information:

  • Getting to know 12c - a series of video interviews with George Lumpkin, Vice President of Data Warehouse Product Management
  • Your presenters - full biographies and links to social media sites for all the key data warehouse presenters
  • Must sees sessions - list of all the most important data warehouse presentations at this year’s conference
  • Our customers - profiles our most important data warehouse customers
  • Must attend labs - list of all the most important data warehouse hands-on labs at this year’s conference
  • Links - a list of links to the most important data warehouse sites

If you want to run these web apps on your smartphone and/or tablet then follow these links:

Android users: I have tested the app on Android and there appears to be a bug in the way the Chrome browser displays iframes because scrolling does not work . The app does work correctly if you use either the Android version of the Opera browser or the standard Samsung browser.

If you have any comments about the app (content you would like to see) then please let me know. Enjoy OpenWorld.

Tuesday Jul 22, 2014

StubHub Taps into Big Data for Insight into Millions of Customers’ Ticket-Buying Patterns, Fraud Detection, and Optimized Ticket Prices

The benefits of Big Data at Stubhub:

  • Stubhub enabled data scientists to work directly with customer-related data—such as ticket-purchasing history—inside the database, and to use database options to explore the data graphically, build and evaluate multiple data-mining models, and deploy predictions and insights throughout the enterprise—drastically improving StubHub’s agility and responsiveness
  • Developed highly targeted ticket promotional campaigns and offers by having the ability to calculate 180 million customers’ lifetime value (or propensity) instead of just 20,000 values at a time
  • Used Oracle R Enterprise component of Oracle Advanced Analytics—an Oracle Database option—to reduce a fraud issue by up to 90%

Read more or watch the video:

Tuesday Jul 15, 2014

Oracle Big Data SQL: One Fast Query, All Your Data

Introduction

Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

SQL on Hadoop

As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

SQL On More Than Hadoop

For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

This is why we think Oracle Big Data SQL is obvious too.

It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

  • Maintain the valuable characteristics of the system storing the data
  • Unify metadata to understand how to execute queries
  • Optimize execution to take advantage of the systems storing the data

For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

Unifying Metadata

To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

 Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

  The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

Optimizing Performance

Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

Smart Scan for Hadoop

Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

Summary

So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

It's SQL that plays data where it lies, letting you place data where you think it belongs.

[Read More]

Monday Jun 30, 2014

My Data Warehousing and Big Data Must-See Guide for OpenWorld

Data Warehousing Must-See Session Guide for Open World 2014
IBook2014

There’s so much to learn at this year’s Oracle OpenWorld - it provides more educational and networking opportunities than any other conference dedicated to Oracle business and technology users. To get the most from this year’s event I have prepared an initial guide which lists all the must important data warehousing and big data sessions. The first part of the guide provides links to videos and content from last year’s event so you can either re-live the highlights from last year or see what you missed by not being there! If you have an iPad or use OSX 10.9 then you will want to download the iBook version because this contains video highlights from last year’s database keynote session.

The session guide is divided into the following chapters:

  • Data Warehousing
  • Performance and Scalability
  • Database Analytics
  • Industry Data Models and Data Integration
  • Unstructured Data
  • Big Data

The last part of the booklet lists all the key data warehouse and big data product managers who will be presenting at this year’s conference. Included alongside each speaker’s biography are links to their social media sites and blogs.

Please note that, as usual, there will be hands-on labs at this year’s OpenWorld but these have not been included at in the session catalog. I am expecting them to be added shortly. Details of all our labs will appear in the more detailed guide that I will publish before the conference begins.

The Must-See guide is available in two formats: 

For iPad and OSX 10.9 (Mavericks) users please download the iBook, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.ibooks

For all other platforms please download the PDF, which is available here: https://dl.dropboxusercontent.com/u/69436560/OOW/Must-See%20OpenWorld%202014_Sessions.pdf

 Hope to see you in September at OpenWorld. 

Thursday Jun 26, 2014

Big Data Breakthrough; Watch the Webcast on July 15th

Wednesday Jun 25, 2014

Big Data Appliance now part of Exastack Partner Program

With data growing at unprecedented levels, application providers need a robust, scalable and high-performance infrastructure to manage and analyze data at the speed of business. Helping Independent Software Vendors (ISVs) meet these challenges, Oracle PartnerNetwork (OPN) is today extending its Oracle Exastack program to give partners the opportunity to achieve Oracle Exastack Ready or Optimized status for Oracle Big Data Appliance and Oracle Database Appliance.

The announcement was made today during the sixth annual Oracle PartnerNetwork (OPN) global kickoff event, “Engineered for Success: Oracle and Partners Winning Together.” For a replay of the event, visit: http://bit.ly/1pgDjXL

Here is a word from one of our partners: “TCS was looking for an enterprise grade and reliable Hadoop solution for our HDMS document management application,” said Jayant Dani, principal consultant, Component Engineering Group, TCS.  “We were drawn to Oracle's Big Data Appliance for its ready to use platform, superior performance, scalability, and simple scale-out capabilities.”

Read the entire news release here.

Monday Jun 23, 2014

Part 3: DBAs guide to building and deploying sandboxes

Recently I started a series of blog posts covering the need for a private, secure and safe area for data discovery within the data warehouse ecosystem. The most common name for this type area is a “sandbox”. The demand for sandboxing is growing rapidly as many companies start exploring the new types of data streams linked to “big data”. In the first of my series of blog posts I started with an overview of the basic elements of sandboxes in the cloud. In the second post I looked at the differences between sandboxes vs. data marts.  As part of the first posts I put forward the idea of a sandbox lifecycle model, as shown below, and in this post I am going to explore the first stage of that lifecycle model: the BUILD stage:

Many companies are actively working on lots of different big data led projects: customer sentiment analysis, clickstream analysis, product/service recommendations, real time event monitoring etc. Consequently there are lots of different project teams wanting to evaluate a wide variety of new data sources inside their own private space. They are looking for new customers or segments to target, new product opportunities, new ways to retain customers and ways to predict system faults before they happen. Consequently, DBAs are being flooded with requests to create new sandbox environments. The process of building, or deploying, a sandbox can be based on one of two approaches:

  • Deliver an empty environment: Project team then finds the required data sources and loads them into the sandbox before starting their analysis
  • Delivered a pre-populated environment: Project team loads any additional new data sources into the sandbox before starting their analysis

At the moment we can easily support both of these scenarios. Below is a review of the features available in earlier, pre-12c, versions of the database followed by new features that are part of Database 12c. Hopefully, if you are already using sandboxes and looking for a reason to move to Database 12c then this post will help you make the case for moving your data warehouse to 12c.

Building Sandboxes pre-Database 12c

There are a number of ways to create sandboxes using pre-12c versions of the Oracle Database. The two most popular choices that I have seen used are Database Configuration Assistant and Workspace Manager. 

Database Configuration Assistant

Creating an empty sandbox is a relatively easy for DBAs and business users. The Database Configuration Assistant (DBCA) provides a wizard for creating and configuring an empty, but ready to go, database. This tool is very flexible and allows DBAs to control all aspects of the database creation process. For most business users the wizard provides far too many parameters to configure so Oracle provides three default configuration templates which provide default values for all the main configuration parameters. The pre-built templates include: general purpose, data warehouse and custom. The DBA can add additional templates for specific types of sandboxes by simply creating new template files and this is covered in the database documentation (see here: http://docs.oracle.com/cd/E16655_01/server.121/e17643/install.htm#BABEGFCG). The information in each template includes details of database options, initialisation parameters, and storage attributes (for data files, tablespaces, control files, and online redo logs). A common approach might be to create templates to configure large, medium and small sandboxes with appropriate settings for the specific hardware platform being used.

Once the new sandbox is ready the business users/data scientists can then start loading data. I am not going to over the data loading process, however, please note that this can be done using a data integration tool such as Oracle Data Integrator or flat files. In fact ODI now has a scripting language called Groovy which allows ETL developers to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy.

Overall, creating empty sandboxes using DBCA is a simple process. However, this approach does mean that data is replicated from the original database and each instance requires its own dedicated hardware resources (I/O, CPU and memory) which cannot be easily transferred to other sandboxes. Every time a new sandbox is required the DBA and systems administrators have to ensure that sufficient memory for managing the data along with all the typical database background processes, storage and CPU are available. As more and more sandboxes are requested and deployed these replicated overheads quickly consume the available server resources and this limits the number of sandboxes that can be run on a specific platform. Having dedicated, non-sharable, resources is extremely limiting and very costly.

Many teams find that trying to get approval to create a new database is a long and  painful process. Sometimes it can months to work through the bureaucracy and ensure all the pieces are in place ready to create the new database: storage, processing nodes, network connectivity etc.

Key Benefits: relatively simple process for creating new, empty, databases;

Key issues: requires dedicated system resources; cannot copy an existing database with its schemas and data; no simple mechanism for moving discoveries (objects, data etc) to production environment; may take a long time to provision due to internal bureaucracy

Workspace Manager

If projects require pre-populated sandbox environments then creating a sandbox with access to existing data is best done using Oracle Workspace Manager. This provides the ability to manage current, proposed and historical versions of data within the same database instance so there is no need to move data. Workspace Manager creates a self-contained virtual area within a database where users can update existing data without affecting the original data. This virtual area isolates all the changes until they are explicitly merged with production data or discarded. This means there is no need to duplicate existing data and any new data can be loaded in a controlled way without impacting the production data.

Users in a workspace always see a transactionally consistent view of the entire database; that is, they see changes made in their current workspace plus the original data in the database as it existed either when the workspace was created or when the workspace was last refreshed with changes from the parent workspace. Workspace Manager even supports a hierarchy of workspaces for very complex sandbox requirements involving multiple teams and/or project goals.

There is a great example of how to use Workspace Manager for what-if analysis on the Oracle Learning Library. The set-by-step tutorial shows you how to use workspaces to do location-based analysis on current and proposed data. The fictitious company MyCompany is planning to build a new warehouse to provide better service to its customers. Two potential sites are under consideration. The company wants to use SQL to analyse the prospective sites. To do this, data for both prospective sites must be entered into the production warehouse schema. However, this new data must be kept separate from the existing production data so that it does not impact the work of employees who are not part of the site selection team. Using Workspace Manager it is possible to isolate data for each of prospective sites and allow the two site selection teams to work concurrently. This use case creates the following hierarchy of workspaces, as shown below:

 version-enabled warehouse table and associated team workspaces: SITE1, SITE2, and LIVE. 

You can take this sandbox tutorial by following the steps on the Oracle Learning Library page, see here: https://apex.oracle.com/pls/apex/f?p=44785:24:15780601154836::NO:24:P24_CONTENT_ID%2CP24_PREV_PAGE:4553%2C29

One of the challenges of using workspaces relates to resource management and the ability to increase or decrease resources which is important for overall resource utilisation.

Obviously at some point users will want to add new data into the workspace and as with the DBCA overview, ETL developers can use ODI’s scripting language called Groovy to create easy-to-use dialog for selecting a table/view from a source system and copy its data to a target schema. For more information about ODI’s Groovy scripting language checkout the posts on the data integration blog: https://blogs.oracle.com/warehousebuilder/tags/groovy

Key Benefits: relatively simple process; no need to copy/clone data; easy to move discoveries to production

Key issues: requires careful workload management; monitoring and chargeback is difficult to achieve

For more information about Oracle Workspace Manager visit the product home page on OTN: http://www.oracle.com/technetwork/database/enterprise-edition/index-087067.html.

Summary 

In general, it is likely that most customers will use both approaches for delivering sandboxes to their project teams. Where there are hundreds of projects taking place at the same time then both approaches will probably create additional work for the DBA and systems administrators in terms of having to monitor and track lots of workspaces and/or plan-monitor-track resource usage across lots of database instances. Ideally what we need is an approach that can easily support the deployment of thousands of sandboxes, offers simple and fast provisioning, leaves data in place to prevent unnecessary movement of large volumes of data, provides a single centrally managed platform and, most importantly, allows for expansion/reduction of resources as needed across all sandboxes. Is this possible? 

Building Sandboxing with Database 12c

The good news is that the new features we introduced as part of Database 12c combine the all best parts of the sandboxing features from the pre-12c approach with none of the drawbacks. Specifically, Database 12c introduced the concept of a multitenant database. This was originally positioned as a way to consolidate databases and applications under the control of a single environment but the approach works amazingly well for sandboxes. Therefore, with 12c you can now deploy sandboxes into the cloud!

This new feature (Oracle Multitenant) delivers a new architecture that provides a multitenant container database to hold a series of pluggable databases, or in this case sandboxes. There is a great quick overview of the new multitenant feature available on YouTube, follow the link below to watch the video:

Video: Oracle Multitenant Architecture

With the new multitenant architecture featuring pluggable databases (sandboxes) it is now possible to create a single container database that allows multiple sandboxes to be simply plugged in. Below is a simple overview of the main multitenant terminology. When we start to create a new multitenant sandbox environment we plug-in our multiple pluggable sandboxes to the root database. The combination of root database and pluggable sandboxes (PDBs) is referred to as a “Container Database” or CDB as shown below:

Architecture

The great thing about the multitenant feature us that it offers a very fast and efficient way of creating new sandboxes. It is a bit like the template feature in Database Configuration Assistant with one important difference: deploying a new pluggable sandbox takes seconds as shown here:

Clone

the above is taken from the Mutitenant info graphic which is here: http://www.oracle.com/us/products/database/oracle-multitenant-infographic-1961308.pdf. The actual speed at which we can deploy pluggable sandboxes is outlined in the graph below:

P3

The first bar on the graph shows the time taken to create a “standard” database, which involves coping template data files and then configuring the instance which can take a considerable amount of time. The second bar shows the time taken to create a new pluggable sandbox, which is significantly faster. The third bar shows that we can deliver pluggable sandboxes in real time by using the built-in “clone” feature - this allows us to deploy a new pluggable sandbox which is able to “see” the source data in the PDB which acted as the source for the clone. The cloned-pluggable sandbox also gets access to new/refreshed data as it is changed in the source system. This means that our cloned pluggable sandbox is always up to date with the very latest data. 

What is clone?

Provisioning a sandbox that is a copy of an existing database so the project teams have access to existing data sets used to be a lengthy multistep process.  Multitenant provides a capability called cloning. It is now possible to create local clones (from a pluggable database in the same container database) or remote clones (across a database link to a remote container database). What are the benefits of using this feature? Firstly, it is very simple - a single SQL statement. Secondly, it keeps the overall security requirements nice and  simple  because all that is required is database access and not OS access. Lastly, business users and data scientists like it because the whole process is so fast. The process of creating full clones is fast. Even faster—ludicrously fast—is the process for creating snapshot clones. Snapshot clones are built on a capability of the underlying file system—where available—called copy-on-write. This makes it possible to request and deploy a pre-populated pluggable sandbox in a matter of seconds!

Now we understand the overall architecture and speed at which we can deploy new sandboxes, how do you actually request a new sandbox/PDB?

Self-Service provisioning of sandboxes

In prior releases of the Oracle Database a lot of the tools and features that were used for sandboxing were typically aimed at DBAs. Today, many business users and data scientists want to control and own the process of requesting and building a sandbox. As part of Database 12.1.0.2 we have provided a new self-service application that DBAs can make available to their business teams for self-service provisioning.  This new application provides is a simple 4-step process for creating a new pluggable sandbox :

Step 1: Start the APEX self-service provisioning app and click on the “New Database” link:

Build 1

Step 2: Select to create a new sandbox/pluggable database or plug-in an existing database that has been moved from another container. In the majority of cases business users will typically provision a new sandbox:

Build 2

Step 3: Determine if you are going to create a new empty sandbox or clone an existing database (mart or warehouse or existing sandbox). In the example below we have selected and existing sandbox MIKETEST and the option to clone this database is available at the bottom of the screen. 

Build 3

Step 4: Check the details and confirm the deployment. 

Build 4

As mentioned earlier the speed at which pluggable sandboxes can be created means that business users can request a new pluggable sandbox and have the environment ready for use in a matter of seconds (of your course your mileage might vary according to your hardware!).

Workspace Manager and Multitenant

In many cases the will still be a need to have an environment where the same data sets can be shared across multiple project teams with each team work independently on their own view of the data. This is exactly the sort of use case for Workspace Manager, as described earlier, and the good news is that Workspace Manager functions transparently in a multitenant architecture. Workspaces benefit from the efficient administration of one multitenant container database, and the separation and resource prioritisation allowed by multiple pluggable databases. 

SQL Developer and Multitenant

For data scientists and more sophisticated business users the latest version of SQL Developer is now multitenant aware. Therefore, rather than using the self-service application, all the required operations for creating and managing a sandbox are now available from within SQL Developer as shown here:

SQLDeveloper 1

 From within SQLDeveloper you can: deploy a sandbox (CREATE PLUGGABLE or CLONE PLUGGABLE), move a sandbox (UNPLUG and PLUG) and delete a sandbox (DROP PLUGGABLE)

SQLDeveloper 2

SQL Developer offers fine grained control and most business users and data scientists will probably need some guidance from their DBA on the various parameter settings for deploying a new sandbox. For more information about using SQL Developer to manage ( creating, modifying, plugging/unplugging) pluggable sandboxes see the online tutorial on the Oracle Learning Library: https://apex.oracle.com/pls/apex/f?p=44785:24:113456037406764:::24:P24_CONTENT_ID%2CP24_PROD_SECTION_GRP_ID%2CP24_PREV_PAGE:7649%2C%2C24

Summary

Database 12c offers significant advantages for delivering sandboxes - deploying pluggable sandboxes into the cloud is now simple and really fast. The fast and efficient deployment of pluggable sandboxes can be delivered as a self-service approach using a variety of tools and applications. These new pluggable sandboxes also support the existing data isolation features such as Workspaces. Overall, Database 12c makes sandboxing faster, simpler and more efficient.

Conclusion

In this latest post on sandboxing I have examined the “Build” phase of our BOX’D sandbox lifecycle. For customers who have not yet moved to Database 12c there are a number of tools and features that can be used to deploy sandboxes. Customers who have moved to Database 12c can use the new multitenant feature alongside some of the earlier features to deliver real-time deployment of pre-populated and empty sandboxes.

Thursday Jun 05, 2014

Globacom and mCentric Deploy BDA and NoSQL Database to analyze network traffic 40x faster

In a fast evolving market, speed is of the essence. mCentric and Globacom leveraged Big Data Appliance, Oracle NoSQL Database to save over 35,000 Call-Processing minutes daily and analyze network traffic 40x faster. 

Here are some highlights from the profile:

Why Oracle

“Oracle Big Data Appliance works well for very large amounts of structured and unstructured data. It is the most agile events-storage system for our collect-it-now and analyze-it-later set of business requirements. Moreover, choosing a prebuilt solution drastically reduced implementation time. We got the big data benefits without needing to assemble and tune a custom-built system, and without the hidden costs required to maintain a large number of servers in our data center. A single support license covers both the hardware and the integrated software, and we have one central point of contact for support,” said Sanjib Roy, CTO, Globacom.

Implementation Process

It took only five days for Oracle partner mCentric to deploy Oracle Big Data Appliance, perform the software install and configuration, certification, and resiliency testing. The entire process—from site planning to phase-I, go-live—was executed in just over ten weeks, well ahead of the four months allocated to complete the project.

Oracle partner mCentric leveraged Oracle Advanced Customer Support Services’ implementation methodology to ensure configurations are tailored for peak performance, all patches are applied, and software and communications are consistently tested using proven methodologies and best practices.

Read the entire profile here.

Thursday May 29, 2014

Oracle Data Warehouse and Big Data Magazine MAY Edition for Customers + Partners

Follow us on Facebook Twitter Blogger

The latest edition of our monthly data warehouse and big data magazine for Oracle customers and partners is now available. The content for this magazine is taken from the various data warehouse and big data Oracle product management blogs, Oracle press releases, videos posted on Oracle Media Network and Oracle Facebook pages. Click here to view the May Edition


Please share this link http://flip.it/fKOUS to our magazine with your customers and partners


This magazine is optimized for display on tablets and smartphones using the Flipboard App which is available from the Apple App store and Google Play store





Thursday May 15, 2014

DBAs Guide to Sandboxes vs. Data Marts

I had an interesting response to my first post on the topic of sandboxing (DBA's Guide to Deploying Sandboxes in the Cloud). The following question was asked: what is the difference between a data mart and sandbox?

This is actually a great question so I thought it would be useful to convert my answer into a short blog post. I am sure there will be lots of different opinions on this topic just as there are alternative names for "sandbox environment" (from analytical sandbox, to analytical appliance to discovery zone etc etc) but here is my attempt at an answer:

OLAP1 IndustryDataModels

In my experience data marts tend to be a single subject area data repository and/or linked to a specific corporate application (such as finance, HR, CRM, ERP, logistics, sales tracking etc). The source data is pushed to a specific line of business for analysis. The push and loading processes implements all the necessary data cleansing and transformation routines so the data arrives into its destination schema ready for use. Most importantly, the data push happens on a regular basis and is driven by the needs of the business.  Many customers implement a data lifecycle management workflow to ensure that sufficient historical data is available to support the required analysis. In many cases the life of the data mart is largely open-ended and the IT team will ensure that regular backups and all the usual patching and maintenance operations are performed on a regular basis. Where the mart is seen as a mission critical system then high-availability features can be implemented and in extreme cases a disaster recovery site is setup and data synchronised between the production and DR systems.

The schema itself is typically organized to support reporting requirements and will be based around the standard relational models such as star and/or snowflake schemas although this is not a mandatory requirements. Sometimes a 3NF approach is required to support the particular needs of the business. The majority of queries within the mart are "well-defined" and "well-known" and subject to tuning and monitoring by the DBA team. 

A "sandbox" is generally meant as a non-operational environment where business analysts and data scientists can test ideas, manipulate data and model "what if" scenarios without placing an excessive computational load on the core operational processes. It has a finite life expectancy so that when timer runs out the sandbox is deleted and the associated discoveries are either incorporated into the enterprise warehouse, or data mart, or simply abandoned. The primary driver from an organisational perspective is to use a 'fail-fast" approach. At any one point in time an organization might be running any number of analytical experiments spread across hundreds of sandboxes. However, at some point in time those experiments will be halted and evaluated and the "hardware" resources being consumed will be returned to a general pool for reuse by existing projects or used to create environments for new projects. 

In general terms a sandbox environment is never patched or upgraded, except in exceptional circumstances. There is never a great urgency to apply software or operating system patches so the ITM team will  just incorporated these tasks into the normal cycles.

A sandbox should never be considered mission critical so there is no need to implement high availability features or build and manage a DR environment. If a sandbox becomes unavailable due to a fault (hardware or software) there is no pressing urgency to resolve the issue - in Oracle parlance a "sandbox going down" is not a P1 issue for support.

Below is a summary of how I view the differences between marts and sandboxes:


Business Centric Attributes

Data Mart

Sandbox

Business scope

Single subject area

Potentially a mixing pot of data sourced from multiple systems

Core objective

Managing the business

Discovery of new products, markets and/or customer segments

Query scope

Batch reporting for dashboards and pre-configured reports along with limited ad-hoc analysis

Ad –hoc data discovery

Frequency of update

Regular, scheduled data loads

One-off and ad-hoc loads as required

Data Volumes

Driven by external factors such as GRC requirements

Driven by needs of project

Data Quality

Very important – data is fully cleansed and transformed during load process

Raw data is loaded, transformations, cleansing and enhancements are incorporated into discovery process

Typical Output

Historical reports, KPIs, scorecards, multi-dimensional analysis

Data mining models: forecasts, predictions, scoring

Sophisticated analytics (aggregations, spatial, graph etc)

Typical query patterns

Pre-defined patterns returning small data sets, easily tuned

Complex ad hoc queries over massive data volumes



IT Centric Attributes

Data Mart

Sandbox

Mission critical

Yes

No

Performance SLAs

Yes

No

Individual/LOB chargeback for resource usage

Unlikely

Yes

HA features/DR Site

Yes

No

Backups

Full + incremental

No

Patching+Upgrades

As required to resolve specific issues along with scheduled maintenance programs

Only when necessary

Use of beta software

Never

Possibly if project needs specific features

Life expectancy

Unlimited

Limited – typically 90 days or less

Number of instances

Most companies choose to
implent a very small number of subject-specific marts

Large companies likely to have many hundreds of sandoxes running at any given point in time

Table: Key differences between data mart and sandbox

As you can see from this table, in some ways sandboxes are similar to data marts and in other ways they are not. For me, the key difference is in the life expectancy - a sandbox should never outstay its welcome. The best sandbox environments that I have come across are those where strict time limits are enforced on their duration. If you let a sandbox live on too long then you run the danger of it morphing into a shadow data mart and that is a very dangerous situation if you look at the attributes and descriptions listed in the first table (Business Centric Attributes). 

So why is there so much confusion about the differences between marts and sandboxes? Much of this is down to niche vendors trying to jump on specific marketing bandwagons. At the moment the latest marketing bandwagon is the concept of "analytical databases" which in reality is nothing more than a data mart (and in many cases these vendors are simple peddling highly specialised data silos). These niche vendor platforms are simply not designed to run hundreds of environments with resources being continually returned to a centralised pool for redistribution to existing or new projects - which is a core requirement for effective sandboxing.

Over the next 2-3 months I am will share why and how the unique features of Oracle Database 12c provide the perfect platform for supporting environments running hundreds of thousands of sandbox-driven projects.

Tuesday May 13, 2014

Announcing: Big Data Lite VM 3.0

Last week we released Big Data Lite VM 3.0. It contains the latest update on the VM for the entire stack.

Oracle Enterprise Linux 6.4
Oracle Database 12c Release 1 Enterprise Edition (12.1.0.1) with Oracle Advanced Analytics, Spatial & Graph, and more
Cloudera’s Distribution including Apache Hadoop (CDH 5.0)
Oracle Big Data Connectors 3.0
        Oracle SQL Connector for HDFS 3.0.0
        Oracle Loader for Hadoop 3.0.0
        Oracle Data Integrator 12c
        Oracle R Advanced Analytics for Hadoop 2.4.0
        Oracle XQuery for Hadoop 3.0.0
Oracle NoSQL Database Enterprise Edition 12cR1 (3.0.5)
Oracle JDeveloper 11g
Oracle SQL Developer 4.0
Oracle Data Integrator 12cR1
Oracle R Distribution 3.0.1

The download page is on OTN in its usual place.

About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« March 2015
SunMonTueWedThuFriSat
1
2
4
5
6
7
8
12
13
14
15
16
17
21
22
23
25
26
27
28
29
30
31
    
       
Today