By Klaker-Oracle on Jan 27, 2015
- MATCH RECOGNIZE
- DETERMINISTIC FINITE AUTO
Oracle and SAS have an ongoing commitment to our joint customers to deliver value-added technology integrations through engineered systems such as Exadata, Big Data Appliance, SuperCluster, Exalogic and ZFS Storage Appliance. Dedicated resources manage and execute on joint SAS/Oracle Database, Fusion Middleware, and Oracle Solaris integration projects; providing customer support, including sizing and IT infrastructure optimization and consolidation. Oracle support teams are onsite at SAS Headquarters in Cary, NC (USA); and in the field on a global basis.
The latest in this effort is to enable our joint customers to deploy SAS High Performance Analytics on Big Data Appliance. This effort enables SAS users to leverage the lower cost infrastructure Hadoop offers in a production ready deployment on Oracle Big Data Appliance. Here from Paul Kent (VP Big Data, SAS) on some of the details.
After this year’s OpenWorld I ran a 1-day workshop on analytical SQL for our Data Warehouse and Big Data Global Leaders customers. This was part of the Global Leaders 'Days with Development' programme. We had a pack room at the Oracle Conference Center and I was very lucky to have Stew Ashton, Technical Architect, BNP Paribas, Dr. Holger Friedrich, CTO at sumIT AG and Joerg Otto, Head of DB Engineering, IDS GmbH co-present with me and have them explain how they were using analytical SQL in their projects.
The workshop covered the following topics:
For the workshop I created a comprehensive slide deck (I will post the presentation shortly on our OTN home page) which included code samples and explain plans to highlight the key benefits of using our analytical SQL features and functions. The great news is that I now have a repository for analytical SQL code samples on the Github repository. To kick things off on this new repository I have posted all the SQL scripts that I created for this workshop so you can now download and work through a series of use cases that explain how to use window functions, intelligently aggregate data, manipulate rows and columns of data, find patterns and create what-if scenarios. Below is my repository home page where you can download the code:
At a simple level, it is an online version control system (and a lot more!) that stores and manages modifications to code script files within in a central repository. Its key benefit is that it makes it very easy for developers to work together on a common project. This environment makes it easy to download a new version of code scripts, make changes, and upload the revisions to those files. Everyone can then see these new changes, download them, and contribute. This system is very popular with developers so we have decided to join this community and make our SQL scripts available via this site. It is the ability to “collaborate” which is most important for me.
To help you get started there is a great section on the website called “Set Up Git”. If like me you are using a Mac then GitHub has a Mac client! You can use it without ever touching the command line interface (which can be a little frustrating at times!).
It would be great if you could contribute your own scripts to this repository so I can build up a library of scripts for analytical SQL. All you need to do is create an account on Github, search for the analytical SQL repository and then either download the repository as a zip file or use the “Clone in Desktop” option. What I want to do is build up a series of well documented use cases and when we have enough content then I will create industry specific folders to help organize the content.
So my new repository is now live, please have a look and feel free to upload your own scripts that show how you have used analytical SQL to solve specific business problems within your projects. Looking forward to lots of files arriving into this great new repository.
This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle. Part 4 explored the Observer-phase of our lifecycle so we have now arrived at the X-Charge part of our model.
To manage the chargeback process for our sandbox environment we are going to use the new Enterprise Manager 12c Cloud Management pack, for more information visit the EM home page on OTN.
Why charge for your providing sandbox services? The simple answer is that placing a price or cost on a service ensures that the resources are used wisely. If a project team incurred zero costs for their database environment then there is no incentive to evaluate the effectiveness of the data set and the cost-benefit calculation for the project is skewed by the lack of real-world cost data. This type of approach is the main reason why sandbox projects evolve over time into “production” data marts. Even if the project is not really delivering on its expected goals there is absolutely no incentive to kill the project and free up resources. Therefore, by not knowing the cost, it is impossible to establish the value...[Read More]
As my first blog post as the product manager for Parallel Execution (still known as Parallel Query by many people) I want to point out a new white paper about parallel execution fundamentals in the Oracle Database. You can find the paper here.
Also, stay tuned for another one in which we will talk about using Database Resource Manager to manage concurrent workloads with parallel execution, this will be published soon.
Please send any comments, questions and feedback about the paper, or generally about parallel execution in Oracle, to email@example.com, or even better use the comments section below.
Most of you will be familiar with partition pruning, where the Oracle Database will avoid the need to scan table and index partitions based on query predicates. This optimization is transparent to your application, but for it to work, the database has to find a way of mapping a query filter predicate to the partitioning key column (or columns). Partition pruning can only occur if the query has predicates that match the predetermined shape of a partitioned object. For example, a query on a SALES table partitioned by ORDER_DATE will need to include ORDER_DATE in a join or WHERE clause for it to be optimized by partition pruning.
What if you could do better than this? What if you could prune partitions using a variety of column predicates and dimension hierarchies, irrespective of their appearance in the partitioning key? How about pruning at a much finer level of granularity than a partition? Perhaps we want to optimize queries that filter SALES by SHIP_DATE, STATE and COUNTY, as well as ORDER_DATE. The new Oracle 18.104.22.168 zone map feature is designed to achieve this, and just like partitioning, zone maps are transparent to your queries; you don’t have to change your applications to make use of them.
Zone maps are available in Oracle Database 12c for Oracle Engineered Systems. Conceptually, they divide a table up into contiguous regions of blocks called zones (the default zone size being 1024 blocks). For each zone, the Oracle database records the minimum and maximum values for specified columns using a new database object called a zone map. Queries that filter on zone map columns have the potential to be optimized; it’s possible to prune zones that contain ranges of column values outside the match specified in the query predicate.
Consider a query that filters a sales table by (North American) state; in this case “CA”. A zone map on the STATE column will record the minimum and maximum values for this column for each zone in the table. This makes it possible to skip the zones that we can be certain won’t contain rows for “CA”.
You are probably aware that Exadata storage cells and the Oracle database In-Memory Column Store uses similar storage index techniques, so what benefits do zone maps add? Besides the fact that you can control zone maps explicitly, the most significant difference between zone maps and storage indexes is that zone maps can be used to prune zones using column predicates from multiple (joined) tables. Consider a more realistic scenario, in which the SALES table doesn’t have a STATE column, but instead has a LOCATION_ID referencing a dimension table called LOCATIONS. This is our query for summing the sales figures in California:
SELECT SUM(amount) FROM sales s, locations l WHERE s.location_id = l.location_id AND l.state = 'CA';
It would be great if we could avoid scanning zones in SALES that don’t contain rows associated with “CA”. Before we look at how we can do this, we’ll make the scenario even more realistic by assuming that LOCATIONS is a dimensional hierarchy of State and County, like this:
Each State is made up of multiple Counties, so “CA” will be associated with multiple LOCATION_ID values. If we want the “CA” rows in SALES, we’ll need to match the ones marked below in bold/red:
If we want to optimize a scan for “CA” rows, we will have to address a few issues:
Of course, zone maps are designed to address these issues - with a little bit of help from another Oracle Database 12c feature called attribute clustering. I introduced attribute clustering in an earlier post, but don’t worry if you haven’t read that yet; I’ll cover the basics here anyway. You’ve probably deduced that we can reduce the number of zones that contain “CA” rows if we cluster or sort the rows in SALES, keeping these rows close to one another, like this:
Attribute clustering is the feature that’s used to cluster the rows together. Zone maps are used to record the min/max values for specified columns for each zone (and this can include column values derived from joins; LOCATIONS.STATE and LOCATIONS.COUNTY in our case).
The following DDL will create a zone map on our SALES fact table using the dimension table columns LOCATIONS.STATE and LOCATIONS.COUNTY. It will also enable attribute clustering, using the same columns to cluster the table’s rows:
ALTER TABLE sales ADD CLUSTERING sales JOIN locations ON (sales_ac.location_id = locations.location_id) BY LINEAR ORDER (locations.state, locations.county) WITH MATERIALIZED ZONEMAP;
The LINEAR ORDER clause specifies a linear clustering algorithm, which is ideal for this example. Another algorithm is available; it is specified with "INTERLEAVED" and is optimized for more complex combinations of query predicates and dimension tables. Note that the definition of attribute clustering by itself does not change any data stored on disk; instead, it provides a directive for direct path operations; INSERT APPEND and MOVE that will physically perform the clustering operation for us. If there are pre-existing rows in SALES, we can MOVE the table (or its partitions) to re-order them.
Joins between SALES and the dimension table are now candidates for optimization when the query includes predicates on the dimension hierarchy “state” and “state, county”. For example:
SELECT SUM(amount) FROM sales JOIN locations ON (sales.location_id = locations.location_id) WHERE locations.state = 'NM';SELECT SUM(amount) FROM sales JOIN locations ON (sales.location_id = locations.location_id) WHERE locations.state = 'CA' AND locations.county = 'Kern';
By clustering the rows and recording appropriate min/max column values for our zones, we have addressed all of the issues I identified above. What’s more, we can still get benefit from Exadata storage indexes because zone maps and storage indexes complement one another, and they work together transparently.
Zone maps are explicitly created and controlled by the database administrator on a table-by-table basis. They are an inherent part of the physical database design and can be thought of as a coarse anti-index structure (unlike an index, a zone map tells you what zones not to access). Zone maps are very compact, and in some cases it is possible to use them where you would otherwise use an index. This is most relevant in data warehousing environments where scanning is often more appropriate than indexed row retrieval, and where indexes can use a considerable amount of storage space. Zone maps must be refreshed to be synchronized with the underlying table data, so you will need to give some consideration to how you want them to be kept up-to-date if you decide to use them as an alternative to indexes.
In summary, take a look at zone maps if you want to:
Here’s an example of using zone maps to optimize a table scan. To compare before and after, start by creating a table that has no zone map or attribute clustering:
CREATE TABLE sales_zm (sale_id NUMBER(10), customer_id NUMBER(10));
Insert 8 million rows with the following PL/SQL code. Why that many? With our example, we'll read one or two zones rather than the entire table, so I'm aiming to make the difference pretty obvious when you look at the block read statistics:
DECLARE i NUMBER(10); BEGIN FOR i IN 1..80 LOOP INSERT INTO sales_zm SELECT ROWNUM, MOD(ROWNUM,1000) FROM dual CONNECT BY LEVEL <= 100000; COMMIT; END LOOP; END; /
EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_ZM');
SET AUTOTRACE ON STATISTIC
Run the following query a few times to see what value “consistent gets” settles at:
SELECT COUNT(DISTINCT sale_id) FROM sales_zm WHERE customer_id = 50;
On my machine, I read 7,545 blocks from the buffer cache, but since the value depends on some storage defaults don’t be surprised if your value is different:
The following DDL will create a zone map, but since attribute clustering is a property of the table (like compression), any existing rows will not be re-ordered:
ALTER TABLE sales_zm ADD CLUSTERING BY LINEAR ORDER (customer_id) WITH MATERIALIZED ZONEMAP;
The zone map will not be efficient until we cluster the rows together, so we’ll MOVE the table to achieve this. This will refresh the zone map too:
ALTER TABLE sales_zm MOVE;
Run the same query a few times to see what value “consistent gets” settles at:
SELECT COUNT(DISTINCT sale_id) FROM sales_zm WHERE customer_id = 50;
On my database, I read around 1,051 database blocks instead of 7,545: a considerable improvement:
You'll find more examples covering zone maps and attribute clustering in the Oracle Learning Library and inside the Oracle Github repository. Full details on zone maps and attribute clustering can be found in the Oracle documentation library; particularly the Oracle 12c Database Data Warehousing Guide.
There's an earlier post on attribute clustering if you haven't read it already.
If there's anything to need to ask, or if you can't find what you need regarding zone maps or attribute clustering, please let me know by leaving a comment below. Thanks!
This is the next part in my on-going series of posts on the topic of how to successfully manage sandboxes within an Oracle data warehouse environment. In Part 1 I provided an overview of sandboxing (key characteristics, deployment models) and introduced the concept of a lifecycle called BOX’D (Build, Observe, X-Charge and Drop). In Part 2 I briefly explored the key differences between data marts and sandboxes. Part 3 explored the Build-phase of our lifecycle.
Now, in this post I am going to focus on the Observe-phase. At this stage in the lifecycle we are concerned with managing our sandboxes. Most modern data warehouse environments will be running hundreds of data discovery projects so it is vital that the DBA can monitor and control the resources that each sandbox consumes by establishing rules to control the resources available to each project both in general terms and specifically for each project.
In most cases, DBAs will setup a sandbox with dedicated resources. However, this approach does not create an efficient use of resources since sharing of unused resources across other projects is just not possible. The key advantage of Oracle Multitenant is its unique approach to resource management. The only realistic way to support thousands of sandboxes, which in today’s analytical driven environments is entirely possible if not inevitable, is to allocate one chunk of memory and one set of background processes for each container database. This provides much greater utilisation of existing IT resources and greater scalability as multiple pluggable sandboxes are consolidated into the multitenant container database.
Using multitenant we can now expand and reduce our resources as required to match our workloads. In the example below we are running an Oracle RAC environment, with two nodes in the cluster. You can see that only certain PDBs are open on certain nodes of the cluster and this is achieved by opening the corresponding services on these nodes as appropriate. In this way we are partitioning the SGA across the various nodes of the RAC cluster. This allows us to achieve the scalability we need for managing lots of sandboxes. At this stage we have a lot of project teams running large, sophisticated workloads which is causing the system to run close to capacity as represented by the little resource meters.
It would be great if our DBA could add some additional processing power to this environment to handle this increased workload. With 12c what we can do is simply drop another node into the cluster which allows us to spread the processing of the various sandbox workloads loads out across the expanded cluster.
Now our little resource meters are showing that the load on the system is a lot more comfortable. This shows that the new multitenant feature integrates really well with RAC. It’s a symbiotic relationship whereby Multitenant makes RAC better and RAC makes Multitenant better.
So now we can add resources to the cluster how do we actually manage resources across each of our sandboxes? As a DBA I am sure that you are familiar with the features in Resource Manager that allow you to control system resources: CPU, sessions, parallel execution servers, Exadata I/O. If you need a quick refresher on Resource Manager then check out this presentation by Dan Norris “Overview of Oracle Resource Manager on Exadata” and the chapter on resource management in the 12c DBA guide.
With 12c Resource Manager is now multitenant-aware. Using Resource Manager we can configure policies to control how system resources are shared across the sandboxes/projects. Policies control how resources are utilised across PDBs creating hard limits that can enforce a “get what you pay for” model which is an important point when we move forward to the next phase of the lifecycle: X-Charge. Within Resource Manager we have adopted an “industry standard” approach to controlling resources based on two notions:
To help DBAs quickly deploy PDBs with a pre-defined set of shares and utilisation limits there is a “Default” configuration that works, even as PDBs are added or removed. How would this work in practice? Using a simple example this is how we could specify resource plans for the allocation of CPU between three PDBs:
As you can see, there are four total shares, 2 for the data warehouse and one each for our two sandboxes. This means that our data warehouse is guaranteed 50% of the CPU whatever else is going on in the other sandboxes (PDBs). Similarly each of our sandbox projects is guaranteed at least 25%. However, in this case we did not specify settings for maximum utilisation. Therefore, our marketing sandbox could use 100% of the CPU if both the data warehouse and the sales sandbox were idle.
By using the “Default” profile we can simplify the whole process of adding and removing sandboxes/PDBS. As we add and remove sandboxes, the system resources are correctly rebalanced, by using the settings specific default profile, across all the plugged-in sandboxes/PDBs as shown below.
In this latest post on sandboxing I have examined the “Observe” phase of our BOX’D sandbox lifecycle. With the new multitenant-aware Resource Manager we can configure policies to control how system resources are shared across sandboxes. Using Resource Manager it is possible to configure a policy so that the first tenant in a large, powerful server experiences a realistic share of the resources that will eventually be shared as other tenants are plugged in.
In the next post I will explore the next phase of our sandbox lifecycle, X-charge, which will cover the metering and chargeback services for pluggable sandboxes.
Attribute clustering is a feature that’s new to Oracle Database 22.214.171.124 (Enterprise Edition). It was designed to work with other features, such as compression, storage indexes, and especially with a new feature called zone maps, but since attribute clustering has some useful benefits of its own I’ll cover them here and make zone maps the subject of a later post.
So what is attribute clustering? It is simply a table property - just like compression - but it defines how rows should be ordered and clustered together in close physical proximity, based on one or more column values. For example, in a sales transaction table you could choose to cluster together rows that share common customer ID values. Why would you consider doing this? Perhaps your system frequently queries sales data relating to particular customers, and perhaps there is a requirement for extremely fast and consistent response times (a call center CRM application would be a good example). Your physical database design will probably incorporate an index on the customer ID column already, but you can gain further benefit if sales table rows are physically near to one another when they share common customer IDs. The diagram below represents an index being scanned to find a particular customer, followed by some reads that fetch the corresponding sales table rows:
With attribute clustering, the matching sales table rows are near to one another, so it is likely that fewer database blocks will need to be read from storage (or database cache) than if the rows are scattered throughout the sales table. The reason for this is that database blocks will usually contain multiple rows, so it is beneficial if each block we read happens to contains multiple matching rows. Technically, attribute clustering improves index clustering factors, an important metric with regards to the efficiency of scanning indexes and fetching the corresponding table rows.
Many DBAs have used a similar trick in the past by ordering rows as they are loaded into the database (using an explicit SQL “ORDER BY” clause). Attribute clustering has the advantage of being transparent and a property of the table itself; clustering behavior is inherited from the table definition and is implemented automatically. Just like compression, attribute clustering is a directive that transparently kicks in for certain operations, namely direct path insertion or data movement. This is especially useful because row clustering will occur during table and partition movement as well as during data load.
It's pretty common for database systems to order rows on data load or data movement to improve table compression ratios. Attribute clustering can be used instead to achieve a similar result.
The name, “attribute clustering” might remind you of another database feature called Oracle Table Clusters, but be careful not to confuse the two. Oracle Table Clusters store rows from one or multiple tables in close proximity inside a specialized storage structure. Attribute clustering simply orders rows within a table (or its partitions and subpartitions); related rows will be physically close to one another, but they will not be held inside any new type of storage structure that's specific to attribute clustering.
Although attribute clustering is especially useful in conjunction with zone maps, it can be used as a transparent, declarative way to cluster or order table rows in order to improve:
Here’s an example of using attribute clustering to speed up a query. We'll compare before and after; so start by creating a table that is not attribute clustered:
CREATE TABLE sales_ac (sale_id NUMBER(10), customer_id NUMBER(10)); INSERT INTO sales_ac SELECT ROWNUM, MOD(ROWNUM,1000) FROM dual CONNECT BY LEVEL <= 100000; EXEC dbms_stats.gather_table_stats(ownname=>NULL, tabname=>'SALES_AC'); CREATE INDEX sales_ac_cix ON sales_ac (customer_id); SET AUTOTRACE ON STATISTIC
Our table is not very large, so I’ve included a hint in the test query to encourage the optimizer to use the index:
SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id) FROM sales_ac WHERE customer_id = 50;
Run the query a few times, and see what the reported value for “consistent gets” settles at. I get 96, but since the value depends on some storage defaults, don’t be surprised if your value is different:
Attribute clustering is a property of the table, so when it is added, existing rows are not re-ordered. The following command is very fast because it just makes a change to the data dictionary:ALTER TABLE sales_ac ADD CLUSTERING BY LINEAR ORDER (customer_id) WITHOUT MATERIALIZED ZONEMAP;
Now we can physically cluster the existing table data by moving the table:ALTER TABLE sales_ac MOVE;
Moving tables and partitions is much cleaner and simpler than the manual “ORDER BY” method, where we would have to create a new table, add indexes, drop the old table and then rename. The simpler MOVE approach is particularly relevant in real-world scenarios, where it would be more usual to move tables a partition or sub-partition at a time, potentially using on-line operations.
Rebuild the index:ALTER INDEX sales_ac_cix REBUILD;
Use the same test query:SELECT /*+ INDEX(sales_ac sales_ac_cix) */ COUNT(DISTINCT sale_id) FROM sales_ac WHERE customer_id = 50;
Again, run it a number of times to settle the reported value for “consistent gets”. In my case, I now read 3 database blocks instead of 96: a considerable improvement!
Full details on zone maps and attribute clustering can be found in the Oracle documentation, particularly the Oracle 12c Data Warehousing Guide.
The zone maps blog post includes some more on attribute clustering, plus links to a bunch of scripts for you to try out.
Data Warehousing and Big Data were at the heart of this year’s OpenWorld conference being across in a number of keynotes and a huge number of general sessions. Our hands-on labs were all completely full as people got valuable hands-on time with our most important new features. The key areas at this year’s conference were:
All these topics appeared in the main keynote sessions including live on-stage demonstrations of how each feature can be used to increased the performance and analytical capability of your data warehouse.
If you want to revisit the most important sessions, or if simply missed this year’s conference and want to catch up on all the most important topics, then I have put together a book of the highlights from this year’s conference. The booklet is divided into the following sections:
You can download my review in PDF format by clicking here. Hope this proves useful and if I missed anything then let me know.
Today, Oracle is using big data technology and concepts to significantly
improve the effectiveness of its support operations, starting with its
hardware support group. While the company is just beginning this
journey, the initiative is already delivering valuable benefits.
In 2013, Oracle’s hardware support group began to look at how it could use automation to improve support quality and accelerate service request (SR) resolution. Its goal is to use predictive analytics to automate SR resolution within 80% to 95% accuracy.
Oracle’s support group gathers a tremendous amount of data. Each month, for example, it logs 35,000 new SRs and receives nearly 6 TB of telemetry data via automated service requests (ASRs)—which represent approximately 18% of all SRs. Like many organizations, Oracle had a siloed view of this data, which hindered analysis. For example, it could look at SRs but could not analyze the associated text, and it could review SRs and ASRs separately, but not together.
Oracle was conducting manual root-cause analysis to identify which types of SRs were the best candidates for automation. This was a time-consuming, difficult, and costly process, and the company looked to introduce big data and predictive analytics to automate insight.
The team knew that it had to walk before it could run. It started by taking information from approximately 10 silos, such as feeds from SRs and ASRs, parts of databases, and customer experience systems, and migrating the information to an Oracle Endeca Information Discovery environment. Using the powerful Oracle Endeca solution, Oracle could look at SRs, ASRs, and associated notes in a single environment, which immediately yielded several additional opportunities for automation. On the first day of going live with the solution, Oracle identified 4% more automation opportunities.
Next, Oracle focused its efforts on gaining insight in near real time, leveraging the parallel processing of Hadoop to automatically feed Oracle Endeca Information Discovery—dramatically improving data velocity. Oracle’s first initiative with this new environment looked at Oracle Solaris SRs. In the first few weeks of that project, Oracle identified automation opportunities that will increase automated SR resolution from less than 1% to approximately 5%—simply by aggregating all of the data in near real-time.
Once Oracle proved via these early proofs of concept that it could process data more efficiently and effectively to feed analytical projects, it began to deploy Oracle Big Data Appliance and Oracle Exalytics In-Memory Machine.
Looking around northern California and inside many technology kitchens makes me believe that we are about to see the Data Scientist bubble burst. And then I read the Fortune Magazine article on Peter Thiel - and the excerpt on Zero to One (his new book) in that article and it dawned on me that is one of the intersting ways to look at the Data Scientist bubble.
Without trying to simplify and/or bastardize mr. Thiel's theory, the example in the Fortune Mag article will make this visible to most people (I hope). In the article the analogy is; going from one type writer to 100 type writers is 1 to N, inventing a word processor is moving us from 0 to 1. In other words, true innovation dramatically changes things by giving previously unknown power to the masses. It is that innovation that moves us from 0 to 1. Expansion of existing ideas - not true innovation - moves us from 1 to N. Of course, don't take my word on this but read the article or the book...
The above paradigm explains the Data Scientist bubble quite nicely. Once upon a time companies hired a few PhD students who by chance had a degree in statistics and had learned how to program and figured out how to deal with (large) data sets. These newly minted data scientists proved that there is potential value in mashing data together, running analytics on these newly created data sets and thus caused a storm of publicity. Companies large and small are now frantically trying to hire these elusive data scientists, or something a little more down to earth, are creating data scientists (luckily not in the lab) by forming teams that bring a part of the skillset to the table.
This approach all starts to smell pretty much like a whole busload of typewriters being thrown at a well-known data analysis and data wrangling problem. Neither the problem nor the solution are new, nor innovative. Data Scientists are therefore not moving us from 0 to 1...
One could argue that while the data scientist quest is not innovative, at least is solves the problem of doing analytics. Fair and by some measure correct, but there is one bigger issue with the paradigm of "data scientists will solve our analytics problem" and that is scale. Giving the keys to all that big data to only a few data scientists is not going to work because these smart and amazing people are now becoming, often unbeknownst to them, an organizational bottleneck to gaining knowledge from big data.
The only real solution, our 0 to 1, is to expose a large number of consumers to all that big data, while enabling these consumers to apply a lot of the cool data science to all that data. In other words, we need to provide tools which include data science smarts. Those tools will enable us to apply the 80% common data science rules to the 80% of common business problems. This approach drives real business value at scale. With large chunks of issues resolved, we can then focus our few star data scientists on the 20% of problems or innovations that drive competitive advantage and change markets.
The bubble is bursting because what I am seeing is more and more tools coming to market (soon) that will drive data science into the day-to-day job of all business people. Innovation is not the building of a better tool for data scientists or hiring more of them, instead the
real 0 to 1 innovation is tools that make make all of us data scientists
and lets us solve our own data science problems. The future of Data Science is smarter tools, not smarter humans.
Since the term big data first appeared in our lexicon of IT and business technology it has been intrinsically linked to the no-SQL, or anything-but-SQL, movement. However, we are now seeing that SQL is experiencing a renaissance. The term “noSQL” has softened to a much more realistic approach "not-only-SQL" approach. And now there is an explosion of SQL-based implementations designed to support big data. Leveraging the Hadoop ecosystem, there is: Hive, Stinger, Impala, Shark, Presto and many more. Other NoSQL vendors such as Cassandra are also adopting flavors of SQL. Why is there a growing level of interest in the reemergence of SQL? Probably, a more pertinent question is: did SQL ever really go away? Proponents of SQL often cite the following explanations for the re-emergence of SQL for analysis:
However, despite the virtues of these explanations, they alone do not explain the recent proliferation of SQL implementations. Consider this: how often does the open-source community embrace a technology just because it is the corporate orthodoxy? The answer is: probably not ever. If the open-source community believed that there was a better language for basic data analysis, they would be implementing it. Instead, a huge range of emerging projects, as mentioned earlier, have SQL at their heart The simple conclusion is that SQL has emerged as the de facto language for big data because, frankly, it is technically superior. Let’s examine the four key reasons for this:
The concept of SQL is underpinned by the relational algebra - a consistent framework for organizing and manipulating sets of data - and the SQL syntax concisely and intuitively expresses this mathematical system.
Most business users, data analysts and even data scientists think about data within the context of a spreadsheet. If you think about a spreadsheet containing a set of customer orders then what do most people do with that spreadsheet? Typically, they might filter the records to look only at the customer orders for a given region. Alternatively, they might hide some columns: maybe the customer address is not needed for a particular piece of analysis, but the customer name and their orders are important data points. Finally, they might add calculations to compute totals and/or perhaps create a cross tabular report.
Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.
When a developer writes a SQL query, he or she simply describes the results that they want. The developer does not have to get into any of the nitty-gritty of describing how to get the results
This type of approach is often referred to as 'declarative programming,’ and it makes the developer's job easier. Even the simplest SQL query illustrates the benefits of declarative programming:
SELECT day, prcp, temp FROM weather
WHERE city = 'San Francisco' AND prcp > 0.0;
SQL engines may have multiple ways to execute this query (for example, by using an index). Fortunately the developer doesn't need to understand any of the underlying database processing techniques. The developer simply specifies the desired set of data using projections (SELECT) and filters (WHERE).
This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query", the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.
The fact that SQL is a declarative language not only shields the developer from the complexities of the underlying query techniques, but also gives the underlying SQL engine has a lot of flexibility in how to optimize any given query.
In a lot of programming languages, if the code runs slow, then it's the programmer's fault. For the SQL language, however, if a SQL query runs slow, then it's the SQL engine's fault.
This is where analytic databases really earn their keep – databases can easily innovate ‘under the covers’ to deliver faster performance; parallelization techniques, query transformations, indexing and join algorithms are just a few key areas of database innovation that drive query performance.
SQL provides a robust framework that adapts to new requirements
SQL has stayed relevant over the decades because, even though its core is grounded in universal data processing techniques, the language itself can be extended with new processing techniques and new calculations. Simple time-series calculations, statistical functions, and pattern-matching capabilities have all been added to SQL over the years.
Consider, as a recent example, what many organizations realized as they started to ask queries such as 'how many distinct visitors came to my website last month?' These organizations realized that it is not vital to have a precise answer to this type of query ... an approximate answer (say, within 1%) would be more than sufficient. This has requirement has now been quickly delivered by implementing the existing hyperloglog algorithms within SQL engines for 'approximate count distinct' operations.
More importantly, SQL is a language that is not explicitly tied to a storage model. While some might think of SQL as synonymous with relational databases, many of the new adopters of SQL are built on non-relational data. SQL is well on its way to being a standard language for accessing data stored in JSON and other serialized data structures.
SQL is an immensely popular language today … and if anything its popularity is growing as the language is adopted for new data types and new use cases. The primacy of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis
PS. Next week, many sessions at this year’s OpenWorld will focus on the power, richness and performance of SQL for sophisticated data analysis including the following:
Monday September 28
Using Analytical SQL to Intelligently Explore Big Data @ 4:00PM Moscone North 131
Joerg Otto - Head of Database Engineering, IDS GmbH
Marty Gubar - Director, Oracle
Keith Laker - Senior Principal Product Manager, Data Warehousing and Big Data, Oracle
YesSQL! A Celebration of SQL and PL/SQL @ 6:00PM Moscone South 103
Tuesday September 29
SQL Is the Best Development Language for Big Data @ 10:45AM Moscone South 104
Enjoy OpenWorld 2014 and if you have time please come and meet the Analytical SQL team in the Moscone South Exhbition Hall. We will be on the Parallel Execution and Advanced SQL Processing demo booth (id 3720).
Here's the inventory of all the features and version:
Just in time for Oracle Openworld, the new Big Data Information Management Reference Architecture is posted on our OTN pages. The reference architecture attempts to create order in the wild west of new technologies, the flurry of new ideas and most importantly tries to go from marketing hype to a real, implementable architecture.
To get all the details, read the paper here. Thanks to the EMEA architecture team , the folks at Rittman Mead Consulting and all others involved.
Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL. The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.
Let's take a look at what it takes to get started and then preview this new capability.
The first thing you need to do is set up a JDBC connection to Hive. Follow these steps to set up the connection:
Cloudera provides high performance JDBC drivers that are required for connectivity:
unzip Cloudera_HiveJDBC4_126.96.36.1996.zip -d /home/oracle/jdbc/
Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.
Update the preferences in SQL Developer to leverage the new drivers:
Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2. Click the New Connection button in the SQL Developer toolbar. You'll need to have an ID, password and the port where Hiveserver2 is running:
hivewhich connects to Hiveserver2 on
localhostrunning on port
10000. The Database field is optional; here we are specifying the
The Hive connection is now treated like any other connection in SQL Developer. The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:
Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables. Let's create a few tables using the metadata from the Hive Metastore. Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item. You will see the same connections found in the SQL Developer connection navigator:
After selecting the hive connection and a database, select the tables to import:
There could be any number of tables here - in our case we will select three tables to import. After completing the import, the logical table definitions appear in our palette:
You can update the logical table definitions - and in our case we will
want to do so. For example, the
recommended column in
Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000). We have domain knowledge and understand
that this field is really much smaller - so we'll update it to the
Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c. Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:
The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing. For example, edit table movieapp_log_json:
You now have full Oracle SQL access to data across the platform. In our example, we can combine data from Hadoop with data in our Oracle Database. The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it! Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:
Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.