Friday Apr 24, 2015

Managing overflows in LISTAGG

This is an interesting problem that has come up a few times in discussions (and I think it has been mentioned on the SQL forums as well).  When using LISTAGG on very large data sets you can sometimes create a list that is too long and consequently get an ORA-01489: result of string concatenation is too long  error.

Obviously, it is possible to determine in advance if the error is going to occur using some bespoke PL/SQL code and then take appropriate action within your application to manage the rows that contain stings that exceed the VARCHAR2 limit.

Many customers have implemented workarounds to overcome the ORA-01489 error, however, this has mostly involved the use of complex code which has impacted performance. Wouldn’t it be great if there was a simple yet elegant way to resolve this issue? Actually there is and we can use a few of the most recent analytical SQL functions. If you are using Database 12c you can make use of the MATCH_RECOGNIZE function to effectively create chunks of strings that do not exceed the VARCHAR2 limit...

[Read More]

Tuesday Apr 14, 2015

Statement of Direction -- Big Data Management System

Click here to start reading the Full Statement of Direction. 

Introduction: Oracle Big Data Management System Today 

As today's enterprises embrace big data, their information architectures must evolve. Every enterprise has data warehouses today, but the best-practices information architecture embraces emerging technologies such as Hadoop and NoSQL. Today’s information architecture recognizes that data not only is stored in increasingly disparate data platforms, but also in increasingly disparate locations: on-premises and potentially multiple cloud platforms. The ideal of a single monolithic ‘enterprise data warehouse’ has faded as a new more flexible architecture has emerged. Oracle calls this new architecture the Oracle Big Data Management System, and today it consists of three key components

  • The data warehouse, running on Oracle Database and Oracle Exadata Database Machine, is the primary analytic database for storing much of a company’s core transactional data: financial records, customer data, point- of-sale data and so forth. Despite now being part of a broader architecture, the requirements on the RDBMS for performance, scalability, concurrency and workload management are in more demand than ever; Oracle Database 12c introduced Oracle Database In-Memory (with columnar tables, SIMD processing, and advanced compression schemes) as latest in a long succession of warehouse-focused innovations. The market-leading Oracle Database is the ideal starting point for customers to extend their architecture to the Big Data Management System.
  • The ‘data reservoir’, hosted on Oracle Big Data Appliance, will augment the data warehouse as a repository for the new sources of large volumes of data: machine-generated log files, social-media data, and videos and images -- as well as a repository for more granular transactional data or older transactional data which is not stored in the data warehouse. Oracle’s Big Data Management System embraces complementary technologies and platforms, including open-source technologies: Oracle Big Data Appliance includes Cloudera’s Distribution of Hadoop and Oracle NoSQL Database for data management.
  • A ‘franchised query engine,’ Oracle Big Data SQL, enables scalable, integrated access in situ to the entire Big Data Management System. SQL is the accepted language for day-to-day data access and analytic queries, and thus SQL is the primary language of the Big Data Management System.  Big Data SQL enables users to combine data from Oracle Database, Hadoop and NoSQL sources within a single SQL statement.  Leveraging the architecture of Exadata Storage Software and the SQL engine of the Oracle Database, Big Data SQL delivers high-performance access to all data in the Big Data Management System.

Using this architecture, the Oracle Big Data Management System combines the performance of Oracle’s market-leading relational database, the power of Oracle’s SQL engine, and the cost-effective, flexible storage of Hadoop and NoSQL. The result is an integrated architecture for managing Big Data, providing all of the benefits of Oracle Database, Exadata, and Hadoop, without the drawbacks of independently-accessed data repositories.  

Note that the scope of this statement of direction is the data platform for Big Data. An enterprise Big Data solution would also be comprised of big data tools and big data applications built upon this data platform. 

Read the full Statement of Direction -- Big Data Management System here.


Friday Mar 27, 2015

Why SQL Part 4 - Intelligent and continuous evolution

In the third part of this series of posts on (Why SQL Part 3 - Simple optimization makes life simplerof this series of blog posts I explained that by letting the database perform the optimisations at source, where the data is located, it is possible to share tuning improvements across a wide spectrum of front-end systems (BI reports, dashboards, applications etc). The ability of the database to “own” the optimization and processing characteristics of how a SQL statement is executed is an important element in the success of SQL as a data analysis language. In the whitepaper that forms the basis for this series of blog posts this section is quite brief. This was done for reasons that will become obvious as you work through this blog post

Now that we have looked at the powerful framework that supports SQL and the simplified optimization layer we can move on to look at how SQL has evolved over time to support ever more sophisticated features and functions and compare this with some of the newer open source frameworks.

[Read More]

Thursday Mar 19, 2015

Why SQL Part 3 - Simple optimization makes life simpler

In the second part of this series of posts on (Why SQL Part 2 - It has a powerful frameworkof this series of blog posts I explained the underlying framework that SQL is based on. That framework is referred to as “relational algebra” and it was first described in 1970 by E.F. Codd while at IBM . It provides well-founded semantics for both modelling and querying a set of data. Codd's framework established a series of basic principles that govern the construction and execution of a query and these can be summarized as follows:

  1. Projection
  2. Filter
  3. Join
  4. Aggregate

Now that we have examined the basics of the relational model let’s now move on to how the language actually determines the best way to answer a question based on its use of simplified and transparent optimizations. To understand why simplified and transparent optimization is so important it is useful to consider two of the most common programming paradigms and these are: procedural and declarative. Let’s explore these two approaches...

[Read More]

Wednesday Mar 18, 2015

Production workloads blend Cloud and On-Premise Capabilities

Prediction #7 - blending production workloadsacross cloud and on-premise in Oracle's Enterprise Big Data Predictions 2015 is a tough nut to crack. Yet, we at Oracle think this is really the direction we all will go. Sure we can debate the timing, and whether or not this happens in 2015, but it is something that will come to all of us who are looking towards that big data future. So let’s discuss what we think is really going to happen over the coming years in the big data and cloud world.

Reality #1 – Data will live both in the cloud and on-premise

We see this today. Organizations run Human Capital Management systems in the cloud, integrate these with data from outside cloud based systems (think for example LinkedIn, staffing agencies etc.) while their general data warehouses and new big data systems are all deployed as on-premise systems. We also see the example in the prediction where various auto dealer systems uplink into the cloud to enable the manufacturer to consolidate all of their disparate systems. This data may be translated into data warehouse entries and possibly live in two worlds – both in the cloud and on-premise for deep diving analytics or in aggregated form.

Reality #2 – Hybrid deployments are difficult to query and optimize

We also see this today and it is one of the major issues of living in the hybrid world of cloud and on-premise. A lot of the issues are driven by low level technical limitations, specifically in network bandwidth and upload / download capacity into and out of the cloud environment. The other challenges are really (big) data management challenges in that they go into the art of running queries across two ecosystems with very different characteristics. We see a trend to use engineered systems on-premise, which delivers optimized performance for the applications, but in the cloud we often see virtualization pushing the trade-off towards ease of deployment and ease of management. These completely different ecosystems make optimization of queries across them very difficult.

Solution – Equality brings optimizations to mixed environments

As larger systems like big data and data warehouse systems move to the cloud, better performance becomes a key success criterion. Oracle is uniquely positioned to drive both standardization and performance optimizations into the cloud by deploying on engineered systems like Oracle Exadata and Oracle Big Data Appliance. Deploying engineered systems enables customers to run large systems in the cloud delivering performance as they see today in on-premise deployments. This then means that we do not live in a world divided in slow and fast, but in a world of fast and fast.
This equivalence also means that we have the same functionality in both worlds, and here we can sprinkle in some – future – Oracle magic, where we start optimizing queries to take into account where the data lives, how fast we can move it around (the dreaded networking bandwidth issue) and where we need to execute code. Now, how are we going to do this? That is a piece magic, and you will just need to wait a bit… suffice it to say we are hard at work at solving this challenging topic.

Monday Mar 09, 2015

Why SQL Part 2 - It has a powerful framework

In the first part (Why SQL is the natural language for data analysis) of this series of blog posts I explained why analytics is important to the business, how the volume of data along with the types of data is continuing to expand and why that makes it vital that you select the right language for data analysis. Many of us work with SQL every day and take for granted many of its unique features, its power, flexibility and the richness of the analytics.

This familiarity with SQL means that sometimes we are a bit slow at preventing some of our projects investing in other big data languages such as MapReduce, Impala, Spark, Java and many of the new generation of SQL-like open source projects. While many of these new languages are considered “cool”, it is very easy to end up building new proprietary data silos or investing in a language that eventually is replaced by another open source project or investing time and effort in code that eventually fails to deliver the required analytics.

One of the aims of this series of blog posts is to refresh your memory about why SQL has been so successful in the area of analytics. SQL has four unique features that make it perfect for data analysis...

    [Read More]

    Tuesday Mar 03, 2015

    Are you leveraging Oracle's database innovations for Cloud and Big data?

    If you are interested in big data, Hadoop, SQL and data warehousing then mark your calendars because on March 18th at 10:00AM PST/1:00PM EST, you will be able to hear Tom Kyte (Oracle Database Architect) talk about how you can use Oracle Big Data SQL to seamlessly integrate all your Hadoop big data datasets with your relational schemas stored in Oracle Database 12c. As part of this discussion Tom will outline how you can build the perfect foundation for your enterprise big data management system using Oracle's innovative technology.

    If you are working on a data warehousing project and/or a big data project then this is one webcast you will not want to miss so register today (click here) to hear the latest about Oracle Database innovations and best practices. The full list of speakers is:

    Tom Kyte
    Oracle Database Architect
    Keith Wilcox
    VP, Database Administration
    Epsilon
    Bill Callahan
    Director, Principal Engineer,
    CCC Information Services, Inc.

    Why SQL is the natural language for data analysis

    Analytics is a must-have component of every corporate data warehousing and big data project. It is the core driver for the business: the development of new products, better targeting of customers with promotions, hiring of new talent and retention of existing key talent. Yet the analysis of especially “big data environments”, data stored and processed outside of classical relational systems, continues to be a significant challenge for the majority companies. According to Gartner, 72% of companies are planning to increase their expenditure on big data yet 55% state they don’t have the necessary skills to make use of it. 

    The objective of this series of articles, which will appear over the coming weeks, is to explain why SQL is the natural language for amy kind of data analysis including big data and the benefits that this brings for application developers, DBAs and business users. 

    [Read More]

    Monday Sep 29, 2014

    The End of the Data Scientist Bubble...

    Looking around northern California and inside many technology kitchens makes me believe that we are about to see the Data Scientist bubble burst. And then I read the Fortune Magazine article on Peter Thiel - and the excerpt on Zero to One (his new book) in that article and it dawned on me that is one of the intersting ways to look at the Data Scientist bubble.

    Thiel's Classification of Innovation

    Without trying to simplify and/or bastardize mr. Thiel's theory, the example in the Fortune Mag article will make this visible to most people (I hope). In the article the analogy is; going from one type writer to 100 type writers is 1 to N, inventing a word processor is moving us from 0 to 1. In other words, true innovation dramatically changes things by giving previously unknown power to the masses. It is that innovation that moves us from 0 to 1. Expansion of existing ideas - not true innovation - moves us from 1 to N. Of course, don't take my word on this but read the article or the book...

    The Demise of the Human Data Scientist

    The above paradigm explains the Data Scientist bubble quite nicely. Once upon a time companies hired a few PhD students who by chance had a degree in statistics and had learned how to program and figured out how to deal with (large) data sets. These newly minted data scientists proved that there is potential value in mashing data together, running analytics on these newly created data sets and thus caused a storm of publicity. Companies large and small are now frantically trying to hire these elusive data scientists, or something a little more down to earth, are creating data scientists (luckily not in the lab) by forming teams that bring a part of the skillset to the table.

    This approach all starts to smell pretty much like a whole busload of typewriters being thrown at a well-known data analysis and data wrangling problem. Neither the problem nor the solution are new, nor innovative. Data Scientists are therefore not moving us from 0 to 1...

    One could argue that while the data scientist quest is not innovative, at least is solves the problem of doing analytics. Fair and by some measure correct, but there is one bigger issue with the paradigm of "data scientists will solve our analytics problem" and that is scale. Giving the keys to all that big data to only a few data scientists is not going to work because these smart and amazing people are now becoming, often unbeknownst to them, an organizational bottleneck to gaining knowledge from big data.

    The only real solution, our 0 to 1, is to expose a large number of consumers to all that big data, while enabling these consumers to apply a lot of the cool data science to all that data. In other words, we need to provide tools which include data science smarts. Those tools will enable us to apply the 80% common data science rules to the 80% of common business problems. This approach drives real business value at scale. With large chunks of issues resolved, we can then focus our few star data scientists on the 20% of problems or innovations that drive competitive advantage and change markets.

    My Conclusion

    The bubble is bursting because what I am seeing is more and more tools coming to market (soon) that will drive data science into the day-to-day job of all business people. Innovation is not the building of a better tool for data scientists or hiring more of them, instead the real 0 to 1 innovation is tools that make make all of us data scientists and lets us solve our own data science problems. The future of Data Science is smarter tools, not smarter humans.

    Friday Sep 26, 2014

    Why SQL is becoming the goto language for Big Data analysis

    Since the term big data first appeared in our lexicon of IT and business technology it has been intrinsically linked to the no-SQL, or anything-but-SQL, movement. However, we are now seeing that SQL is experiencing a renaissance. The term “noSQL” has softened to a much more realistic approach "not-only-SQL" approach. And now there is an explosion of SQL-based implementations designed to support big data. Leveraging the Hadoop ecosystem, there is: Hive, Stinger, Impala, Shark, Presto and many more. Other NoSQL vendors such as Cassandra are also adopting flavors of SQL. Why is there a growing level of interest in the reemergence of SQL? Probably, a more pertinent question is: did SQL ever really go away? Proponents of SQL often cite the following explanations for the re-emergence of SQL for analysis:

    1. There are legions of developers who know SQL. Leveraging the SQL language allows those developers to be immediately productive.
    2. There are legions of tools and applications using SQL today.
    3. Any platform that provides SQL will be able to leverage the existing SQL ecosystem.

    However, despite the virtues of these explanations, they alone do not explain the recent proliferation of SQL implementations. Consider this: how often does the open-source community embrace a technology just because it is the corporate orthodoxy? The answer is: probably not ever. If the open-source community believed that there was a better language for basic data analysis, they would be implementing it. Instead, a huge range of emerging projects, as mentioned earlier, have SQL at their heart The simple conclusion is that SQL has emerged as the de facto language for big data because, frankly, it is technically superior. Let’s examine the four key reasons for this:

    1. SQL is a natural language for data analysis.
    2. SQL is a productive language for writing queries.
    3. SQL queries can be optimised.
    4. SQL is extensible.

    1. SQL is a natural language for data analysis.

    The concept of SQL is underpinned by the relational algebra - a consistent framework for organizing and manipulating sets of data - and the SQL syntax concisely and intuitively expresses this mathematical system.

    Most business users, data analysts and even data scientists think about data within the context of a spreadsheet. If you think about a spreadsheet containing a set of customer orders then what do most people do with that spreadsheet? Typically, they might filter the records to look only at the customer orders for a given region. Alternatively, they might hide some columns: maybe the customer address is not needed for a particular piece of analysis, but the customer name and their orders are important data points. Finally, they might add calculations to compute totals and/or perhaps create a cross tabular report.

    Within the language of SQL these are common steps: 1) projections (SELECT), 2) filters and joins (WHERE), and 3) aggregations (GROUP BY). These are core operators in SQL. The vast majority of people have found the fundamental SQL query constructs to be straightforward and readable representation of everyday data analysis operations.

    2. SQL is a productive language for writing queries.

    When a developer writes a SQL query, he or she simply describes the results that they want. The developer does not have to get into any of the nitty-gritty of describing how to get the results 

    This type of approach is often referred to as  'declarative programming,’ and it makes the developer's job easier. Even the simplest SQL query illustrates the benefits of declarative programming:

    SELECT day, prcp, temp FROM weather
    WHERE city = 'San Francisco' AND prcp > 0.0;

    SQL engines may have multiple ways to execute this query (for example, by using an index). Fortunately the developer doesn't need to understand any of the underlying database processing techniques. The developer simply specifies the desired set of data using projections (SELECT) and filters (WHERE).

    This is perhaps why SQL has emerged as such an attractive alternative to the MapReduce framework for analyzing HDFS data. MapReduce requires the developer to specify, at each step, how the underlying data is to be processed. For the same “query", the code is longer and more complex in MapReduce. For the vast majority of data analysis requirements, SQL is more than sufficient, and the additional expressiveness of MapReduce introduces complexity without providing significant benefits.


    3. SQL queries can be optimized

    The fact that SQL is a declarative language not only shields the developer from the complexities of the underlying query techniques, but also gives the underlying SQL engine has a lot of flexibility in how to optimize any given query. 

    In a lot of programming languages, if the code runs slow, then it's the programmer's fault. For the SQL language, however, if a SQL query runs slow, then it's the SQL engine's fault.

    This is where analytic databases really earn their keep – databases can easily innovate ‘under the covers’ to deliver faster performance; parallelization techniques, query transformations, indexing and join algorithms are just a few key areas of database innovation that drive query performance.

    4. SQL is extensible

    SQL provides a robust framework that adapts to new requirements

    SQL has stayed relevant over the decades because, even though its core is grounded in universal data processing techniques, the language itself can be extended with new processing techniques and new calculations. Simple time-series calculations, statistical functions, and pattern-matching capabilities have all been added to SQL over the years. 

    Consider, as a recent example, what many organizations realized as they started to ask queries such as 'how many distinct visitors came to my website last month?' These organizations realized that it is not vital to have a precise answer to this type of query ... an approximate answer (say, within 1%) would be more than sufficient. This has requirement has now been quickly delivered by implementing the existing hyperloglog algorithms within SQL engines for 'approximate count distinct' operations. 

    More importantly, SQL is a language that is not explicitly tied to a storage model. While some might think of SQL as synonymous with relational databases, many of the new adopters of SQL are built on non-relational data. SQL is well on its way to being a standard language for accessing data stored in JSON and other serialized data structures.  

    Summary

    SQL is an immensely popular language today … and if anything its popularity is growing as the language is adopted for new data types and new use cases. The primacy of SQL for big data is not simply a default choice, but a conscious realization that SQL is the best suited language for basic analysis

    PS. Next week, many sessions at this year’s OpenWorld will focus on the power, richness and performance of SQL for sophisticated data analysis including the following:

    Monday September 28

    Using Analytical SQL to Intelligently Explore Big Data @ 4:00PM Moscone North 131

    Joerg Otto - Head of Database Engineering, IDS GmbH
    Marty Gubar - Director, Oracle
    Keith Laker - Senior Principal Product Manager, Data Warehousing and Big Data, Oracle


    YesSQL! A Celebration of SQL and PL/SQL @ 6:00PM Moscone South 103

    Steven Feuerstein - Architect, Oracle
    Thomas Kyte - Architect, Oracle


    Tuesday September 29

    SQL Is the Best Development Language for Big Data @ 10:45AM Moscone South 104

    Thomas Kyte - Architect, Oracle

    Enjoy OpenWorld 2014 and if you have time please come and meet the Analytical SQL team in the Moscone South Exhbition Hall. We will be on the Parallel Execution and Advanced SQL Processing demo booth (id 3720).

    Tuesday Sep 23, 2014

    Big Data IM Reference Architecture

    Just in time for Oracle Openworld, the new Big Data Information Management Reference Architecture is posted on our OTN pages. The reference architecture attempts to create order in the wild west of new technologies, the flurry of new ideas and most importantly tries to go from marketing hype to a real, implementable architecture.

    To get all the details, read the paper here. Thanks to the EMEA architecture team , the folks at Rittman Mead Consulting and all others involved.

    Monday Sep 15, 2014

    Oracle SQL Developer & Data Modeler Support for Oracle Big Data SQL

    Oracle SQL Developer and Data Modeler (version 4.0.3) now support Hive and Oracle Big Data SQL.  The tools allow you to connect to Hive, use the SQL Worksheet to query, create and alter Hive tables, and automatically generate Big Data SQL-enabled Oracle external tables that dynamically access data sources defined in the Hive metastore.  

    Let's take a look at what it takes to get started and then preview this new capability.

    Setting up Connections to Hive

    The first thing you need to do is set up a JDBC connection to Hive.  Follow these steps to set up the connection:

    Download and Unzip JDBC Drivers

    Cloudera provides high performance JDBC drivers that are required for connectivity:

    • Download the Hive Drivers from the Cloudera Downloads page to a local directory
    • Unzip the archive
      • unzip Cloudera_HiveJDBC_2.5.4.1006.zip
    • Two zip files are contained within the archive.  Unzip the JDBC4 archive to a target directory that is accessible to SQL Developer (e.g. /home/oracle/jdbc below): 
      • unzip Cloudera_HiveJDBC4_2.5.4.1006.zip -d /home/oracle/jdbc/

    Now that the JDBC drivers have been extracted, update SQL Developer to use the new drivers.

    Update SQL Developer to use the Cloudera Hive JDBC Drivers

    Update the preferences in SQL Developer to leverage the new drivers:

    • Start SQL Developer
    • Go to Tools -> Preferences
    • Navigate to Database -> Third Party JDBC Drivers
    • Add all of the jar files contained in the zip to the Third-party JDBC Driver Path.  It should look like the picture below:
      sql developer preferences

    • Restart SQL Developer

    Create a Connection

    Now that SQL Developer is configured to access Hive, let's create a connection to Hiveserver2.  Click the New Connection button in the SQL Developer toolbar.  You'll need to have an ID, password and the port where Hiveserver2 is running:

    connect to hiveserver2

    The example above is creating a connection called hive which connects to Hiveserver2 on localhost running on port 10000.  The Database field is optional; here we are specifying the default database.

    Using the Hive Connection

    The Hive connection is now treated like any other connection in SQL Developer.  The tables are organized into Hive databases; you can review the tables' data, properties, partitions, indexes, details and DDL:

    sqldeveloper - view data in hive

    And, you can use the SQL Worksheet to run custom queries, perform DDL operations - whatever is supported in Hive:

    worksheet

    Here, we've altered the definition of a hive table and then queried that table in the worksheet.

    Create Big Data SQL-enabled Tables Using Oracle Data Modeler

    Oracle Data Modeler automates the definition of Big Data SQL-enabled external tables.  Let's create a few tables using the metadata from the Hive Metastore.  Invoke the import wizard by selecting the File->Import->Data Modeler->Data Dictionary menu item.  You will see the same connections found in the SQL Developer connection navigator:

    pick a connection

    After selecting the hive connection and a database, select the tables to import:

    pick tables to import

    There could be any number of tables here - in our case we will select three tables to import.  After completing the import, the logical table definitions appear in our palette:

    imported tables

    You can update the logical table definitions - and in our case we will want to do so.  For example, the recommended column in Hive is defined as a string (i.e. there is no precision) - which the Data Modeler casts as a varchar2(4000).  We have domain knowledge and understand that this field is really much smaller - so we'll update it to the appropriate size:

    update prop

    Now that we're comfortable with the table definitions, let's generate the DDL and create the tables in Oracle Database 12c.  Use the Data Modeler DDL Preview to generate the DDL for those tables - and then apply the definitions in the Oracle Database SQL Worksheet:

    preview ddl

    Edit the Table Definitions

    The SQL Developer table editor has been updated so that it now understands all of the properties that control Big Data SQL external table processing.  For example, edit table movieapp_log_json:

    edit table props

    You can update the source cluster for the data, how invalid records should be processed, how to map hive table columns to the corresponding Oracle table columns (if they don't match), and much more.

    Query All Your Data

    You now have full Oracle SQL access to data across the platform.  In our example, we can combine data from Hadoop with data in our Oracle Database.  The data in Hadoop can be in any format - Avro, json, XML, csv - if there is a SerDe that can parse the data - then Big Data SQL can access it!  Below, we're combining click data from the JSON-based movie application log with data in our Oracle Database tables to determine how the company's customers rate blockbuster movies:

    compare to blockbuster movies

    Looks like they don't think too highly of them! Of course - the ratings data is fictitious ;)

    Friday Sep 12, 2014

    Announcement: Big Data SQL is Generally Available

    Oracle Big Data SQL and Big Data Appliance 4.0 are generally available

    Big Data Appliance 4.0 receives the following upgrades:

     

    • Big Data SQL: Join data in Hadoop with Oracle Database using Oracle SQL 
      • New external table types for handling data stored in Hadoop 
      • Smart Scan for Hadoop to provide fast query performance 
      • Requires additional license for Big Data SQL 
      • Requires Exadata Database Machine running DB 12.1.0.2 
    • Automated recovery from server failure 
      • This includes migration of master roles to a different server and re-provisioning of a slave node on a server that has been replaced 
    • NoSQL DB multiple-zone configurations on BDA 
      • When adding nodes to a NoSQL DB BDA cluster, they can be added to an existing zone or to a new zone. 
    •  Update to using the 12c ODI Agent on BDA clusters 

    For more information on Big Data SQL, check out:

     

     

    [Read More]

    Tuesday Jul 15, 2014

    Oracle Big Data SQL: One Fast Query, All Your Data

    Introduction

    Today we're pleased to announce Big Data SQL, Oracle's unique approach to providing unified query over data in Oracle Database, Hadoop, and select NoSQL datastores.  Big Data SQL has been in development for quite a while now, and will be generally available in a few months.  With today's announcement of the product, I wanted to take a chance to explain what we think is important and valuable about Big Data SQL.

    SQL on Hadoop

    As anyone paying attention to the Hadoop ecosystem knows, SQL-on-Hadoop has seen a proliferation of solutions in the last 18 months, and just as large a proliferation of press.  From good, ol' Apache Hive to Cloudera Impala and SparkSQL, these days you can have SQL-on-Hadoop any way you like it.  It does, however, prompt the question: Why SQL?

    There's an argument to be made for SQL simply being a form of skill reuse.  If people and tools already speak SQL, then give the people what they know.  In truth, that argument falls flat when one considers the sheer pace at which the Hadoop ecosystem evolves.  If there were a better language for querying Big Data, the community would have turned it up by now.

    I think the reality is that the SQL language endures because it is uniquely suited to querying datasets.  Consider, SQL is a declarative language for operating on relations in data.  It's a domain-specific language where the domain is datasets.  In and of itself, that's powerful: having language elements like FROM, WHERE and GROUP BY make reasoning about datasets simpler.  It's set theory set into a programming language.

    It goes beyond just the language itself.  SQL is declarative, which means I only have to reason about the shape of the result I want, not the data access mechanisms to get there, the join algorithms to apply, how to serialize partial aggregations, and so on.  SQL lets us think about answers, which lets us get more done.

    SQL on Hadoop, then, is somewhat obvious.  As data gets bigger, we would prefer to only have to reason about answers.

    SQL On More Than Hadoop

    For all the obvious goodness of SQL on Hadoop, there's a somewhat obvious drawback.  Specifically, data rarely lives in a single place.  Indeed, if Big Data is causing a proliferation of new ways to store and process data, then there are likely more places to store data then every before.  If SQL on Hadoop is separate from SQL on a DBMS, I run the risk of constructing every IT architect's least favorite solution: the stovepipe.

    If we want to avoid stovepipes, what we really need is the ability to run SQL queries that work seamlessly across multiple datastores.  Ideally, in a Big Data world, SQL should "play data where it lies," using the declarative power of the language to provide answers from all data.

    This is why we think Oracle Big Data SQL is obvious too.

    It's just a little more complicated than SQL on any one thing.  To pull it off, we have to do a few things:

    • Maintain the valuable characteristics of the system storing the data
    • Unify metadata to understand how to execute queries
    • Optimize execution to take advantage of the systems storing the data

    For the case of a relational database, we might say that the valuable storage characteristics include things like: straight-through processing, change-data logging, fine-grained access controls, and a host of other things.

    For Hadoop, I believe that the two most valuable storage characteristics are scalability and schema-on-read.  Cost-effective scalability is one of the first things that people look to HDFS for, so any solution that does SQL over a relational database and Hadoop has to understand how HDFS scales and distributes data.  Schema-on-read is at least equally important if not more.  As Daniel Abadi recently wrote, the flexibility of schema-on-read is gives Hadoop tremendous power: dump data into HDFS, and access it without having to convert it to a specific format.  So, then, any solution that does SQL over a relational database and Hadoop is going to have to respect the schemas of the database, but be able to really apply schema-on-read principals to data stored in Hadoop.

    Oracle Big Data SQL maintains all of these valuable characteristics, and it does it specifically through the approaches taken for unifying metadata and optimizing performance.

    Big Data SQL queries data in a DBMS and Hadoop by unifying metadata and optimizing performance.

    Unifying Metadata

    To unify metadata for planning and executing SQL queries, we require a catalog of some sort.  What tables do I have?  What are their column names and types?  Are there special options defined on the tables?  Who can see which data in these tables?

    Given the richness of the Oracle data dictionary, Oracle Big Data SQL unifies metadata using Oracle Database: specifically as external tables.  Tables in Hadoop or NoSQL databases are defined as external tables in Oracle.  This makes sense, given that the data is external to the DBMS.

    Wait a minute, don't lots of vendors have external tables over HDFS, including Oracle?

     Yes, but Big Data SQL provides as an external table is uniquely designed to preserve the valuable characteristics of Hadoop.  The difficulty with most external tables is that they are designed to work on flat, fixed-definition files, not distributed data which is intended to be consumed through dynamically invoked readers.  That causes both poor parallelism and removes the value of schema-on-read.

      The external tables Big Data SQL presents are different.  They leverage the Hive metastore or user definitions to determine both parallelism and read semantics.  That means that if a file in HFDS is 100 blocks, Oracle database understands there are 100 units which can be read in parallel.  If the data was stored in a SequenceFile using a binary SerDe, or as Parquet data, or as Avro, that is how the data is read.  Big Data SQL uses the exact same InputFormat, RecordReader, and SerDes defined in the Hive metastore to read the data from HDFS.

    Once that data is read, we need only to join it with internal data and provide SQL on Hadoop and a relational database.

    Optimizing Performance

    Being able to join data from Hadoop with Oracle Database is a feat in and of itself.  However, given the size of data in Hadoop, it ends up being a lot of data to shift around.  In order to optimize performance, we must take advantage of what each system can do.

    In the days before data was officially Big, Oracle faced a similar challenge when optimizing Exadata, our then-new database appliance.  Since many databases are connected to shared storage, at some point database scan operations can become bound on the network between the storage and the database, or on the shared storage system itself.  The solution the group proposed was remarkably similar to much of the ethos that infuses MapReduce and Apache Spark: move the work to the data and minimize data movement.

    The effect is striking: minimizing data movement by an order of magnitude often yields performance increases of an order of magnitude.

    Big Data SQL takes a play from both the Exadata and Hadoop books to optimize performance: it moves work to the data and radically minimizes data movement.  It does this via something we call Smart Scan for Hadoop.

    Moving the work to the data is straightforward.  Smart Scan for Hadoop introduces a new service into to the Hadoop ecosystem, which is co-resident with HDFS DataNodes and YARN NodeManagers.  Queries from the new external tables are sent to these services to ensure that reads are direct path and data-local.  Reading close to the data speeds up I/O, but minimizing data movement requires that Smart Scan do some things that are, well, smart.

    Smart Scan for Hadoop

    Consider this: most queries don't select all columns, and most queries have some kind of predicate on them.  Moving unneeded columns and rows is, by definition, excess data movement and impeding performance.  Smart Scan for Hadoop gets rid of this excess movement, which in turn radically improves performance.

    For example, suppose we were querying a 100 of TB set of JSON data stored in HDFS, but only cared about a few fields -- email and status -- and only wanted results from the state of Texas.

    Once data is read from a DataNode, Smart Scan for Hadoop goes beyond just reading.  It applies parsing functions to our JSON data, discards any documents which do not contain 'TX' for the state attribute.  Then, for those documents which do match, it projects out only the email and status attributes to merge with the rest of the data.  Rather than moving every field, for every document, we're able to cut down 100s of TB to 100s of GB.

    The approach we take to optimizing performance with Big Data SQL makes Big Data much slimmer.

    Summary

    So, there you have it: fast queries which join data in Oracle Database with data in Hadoop while preserving the makes each system a valuable part of overall information architectures.  Big Data SQL unifies metadata, such that data sources can be queried with the best possible parallelism and the correct read semantics.  Big Data SQL optimizes performance using approaches inspired by Exadata: filtering out irrelevant data before it can become a bottleneck.

    It's SQL that plays data where it lies, letting you place data where you think it belongs.

    [Read More]

    Thursday Jun 26, 2014

    Big Data Breakthrough; Watch the Webcast on July 15th

    About

    The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

    Search

    Archives
    « September 2015
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today