Wednesday Nov 27, 2013

Big Data - Real and Practical Use Cases

The goal of this post is to explain in a few succinct patterns how organizations can start to work with big data and identify credible and doable big data projects. This goal is achieved by describing a set of general patterns that can be seen in the market today.
Big Data Usage Patterns
The following usage patterns are derived from actual customer projects across a large number of industries and cross boundaries between commercial enterprises and public sector. These patterns are also geographically applicable and technically feasible with today’s technologies.
This paper will address the following four usage patterns:
  • Data Factory – a pattern that enable an organization to integrate and transform – in a batch method – large diverse data sets before moving this data into an upstream system like an RDBMS or a NoSQL system. Data in the data factory is possibly transient and the focus is on data processing.
  • Data Warehouse Expansion with a Data Reservoir – a pattern that expands the data warehouse with a large scale Hadoop system to capture data at lower grain and higher diversity, which is then fed into upstream systems. Data in the data reservoir is persistent and the focus is on data processing as well as data storage as well as the reuse of data.
  • Information Discovery with a Data Reservoir – a pattern that creates a data reservoir for discovery data marts or discovery systems like Oracle Endeca to tap into a wide range of data elements. The goal is to simplify data acquisition into discovery tools and to initiate discovery on raw data.
  • Closed Loop Recommendation and Analytics system – a pattern that is often considered the holy grail of data systems. This pattern combines both analytics on historical data, event processing or real time actions on current events and closes the loop between the two to continuously improve real time actions based on current and historical event correlation.

Pattern 1: Data Factory

The core business reason to build a Data Factory as it is presented here is to implement a cost savings strategy by placing long-running batch jobs on a cheaper system. The project is often funded by not spending money on the more expensive system – for example by switching Mainframe MIPS off  - and instead leveraging that cost savings to fund the Data Factory. The first figure shows a simplified implementation of the Data Factory.
As the image below shows, the data factory must be scalable, flexible and (more) cost effective for processing the data. The typical system used to build a data factory is Apache Hadoop or in the case of Oracle’s Big Data Appliance – Cloudera’s Distribution including Apache Hadoop (CDH).

data factory

Hadoop (and therefore Big Data Appliance and CDH) offers an extremely scalable environment to process large data volumes (or a large number of small data sets) and jobs. Most typical is the offload of large batch updates, matching and de-duplication jobs etc. Hadoop also offers a very flexible model, where data is interpreted on read, rather than on write. This idea enables a data factory to quickly accommodate all types of data, which can then be processed in programs written in Hive, Pig or MapReduce.
As shown in above the data factory is an integration platform, much like an ETL tool. Data sets land in the data factory, batch jobs process data and this processed data moves into the upstream systems. These upstream systems include RDBMS’s which are then used for various information needs. In the case of a Data Warehouse, this is very close to pattern 2 described below, with the difference that in the data factory data is often transient and removed after the processing is done.
This transient nature of data is not a required feature, but it is often implemented to keep the Hadoop cluster relatively small. The aim is generally to just transform data in a more cost effective manner.
In the case of an upstream system in NoSQL systems, data is often prepared in a specific key-value format to be served up to end applications like a website. NoSQL databases work really well for that purpose, but the batch processing is better left to Hadoop cluster.
It is very common for data to flow in the reverse order or for data from RDBMS or NoSQL databases to flow into the data factory. In most cases this is reference data, like customer master data. In order to process new customer data, this master data is required in the Data Factory.
Because of its low risk profile – the logic of these batch processes is well known and understood – and funding from savings in other systems, the Data Factory is typically an IT department’s first attempt at a big data project. The down side of a Data Factory project is that business users see very little benefits in that they do not get new insights out of big data.

Pattern 2: Data Warehouse Expansion

The common way to drive new insights out of big data is pattern two. Expanding the data warehouse with a data reservoir enables an organization to expand the raw data captured in a system that is able to add agility to the organization. The graphical pattern is shown in below.


DW Expansion

A Data Reservoir – like the Data Factory from Pattern 1 – is based on Hadoop and Oracle Big Data Appliance, but rather then have transient data and just process data and then hand the data off, a Data Reservoir aims to store data at a lower than previously stored grain for a period much longer than previous periods.
The Data Reservoir is initially used to capture data, aggregate new metrics and augment (not replace) the data warehouse with new and expansive KPIs or context information. A very typical addition is the sentiment of a customer towards a product or brand which is added to a customer table in the data warehouse.
The addition of new KPIs or new context information is a continuous process. That is, new analytics on raw and correlated data should find their way into the upstream Data Warehouse on a very, very regular basis.
As the Data Reservoir grows and starts to become known to exist because of the new KPIs or context, users should start to look at the Data Reservoir as an environment to “experiment” and “play” with data. With some rudimentary programming skills power users can start to combine various data elements in the Data Reservoir, using for example Hive. This enables the users to verify a hypotheses without the need to build a new data mart. Hadoop and the Data Reservoir now becomes an economically viable sandbox for power users driving innovation, agility and possibly revenue from hitherto unused data.

Pattern 3: Information Discovery

Agility for power users and expert programmers is one thing, but eventually the goal is to enable business users to discover new and exciting things in the data. Pattern 3 combines the data reservoir with a special information discovery system to provide a Graphical User Interface specifically for data discovery. This GUI emulates in many ways how an end user today searches for information on the internet.
To empower a set of business users to truly discover information, they first and foremost require a Discovery tool. A project should therefore always start with that asset.


Once the Discovery tool (like Oracle Endeca) is in place, it pays to start to leverage the Data Reservoir to feed the Discovery tool. As is shown above, the Data Reservoir is continuously fed with new data. The Discovery tool is a business user’s tool to create ad-hoc data marts in the discovery tool. Having the Data Reservoir simplifies the acquisition by end users because they only need to look in one place for data.
In essence, the Data Reservoir now is used to drive two different systems; the Data Warehouse and the Information Discovery environment and in practice users will very quickly gravitate to the appropriate system. But no matter which system they use, they now have the ability to drive value from data into the organization.

Pattern 4: Closed Loop Recommendation and Analytics System

So far, most of what was discussed was analytics and batch based. But a lot of organizations want to come to some real time interaction model with their end customers (or in the world of the Internet of Things – with other machines and sensors).

Closed Loop System

Hadoop is very good at providing the Data Factory and the Data Reservoir, at providing a sandbox, at providing massive storage and processing capabilities, but it is less good at doing things in real time. Therefore, to build a closed loop recommendation system – which should react in real time – Hadoop is only one of the components .
Typically the bottom half of the last figure is akin to pattern 2 and is used to catch all data, analyze the correlations between recorded events (detected fraud for example) and generate a set of predictive models describing something like “if a, b and c during a transaction – mark as suspect and hand off to an agent”. This model would for example block a credit card transaction.
To make such a system work it is important to use the right technology at both levels. Real time technologies like Oracle NoSQL Database, Oracle Real Time Decisions and Oracle Event Processing work on the data stream in flight. Oracle Big Data Appliance, Oracle Exadata/Database and Oracle Advanced Analytics provide the infrastructure to create, refine and expose the models.

Summary

Today’s big data technologies offer a wide variety of capabilities. Leveraging these capabilities with the existing environment and skills already in place according to the four patterns described does enable an organization to benefit from big data today. It is a matter of identifying the applicable pattern for your organization and then to start on the implementation.

The technology is ready. Are you?

Read-All-About-It: new weekly Oracle Data Warehousing newspaper

Thanks to Brendan Tierney for bringing this excellent online automated news service to my attention….

For a long time I have been wondering how to pull together all the articles from my favourite Twitter feeds, Facebook pages and blogs. Well thanks to Brendan I have discovered a service called Paper.li. This weekend I spent some time setting up feeds from all my favourite sources related to data warehousing, big data. Exadata and other related Oracle technologies. The result is the "#Oracle DW-Big Data Weekly Roundup" which is designed to "keep you up to date on all the weekly sql analytics, data warehousing and big data news from # Oracle". The newspaper is refreshed every Sunday night so that it is ready for Monday morning to read over breakfast. It is the perfect way to start the working week….



Newspaper


if you want to subscribe to this weekly newspaper then go here: http://paper.li/OracleBigData/1384259272 and click on the red SUBSCRIBE link in the top right region of the screen. To give you some guidance on the where all this content is coming from, I am pulling articles from the following sources:

  • Oracle Twitter accounts
    • OracleBigData
    • Oracle Database
    • SQLMaria (Optimizer)
    • CharlieDataMine (Advanced Analytics)
    • NoSQL Database
    • SQL Developer
    • Hardware team
    • BI technology
    • Profit Online Magazine
    • Mark Hornick (R Enterprise)
    • Oracle University
  • Oracle Blogs
    • Data Warehousing
    • Data Mining
    • R
  • Oracle Facebook pages
    • Data Warehousing and Big Data page

Looking for feedback on how useful this is to people as we have so many ways to communicate with you it is good to know what works and what does not work.
If you want to subscribe to Brendan's data mining/analytics newsletter it is here: http://paper.li/brendantierney/1364568794.

Now I am off to investigate creating the same thing on Flipboard for all you iPad/iPhone and Android users…..hope to have an update for you on this very soon so stay tuned!

Friday Nov 22, 2013

Using Oracle Exadata to improve crop yields

It is not often you read about how the agricultural industry uses data warehousing so this article in latest edition of Oracle Magazine, with the related video from OOW 2013, on how Land O'Lakes is using Exadata caught my attention:  The Business of Growing, by Marta Bright http://www.oracle.com/technetwork/issue-archive/2013/13-nov/o63lol-2034253.html

A little background on Land O'Lakes: Land O’Lakes is a US company that has grown far beyond its roots as a small cooperative of dairy farmers with forward-thinking ideas about producing and packaging butter. It is a Fortune 500 company and is now the second-largest cooperative in the United States, with annual sales of more than US$14 billion. Over the years, Land O’Lakes has expanded its operations into a variety of subsidiaries, including WinField Solutions (WinField), which provides farmers with a wide variety of crop seeds and crop protection products.

This implementation on our engineered systems highlights one of the key unique features of Oracle: the ability to run a truly mixed operation + data warehouse workload on the same platform, in the same rack. Land O'Lakes uses a lot of Oracle applications which push data into their data warehouse. In many cases we talk about the need for the data warehouse to support small windows of opportunities. For WinField solutions this is exactly why they invested in Oracle. 

What makes seed sales unique and challenging is that they are directly tied to seasonal purchasing. “There’s somewhat of a Black Friday in the seed business,” explains Tony Taylor, director of technology services at Land O’Lakes. “WinField is a US$5 billion company that sells all of its seed during about a six-week period of time”. With that sort of compressed sales cycle you need to have unto date information at your finger tips so you make the all the right decisions at the right time. Speed and efficiency are the key factors. If you watch the video Chris Malott, Manger of the DBA team at Land O'Lakes, explains the real benefits that her team and the business teams have gained from moving their systems (operational and data warehouse) on to our engineered systems platform.

Land OLakes

Click on the image above to link to the video. If the link does not work then click here: http://www.youtube.com/watch?v=-MDYI7IakR8

WinField is maximising the use of Oracle's analytical capabilities by incorporating huge volumes of imaging data which it then uses to help farmers make smarter agronomic decisions and ultimately gain higher yields. How many people would expect Oracle's in-database analytics to be driving improved crop yields? Now that is a great use case!

WinField also drills down and across all the information they collect to help identify new opportunities and what in the telco space we would call "churn" - for instance in January they run reports to identify particular farmers who typically purchases seed and products in November but has not yet ordered. This provides WinField the information it needs in order to perform proactive outreach to farmers to find out if they simply haven’t had time to place an order.

“We’re able to help farmers and our co-op members, even in cases where we’re not sure whether it’s going to directly benefit Land O’Lakes or WinField. Because this is truly a cooperative system, these are the people we work for, and we’re willing to invest in them.” -  Mike Macrie, vice president and CIO at Land O’Lakes. 

For next steps, Land O-Lakes is looking to move to Database 12c and I am sure that will open up even more opportunities for their business users to help farmers. Hopefully, they will be able to make use of new analytical features such as SQL Pattern Matching.


Thursday Nov 14, 2013

Data Scientist Boot camp (Skills and Training)

As almost everyone is interested in data science, take the boot camp to get ahead of the curve. Leverage this free Data Science Boot camp from Oracle Academy to learn some of the following things:

  • Introduction: Providing Data-Driven Answers to Business Questions
  • Lesson 1: Acquiring and Transforming Big Data
  • Lesson 2: Finding Value in Shopping Baskets
  • Lesson 3: Unsupervised Learning for Clustering
  • Lesson 4: Supervised Learning for Classification and Prediction
  • Lesson 5: Classical Statistics in a Big Data World
  • Lesson 6: Building and Exploring Graphs

You will also find the code samples that go with the training and you can get of to a running start.


Wednesday Nov 13, 2013

ADNOC talks about 50x increase in performance

In this new video Awad Ahmen Ali El Sidddig, Senior DBA at ADNOC, talks about the impact that Exadata has had on his team and the whole business. ADNOC is using our engineered systems to drive and manage all their workloads: from transaction systems to payments system to data warehouse to BI environment. A true Disk-to-Dashboard revolution using Engineered Systems. This engineered approach is delivering 50x improvement in performance with one queries running 100x faster! The IT has even revolutionised some of their data warehouse related processes with the help of Exadata and now jobs that were taking over 4 hours now run in a few minutes.

[Read More]

Tuesday Nov 12, 2013

Big Data Appliance X4-2 Release Announcement

Today we are announcing the release of the 3rd generation Big Data Appliance. Read the Press Release here.

Software Focus

The focus for this 3rd generation of Big Data Appliance is:

  • Comprehensive and Open - Big Data Appliance now includes all Cloudera Software, including Back-up and Disaster Recovery (BDR), Search, Impala, Navigator as well as the previously included components (like CDH, HBase and Cloudera Manager) and Oracle NoSQL Database (CE or EE).
  • Lower TCO then DIY Hadoop Systems
  • Simplified Operations while providing an open platform for the organization
  • Comprehensive security including the new Audit Vault and Database Firewall software, Apache Sentry and Kerberos configured out-of-the-box

Hardware Update

A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:


Big Data Appliance X3-2

Big Data Appliance X4-2

CPU

2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
2 x 8-Core Intel® Xeon® E5-2650 V2 (2.6 GHz)
Memory
64GB
64GB
Disk

12 x 3TB High Capacity SAS

12 x 4TB High Capacity SAS
InfiniBand
40Gb/sec
40Gb/sec
Ethernet
10Gb/sec
10Gb/sec

For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X4-2. The larger disks give BDA X4-2 33% more capacity over the previous generation while adding faster CPUs. Memory for BDA is expandable to 512 GB per node and can be done on a per-node basis, for example for NameNodes or for HBase region servers, or for NoSQL Database nodes.

Software Details

More details in terms of software and the current versions (note BDA follows a three monthly update cycle for Cloudera and other software):


Big Data Appliance 2.2 Software Stack Big Data Appliance 2.3 Software Stack
Linux
Oracle Linux 5.8 with UEK 1
Oracle Linux 6.4 with UEK 2
JDK
JDK 6
JDK 7
Cloudera CDH
CDH 4.3
CDH 4.4
Cloudera Manager
CM 4.6
CM 4.7

And like we said at the beginning it is important to understand that all other Cloudera components are now included in the price of Oracle Big Data Appliance. They are fully supported by Oracle and available for all BDA customers.

For more information:



Wednesday Nov 06, 2013

Swiss Re increases data warehouse performance and deploys in record time

Great information on yet another data warehouse deployment on Exadata.

A little background on Swiss Re:

In 2002, Swiss Re established a data warehouse for its client markets and products to gather reinsurance information across all organizational units into an integrated structure. The data warehouse provided the basis for reporting at the group level with drill-down capability to individual contracts, while facilitating application integration and data exchange by using common data standards. Initially focusing on property and casualty reinsurance information only, it now includes life and health reinsurance, insurance, and nonlife insurance information.

Key highlights of the benefits that Swiss Re achieved by using Exadata:

  • Reduced the time to feed the data warehouse and generate data marts by 58%
  • Reduced average runtime by 24% for standard reports
  • comfortably loading two data warehouse refreshes per day with incremental feeds
  • Freed up technical experts by significantly minimizing time spent on tuning activities

Most importantly this was one of the fastest project deployments in Swiss Re's history. They went from installation to production in just four months! What is truly surprising is the that it only took two weeks between power-on to testing the machine with full data volumes! Business teams at Swiss Re are now able to fully exploit up-to-date analytics across property, casualty, life, health insurance, and reinsurance lines to identify successful products.

These points are highlighted in the following quotes from Dr. Stephan Gutzwiller, Head of Data Warehouse Services at Swiss Re: 

"We were operating a complete Oracle stack, including servers, storage area network, operating systems, and databases that was well optimized and delivered very good performance over an extended period of time. When a hardware replacement was scheduled for 2012, Oracle Exadata was a natural choice—and the performance increase was impressive. It enabled us to deliver analytics to our internal customers faster, without hiring more IT staff"

“The high quality data that is readily available with Oracle Exadata gives us the insight and agility we need to cater to client needs. We also can continue re-engineering to keep up with the increasing demand without having to grow the organization. This combination creates excellent business value.”

Our full press release is available here: http://www.oracle.com/us/corporate/customers/customersearch/swiss-re-1-exadata-ss-2050409.html. If you want more information about how Exadata can increase the performance of your data warehouse visit our home page: http://www.oracle.com/us/products/database/exadata-database-machine/overview/index.html


Monday Nov 04, 2013

New Big Data Appliance Security Features

The Oracle Big Data Appliance (BDA) is an engineered system for big data processing.  It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing.  The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security.  Oracle Database customers have benefited from a rich set of security features:  encryption, redaction, data masking, database firewall, label based access control – and much, much more.  They want similar capabilities with their Hadoop cluster.   

Unfortunately, Hadoop wasn’t developed with security in mind.  By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database.  Some critical security features have been implemented – but even those capabilities are arduous to setup and configure.  Oracle believes that a key element of an optimized appliance is that its data should be secure.  Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing.

Security Starts at Authentication

A successful security strategy is predicated on strong authentication – for both users and software services.  Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”.  The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users.

Default Authentication in Hadoop

By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system.  Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user.  In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt.  When logged in as the hr user, you can see the following files.  Notice, we’re using the Hadoop command line utilities for accessing the data:

$ hadoop fs -ls /user/hrdata

Found 1 items
-rw-r--r--   1 oracle supergroup         70 2013-10-31 10:38 /user/hrdata/salaries.txt

$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”. 

$ hadoop fs -ls /user
Found 1 items
drwx------   - hr supergroup          0 2013-10-31 10:38 /user/hrdata

However, DrEvil cannot view the contents of the folder due to lack of access privileges:

$ hadoop fs -ls /user/hrdata
ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------

Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster:

$ sudo useradd hr
$ sudo passwd
$ su hr
$ hadoop fs -cat /user/hrdata/salaries.txt
Tom Brady,11000000
Tom Hanks,5000000
Bob Smith,250000
Oprah,300000000

Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS.

Big Data Appliance Provides Secure Authentication

The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration.


Figure 1: Kerberos Integration

The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc.

At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system.

It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment.

Authorize Access to Sensitive Data

Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server.

For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration.

Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements:

  • Secure Authorization:  the ability to control access to data and/or privileges on data for authenticated users.
  • Fine-Grained Authorization:  the ability to give users access to a subset of the data (e.g. column) in a database
  • Role-Based Authorization:  the ability to create/apply template-based privileges based on functional roles.
With Sentry, “all”, “select” or “insert” privileges are granted to an object. The descendants of that object automatically inherit that privilege. A collection of privileges across many objects may be aggregated into a role – and users/groups are then assigned that role. This leads to simplified administration of security across the system.

Sentry Object Hieararchy

Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views.

Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future.

Audit Hadoop Cluster Activity

Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster:


Figure 3: Monitored Hadoop services.

At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include:

  • MapReduce:  correlate the MapReduce job that accessed the file
  • Oozie:  describes who ran what as part of a workflow
  • Hive:  captures changes were made to the Hive metadata

The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA.

Audit Vault Server

Figure 4: Consolidated audit data across the enterprise. 

Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies.

Conclusion

Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.

Friday Nov 01, 2013

SQL analytical mash-ups deliver real-time WOW! for big data

One of the overlooked capabilities of SQL as an analysis engine, because we all just take it for granted, is that you can mix and match analytical features to create some amazing mash-ups. As we move into the exciting world of big data these mash-ups can really deliver those "wow, I never knew that" moments.

While Java is an incredibly flexible and powerful framework for managing big data there are some significant challenges in using Java and MapReduce to drive your analysis to create these "wow" discoveries. One of these "wow" moments was demonstrated at this year's OpenWorld during Andy Mendelsohn's general keynote session. 

Here is the scenario - we are looking for fraudulent activities in our big data stream and in this case we identifying potentially fraudulent activities by looking for specific patterns. We using geospatial tagging of each transaction so we can create a real-time fraud-map for our business users.

OOW PM  2

Where we start to move towards a "wow" moment is to extend this basic use of spatial and pattern matching, as shown in the above dashboard screen, to incorporate spatial analytics within the SQL pattern matching clause. This will allow us to compute the distance between transactions. Apologies for the quality of this screenshot….hopefully below you see where we have extended our SQL pattern matching clause to use location of each transaction and to calculate the distance between each transaction:

OOW PM  4

This allows us to compare the time of the last transaction with the time of the current transaction and see if the distance between the two points is possible given the time frame. Obviously if I buy something in Florida from my favourite bike store (may be a new carbon saddle for my Trek) and then 5 minutes later the system sees my credit card details being used in Arizona there is high probability that this transaction in Arizona is actually fraudulent (I am fast on my Trek but not that fast!) and we can flag this up in real-time on our dashboard:

OOW PM  3

In this post I have used the term "real-time" a couple of times and this is an important point and one of the key reasons why SQL really is the only language to use if you want to analyse  big data. One of the most important questions that comes up in every big data project is: how do we do analysis? Many enlightened customers are now realising that using Java-MapReduce to deliver analysis does not result in "wow" moments. These "wow" moments only come with SQL because it is offers a much richer environment, it is simpler to use and it is faster - which makes it possible to deliver real-time "Wow!". Below is a slide from Andy's session showing the results of a comparison of Java-MapReduce vs. SQL pattern matching to deliver our "wow" moment during our live demo.

OOW PM  1

 You can watch our analytical mash-up "Wow" demo that compares the power of 12c SQL pattern matching + spatial analytics vs. Java-MapReduce  here:

OOW PM  5

You can get more information about SQL Pattern Matching on our SQL Analytics home page on OTN, see here http://www.oracle.com/technetwork/database/bi-datawarehousing/sql-analytics-index-1984365.html

You can get more information about our spatial analytics here: http://www.oracle.com/technetwork/database-options/spatialandgraph/overview/index.html

If you would like to watch the full Database 12c OOW presentation see here: http://medianetwork.oracle.com/video/player/2686974264001


About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« November 2013 »
SunMonTueWedThuFriSat
     
2
3
5
7
8
9
10
11
15
16
17
18
19
20
21
23
24
25
26
28
29
30
       
Today