By Jean-Pierre Dijcks-Oracle on Mar 13, 2014
The certification of Oracle Big Data Connectors on Intel Distribution for Hadoop now complete (see our previous post). This video from Strata gives you a nice overview of IDH and BDC.
Intel partnered with Oracle to certify compatibility between Intel® Distribution for Apache Hadoop* (IDH) and Oracle Big Data Connectors*. Users can now connect IDH to Oracle Database with Oracle Big Data Connectors, taking advantage of the high performance feature-rich components of that product suite. Applications on IDH can leverage the connectors for fast load into Oracle Database, in-place query of data in HDFS with Oracle SQL, analytics in Hadoop with R, XQuery processing on Hadoop, and native Hadoop integration within Oracle Data Integrator.
You've been hearing alot about Oracle's big data platform. Today, we're pleased to announce Oracle Big Data Lite Virtual Machine - an environment to help you get started with the platform. And, we have a great OTN Virtual Developer Day event scheduled where you can start using our big data products as part of a series of workshops.
Oracle Big Data Lite Virtual Machine is an Oracle VM VirtualBox that contains many key components of Oracle's big data platform, including: Oracle Database 12c Enterprise Edition, Oracle Advanced Analytics, Oracle NoSQL Database, Cloudera Distribution including Apache Hadoop, Oracle Data Integrator 12c, Oracle Big Data Connectors, and more. It's been configured to run on a "developer class" computer; all Big Data Lite needs is a couple of cores and about 5GB memory to run (this means that your computer should have at least 8GB total memory). With Big Data Lite, you can develop your big data applications and then deploy them to the Oracle Big Data Appliance. Or, you can use Big Data Lite as a client to the BDA during application development.
How do you get started? Why not start by registering for the Virtual Developer Day scheduled for Tuesday, February 4, 2014 - 9am to 1pm PT / 12pm to 4pm ET / 3pm to 7pm BRT:
There will be 45 minute sessions delivered by product experts (from both Oracle and Oracle Aces) - highlighted by Tom Kyte and Jonathan Lewis' keynote "Landscape of Oracle Database Technology Evolution". Some of the big data technical sessions include:
Keep an eye on this space - we'll be publishing how-to's that leverage the new Oracle Big Data Lite VM. And, of course, we'd love to hear about the clever applications you build as well!
CaixaBank is Spain’s largest domestic bank by market share with a customer base of 13.7 It is also Spain’s leading bank in terms of innovation and technology, and one of the most prominent innovators worldwide. CaixaBank has been recently awarded the title of the World’s Most Innovative Bank at the 2013 Global Banking Innovation Awards (November 2013).
Like most financial services companies CaixaBank wants to get closer to its customers by collecting data about their activities across all the different channels (offices, internet, phone banking, ATMs, etc.). In the old days we used to call this CRM and then this morphed into "360-degree view" etc etc. While many companies have delivered these types of projects and customers feel much more connected and in control of their relationship with their bank the capture of streams of big data has the potential to create another revolution in the way we interact with our bank. What banks like CaixaBank want to do is to capture data in one part of the business and make it available to all the other lines of business as quickly as possible.
Big data is allowing businesses like CaixaBank to significantly enhance the business value of their existing customer data by integrating it with all sorts of other internal and external data sets. This is probably the most exciting part of big data because the potential business benefits are really only constrained by imagination of the team working on these type of projects. However, that in itself does create problems in terms of securing funding and ownership of projects because the benefits can be difficult to estimate which is where all the industry use cases, conference papers and blog posts can help in terms of providing insight into what is going on in across the market in broad general terms.
To help them implement a strategic Big Data project, CaixaBank has selected Oracle for the deployment of its new Big Data infrastructure. This project, which includes an array of Oracle solutions, positions CaixaBank at the forefront of innovation in the banking industry. The new infrastructure will allow CaixaBank to maximize the business value from any kind of data and embark on new business innovation projects based on valuable information gathered from large data sets. Projects currently under review include:
The Oracle solution (including Oracle Engineered Systems, Oracle Software and Oracle Consulting Services) consists in the implementation of a new Information Management Architecture that provides a unified corporate data model and new advanced analytic capabilities (for more information about how Oracle's Reference Architecture can help you integrate structured, semi-structured and unstructured information into a single logical information resource that can be exploited for commercial gain click here to view our whitepaper)
The importance of the project is best explained by Juan Maria Nin, CEO of CaixaBank:
As a follow-on to the previous post (here) on use cases, follow the link below to a recording that explains how to go about expanding the data warehouse into a big data platform.
The idea behind it all is to cover a best practice on adding to the existing data warehouse and expanding the system (shown in the figure below) to deal with:
To access the webcast:
The core business reason to build a Data Factory as it is presented here
is to implement a cost savings strategy by placing long-running batch
jobs on a cheaper system. The project is often funded by not spending
money on the more expensive system – for example by switching Mainframe
MIPS off - and instead leveraging that cost savings to fund the Data
Factory. The first figure shows a simplified implementation of the Data Factory.
As the image below shows, the data factory must be scalable, flexible and (more) cost effective for processing the data. The typical system used to build a data factory is Apache Hadoop or in the case of Oracle’s Big Data Appliance – Cloudera’s Distribution including Apache Hadoop (CDH).
Hadoop (and therefore Big Data Appliance and CDH) offers an extremely
scalable environment to process large data volumes (or a large number of
small data sets) and jobs. Most typical is the offload of large batch
updates, matching and de-duplication jobs etc. Hadoop also offers a very
flexible model, where data is interpreted on read, rather than on
write. This idea enables a data factory to quickly accommodate all types
of data, which can then be processed in programs written in Hive, Pig
As shown in above the data factory is an integration platform, much like an ETL tool. Data sets land in the data factory, batch jobs process data and this processed data moves into the upstream systems. These upstream systems include RDBMS’s which are then used for various information needs. In the case of a Data Warehouse, this is very close to pattern 2 described below, with the difference that in the data factory data is often transient and removed after the processing is done.
This transient nature of data is not a required feature, but it is often implemented to keep the Hadoop cluster relatively small. The aim is generally to just transform data in a more cost effective manner.
In the case of an upstream system in NoSQL systems, data is often prepared in a specific key-value format to be served up to end applications like a website. NoSQL databases work really well for that purpose, but the batch processing is better left to Hadoop cluster.
It is very common for data to flow in the reverse order or for data from RDBMS or NoSQL databases to flow into the data factory. In most cases this is reference data, like customer master data. In order to process new customer data, this master data is required in the Data Factory.
Because of its low risk profile – the logic of these batch processes is well known and understood – and funding from savings in other systems, the Data Factory is typically an IT department’s first attempt at a big data project. The down side of a Data Factory project is that business users see very little benefits in that they do not get new insights out of big data.
The common way to drive new insights out of big data is pattern two. Expanding the data warehouse with a data reservoir enables an organization to expand the raw data captured in a system that is able to add agility to the organization. The graphical pattern is shown in below.
A Data Reservoir – like the Data Factory from Pattern 1 – is based on
Hadoop and Oracle Big Data Appliance, but rather then have transient
data and just process data and then hand the data off, a Data Reservoir
aims to store data at a lower than previously stored grain for a period
much longer than previous periods.
The Data Reservoir is initially used to capture data, aggregate new metrics and augment (not replace) the data warehouse with new and expansive KPIs or context information. A very typical addition is the sentiment of a customer towards a product or brand which is added to a customer table in the data warehouse.
The addition of new KPIs or new context information is a continuous process. That is, new analytics on raw and correlated data should find their way into the upstream Data Warehouse on a very, very regular basis.
As the Data Reservoir grows and starts to become known to exist because of the new KPIs or context, users should start to look at the Data Reservoir as an environment to “experiment” and “play” with data. With some rudimentary programming skills power users can start to combine various data elements in the Data Reservoir, using for example Hive. This enables the users to verify a hypotheses without the need to build a new data mart. Hadoop and the Data Reservoir now becomes an economically viable sandbox for power users driving innovation, agility and possibly revenue from hitherto unused data.
Agility for power users and expert programmers is one thing, but
eventually the goal is to enable business users to discover new and
exciting things in the data. Pattern 3 combines the data reservoir with a
special information discovery system to provide a Graphical User
Interface specifically for data discovery. This GUI emulates in many
ways how an end user today searches for information on the internet.
To empower a set of business users to truly discover information, they first and foremost require a Discovery tool. A project should therefore always start with that asset.
Once the Discovery tool (like Oracle Endeca) is in place, it pays to
start to leverage the Data Reservoir to feed the Discovery tool. As is
shown above, the Data Reservoir is continuously fed with new data. The
Discovery tool is a business user’s tool to create ad-hoc data marts in
the discovery tool. Having the Data Reservoir simplifies the acquisition
by end users because they only need to look in one place for data.
In essence, the Data Reservoir now is used to drive two different systems; the Data Warehouse and the Information Discovery environment and in practice users will very quickly gravitate to the appropriate system. But no matter which system they use, they now have the ability to drive value from data into the organization.
So far, most of what was discussed was analytics and batch based. But a lot of organizations want to come to some real time interaction model with their end customers (or in the world of the Internet of Things – with other machines and sensors).
Hadoop is very good at providing the Data Factory and the Data
Reservoir, at providing a sandbox, at providing massive storage and
processing capabilities, but it is less good at doing things in real
time. Therefore, to build a closed loop recommendation system – which
should react in real time – Hadoop is only one of the components .
Typically the bottom half of the last figure is akin to pattern 2 and is used to catch all data, analyze the correlations between recorded events (detected fraud for example) and generate a set of predictive models describing something like “if a, b and c during a transaction – mark as suspect and hand off to an agent”. This model would for example block a credit card transaction.
To make such a system work it is important to use the right technology at both levels. Real time technologies like Oracle NoSQL Database, Oracle Real Time Decisions and Oracle Event Processing work on the data stream in flight. Oracle Big Data Appliance, Oracle Exadata/Database and Oracle Advanced Analytics provide the infrastructure to create, refine and expose the models.
Today’s big data technologies offer a wide variety of capabilities. Leveraging these capabilities with the existing environment and skills already in place according to the four patterns described does enable an organization to benefit from big data today. It is a matter of identifying the applicable pattern for your organization and then to start on the implementation.
The technology is ready. Are you?
As almost everyone is interested in data science, take the boot camp to get ahead of the curve. Leverage this free Data Science Boot camp from Oracle Academy to learn some of the following things:
You will also find the code samples that go with the training and you can get of to a running start.
Today we are announcing the release of the 3rd generation Big Data Appliance. Read the Press Release here.
The focus for this 3rd generation of Big Data Appliance is:
A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:
||Big Data Appliance X3-2
Big Data Appliance X4-2
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
2 x 8-Core Intel® Xeon® E5-2650 V2 (2.6 GHz)
12 x 4TB High Capacity SAS
For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X4-2. The larger disks give BDA X4-2 33% more capacity over the previous generation while adding faster CPUs. Memory for BDA is expandable to 512 GB per node and can be done on a per-node basis, for example for NameNodes or for HBase region servers, or for NoSQL Database nodes.
More details in terms of software and the current versions (note BDA follows a three monthly update cycle for Cloudera and other software):
||Big Data Appliance 2.2 Software Stack||Big Data Appliance 2.3 Software Stack|
Oracle Linux 5.8 with UEK 1
Oracle Linux 6.4 with UEK 2
And like we said at the beginning it is important to understand that all other Cloudera components are now included in the price of Oracle Big Data Appliance. They are fully supported by Oracle and available for all BDA customers.
For more information:
The Oracle Big Data Appliance (BDA) is an engineered system for big data processing. It greatly simplifies the deployment of an optimized Hadoop Cluster – whether that cluster is used for batch or real-time processing. The vast majority of BDA customers are integrating the appliance with their Oracle Databases and they have certain expectations – especially around security. Oracle Database customers have benefited from a rich set of security features: encryption, redaction, data masking, database firewall, label based access control – and much, much more. They want similar capabilities with their Hadoop cluster.
Unfortunately, Hadoop wasn’t developed with security in mind. By default, a Hadoop cluster is insecure – the antithesis of an Oracle Database. Some critical security features have been implemented – but even those capabilities are arduous to setup and configure. Oracle believes that a key element of an optimized appliance is that its data should be secure. Therefore, by default the BDA delivers the “AAA of security”: authentication, authorization and auditing.
A successful security strategy is predicated on strong authentication – for both users and software services. Consider the default configuration for a newly installed Oracle Database; it’s been a long time since you had a legitimate chance at accessing the database using the credentials “system/manager” or “scott/tiger”. The default Oracle Database policy is to lock accounts thereby restricting access; administrators must consciously grant access to users.
By default, a Hadoop cluster fails the authentication test. For example, it is easy for a malicious user to masquerade as any other user on the system. Consider the following scenario that illustrates how a user can access any data on a Hadoop cluster by masquerading as a more privileged user. In our scenario, the Hadoop cluster contains sensitive salary information in the file /user/hrdata/salaries.txt. When logged in as the hr user, you can see the following files. Notice, we’re using the Hadoop command line utilities for accessing the data:
$ hadoop fs -ls /user/hrdata
Found 1 items
-rw-r--r-- 1 oracle supergroup 70 2013-10-31 10:38 /user/hrdata/salaries.txt
$ hadoop fs -cat /user/hrdata/salaries.txt
User DrEvil has access to the cluster – and can see that there is an interesting folder called “hrdata”.
$ hadoop fs -ls /user
Found 1 items
drwx------ - hr supergroup 0 2013-10-31 10:38 /user/hrdata
However, DrEvil cannot view the contents of the folder due to lack of access privileges:
$ hadoop fs -ls /user/hrdata
ls: Permission denied: user=drevil, access=READ_EXECUTE, inode="/user/hrdata":oracle:supergroup:drwx------
Accessing this data will not be a problem for DrEvil. He knows that the hr user owns the data by looking at the folder’s ACLs. To overcome this challenge, he will simply masquerade as the hr user. On his local machine, he adds the hr user, assigns that user a password, and then accesses the data on the Hadoop cluster:
$ sudo useradd hr
$ sudo passwd
$ su hr
$ hadoop fs -cat /user/hrdata/salaries.txt
Hadoop has not authenticated the user; it trusts that the identity that has been presented is indeed the hr user. Therefore, sensitive data has been easily compromised. Clearly, the default security policy is inappropriate and dangerous to many organizations storing critical data in HDFS.
The BDA provides secure authentication to the Hadoop cluster by default – preventing the type of masquerading described above. It accomplishes this thru Kerberos integration.
Figure 1: Kerberos Integration
The Key Distribution Center (KDC) is a server that has two components: an authentication server and a ticket granting service. The authentication server validates the identity of the user and service. Once authenticated, a client must request a ticket from the ticket granting service – allowing it to access the BDA’s NameNode, JobTracker, etc.
At installation, you simply point the BDA to an external KDC or automatically install a highly available KDC on the BDA itself. Kerberos will then provide strong authentication for not just the end user – but also for important Hadoop services running on the appliance. You can now guarantee that users are who they claim to be – and rogue services (like fake data nodes) are not added to the system.
It is common for organizations to want to leverage existing LDAP servers for common user and group management. Kerberos integrates with LDAP servers – allowing the principals and encryption keys to be stored in the common repository. This simplifies the deployment and administration of the secure environment.
Kerberos-based authentication ensures secure access to the system and the establishment of a trusted identity – a prerequisite for any authorization scheme. Once this identity is established, you need to authorize access to the data. HDFS will authorize access to files using ACLs with the authorization specification applied using classic Linux-style commands like chmod and chown (e.g. hadoop fs -chown oracle:oracle /user/hrdata changes the ownership of the /user/hrdata folder to oracle). Authorization is applied at the user or group level – utilizing group membership found in the Linux environment (i.e. /etc/group) or in the LDAP server.
For SQL-based data stores – like Hive and Impala – finer grained access control is required. Access to databases, tables, columns, etc. must be controlled. And, you want to leverage roles to facilitate administration.
Apache Sentry is a new project that delivers fine grained access control; both Cloudera and Oracle are the project’s founding members. Sentry satisfies the following three authorization requirements:
Figure 2: Object Hierarchy – granting a privilege on the database object will be inherited by its tables and views.
Sentry is currently used by both Hive and Impala – but it is a framework that other data sources can leverage when offering fine-grained authorization. For example, one can expect Sentry to deliver authorization capabilities to Cloudera Search in the near future.
Auditing is a critical component to a secure system and is oftentimes required for SOX, PCI and other regulations. The BDA integrates with Oracle Audit Vault and Database Firewall – tracking different types of activity taking place on the cluster:
Figure 3: Monitored Hadoop services.
At the lowest level, every operation that accesses data in HDFS is captured. The HDFS audit log identifies the user who accessed the file, the time that file was accessed, the type of access (read, write, delete, list, etc.) and whether or not that file access was successful. The other auditing features include:
The audit data is captured in the Audit Vault Server – which integrates audit activity from a variety of sources, adding databases (Oracle, DB2, SQL Server) and operating systems to activity from the BDA.
Figure 4: Consolidated audit data across the enterprise.
Once the data is in the Audit Vault server, you can leverage a rich set of prebuilt and custom reports to monitor all the activity in the enterprise. In addition, alerts may be defined to trigger violations of audit policies.
Security cannot be considered an afterthought in big data deployments. Across most organizations, Hadoop is managing sensitive data that must be protected; it is not simply crunching publicly available information used for search applications. The BDA provides a strong security foundation – ensuring users are only allowed to view authorized data and that data access is audited in a consolidated framework.
Big Data has the power to change the way we work, live, and think. The datafication of everything will create unprecedented demand for data scientists, software developers and engineers who can derive value from unstructured data to transform the world.
The Oracle Academy Big Data Resource Guide is a collection of articles, videos, and other resources organized to help you gain a deeper understanding of the exciting field of Big Data. To start your journey visit the Oracle Academy website here: https://academy.oracle.com/oa-web-big-data.html. This landing pad will guide through the whole area of big data using the following structure:
This is great resource packed with must-see videos and must-read whitepapers and blog posts by industry leaders.
For those who did go to Openworld, the session catalog now has the download links to the session materials online. You can now refresh your memory and share your experience with the rest of your organization. For those who did not go, here is your chance to look over some of the materials.
On the big data side, here are some of the highlights:
There are a great number of other sessions, simply look for: Solutions => Big Data and Business Analytics and you will find a wealth of interesting content around big data, Hadoop and analytics.
Join us for the inaugural Apache Sentry meetup at Oracle's offices in NYC, on the evening of the last day of Strata + Hadoop World 2013 in New York.
(@ Oracle Offices, 120 Park Ave, 26th Floor -- Note: Bring your ID and check in with security in the lobby!)
We'll kick-off the meetup with the following presentation:
Getting Serious about Security with Sentry
Shreepadma Venugopalan - Lead Engineer for Sentry
Arvind Prabhakar - Engineering Manager for Sentry
Jacco Draaijer - Development Manager for Oracle Big Data
Apache Hadoop offers strong support for authentication and coarse grained authorization - but this is not necessarily enough to meet the demands of enterprise applications and compliance requirements. Providing fine-grained access to data will enable organizations to store more sensitive information in Hadoop; only those users with the appropriate privileges will ever see that sensitive data.
Cloudera and Oracle are taking the lead on Sentry - a new open source authorization module that integrates with Hadoop-based SQL query engines. Key developers for the project will provide details on its implementation, including:
-Motivations for the project
-Key requirements that Sentry satisfies
-Utilizing Sentry in your applications
Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories. However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig. To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.
In this scenario, we have three users: oracle, dan, and not_dan. The oracle user has captured some data in an HDFS directory. The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan. One of the files in the directory has 700 permissions, meaning that only the oracle user can read it. Each user will tries to do the following tasks:
Each user issues the command
hadoop fs -ls /user/shared/moving_average|more
And what do they see:[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items
Obviously, the oracle user can see all the files in its own directory.
[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|moreFound 564 items
Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.
[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
ls: Permission denied: user=not_dan, access=READ_EXECUTE,
In this test, each user pipes a set of HDFS files into a unix command and counts rows. Recall, one of the files has 700 permissions.
The oracle user, again, can see all the available data:
[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l40
The user with partial permissions receives an error on the console, but can access the data they have permissions on. Naturally, the user without permissions only receives the error.
[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------30[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---0
In this final test, the oracle user defines an external Hive table over the shared directory. Each user issues a simple COUNT(*) query against the directory. Interestingly, the results are not the same as piping the datastream to the shell.
The oracle user's query runs correctly, while both dan and not_dan's queries fail:
Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'
Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException (Permission denied: user=not_dan, access=READ_EXECUTE,
So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker. The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories, but not another, they can issue queries against the readable partitions, but not against the entire table.
In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to.
There is a lot of hype around big data, but here at Oracle we try to help customers implement big data solutions to solve real business problems. For those of you interested in understanding more about how you can put big data to work at your organization, consider joining these events:
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.