By Jean-Pierre Dijcks-Oracle on Jan 21, 2013
Apart from the 2012 Government Big Data Award hear from Thomson Reuters in this video:
Apart from the 2012 Government Big Data Award hear from Thomson Reuters in this video:
Hello world. Waaw, time went by too fast. Happy new year, and here is the long past due update on the new Big Data Appliance and the software updates.
Both the software as well as the hardware of the Big Data Appliance got a refresher.
A good place to start is to quickly review the hardware differences (no price changes!). On a per node basis the following is a comparison between old and new (X3-2) hardware:
|Big Data Appliance v1||
Big Data Appliance X3-2
2 x 6-Core Intel® Xeon® 5675 (3.06 GHz)
2 x 8-Core Intel® Xeon® E5-2660 (2.2 GHz)
64GB expandable to 512GB
12 x 3TB High Capacity SAS
1 KVM Switch
For all the details on the environmentals and other useful information, review the data sheet for Big Data Appliance X3-2. For those wondering what we did with the 2RU we now have left from the KVM, that is open space, at the top of the rack.
The higher core count gives a BDA X3-2 more parallel compute power while saving some 30% in energy and heat.
As we did with Hardware, a good place to start is a quick overview of the software changes in below table:
|Big Data Appliance v1.1.x Software Stack||Big Data Appliance V2.0.1 Software Stack|
Oracle Linux 5.6
Oracle Linux 5.8 with UEK
|Oracle Enterprise Manager||
Big Data Appliance Plug-In for Enterprise Manager
Open Source R
Oracle R Distribution 2.x
|Big Data Connectors *||
Big Data Connectors 1.1.x
Big Data Connectors 2.0.x
|Oracle NoSQL Database CE **||
NoSQL DB 1.x
NoSQL DB 2.x
* Oracle Big Data Connectors is a separately licensed product which can be pre-installed and pre-configured on BDA
** Oracle NoSQL DB 2.x will be pre-installed in a future update to Mammoth but can be applied manually today
Apart from the versions updates, bug fixes and a great number of performance improvements across the entire system, the biggest updates are the inclusion of CDH 4.1.2 and the default set up of highly available name nodes for Hadoop, the Enterprise Manager management of the BDA, the uptake of the Oracle R Distribution and the updates to Oracle NoSQL Database. In a nutshell these updates deliver the following improvements:
The latest version of CDH and CM deliver:
On top of this, BDA now has both Zookeeper and Oozie configured out of the box.
The new Big Data Appliance Plug-In for Enterprise Manager delivers the first end-to-end management of the Hadoop cluster from hardware metrics to software and Hadoop metrics. To achieve the end-to-end management of the system Enterprise Manager delivers all the system metrics users are used to from the Exadata Plug-In for Enterprise Manager. Enterprise Manager enables a seamless transition between the Hardware and high level software monitoring and the expanded Hadoop monitoring and diagnostics from Cloudera Manager. This combination of functionality makes operations for a BDA simpler and allows operations staff to seamlessly switch between their Exadata, Big Data Appliance and other Oracle Engineered systems.
The big difference between Oracle R Distribution and the Opensource R distribution is that Oracle R Distribution is enabled to dynamically load the math kernel libraries on the CPUs from both Intel and AMD. This increases performance of basic calculations, which in turn increases the performance of the overall R calculations because more math is off-loaded into the CPUs.
A great number of great new features are added into NoSQL DB 2.x. Most of these are in both the Community Edition as well in the Enterprise Edition. Charles Lamb has a nice concise post describing what is new here.
To close out, Big Data Connectors got a refresher focused on performance, so download the new products here and give them a go via this download page. More information on news, read the data sheet here.
Hot off the press:
The winner of the 2012 Government Big Data Solutions Aware is the National Cancer Institute!! Read all the details on CTOLabs.com.
A short excerpt to wet your appetite: "... This solution, based on the Oracle Big Data Appliance with the Cloudera Distribution of Apache Hadoop (CDH), leverages capabilities available from the Big Data community today in pioneering ways that can serve a broad range of researchers. The promising approach of this solution is repeatable across many other Big Data challenges for bioinfomatics, making this approach worthy of its selection as the 2012 Government Big Data Solution Award."
Congrats to the entire team!!
There's a new whitepaper up on Oracle Technology Network which details the use of Digital Reasoning Systems' Synthesys software on Oracle Big Data Appliance. Digital Reasoning's approach is inherently "big data friendly," as it leverages multiple components of the Hadoop ecosystem. Moreover, the paper addresses the oldest big data problem of them all: extracting knowledge from human text.
You can find the paper here.
From the Executive Summary:
There is a wealth of information to be extracted from natural language, but that extraction is challenging. The volume of human language we generate constitutes a natural Big Data problem, while its complexity and nuance requires a particular expertise to model and mine. In this paper we illustrate the impressive combination of Oracle Big Data Appliance and Digital Reasoning Synthesys software. The combination of Synthesys and Big Data Appliance makes it possible to analyze tens of millions of documents in a matter of hours. Moreover, this powerful combination achieves four times greater throughput than conducting the equivalent analysis on a much larger cloud-deployed Hadoop cluster.
For those who have had the pleasure to be in SF for Oracle Openworld, you might have seen or heard about this story already. If you did not, here is a great story on how to use Oracle NoSQL Database.
Apart from all the cool technology, I'm just excited that this is a company founded by a football international and dealing with sports data, games and other cool things. Like an all things cool combo in one place.
More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site.
In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie.
I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code.
CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2)
PARTITIONED BY (yr string)
STORED AS ...
As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access.
ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010')
INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history'
SELECT w.stn, w.wban, w.weather_year, w.weather_month,
w.weather_day, w.temp, w.dewp, w.weather FROM (
FROM historic_weather SELECT TRANSFORM(...)
as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather
Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called
Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point:
<workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"><start to="ParseNCDCData"/><end name="end"/></workflow-app>
To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure.
<action name="ParseNCDCData"><hive xmlns="uri:oozie:hive-action:0.2"><job-tracker>localhost:8021</job-tracker><name-node>localhost:8020</name-node><configuration><property><name>oozie.hive.defaults</name><value>/user/oracle/weather_ooze/hive-default.xml</value></property></configuration><script>ncdc_parse.hql</script></hive><ok to="WeatherMan"/><error to="end"/></action>
hive-default.xmland script file must be stored in HDFS
hive-defaults.xmlon different clusters (e.g. MySQL or Postgres-backed metastores).
A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to
workflow.xml. At this point, our local directory should contain:
hive-defaults.xml(make sure this file contains your metastore connection data)
Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to
workflow.xml as follows:
<action name="WeatherMan"><pig><job-tracker>localhost:8021</job-tracker><name-node>localhost:8020</name-node><script>weather_train.pig</script></pig><ok to="end"/><error to="end"/></action>
Once we've done this, we'll copy
weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes.
While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under
working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the
working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an
Yes, that's as confusing as you think it is.
You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook.
We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called
job.properties as follows:
While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is
weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The
oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory.
We're finally ready to submit our job! After all that work we only need to do a few more things:
oozie validate workflow.xml
hadoop fs -put working_dir /user/oracle/working_dir
job.propertiesfile as an argument.
oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit
PREPstatus. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job.
oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle
So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.[Read More]
It is a news day today, check out the pictures of the BDA getting wheeled into Enkitec...
Always fun to see happy faces when the new to arrives. Looking forward to more Enkitec news in the coming weeks (starting at Openworld of course).
Here I am, and this is an actual story of one of my evenings, trying to spend money with a company and ultimately failing. I just gave up and bought a service from another vendor, not the incumbent. Here is that story and how I think big data could actually fix this (and potentially prevent some of this from happening). In the end this story should illustrate how big data can benefit me (get me what I want without causing grief) and the company I am trying to buy something from.
Note: Lots of details left out, I have no intention of being the annoyed blogger moaning about a specific company.
We watch TV, we have internet and we do have a land line. The land line is from a different vendor then the TV and the internet. I have decided that this makes no sense and I was going to get a bundle (no need to infer who this is, I just picked the generic bundle word as this is what I want to get) of all three services as this seems to save me money. I also want to not talk to people, I just want to click on a website when I feel like it and get it all sorted. I do think that is reality. I want to just do my shopping at 9.30pm while watching silly reruns on TV.
So, I'm an existing customer of the company I want to buy my bundle from. I go to the website, I click on offers. Turns out they are offers for new customers. After grumbling about how good they are, I click on offers for existing customers. Bummer, it goes to offers for new customers, so I click again on the link for offers for existing customers. No cigar... it just does not work.
Big data solutions:
1) Do not show an existing customer the offers for new customers unless they are the same => This is only partially doable without login, but if a customer logs in the application should always know that this is an existing customer. But in general, imagine I do this from my home going through the internet service of this vendor to their domain... an instant filter should move me into the "existing customer route".
2) Flag dead or incorrect links => I've clicked the link for "existing customer offers" at least 3 times in under 5 seconds... Identifying patterns like this is easy in Hadoop and can very quickly make a list of potentially incorrect links. No need for realtime fixing, just the fact that this link can be pro-actively fixed across my entire web domain is a good thing. Preventative maintenance!
Apart from the fact that the browsing pattern to actually get to what I want is poorly designed, my purchase never gets past a specific point. In other words, I put something into my shopping cart and when I want to move on the application either crashes (with me going to an error page) or hangs or goes into something like chat. So I try again, and again and again. I think I tried this entire path (while being logged in!!) at least 10 times over the course of 20 minutes. I also clicked on the feedback button and, frustrated as I was, tried to explain this did not work...
Big Data Solutions:
1) This web site does shopping cart analysis. I got an email next day stating I have things in my shopping cart, just click here to complete my purchase. After the above experience, this just added insult to my pain...
2) What should have happened, is a Hadoop job going over all logged in customers that are on the buy flow. It should flag anyone who is trying (multiple attempts from the same user to do the same thing), analyze the shopping card, the clicks to identify what the customers wants, his feedback provided (note: always own your own website feedback, never just farm this out!!) and in a short turn around time (30 minutes to 2 hours or so) email me with a link to complete my purchase. Not with a link to my shopping cart 12 hours later, but a link to actually achieve what I wanted...
I do believe this is relatively easy to do using our Oracle Event Processing and Big Data Appliance solutions combined. It is almost so simple (to my mind) that it makes no sense that this is not in place? But, now I am ranting... Why is this interesting? It is because of $$$$.
After trying really hard, I mean I did this all in the evening, and again in the morning before going to work. I kept on failing, But I really wanted this to work... so an email that said, sorry, we noticed you tried to get a bundle (the log knows what I wanted, where I failed, so easy to generate), here is the link to click and complete your purchase. And here is 2 movies on us as an apology would have kept me as a customer, and got the additional $$$$ per month for the next couple of years. It would also lead to upsell on my phone package etc.
Instead, I went to a completely different company, bought service from them. Lost money for company A, negative sentiment for company A and me telling this story at the water cooler so I'm influencing more people to think negatively about company A. All in all, a loss of easy money, a ding in sentiment and image where a relatively simple solution exists and can be in place on the software I describe routinely in this blog...
For those who are coming to Openworld and maybe see value in solving the above, or are thinking of how to solve this, come visit us in Moscone North - Oracle Red Lounge or in the Engineered Systems Showcase.
As we are gearing up for Openworld, here is a nice E-book on big data to start paging through. It contains Gartner's take on big data, customer and partner interviews and a lot more good info. Enjoy the read so you come prepared for Openworld!!
For those coming to Oracle Openworld (or the Americas Cup races around the same time), you can find big data sessions via this URL.
There's a lot to learn from log data, but to get the most value from it, that data needs to be easy to collect and analyze. Otherwise, time that could be used to learn from data is spent writing parsers and transport components. In this entry we'll simplify log collection and transport using JSON serialization and parts of the Hadoop ecosystem.
Logging everything in JSON is a great idea. As serialization formats go it's engineer-friendly: you and your favorite programming language can both read it. Moreover, having all of your log data structured as universal data structures makes getting started with analytics much simpler. To illustrate how much simpler, we'll take JSON logs written to a flat file, stream them into HDFS, and expose them via Hive for exploration and aggregation.
We're going to use three components to put our system together:
We'll begin by setting up the final destination for our log data. This requires we create a directory in HDFS to hold the log data and define a Hive table over it. Making the directory's easy:
hadoop fs -mkdir /user/oracle/tail_table
Similarly, defining the external table is straightforward in the Hive command line:
CREATE EXTERNAL TABLE IF NOT EXISTS tail_table(fieldA int, fieldB string, fieldC float)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
CDH 3u4 ships with two very different versions of Flume. The default is Flume 0.9.4, or Flume OG . It's great at streaming data into HDFS, but Flume OG has some requirements.
collectorSink("hdfs://namenode/user/oracle/tail_table", "logdata", 30)
This configuration will tail the log file and write a new message into HDFS with each new line. The collectorSink will commit data to our Hive table every 30 seconds. The resulting configuration looks like this:
For this type of multi-hop log transfer, we need a flume-ng-agent running on each application server and one on the Hadoop cluster. The application servers will have a flume.conf file which includes something like this:
app-agent.sources = tail
app-agent.channels = memoryChannel
app-agent.sinks = avro-forward-sink
app-agent.sources.tail.type = exec
app-agent.sources.tail.command = tail -f /path/to/json.log
app-agent.sources.avro-forward-sink.type = avro
app-agent.sources.avro-forward-sink.hostname = 10.1.1.100
app-agent.sources.avro-forward-sink.port = 10000
This sets up a source that runs "tail" and sinks that data via Avro RPC to 10.1.1.100 on port 10000.
hdfs-agent.sinks = hdfs-write
hdfs-agent.channels = memoryChannel
hdfs-agent.sources.avro-collect.type = avro
hdfs-agent.sources.avro-collect.bind = 10.1.1.100
hdfs-agent.sources.avro-collect.port = 10000
hdfs-agent.sinks.hdfs-write.type = hdfs
hdfs-agent.sinks.hdfs-write.path = hdfs://namenode/user/oracle/tail_table
hdfs-agent.sinks.hdfs-write.rollInterval = 30
On this side we've defined a source that reads Avro messages from port 10000 on 10.1.1.100 and writes the results into HDFS, rolling the file every 30 seconds. It's just like our setup in Flume OG, but now multi-hop forwarding is a snap.
The resulting configuration looks like this:
No matter which version you deploy, the combination of Flume, Hive and JSON make it straightforward to up an end-to-end pipeline for consuming and analyzing serialized log data. With deployments this simple, you can spend more time focusing on your applications and analytics
You can view them all on YouTube using the following links:
These videos are a great place to start learning about big data, the value it can bring to your organisation and how Oracle can help you start working with big data today.
Sitting in Dominic Delmolino's session at ODTUG (KScope 12). If the session count at conferences is any indication then we will see more and more people start to deploy MapReduce in the database. And yes, that would be with SQL and PL/SQL first and foremost. Both Dominic and our own Bryn Llewellyn are doing MapReduce in the database presentations.
Since I have seen both, I would advice people to first look through Dominic's session to get a good grasp on what mappers do and what reducers do, then dive into Bryn's for a bunch of PL/SQL example. The thing I like about Dominic's is the last slide (a recursive WITH statement) to do this in SQL...
Now I am hoping that next year we will see tools vendors show off how they work with Hadoop and MapReduce (at least talking about the concepts!!).
Since I was on vacation I missed an interesting question on one of the posts:
I see the interesting pyramid in the middle of your article showing the current status of big data market in the many eyes of the beholders and have made many thoughts on that. In your opinion, do you think "Big Data is our business" is only applicable for those Internet-scale companies, independent software vendors OR it will happen soon for most large-scale corporations in the upcoming future? I will think that "Big Data improves our business" shall be applicable to those large-scale corporate and up to that line only in the mainstream. Would like to hear more comments from your side especially on the use of Big Data for Corporate IT or in the Corporate level.
I think you can argue about this quite a bit, but I do believe that top tier is something we should be able to achieve, even if we are not an IT, Web or Data company. But you make a good point. At its current state, the wording is a bit off and would lead people to believe what you said... Does a manufacturing company really become a data company?
Maybe it should read "Big Data Transforms Our Business" ... which is what I am really after. Improvements are good, creativity should make us leverage data and change the outcomes of business processes. Then I think it is achievable for everyone...
Then a manufacturing company can transform the business, a telecom company can all of sudden personalize every interaction with every customer.
Thanks for the great comment, it certainly helped formulating some of this more crisp and I hope it makes sense.
We are seeing a lot of discussions about pricing and comparable pricing between a "standard Hadoop cluster" and the Oracle Big Data Appliance. This post is aimed at providing a simple apples-to-apples comparison and a clarification of what is, and what is not included in the pricing and packaging of Oracle Big Data Appliance.Read the updated post here.
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.