Monday Feb 20, 2012

We are hiring!

Oracle’s Big Data development team is Hiring a Product Manager

About Us and the Product

Big Data is here today. Leading-edge enterprises are implementing Big Data projects. Oracle has extended its product line to embrace the new technologies and the new opportunities in Big Data. The Big Data development team has recently introduced the Big Data Appliance, an engineered system that combines hardware and software optimized for Hadoop and Oracle NoSQL Database (More information on Oracle’s Big Data offerings can be found at oracle.com/bigdata). The Big Data development team is part of the data warehouse organization in Oracle’s Database Server Technologies division, a vibrant engineering organization with deep experience in scalable, parallel data processing, complex query optimization, and advanced analytics.

About the Role

This position is located at our headquarters location in Redwood Shores, CA (San Francisco Bay Area). We are seeking a product manager for the Big Data Appliance. As product manager, you would leverage your strong technical background in order to help define the roadmap for the Oracle Big Data Appliance, and become one of the faces of Big Data at Oracle. You will be actively writing collateral, delivering presentations, and visiting customers to ensure the success of Oracle Big Data Appliance and other products in Oracle’s Big Data portfolio, while also working internally within the development organization to ensure that the Big Data Appliance meets all of the current and future requirements of our customers. 

Learn more and apply: http://www.linkedin.com/jobs?viewJob=&jobId=2555091

Thursday Feb 09, 2012

Sizing for data volume or performance or both?

Before you start reading this post please understand the following:

  • General Hadoop capacity planning guidelines and principles are here. I’m not doubting, replicating or replacing them.
  • I’m going to use our (UPDATED as of Dec 2012)Big Data Appliance as the baseline set of components to discuss capacity. That is because my brain does not hold theoretical hardware configurations.
  • This is (a bit of) a theoretical post to make the point that you worry about both (just given away the conclusion!) but I’m not writing the definitive guide to sizing your cluster. I just want to get everyone to think of Hadoop as processing AND storage. Not either or!

Now that you are warned, read on…

Imagine you want to store ~50TB (sounded like a nice round number) of data on a Hadoop cluster without worrying about any data growth for now (I told you this was theoretical). I’ll leave the default replication for Hadoop at 3 and simple math now dictates that I need 3 * 50TB = 150TB of Hadoop storage.

I also need space to do my MapReduce work (and this is the same for Pig and Hive which generate MapReduce) so your system will be even bigger to ensure Hadoop can write temporary files, shuffle date etc.

Now, that is fabulous, so I need to talk to my lovely Oracle Rep and buy a Big Data Appliance which would easily hold the above mentioned 50TB with triple replication. It actually holds 150TB (to make the math easy for me) or so of user data, and you will instantly say that the BDA is way to big!

Ah, but how fast do you want to process data? A Big Data Appliance has 18 nodes, each node has 12 cores to do the work for you. MapReduce is using processes called mappers and reducers (really!) to do the actual work.

Let’s assume that we are allowing Hadoop to spin up 15 mappers per node and 10 reducers per node. Let’s further assume we are going full bore and have every slot allocated to the current and only job’s mappers and reducers (they do not run together I know, theoretical exercise – remember?).

Because you decide the Big Data Appliance was way to big, you have bought 8 equivalent nodes to fit your data. Two of these run your Name Node, your Jobtracker and secondary Name Node (and you should actually have three nodes of all this, but I’m generous and say we run Jobtracker on Secondary Name Node). You have however 6 nodes for the data nodes based on your capacity based sizing.

That system you just bought based on storage will give us 6 * 15 = 90 mappers and 6 * 10 = 60 reducers working on my workload (the 2 other nodes do not run data nodes and do not run mappers and reducers).

Now let’s assume that I finish my job in N minutes on my lovely 8 node cluster by leveraging the full set of workers, and assume that my business users want to refresh the state of the world every N/2 minutes (it always has to go faster), then the assumption would be to simply get 2 * the number of nodes in my original cluster assuming linear scalability… The assumption is reasonable by the way for a lot of workloads, certainly for the ones in social, search and other data patterns that show little data skew because of their overall data size.

A Big Data Appliance gives us 15 * 15 = 225 mappers and 15 * 10 = 150 reducers working on my 50TB of user data… providing a 2.5x speed up on my data set.

Just another reference point on this, a Terasort of 100GB is run on a 20 node cluster with a total disk capacity of 80TB. Now that is of course a little too much, but you will see the point of not worrying too much about “that big system” and think processing power rather than storage.

Conclusion?

You will need to worry about the processing requirements and you will need to understand the characteristics of the machine and the data. You should not size a system, or discard something as too big right away by just thinking about your raw data size. You should really, really consider Hadoop to be a system that scales processing and data storage together and use the benefits of the scale-out to balance data size with runtimes.

PS. Yes, I completely ignored those fabulous compression algorithms… Compression can certainly play a role here but I’ve left it out for now. Mostly because it is extremely hard (at least for my brain) to figure out an average compression rate and because you may decide to only compress older data, and compression costs CPU, but allows faster scan speeds and more of this fun stuff…

Monday Jan 09, 2012

Big Data Appliance and Big Data Connectors are now Generally Available

Today - January 10th, we announced the general availability of Oracle Big Data Appliance and Oracle Big Data Connectors as well as a partnership with Cloudera. Now that should be fun to start the new year in big data land!!

Big Data Appliance

Oracle Big Data Appliance brings Big Data solutions to mainstream enterprises. Built using industry-standard hardware from Sun and Cloudera's Distribution including Apache Hadoop, the Big Data Appliance is designed and optimized for big data workloads. By integrating the key components of a big data platform into a single product, Oracle Big Data Appliance delivers an affordable, scalable and fully supported big data infrastructure without the risks of a custom built solution. The Big Data Appliance integrates tightly with Oracle Exadata and Oracle Database using Oracle Big Data Connectors, and enables analysis of all data in the enterprise -structured and unstructured.

Big Data Connectors

Built from the ground up by Oracle, Oracle Big Data Connectors delivers a high-performance Hadoop to Oracle Database integration solution and enables optimized analysis using Oracle’s distribution of open source R analysis directly on Hadoop data. By providing efficient connectivity, Big Data Connectors enables analysis of all data in the enterprise – both structured and unstructured.

Cloudera CDH and Cloudera Manager

Oracle Big Data Appliance contains Cloudera’s Distribution including Apache Hadoop (CDH) and Cloudera Manager. CDH is the #1 Apache Hadoop-based distribution in commercial and non-commercial environments. CDH consists of 100% open source Apache Hadoop plus the comprehensive set of open source software components needed to use Hadoop. Cloudera Manager is an end-to-end management application for CDH. Cloudera Manager gives a cluster-wide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization.

More Information

Data sheets, white papers and other interesting information can be found here:

    * Big Data Appliance OTN page
    * Big Data Connectors OTN page

Happy new year and I hope life just got a bit more interesting!!

Thursday Dec 15, 2011

Understanding a Big Data Implementation and its Components

I often get asked about big data, and more often than not we seem to be talking at different levels of abstraction and understanding. Words like real time show up, words like advanced analytics show up and we are instantly talking about products. The latter is typically not a good idea.

So let’s try to step back and go look at what big data means from a use case perspective and how we then map this use case into a usable, high-level infrastructure picture. As we walk through this all you will – hopefully – start to see a pattern and start to understand how words like real time and analytics fit…

The Use Case in Business Terms

Rather then inventing something from scratch I’ve looked at the keynote use case describing Smart Mall (you can see a nice animation and explanation of smart mall in this video).

The idea behind this is often referred to as “multi-channel customer interaction”, meaning as much as “how can I interact with customers that are in my brick and mortar store via their phone”. Rather than having each customer pop out there smart phone to go browse prices on the internet, I would like to drive their behavior pro-actively.

The goals of smart mall are straight forward of course:

  • Increase store traffic within the mall
  • Increase revenue per visit and per transaction
  • Reduce the non-buy percentage

What do I need?

In terms of technologies you would be looking at:

  • Smart Devices with location information tied to an invidivual
  • Data collection / decision points for real-time interactions and analytics
  • Storage and Processing facilities for batch oriented analytics

In terms of data sets you would want to have at least:

  • Customer profiles tied to an individual linked to their identifying device (phone, loyalty card etc.)
  • A very fine grained customer segmentation
  • Tied to detailed buying behavior
  • Tied to elements like coupon usage, preferred products and other product recommendation like data sets

High-Level Components

A picture speaks a thousand words, so the below is showing both the real-time decision making infrastructure and the batch data processing and model generation (analytics) infrastructure.

The first – and arguably most important step and the most important piece of data – is the identification of a customer. Step 1 is in this case the fact that a user with cell phone walks into a mall. By doing so we trigger the lookups in step 2a and 2b in a user profile database. We will discuss this a little more later, but in general this is a database leveraging an indexed structure to do fast and efficient lookups. Once we have found the actual customer, we feed the profile of this customer into our real time expert engine – step 3. The models in the expert system (customer built or COTS software) evaluate the offers and the profile and determine what action to take (send a coupon for something). All of this happens in real time… keeping in mind that websites do this in milliseconds and our smart mall would probably be ok doing it in a second or so.

To build accurate models – and this where a lot of the typical big data buzz words come around, we add a batch oriented massive processing farm into the picture. The lower half in the picture above shows how we leverage a set of components to create a model of buying behavior. Traditionally we would leverage the database (DW) for this. We still do, but we now leverage an infrastructure before that to go after much more data and to continuously re-evaluate all that data with new additions.

A word on the sources. One key element is POS data (in the relational database) which I want to link to customer information (either from my web store or from cell phones or from loyalty cards). The NoSQL DB – Customer Profiles in the picture show the web store element. It is very important to make sure this multi-channel data is integrated (and de-duplicated but that is a different topic) with my web browsing, purchasing, searching and social media data.

Once that is done, I can puzzle together of the behavior of an individual. In essence big data allows micro segmentation at the person level. In effect for every one of my millions of customers!

The final goal of all of this is to build a highly accurate model to place within the real time decision engine. The goal of that model is directly linked to our business goals mentioned earlier. In other words, how can I send you a coupon while you are in the mall that gets you to the store and gets you to spend money…

Detailed Data Flows and Product Ideas

Now, how do I implement this with real products and how does my data flow within this ecosystem? That is something shown in the following sections…

Step 1 – Collect Data

To look up data, collect it and make decisions on it you will need to implement a system that is distributed. As these devices essentially keep on sending data, you need to be able to load the data (collect or acquire) without much delay. That is done like below in the collection points. That is also the place to evaluate for real time decisions. We will come back to the Collection points later…

The data from the collection points flows into the Hadoop cluster – in our case of course a big data appliance. You would also feed other data into this. The social feeds shown above would come from a data aggregator (typically a company) that sorts out relevant hash tags for example. Then you use Flume or Scribe to load the data into the Hadoop cluster.

Next step is the add data and start collating, interpreting and understanding the data in relation to each other.

For instance, add user profiles to the social feeds and the location data to build up a comprehensive understanding of an individual user and the patterns associated with this user. Typically this is done using MapReduce on Hadoop. The NoSQL user profiles are batch loaded from NoSQL DB via a Hadoop Input Format and thus added to the MapReduce data sets.

To combine it all with Point of Sales (POS) data, with our Siebel CRM data and all sorts of other transactional data you would use Oracle Loader for Hadoop to efficiently move reduced data into Oracle. Now you have a comprehensive view of the data that your users can go after. Either via Exalytics or BI tools or, and this is the interesting piece for this post – via things like data mining.

That latter phase – here called analyze will create data mining models and statistical models that are going to be used to produce the right coupons. These models are the real crown jewels as they allow an organization to make decisions in real time based on very accurate models. The models are going into the Collection and Decision points to now act on real time data.

In the picture above you see the gray model being utilized in the Expert Engine. That model describes / predicts behavior of an individual customer and based on that prediction we determine what action to undertake.

The above is an end-to-end look at Big Data and real time decisions. Big Data allows us to leverage tremendous data and processing resources to come to accurate models. It also allows us to find out all sorts of things that we were not expecting, creating more accurate models, but also creating new ideas, new business etc.

Once the Big Data Appliance is available you can implement the entire solution as shown here on Oracle technology… now you just need to find a few people who understand the programming models and create those crown jewels.

Friday Dec 09, 2011

Big Data Videos

I've been a bit quiet, been a bit busy working towards releasing our Big Data Appliance. But I thought I'd share the Youtube versions of the Openworld Videos on big data:

Big Data -- The Challenge
http://www.youtube.com/watch?v=DeQIdp6vYHg

Big Data -- Gold Mine, or just Stuff
http://www.youtube.com/watch?v=oiWlOeGG26U

Big Data -- Big Data Speaks
http://www.youtube.com/watch?v=Qz8bRyf1374

Big Data -- Everything You Always Wanted to Know
http://www.youtube.com/watch?v=pwQ9ztbSEpI

Big Data -- Little Data
http://www.youtube.com/watch?v=J2H6StHNJ18

Should be fun to watch over the weekend!


Thursday Nov 17, 2011

Article on Oracle NoSQL Database Testing

A quick post, looks like Infoworld did a test with Oracle SQL Database and wrote about it. Read more here:

http://www.infoworld.com/d/data-explosion/first-look-oracle-nosql-database-179107

Enjoy and maybe test this one when you start your investigations into a NoSQL Database!

Friday Nov 11, 2011

My Take on Hadoop World 2011

I’m sure some of you have read pieces about Hadoop World and I did see some headlines which were somewhat, shall we say, interesting?

I thought the keynote by Larry Feinsmith of JP Morgan Chase & Co was one of the highlights of the conference for me. The reason was very simple, he addressed some real use cases outside of internet and ad platforms.

The following are my notes, since the keynote was recorded I presume you can go and look at Hadoopworld.com at some point…

On the use cases that were mentioned:

  1. ETL – how can I do complex data transformation at scale
    1. Doing Basel III liquidity analysis
    2. Private banking – transaction filtering to feed [relational] data marts
  2. Common Data Platform – a place to keep data that is (or will be) valuable some day, to someone, somewhere
    1. 360 Degree view of customers – become pro-active and look at events across lines of business. For example make sure the mortgage folks know about direct deposits being stopped into an account and ensure the bank is pro-active to service the customer
    2. Treasury and Security – Global Payment Hub [I think this is really consolidation of data to cross reference activity across business and geographies]
  3. Data Mining
    1. Bypass data engineering [I interpret this as running a lot of a large data set rather than on samples]
    2. Fraud prevention – work on event triggers, say a number of failed log-ins to the website. When they occur grab web logs, firewall logs and rules and start to figure out who is trying to log in. Is this me, who forget his password, or is it someone in some other country trying to guess passwords
    3. Trade quality analysis – do a batch analysis or all trades done and run them through an analysis or comparison pipeline

One of the key requests – if you can say it like that – was for vendors and entrepreneurs to make sure that new tools work with existing tools. JPMC has a large footprint of BI Tools and Big Data reporting and tools should work with those tools, rather than be separate.

Security and Entitlement – how to protect data within a large cluster from unwanted snooping was another topic that came up.

I thought his Elephant ears graph was interesting (couldn’t actually read the points on it, but the concept certainly made some sense) and it was interesting – when asked to show hands – how the audience did not (!) think that RDBMS and Hadoop technology would overlap completely within a few years.

Another interesting session was the session from Disney discussing how Disney is building a DaaS (Data as a Service) platform and how Hadoop processing capabilities are mixed with Database technologies. I thought this one of the best sessions I have seen in a long time. It discussed real use case, where problems existed, how they were solved and how Disney planned some of it.

The planning focused on three things/phases:

  1. Determine the Strategy – Design a platform and evangelize this within the organization
  2. Focus on the people – Hire key people, grow and train the staff (and do not overload what you have with new things on top of their day-to-day job), leverage a partner with experience
  3. Work on Execution of the strategy – Implement the platform Hadoop next to the other technologies and work toward the DaaS platform

This kind of fitted with some of the Linked-In comments, best summarized in “Think Platform – Think Hadoop”. In other words [my interpretation], step back and engineer a platform (like DaaS in the Disney example), then layer the rest of the solutions on top of this platform.

One general observation, I got the impression that we have knowledge gaps left and right. On the one hand are people looking for more information and details on the Hadoop tools and languages. On the other I got the impression that the capabilities of today’s relational databases are underestimated. Mostly in terms of data volumes and parallel processing capabilities or things like commodity hardware scale-out models.

All in all I liked this conference, it was great to chat with a wide range of people on Oracle big data, on big data, on use cases and all sorts of other stuff. Just hope they get a set of bigger rooms next time… and yes, I hope I’m going to be back next year!

Monday Oct 31, 2011

Join us for Webcast on Big Data and ask your questions!

Join us for a webcast on big data and Oracle's offerings in this space:

When: November 3rd, 10am PT / 1pm ET
Where: Register here to attend

What:
As the world becomes increasingly digital, aggregating and analyzing new and diverse digital data streams can unlock new sources of economic value, provide fresh insights into customer behavior, and help you identify market trends early on. But this influx of new data can also create problems for IT departments. Attend this Webcast to learn how to capture, organize, and analyze your big data to deliver new insights with Oracle.


Tuesday Oct 25, 2011

Read Up on the Overall Big Data Solution

On top of the NoSQL Database release I wanted to share the new paper on big data with all. It gives you an overview of the end-to-end solution as presented at Openworld and places it in context of the importance of big data for our customers.

This is a a quick look at the Executive Summary and the Introduction (or click here for the paper):

Executive Summary

Today the term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of non-traditional, less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Decreases in the cost of both storage and compute power have made it feasible to collect this data - which would have been thrown away only a few years ago.  As a result, more and more companies are looking to include non-traditional yet potentially very valuable data with their traditional enterprise data in their business intelligence analysis.

To derive real business value from big data, you need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data types and analyze them alongside your existing data to find new insights and capitalize on hidden relationships.

Introduction

With the recent introduction of Oracle Big Data Appliance, Oracle is the first vendor to offer a complete and integrated solution to address the full spectrum of enterprise big data requirements. Oracle's big data strategy is centered on the idea that you can evolve your current enterprise data architecture to incorporate big data and deliver business value. By evolving your current enterprise architecture, you can leverage the proven reliability, flexibility and performance of your Oracle systems to address your big data requirements.

Defining Big Data

Big data typically refers to the following types of data:

  • Traditional enterprise data - includes customer information from CRM systems, transactional ERP data, web store transactions, general ledger data.
  • Machine-generated /sensor data - includes Call Detail Records ("CDR"), weblogs, smart meters, manufacturing sensors, equipment logs (often referred to as digital exhaust), trading systems data.
  • Social data - includes customer feedback streams, micro-blogging sites like Twitter, social media platforms like Facebook

The McKinsey Global Institute estimates that data volume is growing 40% per year, and will grow 44x between 2009 and 2020. But while it's often the most visible parameter, volume of data is not the only characteristic that matters. In fact, there are four key characteristics that define big data:

  • Volume. Machine-generated data is produced in much larger quantities than non-traditional data. For instance, a single jet engine can generate 10TB of data in 30 minutes. With more than 25,000 airline flights per day, the daily volume of just this single data source runs into the Petabytes. Smart meters and heavy industrial equipment like oil refineries and drilling rigs generate similar data volumes, compounding the problem.
  • Velocity. Social media data streams - while not as massive as machine-generated data - produce a large influx of opinions and relationships valuable to customer relationship management. Even at 140 characters per tweet, the high velocity (or frequency) of Twitter data ensures large volumes (over 8 TB per day).
  • Variety. Traditional data formats  tend to be relatively well described and change slowly. In contrast, non-traditional data formats exhibit a dizzying rate of change. As new services are added, new sensors deployed, or new marketing campaigns executed, new data types are needed to capture the resultant information.
  • Value. The economic value of different data varies significantly. Typically there is good information hidden amongst a larger body of non-traditional data; the challenge is identifying what is valuable and then transforming and extracting that data for analysis.

To make the most of big data, enterprises must evolve their IT infrastructures to handle the rapid rate of delivery of extreme volumes of data, with varying data types, which can then be integrated with an organization's other enterprise data to be analyzed. 

The Importance of Big Data

When big data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful  understanding of their business, which can lead to enhanced productivity, a stronger competitive position and greater  innovation - all of which can have a significant impact on the bottom line.
For example, in the delivery of healthcare services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs, and monitor progress is just one way that sensor data can be used to improve patient health and reduce both office visits and hospital admittance.
Manufacturing companies deploy sensors in their products to return a stream of telemetry. Sometimes this is used to deliver services like OnStar, that delivers communications, security and navigation services. Perhaps more importantly, this telemetry also reveals usage patterns, failure rates and other opportunities for product improvement that can reduce development and assembly costs.

The proliferation of smart phones and other GPS devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, a coffee shop or a restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers.
Retailers usually know who buys their products.  Use of social media and web log files from their ecommerce sites can help them understand who didn't buy and why they chose not to, information not available to them today.  This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies.

Finally, social media sites like Facebook and LinkedIn simply wouldn't exist without big data. Their business model requires a personalized experience on the web, which can only be delivered by capturing and using all the available data about a user or member.
-------

The full paper is linked here. Happy reading...

Monday Oct 24, 2011

Oracle NoSQL Database EE is now Available

Oracle NoSQL Database was announced at Openworld, just under a month ago. It is now available (Enterprise Edition) and can be downloaded from OTN:

Read more in the press release here. Obviously, this is great news and we are happy to see the first of a set of big data products becoming generally available. More products coming!

 

Monday Oct 17, 2011

More information on Oracle NoSQL Database

Since everyone asks me about more information all the time, here is some more information on Oracle NoSQL Database:

  • The OTN page discussing NoSQL Database - make sure to browse all the way down (link)
  • Data Sheet on Oracle NoSQL Database (link)
  • Technical white paper (link)
  • Cisco's solutions brief (link) on the Cisco site
  • Technical blog  from the development team (link)
As we progress to the release dates for the other components expect details like this to show up across the board...
About

The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today