Tuesday Oct 15, 2013
Wednesday Oct 09, 2013
By Jean-Pierre Dijcks-Oracle on Oct 09, 2013
For those who did go to Openworld, the session catalog now has the download links to the session materials online. You can now refresh your memory and share your experience with the rest of your organization. For those who did not go, here is your chance to look over some of the materials.
On the big data side, here are some of the highlights:
- Big Data Deep Dive - Lessons Learned from a Big Data Project (Dan McClary - Oracle and Terry McFadden - P&G)
- Hadoop your ETL - Using Big Data technologies to Enhance Today's DW (George Lumpkin - Oracle)
- Big Data Deep Dive - Oracle Big Data Appliance (Jacco Draaijer - Oracle, Justin Erickson - Cloudera and Jean-Pierre Dijcks - Oracle)
- Big Data Deep Dive - Oracle's Strategy and Roadmap (Marty Gubar - Oracle and Jean-Pierre Dijcks - Oracle)
- Programming with Oracle Big Data Connectors (Melliyal Annamalai - Oracle and Robert Abbott - Oracle)
- Oracle's Big Data Solutions - NoSQL, R, Connectors and Appliance Technologies (Oracle and Customer speakers)
There are a great number of other sessions, simply look for: Solutions => Big Data and Business Analytics and you will find a wealth of interesting content around big data, Hadoop and analytics.
By Jean-Pierre Dijcks-Oracle on Oct 09, 2013
Join us for the inaugural Apache Sentry meetup at Oracle's offices in NYC, on the evening of the last day of Strata + Hadoop World 2013 in New York.
(@ Oracle Offices, 120 Park Ave, 26th Floor -- Note: Bring your ID and check in with security in the lobby!)
We'll kick-off the meetup with the following presentation:
Getting Serious about Security with Sentry
Shreepadma Venugopalan - Lead Engineer for Sentry
Arvind Prabhakar - Engineering Manager for Sentry
Jacco Draaijer - Development Manager for Oracle Big Data
Apache Hadoop offers strong support for authentication and coarse grained authorization - but this is not necessarily enough to meet the demands of enterprise applications and compliance requirements. Providing fine-grained access to data will enable organizations to store more sensitive information in Hadoop; only those users with the appropriate privileges will ever see that sensitive data.
Cloudera and Oracle are taking the lead on Sentry - a new open source authorization module that integrates with Hadoop-based SQL query engines. Key developers for the project will provide details on its implementation, including:
-Motivations for the project
-Key requirements that Sentry satisfies
-Utilizing Sentry in your applications
Thursday Jul 18, 2013
By dan.mcclary on Jul 18, 2013
Documentation and most discussions are quick to point out that HDFS provides OS-level permissions on files and directories. However, there is less readily-available information about what the effects of OS-level permissions are on accessing data in HDFS via higher-level abstractions such as Hive or Pig. To provide a bit of clarity, I decided to run through the effects of permissions on different interactions with HDFS.
In this scenario, we have three users: oracle, dan, and not_dan. The oracle user has captured some data in an HDFS directory. The directory has 750 permissions: read/write/execute for oracle, read/execute for dan, and no access for not_dan. One of the files in the directory has 700 permissions, meaning that only the oracle user can read it. Each user will tries to do the following tasks:
- List the contents of the directory
- Count the lines in a subset of files including the file with 700 permissions
- Run a simple Hive query over the directory
Each user issues the command
hadoop fs -ls /user/shared/moving_average|more
And what do they see:[oracle@localhost ~]$ hadoop fs -ls /user/shared/moving_average|more
Found 564 items
Obviously, the oracle user can see all the files in its own directory.
[dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|moreFound 564 items
Similarly, since dan has group read access, that user can also list all the files. The user without group read permissions, however, receives an error.
[not_dan@localhost oracle]$ hadoop fs -ls /user/shared/moving_average|more
ls: Permission denied: user=not_dan, access=READ_EXECUTE,
Counting Rows in the Shell
In this test, each user pipes a set of HDFS files into a unix command and counts rows. Recall, one of the files has 700 permissions.
The oracle user, again, can see all the available data:
[oracle@localhost ~]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -l40
The user with partial permissions receives an error on the console, but can access the data they have permissions on. Naturally, the user without permissions only receives the error.
[dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=dan, access=READ, inode="/user/shared/moving_average/FlumeData.1374082184056":oracle:shared_hdfs:-rw-------30[not_dan@localhost oracle]$ hadoop fs -cat /user/shared/moving_average/FlumeData.137408218405*|wc -lcat: Permission denied: user=not_dan, access=READ_EXECUTE, inode="/user/shared/moving_average":oracle:shared_hdfs:drwxr-x---0
Permissions on Hive
In this final test, the oracle user defines an external Hive table over the shared directory. Each user issues a simple COUNT(*) query against the directory. Interestingly, the results are not the same as piping the datastream to the shell.
The oracle user's query runs correctly, while both dan and not_dan's queries fail:
Job Submission failed with exception 'java.io.FileNotFoundException(File /user/shared/moving_average/FlumeData.1374082184056 does not exist)'
Job Submission failed with exception 'org.apache.hadoop.security.AccessControlException (Permission denied: user=not_dan, access=READ_EXECUTE,
So, what's going on here? In each case, the query fails, but for different reasons. In the case of not_dan, the query fails because the user has no permissions on the directory. However, the query issued by dan fails because of a FileNotFound exception. Because dan does not have read permissions on the file, Hive cannot find all the files necessary to build the underlying MapReduce job. Thus, the query fails before being submitted to the JobTracker. The rule then, becomes simple: to issue a Hive query, a user must have read permissions on all files read by the query. If a user has permissions on one set of partition directories, but not another, they can issue queries against the readable partitions, but not against the entire table.
In a nutshell, the OS-level permissions of HDFS behave just as we would expect in the shell. However, problems can arise when tools like Hive or Pig try to construct MapReduce jobs. As a best practice, permissions structures should be tested against the tools which will access the data. This ensures that users can read
what they are allowed to, in the manner that they need to.
Wednesday Jul 17, 2013
By Jean-Pierre Dijcks-Oracle on Jul 17, 2013
There is a lot of hype around big data, but here at Oracle we try to help customers implement big data solutions to solve real business problems. For those of you interested in understanding more about how you can put big data to work at your organization, consider joining these events:
Event Registration Page
Event Registration Page
Friday May 10, 2013
By Jean-Pierre Dijcks-Oracle on May 10, 2013
A quick update on some of the integration components needed to build things like M2M (Machine 2 Machine communication) and on integrating fast moving data (events) with the Hadoop and NoSQL Database. As of 18.104.22.168 of the Oracle Event Processing product you now have:
OEP Data Cartridge for Hadoop (the real doc is here)
OEP Data Cartridge for NoSQL Database (the real doc is here)
The fun with these products is that you can now model (in a UI!!) how to interact with these products. For example you can sink data into Hadoop without impacting the stream logic and stream performance and you can do a quick CQL (the OEP language) lookup to our NoSQL DB to resolve for example a customer profile or status lookup.
More to come, but very interesting and really something cool on making products work together out of the box.
Friday May 03, 2013
By Jean-Pierre Dijcks-Oracle on May 03, 2013
For those interested in understanding how to actually build a big data solution including things like NoSQL Database, Hadoop, MapReduce, Hive, Pig and Analytics (data mining, R) have a look at the big data videos Marty did:
- Video 1: Using Big Data to Improve the Customer Experience
- Video 2: Using Big Data to Deliver a Personalized Service
- Video 3: Using Big Data and NoSQL to Manage On-line Profiles
- Video 4: Oracle Big Data and Hadoop to Process Log Files
- Video 5: Integrate All your data with Big Data Connectors
- Video 6: Maximizing business impact with Big Data
Happy watching and learning.
Thursday Apr 11, 2013
By Jean-Pierre Dijcks-Oracle on Apr 11, 2013
This week Oracle announced the availability (yes you can right away buy and use these systems) to Big Data Appliance X3-2 Starter Rack and Big Data Appliance X3-2 In-Rack Expansion. You can read the press release here. For those who are interested in the operating specs, best to look at the data sheet on OTN.
So what does this mean? In effect this means that you can now start any big data project with an appliance. Whether you are looking to try your hand on your first project with Hadoop, or whether you are building your enterprise Hadoop solution with a large number of nodes, you can now get the benefits of Oracle Big Data Appliance. By leveraging Big Data Appliance for all your big data needs (being this Hadoop or Oracle NoSQL Database) you always get:
- Reduced risk by having the best of Oracle and Cloudera engineering available in an easy to consume appliance
- Faster time to value by not spending weeks or months building and tuning your own Hadoop system
- No cost creep for the cluster as your system is set up and configured for a known cost
Assume you want to start your first implementation on Hadoop, you can now start with the BDA Starter Rack, 6 servers which you can fully deploy for HDFS and MapReduce capabilities (of course we also support for example HBase). All the services are pre-configured, so you have Highly Available NameNodes, automatic failover and a balanced approach to leveraging the 6 servers as Hadoop nodes.As your project grows (and you need more compute power and space to store data) you simply add nodes in chunks of 6 using the In-Rack Expansion, filling up the rack.
Once full you can either add another Starter Rack or add Full Racks to the system. As you do that, Mammoth - the install, configure and patch utility for BDA - ensures that your service nodes are in the appropriate place. For example, once you have 2 cabinets assigned to a single cluster, Mammoth will move the second NameNode to the second Rack for higher fault tolerance.
This new release of Big Data Appliance (the software parts of it) now also include Cloudera CDH 4.2 and Cloudera Manager 4.5. On top of that, you now create multiple clusters on a single BDA Full Rack using just Mammoth, which means you can now patch and update individual clusters on that Full Rack. As you add nodes to a cluster, Mammoth will allow you to choose where to add nodes, how to grow a set of clusters, etc.
Lastly, but not least, there is more flexibility in how to acquire Big Data Appliance Full Rack, as it is now a part of Oracle's Infrastructure as a Service offering, allowing for a smooth capital outflow for your Big Data Appliance.
Sunday Apr 07, 2013
By dan.mcclary on Apr 07, 2013
In the final installment in our series on Hive UDFs, we're going to tackle the least intuitive of the three types: the User Defined Aggregating Function. While they're challenging to implement, UDAFs are necessary if we want functions for which the distinction of map-side v. reduce-side operations are opaque to the user. If a user is writing a query, most would prefer to focus on the data they're trying to compute, not which part of the plan is running a given function.
The UDAF also provides a valuable opportunity to consider some of the nuances of distributed programming and parallel database operations. Since each task in a MapReduce job operates in a bit of a vacuum (e.g. Map task A does not know what data Map task B has), a UDAF has to explicitly account for more operational states than a simple UDF. We'll return to the notion of a simple Moving Average function, but ask yourself: how do we compute a moving average if we don't have state or order around the data?
As before, the code is available on github, but we'll excerpt the important parts here.
Prefix Sum: Moving Average without State
In order to compute a moving average without state, we're going to need a specialized parallel algorithm. For moving average, the "trick" is to use a prefix sum, effectively keeping a table of running totals for quick computation (and recomputation) of our moving average. A full discussion of prefix sums for moving averages is beyond length of a blog post, but John Jenq provides an excellent discussion of the technique as applied to CUDA implementations.
What we'll cover here is the necessary implementation of a pair of classes to store and operate on our prefix sum entry within the UDAF.
Here we have the definition of our moving average class and the static inner class which serves as an entry in our table. What's important here are some of the variables we define for each entry in the table: the time-index or period of the value (its order), the value itself, the prefix sum, the subsequence total, and the moving average itself. Every entry in our table requires not just the current value to compute the moving average, but also sum of entries in our moving average window. It's the pair of these two values which allows prefix sum methods to work their magic.
The above are simple initialization routines: a constructor, a method to reset the table, and a boolean method on whether or not the object has a prefix sum table on which to operate. From here, there are 3 important methods to examine: add, merge, and serialize. The first is intuitive, as we scan rows in Hive we want to add them to our prefix sum table. The second are important because of partial aggregation.
We cannot say ahead of time where this UDAF will run, and partial aggregation may be required. That is, it's entirely possible that some values may run through the UDAF during a map task, but then be passed to a reduce task to be combined with other values. The serialize method will allow Hive to pass the partial results from the map side to the reduce side. The merge method allows reducers to combine the results of partial aggregations from the map tasks.
The first part of the add task is simple: we add the element to the list and update our table's prefix sums.
In the second half of the add function, we compute our moving averages based on the prefix sums. It's here you can see the hinge on which the algorithm swings: thisEntry.prefixSum - backEntry.prefixSum -- that offset between the current table entry and it's nth predecessor makes the whole thing work.
The serialize method needs to package the results of our algorithm to pass to another instance of the same algorithm, and it needs to do so in a type that Hadoop can serialize. In the case of a method like sum, this would be relatively simple: we would only need to pass the sum up to this point. However, because we cannot be certain whether this instance of our algorithm has seen all the values, or seen them in the correct order, we actually need to serialize the whole table. To do this, we create a list ofDoubleWritables, pack the window size at its head, and then each period and value. This gives us a structure that's easy to unpack and merge with other lists with the same construction.
Merging results is perhaps the most complicated thing we need to handle. First, we check the case in which there was no partial result passed -- just return and continue. Second, we check to see if this instance of PrefixSumMovingAverage already has a table. If it doesn't, we can simply unpack the serialized result and treat it as our window.
The third case is the non-trivial one: if this instance has a table and receives a serialized table, we must merge them together. Consider a Reduce task: as it receives outputs from multiple Map tasks, it needs to merge all of them together to form a larger table. Thus, merge will be called many times to add these results and reassemble a larger time series.
This part should look familiar, it's just like the add method. Now that we have new entries in our table, we need to sort by period and recompute the moving averages. In fact, the rest of the merge method is exactly like the add method, so we might consider putting sorting and recomputing in a separate method.
Orchestrating Partial Aggregation
We've got a clever little algorithm for computing moving average in parallel, but Hive can't do anything with it unless we create a UDAF that understands how to use our algorithm. At this point, we need to start writing some real UDAF code. As before, we extend a generic class, in this case GenericUDAFEvaluator.
As in the case of a UDTF, we create ObjectInspectors to handle type checking. However, notice that we have inspectors for different states: PARTIAL1, PARTIAL2, COMPLETE, and FINAL. These correspond to the different states in which our UDAF may be executing. Since our serialized prefix sum table isn't the same input type as the values our add method takes, we need different type checking for each.
Here's the beginning of our overrided initialization function. We check the parameters for two modes, PARTIAL1 and COMPLETE. Here we assume that the arguments to our UDAF are the same as the user passes in a query: the period, the input, and the size of the window. If the UDAF instance is consuming the results of our partial aggregation, we need a different ObjectInspector. Specifically, this one:
Similar to the UDTF, we also need type checking on the output types -- but for both partial and full aggregation. In the case of partial aggregation, we're returning lists of DoubleWritables:
But in the case of FINAL or COMPLETE, we're dealing with the types that will be returned to the Hive user, so we need to return a different output. We're going to return a list of structs that contain the period, moving average, and residuals (since they're cheap to compute).
Next come methods to control what happens when a Map or Reduce task is finished with its data. In the case of partial aggregation, we need to serialize the data. In the case of full aggregation, we need to package the result for Hive users.
We also need to provide instruction on how Hive should merge the results of partial aggregation. Fortunately, we already handled this in our PrefixSumMovingAverage class, so we can just call that.
Of course, merging and serializing isn't very useful unless the UDAF has logic for iterating over values. The iterate method handles this and -- as one would expect -- relies entirely on thePrefixSumMovingAverage class we created.
Aggregation Buffers: Connecting Algorithms with Execution
One might notice that the code for our UDAF references an object of type AggregationBuffer quite a lot. This is because the AggregationBuffer is the interface which allows us to connect our custom PrefixSumMovingAverage class to Hive's execution framework. While it doesn't constitute a great deal of code, it's glue that binds our logic to Hive's execution framework. We implement it as such:
Using the UDAF
The goal of a good UDAF is that, no matter how complicated it was for us to implement, it's that it be simple for our users. For all that code and parallel thinking, usage of the UDAF is very straightforward:
ADD JAR /mnt/shared/hive_udfs/dist/lib/moving_average_udf.jar;CREATE TEMPORARY FUNCTION moving_avg AS 'com.oracle.hadoop.hive.ql.udf.generic.GenericUDAFMovingAverage';#get the moving average for a single tail numberSELECT TailNum,moving_avg(timestring, delay, 4) FROM ts_example WHERE TailNum='N967CA' GROUP BY TailNum LIMIT 100;
Here we're applying the UDAF to get the moving average of arrival delay from a particular flight. It's a really simple query for all that work we did underneath. We can do a bit more and leverage Hive's abilities to handle complex types as columns, here's a query which creates a table of timeseries as arrays.
#create a set of moving averages for every plane starting with N#Note: this UDAF blows up unpleasantly in heap; there will be data volumes for which you need to throw#excessive amounts of memory at the problemCREATE TABLE moving_averages ASSELECT TailNum, moving_avg(timestring, delay, 4) as timeseries FROM ts_example
WHERE TailNum LIKE 'N%' GROUP BY TailNum;
We've covered all manner of UDFs: from simple class extensions which can be written very easily, to very complicated UDAFs which require us to think about distributed execution and plan orchestration done by query engines. With any luck, the discussion has provided you with the confidence to go out and implement your own UDFs -- or at least pay some attention to the complexities of the ones in use every day.[Read More]
Thursday Apr 04, 2013
By dan.mcclary on Apr 04, 2013
In our ongoing exploration of Hive UDFs, we've covered the basic row-wise UDF. Today we'll move to the UDTF, which generates multiple rows for every row processed. This UDF built its house from sticks: it's slightly more complicated than the basic UDF and allows us an opportunity to explore how Hive functions manage type checking.
We'll step through some of the more interesting pieces, but as before the full source is available on github here.
Our UDTF is going to produce pairwise combinations of elements in a comma-separated string. So, for a string column "Apples, Bananas, Carrots" we'll produce three rows:
- Apples, Bananas
- Apples, Carrots
- Bananas, Carrots
As with the UDF, the first few lines are a simple class extension with a decorator so that Hive can describe what the function does.
We also create an object of PrimitiveObjectInspector, which we'll use to ensure that the input is a string. Once this is done, we need to override methods for initialization, row processing, and cleanup.
This UDTF is going to return an array of structs, so the initialize method needs to return aStructObjectInspector object. Note that the arguments to the constructor come in as an array of ObjectInspector objects. This allows us to handle arguments in a "normal" fashion but with the benefit of methods to broadly inspect type. We only allow a single argument -- the string column to be processed -- so we check the length of the array and validate that the sole element is both a primitive and a string.
The second half of the initialize method is more interesting:
Here we set up information about what the UDTF returns. We need this in place before we start processing rows, otherwise Hive can't correctly build execution plans before submitting jobs to MapReduce. The structures we're returning will be two strings per struct, which means we'll needObjectInspector objects for both the values and the names of the fields. We create two lists, one of strings for the name, the other of ObjectInspector objects. We pack them manually and then use a factor to get the StructObjectInspector which defines the actual return value.
Now we're ready to actually do some processing, so we override the process method.
This is simple pairwise expansion, so the logic isn't anything more than a nested for-loop. There are, though, some interesting things to note. First, to actually get a string object to operate on, we have to use an ObjectInspector and some typecasting. This allows us to bail out early if the column value is null. Once we have the string, splitting, sorting, and looping is textbook stuff.
The last notable piece is that the process method does not return anything. Instead, we callforward to emit our newly created structs. From the context of those used to database internals, this follows the producer-consumer models of most RDBMs. From the context of those used to MapReduce semantics, this is equivalent to calling write on the Context object.
If there were any cleanup to do, we'd take care of it here. But this is simple emission, so our override doesn't need to do anything.
Using the UDTF
Once we've built our UDTF, we can access it via Hive by adding the jar and assigning it to a temporary function. However, mixing the results of a UDTF with other columns from the base table requires that we use a LATERAL VIEW.
#Add the Jaradd jar /mnt/shared/market_basket_example/pairwise.jar;#Create a functionCREATE temporary function pairwise AS 'com.oracle.hive.udtf.PairwiseUDTF';# view the pairwise expansion outputSELECT m1, m2, COUNT(*) FROM market_basket
LATERAL VIEW pairwise(basket) pwise AS m1,m2 GROUP BY m1,m2;
By dan.mcclary on Apr 04, 2013
In our ongoing series of posts explaining the in's and out's of Hive User Defined Functions, we're starting with the simplest case. Of the three little UDFs, today's entry built a straw house: simple, easy to put together, but limited in applicability. We'll walk through important parts of the code, but you can grab the whole source from github here.
The first few lines of interest are very straightforward:
We're extending the UDF class with some decoration. The decoration is important for usability and functionality. The description decorator allows us to give the Hive some information to show users about how to use our UDF and what it's method signature will be. The UDFType decoration tells Hive what sort of behavior to expect from our function.
A deterministic UDF will always return the same output given a particular input. A square-root computing UDF will always return the same square root for 4, so we can say it is deterministic; a call to get the system time would not be. The stateful annotation of the UDFType decoration is relatively new to Hive (e.g., CDH4 and above). The stateful directive allows Hive to keep some static variables available across rows. The simplest example of this is a "row-sequence," which maintains a static counter which increments with each row processed.
Since square-root and row-counting aren't terribly interesting, we'll use the stateful annotation to build a simple moving average function. We'll return to the notion of a moving average later when we build a UDAF, so as to compare the two approaches.
The above code is basic initialization. We make a double in which to hold the result, but it needs to be of class DoubleWritable so that MapReduce can properly serialize the data. We use a deque to hold our sliding window, and we need to keep track of the window's size. Finally, we implement a constructor for the UDF class.
Here's the meat of the class: the evaluate method. This method will be called on each row by the map tasks. For any given row, we can't say whether or not our sliding window exists, so we initialize it if it's null.
Here we deal with the deque and compute the sum of the window's elements. Deques are essentially double-ended queues, so they make excellent sliding windows. If the window is full, we pop the oldest element and add the current value.
Computing the moving average without weighting is simply dividing the sum of our window by its size. We then set that value in our Writable variable and return it. The value is then emitted as part of the map task executing the UDF function.
The stateful annotation made it simple for us to compute a moving average since we could keep the deque static. However, how would we compute a moving average if there was no notion of state between Hadoop tasks? At the end of the series we'll examine a UDAF that does this, but the algorithm ends up being much different. In the meantime, I challenge the reader to think about what sort of approach is needed to compute the window.[Read More]
Tuesday Apr 02, 2013
By dan.mcclary on Apr 02, 2013
User-defined Functions (UDFs) have a long history of usefulness in SQL-derived languages. While query languages can be rich in their expressiveness, there's just no way they can anticipate all the things a developer wants to do. Thus, the custom UDF has become commonplace in our data manipulation toolbox.
Apache Hive is no different in this respect from other SQL-like languages. Hive allows extensibility via both Hadoop Streaming and compiled Java. However, largely because of the underlying MapReduce paradigm, all Hive UDFs are not created equally. Some UDFs are intended for "map-side" execution, while others are portable and can be run on the "reduce-side." Moreover, UDF behavior via streaming requires that queries be formatted so as to direct script execution where we desire it.
The intricacies of where and how a UDF executes may seem like minutiae, but we would be disappointed time spent coding a cumulative sum UDF only executed on single rows. To that end, I'm going to spend the rest of the week diving into the three primary types of Java-based UDFs in Hive. You can find all of the sample code discussed here.
The Three Little UDFs
Hive provides three classes of UDFs that most users are interested in: UDFs, UDTFs, and UDAFs. Broken down simply, the three classes can be explained as such:
- UDFs -- User Defined Functions; these operate row-wise, generally during map execution. They're the simplest UDFs to write, but constrained in their functionality.
- UDTFs -- User Defined Table-Generating Functions; these also execute row-wise, but they produce multiple rows of output (i.e., they generate a table). The most common example of this is Hive's explode function.
- UDAFs -- User Defined Aggregating Functions; these can execute on either the map-side or the reduce-side and far more flexible than UDFs. The challenge, however, is that in writing UDAFs we have to think not just about what to do with a single row, or even a group of rows. Here, one has to consider partial aggregation and serialization between map and reduce proceses.
Tuesday Mar 05, 2013
By Jean-Pierre Dijcks-Oracle on Mar 05, 2013
About a year ago we did a comparison (with an update here) of Build your Own Hadoop cluster and a Big Data Appliance, where we focused purely on the hardware and software cost. We thought it could use an update, but luckily an analyst firm did one for us and this time it covers both the Hardware/Software costs, but also ventures a lot more into dealing with other costs.
Some highlights from the report:
- Oracle Big Data Appliance($450k) is 39% less costly than "build your own"($733k)
- OBDA reduces time-to-market by 33% vs "build"
But, the report is not just about those numbers, it covers a number of very interesting things like 3 Hadoop Myths, the importance of big data in the near future and the priority customers give to improving their analytics footprint.
Enjoy the read!
Wednesday Feb 20, 2013
By Jean-Pierre Dijcks-Oracle on Feb 20, 2013
Look no further, Infosys today announced its Infosys BigDataEdge developer platform to drive value from your big data stack.
By empowering business users to rapidly develop insights from vast amounts of structured and unstructured data, better business decisions can be made in near real-time. With Infosys BigDataEdge, enterprises can reduce the time taken to extract information by up to 40 percent and generate insights up to eight times faster.
Thursday Feb 07, 2013
By dan.mcclary on Feb 07, 2013
Contrary to my habit of posting How-To articles, today I'm not going to tell you how to do anything. Instead, I'm going to ask you to go do something. What I want you to do is check out the series of video tutorials I'm doing around pieces of the Hadoop ecosystem and Big Data analytics.
The first entry is on YouTube and in the Big Data Playlist on Oracle Media Network. It gives a brief overview of setting up Flume flows into HDFS. Of course, if you're looking for a more detailed How-To, just check the archives -- I covered it a while back.
Some of the topics I'll cover in upcoming months include:
- Hive User Defined Functions
- Market Basket Analysis with Hive, R and Data Miner
- Customer Segmentation and Clustering in Hadoop
- Classification and Targeting using Random Forests
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.
- Using Oracle Big Data Spatial and Graph
- Oracle Big Data Spatial and Graph - Installing the Image Processing Framework
- Oracle Big Data Lite 4.2 Now Available!
- Space Management and Oracle Direct Path Load
- PX In Memory, PX In Memory IMC?
- Big Data Spatial and Graph is now released!
- Monitoring Parallel Execution using Real-Time SQL Monitoring in Oracle Database 12c
- Managing overflows in LISTAGG
- Statement of Direction -- Big Data Management System
- Noteworthy event for big data and data warehousing