Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Sizing for data volume or performance or both?

Jean-Pierre Dijcks
Master Product Manager

Before you start reading this post please understand the following:

  • General Hadoop capacity planning guidelines and principles are here. I’m not doubting, replicating or replacing them.
  • I’m going to use our (UPDATED as of Dec 2012)Big Data Appliance as the baseline set of components to discuss capacity. That is because my brain does not hold theoretical hardware configurations.
  • This is (a bit of) a theoretical post to make the point that you worry about both (just given away the conclusion!) but I’m not writing the definitive guide to sizing your cluster. I just want to get everyone to think of Hadoop as processing AND storage. Not either or!

Now that you are warned, read on…

Imagine you want to store ~50TB (sounded like a nice round number) of data on a Hadoop cluster without worrying about any data growth for now (I told you this was theoretical). I’ll leave the default replication for Hadoop at 3 and simple math now dictates that I need 3 * 50TB = 150TB of Hadoop storage.

I also need space to do my MapReduce work (and this is the same for Pig and Hive which generate MapReduce) so your system will be even bigger to ensure Hadoop can write temporary files, shuffle date etc.

Now, that is fabulous, so I need to talk to my lovely Oracle Rep and buy a Big Data Appliance which would easily hold the above mentioned 50TB with triple replication. It actually holds 150TB (to make the math easy for me) or so of user data, and you will instantly say that the BDA is way to big!

Ah, but how fast do you want to process data? A Big Data Appliance has 18 nodes, each node has 12 cores to do the work for you. MapReduce is using processes called mappers and reducers (really!) to do the actual work.

Let’s assume that we are allowing Hadoop to spin up 15 mappers per node and 10 reducers per node. Let’s further assume we are going full bore and have every slot allocated to the current and only job’s mappers and reducers (they do not run together I know, theoretical exercise – remember?).

Because you decide the Big Data Appliance was way to big, you have bought 8 equivalent nodes to fit your data. Two of these run your Name Node, your Jobtracker and secondary Name Node (and you should actually have three nodes of all this, but I’m generous and say we run Jobtracker on Secondary Name Node). You have however 6 nodes for the data nodes based on your capacity based sizing.

That system you just bought based on storage will give us 6 * 15 = 90 mappers and 6 * 10 = 60 reducers working on my workload (the 2 other nodes do not run data nodes and do not run mappers and reducers).

Now let’s assume that I finish my job in N minutes on my lovely 8 node cluster by leveraging the full set of workers, and assume that my business users want to refresh the state of the world every N/2 minutes (it always has to go faster), then the assumption would be to simply get 2 * the number of nodes in my original cluster assuming linear scalability… The assumption is reasonable by the way for a lot of workloads, certainly for the ones in social, search and other data patterns that show little data skew because of their overall data size.

A Big Data Appliance gives us 15 * 15 = 225 mappers and 15 * 10 = 150 reducers working on my 50TB of user data… providing a 2.5x speed up on my data set.

Just another reference point on this, a Terasort of 100GB is run on a 20 node cluster with a total disk capacity of 80TB. Now that is of course a little too much, but you will see the point of not worrying too much about “that big system” and think processing power rather than storage.


You will need to worry about the processing requirements and you will need to understand the characteristics of the machine and the data. You should not size a system, or discard something as too big right away by just thinking about your raw data size. You should really, really consider Hadoop to be a system that scales processing and data storage together and use the benefits of the scale-out to balance data size with runtimes.

PS. Yes, I completely ignored those fabulous compression algorithms… Compression can certainly play a role here but I’ve left it out for now. Mostly because it is extremely hard (at least for my brain) to figure out an average compression rate and because you may decide to only compress older data, and compression costs CPU, but allows faster scan speeds and more of this fun stuff…

Join the discussion

Comments ( 2 )
  • guest Monday, February 27, 2012


    In your topic, you have mentioned the below:

    "A Big Data Appliance gives us 15 * 15 = 225 mappers and 15 * 10 = 150 reducers working on my 50TB of user data… providing a 2.5x speed up on my data set"

    I have a question, where "15 * 15 = 225" coming from? I know the second "15" coming from your assumption "spin up 15 mappers per node". could you explain the first 15 in detail?



  • Jean-Pierre Dijcks Monday, February 27, 2012

    The first 15 comes from the assumption that we are just using the 15 non-master nodes to run mappers and reducers on. Those nodes run task trackers, so I must took that number.

    In actual fact, you can use all 18 nodes (the master do run data nodes, just don't run task trackers).

    Choosing to use just 15 nodes is the "worst case scenario" on the appliance in terms of processing capacity.

Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.