Before you start reading this post please understand the following:
Now that you are warned, read on…
Imagine you want to store ~50TB (sounded like a nice round number) of data on a Hadoop cluster without worrying about any data growth for now (I told you this was theoretical). I’ll leave the default replication for Hadoop at 3 and simple math now dictates that I need 3 * 50TB = 150TB of Hadoop storage.
I also need space to do my MapReduce work (and this is the same for Pig and Hive which generate MapReduce) so your system will be even bigger to ensure Hadoop can write temporary files, shuffle date etc.
Now, that is fabulous, so I need to talk to my lovely Oracle Rep and buy a Big Data Appliance which would easily hold the above mentioned 50TB with triple replication. It actually holds 150TB (to make the math easy for me) or so of user data, and you will instantly say that the BDA is way to big!
Ah, but how fast do you want to process data? A Big Data Appliance has 18 nodes, each node has 12 cores to do the work for you. MapReduce is using processes called mappers and reducers (really!) to do the actual work.
Let’s assume that we are allowing Hadoop to spin up 15 mappers per node and 10 reducers per node. Let’s further assume we are going full bore and have every slot allocated to the current and only job’s mappers and reducers (they do not run together I know, theoretical exercise – remember?).
Because you decide the Big Data Appliance was way to big, you have bought 8 equivalent nodes to fit your data. Two of these run your Name Node, your Jobtracker and secondary Name Node (and you should actually have three nodes of all this, but I’m generous and say we run Jobtracker on Secondary Name Node). You have however 6 nodes for the data nodes based on your capacity based sizing.
That system you just bought based on storage will give us 6 * 15 = 90 mappers and 6 * 10 = 60 reducers working on my workload (the 2 other nodes do not run data nodes and do not run mappers and reducers).
Now let’s assume that I finish my job in N minutes on my lovely 8 node cluster by leveraging the full set of workers, and assume that my business users want to refresh the state of the world every N/2 minutes (it always has to go faster), then the assumption would be to simply get 2 * the number of nodes in my original cluster assuming linear scalability… The assumption is reasonable by the way for a lot of workloads, certainly for the ones in social, search and other data patterns that show little data skew because of their overall data size.
A Big Data Appliance gives us 15 * 15 = 225 mappers and 15 * 10 = 150 reducers working on my 50TB of user data… providing a 2.5x speed up on my data set.
Just another reference point on this, a Terasort of 100GB is run on a 20 node cluster with a total disk capacity of 80TB. Now that is of course a little too much, but you will see the point of not worrying too much about “that big system” and think processing power rather than storage.
You will need to worry about the processing requirements and you will need to understand the characteristics of the machine and the data. You should not size a system, or discard something as too big right away by just thinking about your raw data size. You should really, really consider Hadoop to be a system that scales processing and data storage together and use the benefits of the scale-out to balance data size with runtimes.
PS. Yes, I completely ignored those fabulous compression algorithms… Compression can certainly play a role here but I’ve left it out for now. Mostly because it is extremely hard (at least for my brain) to figure out an average compression rate and because you may decide to only compress older data, and compression costs CPU, but allows faster scan speeds and more of this fun stuff…