Cloud Computing: Its raining data.. hallelujah!

The information technology industry has never seen the likes of the data tsunami, or more appropriately, the perpetual data hurricane that is raining down on us. Many of the cloud pundits talk about the Infrastructure as a Service, Platform as a Service and Software as a Service. But very few discuss the critical aspect of cloud : big data. Bill Oreilly calls it the network effect in data, and Amazon recently gave their nod to big data by putting public sets of census, genome, economics and 3-d chemical data online. Google has been indexing public books ( Google scholar ) for quite a while now.

So how big is this data? and how fast is it growing?  I did some research - the results are astounding!

( For reference : Petabyte(PB) is 1000 terrabytes(TB), where 1 TB = 1000 GB )

DATA
 SIZE ( Plus Compound Annual Growth Rate )


Wikipedia 10GB ( 100% CAGR )
 Merck Bio Research DB  1.5TB/quarter
 Wal-Mark Transaction DB  600TB
 UPMC Hospitals Imaging Data  500TB/year
 Typical Oil Company data per oil field
 350 TB
 One day of Instant Messaging in 2002  750GB
 World Wide Web  1 PB
 Internet Archive  1 PB +
 Terashake earthquake model of LA Basin  1PB
 MIT BabyTalk Speech Experiment  1.5PB
 Estimated Online Ram in Google  8PB+
 Large hadron Collidor  15 PB per run ( 300Exabytes per year! )
 Annual Email traffic ( no spam )  300 PB
 Personal digital photos  1000 PB+ ( 100% CAGR )
 Human Genomics  7000 PB ( 1GB per person / 200 PB + captured ) ( 200% CAGR )
 Total Digital Data created  in 2007 ( IDC )  281,000 PB ( 281 Exabytes ) with 10% CAGR

There are some interesting data points to dwell upon here:

a. The TOTAL size of the documents on world wide web is only around 1PB - compared to the digital photography which is 1000 times the size of WWW or the current size of human genome which is 700 times the size of the WWW.

b. Many of the large data sets are not created by the social web but by large institutions. High Performance Computing  involving audio/video analysis and simulations produces data sets that dwarf the size of others.

 c. The IDC report quoted above estimates that by the year 2011 there will be 1,773 exabytes of digital data in the world!. The report contains many jewels of information, one being that only 5% of this data is generated from the enterprise and only 35% emanates from workers overall ( from their workstations ). Rest of it is created by consumers themselves or workers in enterprises capturing personal information for their customers. In fact if you evenly divide the data by the world population, each person is assumed to have about 45GB of data. I know I probably have created much more than that over last year.

This data points to some interesting trends -

1. The storage market will continue to grow by double digits over the next 5 years. System that are bigger, better, faster, more cost effective and easier to manage and operate will do better. ( Note to store 280exabytes of data, it will take  560,000 Sun Storage 7410 servers.)

2. The applications that can mine this data and expose meaningful information will become widely popular.  Best data mining providers that want to offer services above these data sets. Data Warehousing technologies and new distributed analytics models ( like hadoop ) will thrive.

3. Security of this data will remain relevant. With tons of PII information, privacy and security regulations ( think sarbox, HIPPAA,  GLB etc ) will continue to force enterprises and guardians of this data to address security at all levels.

I do see a bright future for companies that are in the data management , retention and safeguarding business. As well as those companies that can corral this data hurricane and offer meaningful analysis and services above these.

Thoughts/ comments - please fire away!

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

ralvi

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today