is a convergence of High Performance Computing architectures, Web
2.0 data models, and Enterprise computing data scale.
Cloud Analytics should leverage Sun's compelling storage
Hadoop Distributed File System
is scalable with high availability and
high performance. HDFS on servers with 3 cluster nodes minimum (1
Master Node and 2 Slaves Nodes). The blocks data are 64 MB
(default) / 128 MB, every block is replicated 3 times
(default). NameNode is the metadata of the file system. The files
are divided and distributed on DataNodes.
is a data
processing software and is designed to store and stream extremely
large datasets in batch, not intended for realtime querying and
does not support ramdom access. JobTracker schedules and manages
jobs, TaskTracker executes individual map() and reduce() tasks on
each cluster node.
is distributed storage
system, column-oriented and multi-dimensional, This software is
very interesting to manage very large structured data for the web
semantic. HBase can manage billions of rows, millions of columns,
thousands of versions and petabytes across thousands of servers.
is a system for managing
and querying structured data built on top of Hadoop with SQL as
data warehousing tool. No realtime querying
- The NameNode is a single point of failure (SPOF), the
transaction Log is stored in multiple directories and a directory
is on the local file system or on a remote file system (NFS/CIFS).
- The secondary NameNode is the copies of FsImage and Transaction
Log from NameNode to a temporary directory.
- For increasing the high availability of the Hadoop cluster it is
possible to interconnect 2 master nodes (active/passive) servers
with Solaris Cluster
- For the security of the Hadoop cluster you should encrypted the
data for safeguarding all transactions on the web.
Proof Of Concept
- Create an architecture with minimum three nodes and test the
performance and the feasibility of Hadoop.
- For rapidly testing Hadoop you can use the OpenSolaris
Hadoop Live CD
- The OpenSolaris LiveHadoop setup install three virtual nodes
- Once OpenSolaris boots,
two virtual servers are created using Zones
- Zones are very lightweight,
minimizing virtualization overheads and leaving more memory for
- The "Global" zone hosts the
NameNode and JobTracker, and two "Local" zones each host a
DataNode and TaskTracker
- Interface your application with HDFS and implement the "Save as Cloud...
" and "Open from Cloud..."
functionalities. Use the Hadoop
for your development.
Service and Support
- HDFS, MapReduce, HBase and Hive are Open Source software and
supported on OpenSolaris.
- For the US countries it is possible to contact Cloudera
for bringing big
data to the enterprise with Hadoop.
- Who support Hadoop across the globe ? http://wiki.apache.org/hadoop/Support
Sizing for HA Cluster
- Business Data Volume = Customer needs
- No RAID factor, No HBA port
- 2 CPU Quad-core for all servers
- 2 System hard disks
- Number of replication blocks = 3
- Block size = 128 MB
- Temporary Space = 25% of the total hard disk
- Raw Data Volume = 1.25 \* (Business Data Volume \* Nb of
- Number of NameNode Servers = 2
- Number of DataNode Servers = Raw Data Volume / Server Capacity
- NameNode RAM = 64 GB
- DataNode RAM = 32 GB mini