By Philippe Julio on Jul 19, 2009
Cloud Analytics should leverage Sun's compelling storage architecture.
Hadoop Distributed File System (HDFS) is scalable with high availability and high performance. HDFS on servers with 3 cluster nodes minimum (1 Master Node and 2 Slaves Nodes). The blocks data are 64 MB (default) / 128 MB, every block is replicated 3 times (default). NameNode is the metadata of the file system. The files are divided and distributed on DataNodes.
MapReduce is a data processing software and is designed to store and stream extremely large datasets in batch, not intended for realtime querying and does not support ramdom access. JobTracker schedules and manages jobs, TaskTracker executes individual map() and reduce() tasks on each cluster node.
HBase is distributed storage system, column-oriented and multi-dimensional, This software is very interesting to manage very large structured data for the web semantic. HBase can manage billions of rows, millions of columns, thousands of versions and petabytes across thousands of servers. Realtime querying.
Hive is a system for managing and querying structured data built on top of Hadoop with SQL as data warehousing tool. No realtime querying
- The NameNode is a single point of failure (SPOF), the transaction Log is stored in multiple directories and a directory is on the local file system or on a remote file system (NFS/CIFS).
- The secondary NameNode is the copies of FsImage and Transaction Log from NameNode to a temporary directory.
- For increasing the high availability of the Hadoop cluster it is possible to interconnect 2 master nodes (active/passive) servers with Solaris Cluster
- For the security of the Hadoop cluster you should encrypted the data for safeguarding all transactions on the web.
Proof Of Concept
- Create an architecture with minimum three nodes and test the performance and the feasibility of Hadoop.
- For rapidly testing Hadoop you can use the OpenSolaris Hadoop Live CD
- The OpenSolaris LiveHadoop setup install three virtual nodes Hadoop Cluster
- Once OpenSolaris boots, two virtual servers are created using Zones
- Zones are very lightweight, minimizing virtualization overheads and leaving more memory for your application
- The "Global" zone hosts the NameNode and JobTracker, and two "Local" zones each host a DataNode and TaskTracker
- Interface your application with HDFS and implement the "Save as Cloud..." and "Open from Cloud...". functionalities. Use the Hadoop Java API for your development.
Service and Support
- HDFS, MapReduce, HBase and Hive are Open Source software and supported on OpenSolaris.
- For the US countries it is possible to contact Cloudera for bringing big data to the enterprise with Hadoop.
- Who support Hadoop across the globe ? http://wiki.apache.org/hadoop/Support
Sizing for HA Cluster
- Business Data Volume = Customer needs
- No RAID factor, No HBA port
- 2 CPU Quad-core for all servers
- 2 System hard disks
- Number of replication blocks = 3
- Block size = 128 MB
- Temporary Space = 25% of the total hard disk
- Raw Data Volume = 1.25 \* (Business Data Volume \* Nb of replication blocks)
- Number of NameNode Servers = 2
- Number of DataNode Servers = Raw Data Volume / Server Capacity Storage
- NameNode RAM = 64 GB
- DataNode RAM = 32 GB mini