Sunday Jul 19, 2009

Hadoop Architecture

Hadoop Architecture
Cloud computing is a convergence of High Performance Computing architectures, Web 2.0 data models, and Enterprise computing data scale.
Cloud Analytics should leverage Sun's compelling storage architecture.

Hadoop Distributed File System (HDFS)
is scalable with high availability and high performance. HDFS on servers with 3 cluster nodes minimum (1 Master Node and 2 Slaves Nodes). The blocks data are 64 MB (default) / 128 MB, every block is replicated  3 times (default). NameNode is the metadata of the file system. The files are divided and distributed on DataNodes.
MapReduce is a data processing software and is designed to store and stream extremely large datasets in batch, not intended for realtime querying and does not support ramdom access. JobTracker schedules and manages jobs, TaskTracker executes individual map() and reduce() tasks on each cluster node.
HBase is distributed storage system, column-oriented and multi-dimensional, This software is very interesting to manage very large structured data for the web semantic. HBase can manage billions of rows, millions of columns, thousands of versions and petabytes across thousands of servers. Realtime querying.
Hive is a system for managing and querying structured data built on top of Hadoop with SQL as data warehousing tool. No realtime querying

High Availability

- The NameNode is a single point of failure (SPOF), the transaction Log is stored in multiple directories and a directory is on the local file system or on a remote file system (NFS/CIFS).
- The secondary NameNode is the copies of FsImage and Transaction Log from NameNode to a temporary directory.
- For increasing the high availability of the Hadoop cluster it is possible to interconnect 2 master nodes (active/passive) servers with Solaris Cluster


- For the security of the Hadoop cluster you should encrypted the data for safeguarding all transactions on the web.

Proof Of Concept

- Create an architecture with minimum three nodes and test the performance and the feasibility of Hadoop.
- For rapidly testing Hadoop you can use the OpenSolaris Hadoop Live CD
- The OpenSolaris LiveHadoop setup install three virtual nodes Hadoop Cluster
        - Once OpenSolaris boots, two virtual servers are created using Zones
        - Zones are very lightweight, minimizing virtualization overheads and leaving more memory for your application
        - The "Global" zone hosts the NameNode and JobTracker, and two "Local" zones each host a DataNode and TaskTracker


- Interface your application with HDFS and implement the "Save as Cloud..." and  "Open from Cloud...". functionalities. Use the Hadoop Java API for your development.

Service and Support

- HDFS, MapReduce, HBase and Hive are Open Source software and supported on OpenSolaris.
- For the US countries it is possible to contact Cloudera for bringing big data to the enterprise with Hadoop.
- Who support Hadoop across the globe ?

Architecture Overview

Sizing for HA Cluster

- Business Data Volume = Customer needs
- No RAID factor, No HBA port
- 2 CPU Quad-core for all servers
- 2 System hard disks
- Number of replication blocks = 3
- Block size = 128 MB
- Temporary Space = 25% of the total hard disk
- Raw Data Volume = 1.25 \* (Business Data Volume \* Nb of replication blocks)
- Number of NameNode Servers = 2
- Number of DataNode Servers = Raw Data Volume / Server Capacity Storage
- NameNode RAM = 64 GB
- DataNode RAM = 32 GB mini

Key Links


Saturday Jul 05, 2008

Sun and Greenplum Appliance

Greenplum Appliance Data Warehousing Without Limits.

The Data Warehouse Appliance powered by Sun and Greenplum is the industry's first cost-effective, high-performance super-capacity data warehouse appliance. Purpose-built for high-performance, large-scale data warehousing, the solution integrates best-in-class database, server, and storage components into one easy-to-use, plug-and-play system.

At the heart of the Data Warehouse Appliance is the Sun Fire™ X4500 server powered by Dual-Core AMD Opteron™ processors. The Sun Fire X4500 server
represents a revolution in server architecture for data warehousing. With up to 24 terabytes of on-board, high-density storage (48 drives in a 4 RU system), it delivers industry leading compute power, storage density, and near-zero latency access to data in a single, integrated solution. The hot-swappable disks provide 2 gigabytes per second serial read throughput per system. Utilizing the massively parallel processing architecture provided by Greenplum's MPP PostgreSQL database, the Data Warehouse Appliance
distributes data across all disks in the system, enabling query-in-storage processing for today's demanding data warehousing applications. The Data Warehouse Appliance powered by Sun and Greenplum changes the game in data warehousing with low acquisition costs, quality, global support, and technical expertise.The Data Warehouse Appliance truly transforms the economics of data warehousing.

Value Proposition

  • Data Warehouse Appliance powered by Sun and Greenplum
  • Open Source : PostgreSQL
  • Solaris™10 Operating System and Solaris ZFS
  • Sun Fire™ x4500 servers with 2 Dual-Core AMD Opteron processors
  • Sun Fire x4100 server (parallel optimizer planner)
  • 1 TB/mn Scan
  • Scale to hundreds of terabytes
  • Massively Parallel Processing
  • MPP PostgreSQL
  • Parallel Loading 500GB/hour
  • Modular Design
  • 100 TB/rack (DW100 hardware)
  • 9 KW/rack
  • Global Support
  • Sun Solution Center

Sunday Jun 01, 2008

Follow the Sun

Follow The Sun The Helios-synchronous Dynamic Architecture

The best architecture for sales management of an multinational compagny. The system performance must be on top for user activity and data Integration. User activity and data integration are in day and night alternation for every time zone across the world. The system must follow the sun and be synchronized with users activity. Nevertheless the data integration activity being done the night, a reserve of power must be allocated at every time zone to guarantee the system availability and performance. It is thus necessary to design a dynamic system according to the days and nights alternation.

This solution is based on SAP BI Software and Solaris Containers.
  • Solaris 10 and Solaris Resource Manager
  • Resources guaranteed for any Time Zone
  • One Local Zone/Time Zone per AS Instances
  • Resources consolidation
  • Global Zone for DB/CI Instances
  • User activity the Day & Data Integration the Night
  • Resources in Day & Night Alternation
  • Dynamic integration for new country

Saturday Feb 16, 2008

MySQL For Business Intelligence

MySQL for Business Intelligence MySQL, Yes for Business Intelligence

I think that MySQL can be a good compromise compared to the great databases of the Business Intelligence market having advanced functionalities in this field like Oracle or Sybase. Today we surely can propose MySQL as an alternative to Oracle or Sybase. MySQL evolves and is improved with various functionalities related to Business Intelligence.
Today, if you want increase Business Intelligence funtionalities with MySQL, you should add a software Infobright BrightHouse (it's a Data Warehouse engine for very large database)

It's a solution for Analytic Data Warehousing that delivers high performance for complex analytic queries across vast amounts of data. BrightHouse delivers the following key features:

  • high query performance for analysis across terabytes of data.
  • average data compression of 10:1 (10TB of raw data can be stored at 1TB).
  • low administration requirements.
  • runs on low cost, commodity hardware.
  • compatible with all major BI tools including Cognos, Business Objects, etc.

BrightHouse at its core is a highly compressed column-oriented datastore that incorporates MySQL technology. BrightHouse leverages MySQL’s pluggable storage engine architecture and bundles MySQL Version 5.1.
The MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.) are used in BrightHouse. The MySQL management services and utilities are used as the technology around connection pooling. As with other MySQL storage engines, MyISAM is used to store catalogue information such as table definitions, views, users, permissions, etc.

For the reporting functionalities, we work with the partners like JasperSoft and Actuate. For the ETL functionality, we work with Talend (Open Source ETL).

Our stack Business Intelligence is thus: MySQL + Talend + JasperSoft.

For IT governance, I see real interest to position MySQL as a reference database to implement CMDB according to ITIL best practices. At first, to control IT performance infrastructure by deploying IT model in MySQL and then to have this model deployed in the field by professional services ( standardization phase ).


Friday Feb 15, 2008

BI Architecture Design

BI Architecture Design The Best Architecture for Business Driving

If you want to size this type of architecture you must read the BI rules and definitions here

Sizing Methodology
The major parameters for sizing business intelligence technical architecture are : Concurrent queries launched by users (low, medium, high), Processor (type, frequency), Operating System (name, version), Tools Analysis and Databases (name, version), Data (raw data volume, usesable data volume), Data flow (size, frequency, timing, complexity, period) and aggregates building.

Architecture Design example
• Users activity in different time zone (ex: France, Japan, Brasil, Australia...)
Standardization : best practices ITIL v3, servers and storage consolidation
• Virtualization : servers virtualization (Solaris Containers cloned by country) and storage virtualization Sun StorageTek 9990V
• Dynamic infrastructure : more flexibilty for dynamic user and data integration. Resource management with Solaris Resource Manager (SRM) for data integration vs users activity
performance. Data replication for high availability.
• Gouvernance : Performance and cost management


Tuesday Feb 12, 2008

BI Architecture Definition

BI Architecture Definition Understanding Business Intelligence Rules and Definitions

If you want to understand the Business Intelligence and design the best architecture for the customers needs, then, you must know of them the rules and definitions. It is the best way of being able to dialogue more easily with the specialists.

What is Business Intelligence ?
Business intelligence (BI) is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining. Business intelligence applications can be: Mission-critical and integral to an enterprise's operations or occasional to meet a special requirement Enterprise-wide or local to one division, department, or project. Centrally initiated or driven by user demand.

Raw Data vs Usable Data
Raw data is the data source resulting from the operational systems (CRM, RH, BILLING, PURCHASES, SCM).
Usable data is the result of raw data and technical data according database organization, like indexes, aggregates, metadata, axis, indicators and data work. Usable data does not include Raid factor.

Data Structure

The database is structured in 3 levels: Staging Area is the storage area for data validation. Data Warehouse is the storage area for data details and metadata (ex. Oracle, DB2...) and Data Marts is the storage area for business data including axis, indicators and aggregates (ex: Oracle, DB2, Sybase, Essabse...)

Users Activity

Named users may reach the Business Intelligence system. Users perform concurrent access to BI system ressources. Low users perform reporting by means of requests sweeping around 1.000 records. Medium users perform navigation and analysis around 100.000 records. High users perform ad hoc navigation and analyze large volumes of data with several joints of tables or full facts table scan around 1 million records.

Extraction, Transformation, Loading

Data Integration is more or less complex according to the transformation topics that they perform.
Simple processing represents simple calculations, simple concatenations. Medium processing represents average calculations, medium concatenations. Heavy processing represents heavy calculations, statistical, complex algorithms and heavy concatenations


Software is classified according to several technologies topics: ETL Tools for extraction, transformation and data loading (PowerCenter, DataStage, AbItinio...). Relational database (RDMS) is an entitie/relation data structure (ex: Oracle, Sybase, SAS Base, DB2...). Multidimensional database (MDMS) is a matrix data structure stored on disk (ex: Essbase, Powerplay...). Reporting/Analysis tools (ex: Business Objects, Cognos, SAS, SAP/BI...)

Time Management

The Business Intelligence system is different from the operational system because it integrates the time factor.
Time management is very important in Business Intelligence: data retention duration (ex: 3 years), operational period (ex: Monday - Friday), operation frequency (ex: daily) and associated time frame (ex: 08:00 AM - 07:00 PM)


Thursday Jan 17, 2008

Business Intelligence

Business Intelligence

Business Intelligence drives Business and IT Performance

In today’s highly competitive business climate, making better decisions faster
can mean the difference between surviving and thriving. The challenges are
managing the exponential growth of data in a cost-effective and secure manner,
while transforming relevant data into information for decision support needs. Sun takes the cost and complexity out of today’s business intelligence and data warehouse requirements with a single open platform whose architecture can scale to meet your entire needs from deployment today to meeting your growth needs tomorrow. The results are faster access to information, the ability to make better decisions quickly and speed up time to market.

Sun has more 2000 customers references in business intelligence and data warehousing in the world on all industries (bank/finance, manufacture, retail, government, telco...)
Sun Microsystems developed a network competenties and expert in Business Intelligence and Data Warehousing around the world and working with its partners : SAS, Oracle, Informatica, SAP, etc.
The Sun Microsystems Business Intelligence Solutions integrate specific services around Extraction, Transformation and Loading,  Database, Reporting, OnLine Analytical Processing, Technical architectures, Proof-of-Concept and Benchmarks.

See performance results
DMreview, Wintercorp, TPC           

The qualification of the Business Intelligence technical architecture is declined according three assumptions :

  1. Data storage volume : for disk sizing. technical architecture support the data volume. The useful volume is the raw volume for operational systems with index, agregats, metadata and data work for database system.
  2. Extract, transform and load : for extraction, transformation and data loading. The technical architecture support for ETL process is based on the data flow volume and data processing.
  3. Users volume : for sizing users activities (Reporting, OnLine Analytical Processing). The technical architecture support for reporting process is based on concurrent users number on Data Warehouse and Data Marts.
Our Technology Assets

Business stakes are changing, the IT infrastructure must be increasingly reactive to significantly reduce Time To Market. Today, we have the technology and methodology addressing these new business challenges.


« July 2016