Years ago one of the big business trends was to create a data warehouse. The idea was to take all of the coporate operational data and put it into one database and grind on it to generate reports. History has shown that aggregation of the data was a difficult task as well as the processing power required to grind through reports. The task took significant resources to architect the data, to host the data, and to write select statements to generate reports for users. As retail got more and more ingrained on the web, sources outside the company became highly relevant and influential on products and services. Big Data and Hadoop have come with tools to pull from non-structured data like Twitter, Yelp, and other public web services and correlate comments and reviews to products and services.
The three characterizations of Big Data according to Hadoop for Dummies are
Hadoop is architected to view high volumes of data and data with a variety of structures but it is not necessarily suited to analyze data in motion as it enters the organization but once it is stored and at rest.
Since we touched on the subject, let's define different data structures. Structured data is characterized by a high degree of organization and is typically stored in a database or spreadsheet. There is a relational mapping to the data and programs can be written to analize and process the relationships. Semi-structured data is a bit more difficult to understand than structured data. It is typically stored in the form of text data or log files. The data is typically somewhat structured and is either comma, tab, or character delimited. Unfortunately multiple log files have different formats so the stream of formatting is different for each file and parsing and analysis is a little more challenging. Unstructured data has none of the advantages of the other two data types. Structure might be in the form of directory structure, server location, or file type. The actual architecture of the data might or might not be predictable and needs a special translator to parse the data. Analyzing this type of data typically requires a data architect or data scientist to look at the data and reformat it to make it usable.
From Dummies Guide again, Hadoop is a framework for storing data on large clusters of commodity hardware. This lends itself well to running on a cloud infrastructure that is predictable and scalable. Level 3 networking is the foundation for the cluster. An application that is running on Hadoop gets its work divided among the nodes in the cluster. Some nodes aggregate data through MapReduce or YARN and the data is stored and managed by other nodes using a distributed file system know as the Hadoop distributed file system (HDFS). Hadoop started back in 2002 with the Apache Nutch project. The purpose of this project was to create the foundation for an open source search engine. The project needed to be able to scale to billions of web pages and in 2004 Google published a paper that introduced MapReduce as a way of parsing these web pages.
MapReduce performs a sequence of operations on distributed data sets. The data consists of key-value pairs and has two phases, mapping and data reduction. During the map phase, input data is split into a large number of fragments which is assigned to a map task. Map tasks process the key-value pair that it assigned to look for and proces a set of intermediate key-value pairs. This data is sorted by key and stored into a number of fragments that matches the number of reduce tasks. If for example, we are trying to parse data for the National Football League in the US we would want to spawn 32 task nodes to that we could parse data for each team in the league. Fewer nodes would cause one node to do double duty and more than 32 nodes would cause a duplication of effort. During the reduction phase each task processes the data fragment that it was assigned to it and produces an output key-value pair. For example, if we were looking for passing yardage by team we would spawn 32 task nodes. Each node would look for yardage data for each team and categorize it as either passing or rushing yardage. We might have two quarterbacks pay for a team or have a wide receiver throw a pass. The key for this team would be the passer and the value would be the yards gained. These reduce tasks are distributed across the cluster and the results of their output is stored on the HDFS when finished. We should end up with 32 data files from 32 different task nodes updating passing yardage by team.
Hadoop is more than just distributed storage and MapReduce. It also contains components to help administer and coordinate servers (HUE, Ambari, and Zookeeper), data movement management (flume and sqoop), resource management (YARN), processing framework (MapReduce, Tez, Hoya), Workflow engines (Oozie), Data Serialization (Avro), Data Collection (MapReduce, Pig, Hive, and HBase), and Data Analysis (Mahout). We will look into these system individually later.
There are commercial and public domain offerings for Hadoop.
A good project to start a small Hadoop project is log analysis. If you have a web server, it generates logs every time that a web page is requested. When a change is made to the web site, logs are generated when people log into manage the pages or change the page content. If you web page is a transactional system, orders are being placed for goods and services as well as credit card transaction processing. All of these generate log files. If we wanted to look at a product catalog and correlate what people look at in relationship to what is ordered, we could do what Amazon has done for years. We could come up with recommendations on what other people are looking at as well as what other people ordered along with this item. If, for example, we are buying a pair of athletic shoes. A common purchase with a pair of shoes is also socks. We could give a recommendation on socks that could go with the shoes or a shoe deoderant product that yields a higher profit margin. These items could be displayed with the product in the catalog or shopping cart to facilitate more goods sold on the web. We can also look at the products that no one is looking at and reduce our inventories since they are not even getting looked at casually.
We can also use Hadoop as a fraud detection or risk modeling engine. Both provide significant value to companies and allow executives to look at revenue losses as well as potential transactions that could cause a loss. For example, we might want to look at the packing material that we use for a fragile item that we sell. If we have a high rate of return on a specific item we might want to change the packing, change the shipper, or stop shipping to a part of the country that tends to have a high return rate. Any and all of these solutions can be implemented but a typical data warehouse will not be able to coordinate the data and answer these questions. Some of the data might be stored in plain text files or log files on our return web site. Parsing and processing this data is a good job for Hadoop.
In the upcoming weeks we will dive into installation of a Hadoop framework on the Oracle Cloud. We will look at resources required, pick a project, and deploy sample code into a IaaS solution. We will also look at other books and resources to help us understand and deploy sandboxes to build a prototype that might help us solve a business problem.