Chapter 3
Chapter 3
Framework Architecture
5 SERVICES OF HADOOP
Storage:
Process
• HDFS
• Files are stored in HDFS and divided into blocks, which are then copied
into multiple Data Nodes.
• Hadoop clusters contains only one Name Node and many Data Node.
• Data blocks are replicated for high availabilty and fast access.
• Name Node
• Run on separate machines.
• Manage file system namespaces, and control access of external clients.
• Store file system metadata in memory file information, each block
information in Data Node.
9 HADOOP ECOSYSTEM
• Data Node
• Run on separate machines, which is the basic unit of file storage.
• Sent all messages of existing blocks periodically to Name Node.
• Data Node response read and write request from the Name Node.
• It also respond, create, delete, and copy the block command from the
Name Node.
10 HADOOP ECOSYSTEM
• MapReduce
• Programming model for data processing.
• Hadoop can run MapReduce programs written in various languages
java, python.
• Parallel processing, put MapReduce in very large scale data-analysis.
• Mapper produce intermediate results.
• Reducer integrates the results.
11 HADOOP ECOSYSTEM
• MapReduce
• Files are split into fixed sized blocks and stored on data nodes(default
64MB)
• Programs written, can process on distributed clusters in parallel.
• Input data is a key/value pairs, the output is also key/value pairs.
• Mainly two phase map and reduce.
•
12 HADOOP ECOSYSTEM
• MapReduce
• Map
• Map process each block separately in parallel.
• Generate an intermediate key/value pairs set.
• Results of these logic blocks are reassembled.
• Reduce
• Accepts an intermediate key and related value
• Processed the intermediate key and value.
• Form a set of relatively small value set.
13 HADOOP ECOSYSTEM
• Avro
• Avro is the framework for performing remote procedure calls and data
serialization.
• It can be used to pass data from one program or language to another,
eg from C to pig.
• Suited for use with scripting languages such as Pig because data is
always stored with its schema in Avro and therefore the data is self-
describing.
• Avro can also handle changes in schema still preserving access to the
data.
15 HADOOP ECOSYSTEM
• Pig
• Pig is a framework consisting of a high-level scripting languages (Pig
Latin)
• Run time environment that allows users to execute MapReduce on
Hadoop cluster.
• Like HiveQL in Hive, Pig Latin is a higher level language that compiles
to MapReduce.
• Pig is more flexible than hive with respect to possible data format.
• Pig’s data model is similar to the relational data model, except that
tuples (records or rows) can be nested.
16 HADOOP ECOSYSTEM
• Hive
• Apache hive is data warehouse infrastructure built on top of Hadoop
for providing data summarization, query and analysis.
• Using Hadoop is not easy for end users those who were not familiar
with Mapreduce framework.
• A Hive query is converted To MapReduce tasks.
17 HADOOP ECOSYSTEM
• HBase
• HBase is distributed column-oriented database built on top of HDFS.
• HBase is not relational and does not support SQL, but given the
problem space.
• It is able to do what an RDBMS cannot.
19 HADOOP ECOSYSTEM
• Mahout
• Mahout is scalable machine-learning and data mining library.
• There are currently four main groups of algorithms in Mahout:
• Recommendations
• Classification
• Clustering
• Frequent itemset mining
• Sqoop
• Sqoop allows easy import and export of data from structured data
stores.
• Command-line tool to import any JDBC supported database into
Hadoop.
• High performance connectors for RDBMS.
• Distributed, reliable, available service for efficiently moving large
amount of data as it is produced.
• Suited for gathering log from multiple systems.
21 HADOOP ECOSYSTEM
• Sqoop
• Sqoop allows easy import and export of data from structured data
stores.
• Command-line tool to import any JDBC supported database into
Hadoop.
• High performance connectors for RDBMS.
• Distributed, reliable, available service for efficiently moving large
amount of data as it is produced.
• Suited for gathering log from multiple systems.