0% found this document useful (0 votes)
6 views

Chapter 3

big data and business intelligence chapter 3

Uploaded by

lalisagutama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Chapter 3

big data and business intelligence chapter 3

Uploaded by

lalisagutama
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

FUNDAMENTALS OF BIG DATA

AND BUSINESS INTELLIGENCE

CHAPTER THREE–HADOOP ECOSYSTEM


2 INTRODUCTION

Hadoop is the product of Apache, it is the type of distributed


system, it is the frame work of big data.
Apache Hadoop is an open-source software framework for storage and large-scale
processing of datasets on clusters of commodity hardware.
Characteristics of hadoop
Open source
Distributed storage
Distributed processing
Reliable
Economical
Flexible
3 HADOOP FRAMEWORK MODULES

The base of apache hadoop frame composed of the following modules:


Hadoop Common: contains the libraries and utilities needed by other Hadoop
modules.
Hadoop Distributed File System(HDFS) : a distributed file-system that stores data
on commodity machines, providing very high aggregate bandwidth across the
clusters.
Hadoop YARN(Yet Another Resource Negotiator) : a resource-management
platform responsible for managing computing resource in clusters and using them
for scheduling of users’ applications.
 Hadoop MapReduce: an implementation of the MapReduce programming model
for large data scale processing.
4 HADOOP FRAMEWORK MODULES

Framework Architecture
5 SERVICES OF HADOOP

Storage:

1.HDFS(Hadoop Distributed File system) :


• Horizontally Unlimited Scalability ( No limit for maximum number of slaves)
• Horizontal scalability means that the system can expand its capacity by adding
more machines (nodes) to the cluster.
• When you add new nodes to the cluster, the storage capacity and processing
power increase proportionally.
• HDFS automatically integrates new nodes and redistributes data to maintain
balance.
• Block size – 64 MB (old version)
- 128 MB (new version)
6 SERVICES OF HADOOP

Process

1.MapReduce (old model): Programming based Data Processing Disk-based


processing (intermediate data is written to disk between each stage), Slower due to
repeated disk I/O.
2.Spark(new model): In-Memory data processing (keeps data in memory as much as
possible, falling back to disk when needed).
7 HADOOP ECOSYSTEM

• There are countless commercial Hadoop-integrated products focused


on making Hadoop more usable, but the ones here were chosen
because they provide core functionalities and speed in hadoop so
called HadoopEcosystem.
8 HADOOP ECOSYSTEM

• HDFS
• Files are stored in HDFS and divided into blocks, which are then copied
into multiple Data Nodes.
• Hadoop clusters contains only one Name Node and many Data Node.
• Data blocks are replicated for high availabilty and fast access.
• Name Node
• Run on separate machines.
• Manage file system namespaces, and control access of external clients.
• Store file system metadata in memory file information, each block
information in Data Node.
9 HADOOP ECOSYSTEM

• Data Node
• Run on separate machines, which is the basic unit of file storage.
• Sent all messages of existing blocks periodically to Name Node.
• Data Node response read and write request from the Name Node.
• It also respond, create, delete, and copy the block command from the
Name Node.
10 HADOOP ECOSYSTEM

• MapReduce
• Programming model for data processing.
• Hadoop can run MapReduce programs written in various languages
java, python.
• Parallel processing, put MapReduce in very large scale data-analysis.
• Mapper produce intermediate results.
• Reducer integrates the results.
11 HADOOP ECOSYSTEM

• MapReduce
• Files are split into fixed sized blocks and stored on data nodes(default
64MB)
• Programs written, can process on distributed clusters in parallel.
• Input data is a key/value pairs, the output is also key/value pairs.
• Mainly two phase map and reduce.

12 HADOOP ECOSYSTEM

• MapReduce
• Map
• Map process each block separately in parallel.
• Generate an intermediate key/value pairs set.
• Results of these logic blocks are reassembled.
• Reduce
• Accepts an intermediate key and related value
• Processed the intermediate key and value.
• Form a set of relatively small value set.
13 HADOOP ECOSYSTEM

• YARN(Yet Another Resource Negotiator)


• MapReduce 1.0 had an issues with scalability, memory usage, and
synchronization.
• YARN addresses problems with Mapreduce 1.0’s architecture,
specifically with JobTracker service.
• YARN splits up two major functionalities of JobTracker, resource
management and job scheduling/ monitoring, into separate deamons.
• Rather than burdening single node with handling scheduling and
resource management for the entire cluster, YARN now distributes this
responsibility across the cluster.
14 HADOOP ECOSYSTEM

• Avro
• Avro is the framework for performing remote procedure calls and data
serialization.
• It can be used to pass data from one program or language to another,
eg from C to pig.
• Suited for use with scripting languages such as Pig because data is
always stored with its schema in Avro and therefore the data is self-
describing.
• Avro can also handle changes in schema still preserving access to the
data.
15 HADOOP ECOSYSTEM

• Pig
• Pig is a framework consisting of a high-level scripting languages (Pig
Latin)
• Run time environment that allows users to execute MapReduce on
Hadoop cluster.
• Like HiveQL in Hive, Pig Latin is a higher level language that compiles
to MapReduce.
• Pig is more flexible than hive with respect to possible data format.
• Pig’s data model is similar to the relational data model, except that
tuples (records or rows) can be nested.
16 HADOOP ECOSYSTEM

• Hive
• Apache hive is data warehouse infrastructure built on top of Hadoop
for providing data summarization, query and analysis.
• Using Hadoop is not easy for end users those who were not familiar
with Mapreduce framework.
• A Hive query is converted To MapReduce tasks.
17 HADOOP ECOSYSTEM

• Building blocks of Hive


• Metastore stores the system catalog and metadata about tables,
columns, partitions, and etc.
• Driver manages the lifecycle of a HiveQL statement as it moves
through the Hive.
• Query compiler compiles HiveQL into a directed acyclic graph for
MapReduce tasks.
• Execution engine executes the tasks produced by the compiler in
proper dependency order.
• Hive Server provides a thrift interface and a JDBC/ODBC server.
18 HADOOP ECOSYSTEM

• HBase
• HBase is distributed column-oriented database built on top of HDFS.
• HBase is not relational and does not support SQL, but given the
problem space.
• It is able to do what an RDBMS cannot.
19 HADOOP ECOSYSTEM

• Mahout
• Mahout is scalable machine-learning and data mining library.
• There are currently four main groups of algorithms in Mahout:
• Recommendations
• Classification
• Clustering
• Frequent itemset mining

• Mahout is not simply a collection of pre-existing algorithms.


• Algorithms in the Mahout library belong to the subset that can be
executed in a distributed fashion.
20 HADOOP ECOSYSTEM

• Sqoop
• Sqoop allows easy import and export of data from structured data
stores.
• Command-line tool to import any JDBC supported database into
Hadoop.
• High performance connectors for RDBMS.
• Distributed, reliable, available service for efficiently moving large
amount of data as it is produced.
• Suited for gathering log from multiple systems.
21 HADOOP ECOSYSTEM

• Sqoop
• Sqoop allows easy import and export of data from structured data
stores.
• Command-line tool to import any JDBC supported database into
Hadoop.
• High performance connectors for RDBMS.
• Distributed, reliable, available service for efficiently moving large
amount of data as it is produced.
• Suited for gathering log from multiple systems.

You might also like