0% found this document useful (0 votes)
45 views6 pages

HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce

With an increased usage of the internet, the data usage is also getting increased exponentially year on year. So obviously to handle such an enormous data we needed a better platform to process data. So a programming model was introduced called Map Reduce, which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Since HADOOP has been emerged as a popular tool for BIG DATA implementation, the paper deals with the overall architecture of HADOOP along with the details of its various components. Jagjit Kaur | Heena Girdher"HADOOP: A Solution to Big Data Problems using Partitioning Mechanism Map-Reduce" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-4 , June 2018, URL: http://www.ijtsrd.com/papers/ijtsrd14374.pdf http://www.ijtsrd.com/computer-science/database/14374/hadoop-a-solution-to-big-data-problems-using-partitioning-mechanism-map-reduce/jagjit-kaur

Uploaded by

Editor IJTSRD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views6 pages

HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce

With an increased usage of the internet, the data usage is also getting increased exponentially year on year. So obviously to handle such an enormous data we needed a better platform to process data. So a programming model was introduced called Map Reduce, which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. Since HADOOP has been emerged as a popular tool for BIG DATA implementation, the paper deals with the overall architecture of HADOOP along with the details of its various components. Jagjit Kaur | Heena Girdher"HADOOP: A Solution to Big Data Problems using Partitioning Mechanism Map-Reduce" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-4 , June 2018, URL: http://www.ijtsrd.com/papers/ijtsrd14374.pdf http://www.ijtsrd.com/computer-science/database/14374/hadoop-a-solution-to-big-data-problems-using-partitioning-mechanism-map-reduce/jagjit-kaur

Uploaded by

Editor IJTSRD
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Trend in Scientific

Research and Development (IJTSRD)


International Open Access Journal
ISSN No: 2456 - 6470 | www.ijtsrd.com | Volume - 2 | Issue – 4

HADOOP: A Solution to Big Data Problems using Partitioning


P
Mechanism
echanism Map
Map-Reduce
Jagjit Kaur Heena Girdher
Girdh
Assistant Professor Assistant Professor, Department
Depart of Computer
Chandigarh Group of Colleges, (CGC
(CGC-COE) Applications, Chandigarh Group of Colleges,
Landran, Mohali, Punjab, India Landran, Mohali, Punjab, India

ABSTRACT
With an increased usage of the internet, the data usage  This approach lowers the risk of catastrophic
is also getting increased exponentially year on year. system failure and unexpected data loss.
So obviously to handle such an enormous data we
needed a better platform to process data. So a
programming model was introduced called Map
Reduce, which process big amounts of data in in-parallel
on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant
tolerant manner. Since
HADOOP has been emerged as a popular tool for
BIG DATA implementation, the paper deals with the
overall architecture of HADOOP along with the
details of its various components.

Keywords : Hadoop, Big data, HDFS, YARN, SAS etc


Fig1.1-Various
Various features of Hadoop
INTRODUCTION Challenges:-
Hadoop is open source software for reliable, scalable
and distributed computing. It’s an Java Java-based  Security concerns
programming framework that supports the processing  Vulnerable by nature
and storage of extremely large data sets in a  Not fit for small data
distributed computing environment. It is part of  Potential stability issues
the Apache project. Hadoop makes it possible to run
applications on systems with thousands of Doug cutting built a parallel Hadoop Distributed File
hardware nodes, and to handle thousan
thousands System (HDFS). The software or framework that
of terabytes of data. This approach facilitates:
facilitates:- supports HDFS and MapReduce is known as
Hadoop.. Hadoop is an open source and distributed by
 Rapid data transfer Apache.
 Cost effective and flexible
 Distribute data and computation
 Tasks are independent
 Simple programming model. The end end-user
programmer only writes map-reduce
reduce tasks.
 Highly scalable storage platform

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun


Jun 2018 Page: 1296
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
Hadoop Framework :-

Currently, four core modules are included in the


basic framework from the apache foundation:

Hadoop Common – The libraries and utilities used


by other Hadoop modules.
Hadoop Distributed File System (HDFS) – the
Java-based scalable system that stores data across
multiple machines without prior organization.
[Fig 1.2-Hadoop Framework Model]
YARN – (Yet Another Resource Negotiator)
Hadoop Map Reduce:-
provides resource management for the processes
running on Hadoop. MapReduce is mainly used for parallel processing of
large sets of data stored in Hadoop cluster. Initially, it
MapReduce – A parallel processing software is a hypothesis specially designed by Google to
framework. It is comprised of two steps. Map step is a provide parallelism, data distribution and fault-
master node that takes inputs and partitions them into tolerance. MR processes data in the form of key-value
smaller sub-problems and then distributes them to pairs. A key-value (KV) pair is a mapping element
worker nodes. After the map step has taken place, the between two linked data items - key and its value.
master node takes the answers to all of the sub-
problems and combines them to produce output.

[Fig1.3-Hadoop-Map Reduce Model]

This algorithm divides the task into small parts and assigns those parts to many computers connected over the
network, and collects the results to form the final result dataset.

 The first is the map job, which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).

 The reduce job takes the output from a map as input and combines those data tuples into a smaller set
of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the
map job.

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun 2018 Page: 1297
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
Hadoop Map Reduce architecture

[Fig-1.4-Hadoop Map reduce Architecture]

Map reduce architecture consists of mainly two Intermediate Process


processing stages. First one is the map stage and the
second one is reduce stage. The actual MR process The mapper output data undergoes shuffle and sorting
happens in task tracker. In between map and reduce in intermediate process. The intermediate data is
stages, Intermediate process will take place. going to get stored in local file system without having
Intermediate process will do operations like shuffle replications in Hadoop nodes. This intermediate data
and sorting of the mapper output data. The is the data that is generated after some computations
Intermediate data is going to get stored in local file based on certain logics. Hadoop uses a Round-Robin
system. algorithm to write the intermediate data to local disk.
There are many other sorting factors to reach the
Mapper Phase conditions to write the data to local disks.

In Mapper Phase the input data is going to split into 2 Reducer Phase
components, Key and Value. The key is writable and
comparable in the processing stage. Value is writable Shuffled and sorted data is going to pass as input to
only during the processing stage. Suppose, client the reducer. In this phase, all incoming data is going
submits input data to Hadoop system, the Job tracker to combine and same actual key value pairs is going
assigns tasks to task tracker. The input data that is to write into hdfs system. Record writer writes data
going to get split into several input splits. from reducer to hdfs. The reducer is not so mandatory
for searching and mapping purpose.

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun 2018 Page: 1298
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
Hadoop - HDFS File System  Hadoop provides a command interface to
Hadoop File System was developed using distributed interact with HDFS.
file system design. It is run on commodity hardware.  The built-in servers of name node and data
Unlike other distributed systems, HDFS is highly node help users to easily check the status of
fault olerant and designed using low-cost hardware. cluster.
HDFS holds very large amount of data and provides  Streaming access to file system data.
easier access. To store such huge data, the files are  HDFS provides file permissions and
stored across multiple machines. These files are authentication.
stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS
also makes applications available to parallel
processing.
Features of HDFS

 It is suitable for the distributed storage and


processing.

HDFS Architecture:-

[Fig 1.5-HDFS Architecture]


HDFS follows the master-slave architecture and it
 Manages the file system namespace.
has the following elements.
 Regulates client’s access to files.
Name Node
 It also executes file system operations such as
The name node is the commodity hardware that renaming, closing, and opening files and
contains the GNU/Linux operating system and the directories.
name node software. It is software that can be run on
commodity hardware. The system having the name Data Node
node acts as the master server and it does the The data node is a commodity hardware having the
following tasks: GNU/Linux operating system and data node
software. For every node (Commodity

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun 2018 Page: 1299
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
hardware/System) in a cluster, there will be a data
node. These nodes manage the data storage of their
system. Data nodes perform read-write operations on
the file systems, as per client request. They also
perform operations such as block creation, deletion,
and replication according to the instructions of the
name node.
Block
Generally the user data is stored in the files of
HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual
data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that
[Fig-1.6 Hadoop Architecture Model]
HDFS can read or write is called a Block. The
default block size is 64MB, but it can be increased as
per the need to change in HDFS configuration. How Does Hadoop Work?

Goals of HDFS It is quite expensive to build bigger servers with


 Fault detection and recovery : Since HDFS heavy configurations that handle large scale
includes a large number of commodity processing, but as an alternative, you can tie together
hardware, failure of components is frequent. many commodity computers with single-CPU, as a
Therefore HDFS should have mechanisms for single functional distributed system and practically,
quick and automatic fault detection and the clustered machines can read the dataset in parallel
recovery. and provide a much higher throughput.
 Huge data sets : HDFS should have hundreds Moreover, it is cheaper than one high-end server. So
of nodes per cluster to manage the this is the first motivational factor behind using
applications having huge datasets. Hadoop that it runs across clustered and low-cost
 Hardware at data: A requested task can be machines.
done efficiently, when the computation takes Hadoop runs code across a cluster of computers. This
place near the data. Especially where huge process includes the following core tasks that Hadoop
datasets are involved, it reduces the network performs:
traffic and increases the throughput.
 Data is initially divided into directories and
How Application works:- files. Files are divided into uniform sized
There are machine nodes on the bottom. Data node is blocks of 128M and 64M (preferably 128M).
also known as hadoop distribution file system. Data
node contains the entire set of data and Task tracker  These files are then distributed across various
does all the operations. . Job Tracker makes sure that cluster nodes for further processing.
each operation is completed and if there is a process
failure at any node, it needs to assign a duplicate task  HDFS, being on top of the local file system,
to some task tracker. Job tracker also distributes the supervises the processing.
entire task to all the machines.
 Blocks are replicated for handling hardware
failure.

 Checking that the code was executed


successfully.

 Performing the sort that takes place between


the map and reduce stages.

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun 2018 Page: 1300
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
Data Management for Hadoop:-Big data skills are References:-
in high demand. Now business users can profile,
transform and cleanse data – on Hadoop or anywhere 1) S.Vikram Phaneendra & E.Madhusudhan Reddy
else it may reside – using an intuitive user interface. “Big Data- solutions for RDBMS problems- A
Data analysts can run SAS code on Hadoop for even survey” In 12th IEEE/IFIP Network Operations &
better performance. With SAS("Statistical Analysis Management Symposium (NOMS 2010) (Osaka,
System"), We can: Japan, Apr 19{23 2013).
2) Kiran kumara Reddi & Dnvsl Indira “Different
 Access and load Hadoop data fast:- Turn big data Technique to Transfer Big Data : survey” IEEE
into valuable data with quick, easy access to Transactions on 52(8) (Aug.2013) 2348 { 2355}
Hadoop and the ability to load to and from relational
data sources as well as SAS datasets. 3) Z. Zheng, J. Zhu, M. R. Lyu. “Service-generated
Big Data and Big Data-as-a-Service: An
 Stop the “garbage in, garbage out” cycle:- Overview,” in Proc. IEEE BigData, pp. 403-410,
Integrated data quality delivers pristine data that fuels October 2013. A . Bellogín, I. Cantador, F. Díez,
accurate analytics amplified by the power of Hadoop. et al., “An empirical comparison of social,
collaborative filtering, and hybrid recommenders,”
 Put big data to work for you:- Transform, filter and ACM Trans. on Intelligent Systems and
summarize data yourself, and get more value from Technology, vol. 4, no. 1, pp. 1-37, January 2013.
your big data.
4) W. Zeng, M. S. Shang, Q. M. Zhang, et al., “Can
Dissimilar Users Contribute to Accuracy and
 Get more out of your computing resources:-
Diversity of Personalized Recommendation?,”
Optimize your workloads, and gain high availability
International Journal of Modern Physics C, vol.
across the Hadoop cluster.
21, no. 10, pp. 1217- 1227, June 2010.
WHY SAS("Statistical Analysis System")? 5) T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall,
and M. Palaniswami, “Fuzzy c-Means Algorithms
 Better productivity through faster management of for Very Large Data,” IEEE Trans. on Fuzzy
big data:- In-Hadoop data quality and code execution Systems, vol. 20, no. 6, pp. 1130-1146, December
take advantage of MapReduce and YARN to speed 2012.
the process of accessing trusted data.
6) Z. Liu, P. Li, Y. Zheng, et al., “Clustering to find
 Big data management:- Big data is becoming the exemplar terms for keyphrase extraction,” in Proc.
backbone of business information. We help business 2009 Conf. on Empirical Methods in Natural
and IT work together to deliver big data that's Language Processing, pp. 257-266, May 2009
enterprise ready – no need to write code (unless you 7) Sameer Agarwal†, Barzan MozafariX, Aurojit
want to). Panda†, Henry Milner†, Samuel MaddenX, Ion
Stoica “BlinkDB: Queries with Bounded Errors
 Data you can trust:- Make big data better. SAS and Bounded Response Times on Very Large
provides multiple data integration and data quality Data” Copyright © 2013ì ACM 978-1-4503-1994
transformations to profile, parse and join your data 2/13/04
without moving it out of Hadoop.
8) Yingyi Bu _ Bill Howe _ Magdalena Balazinska _
Conclusion Michael D. Ernst “The HaLoop Approach to
Large-Scale Iterative Data Analysis” VLDB 2010
We have entered on Big Data era. The paper describes paper “HaLoop: Efficient Iterative Data
the concept of Big Data analystics with the help of Processing on Large Clusters.
partitioning mechanism-MAP Reduce and describes
the management of large amount of data through
HDFS. The paper also focuses on Big Data processing
problems. These technical challenges must be
addressed for efficient and fast processing of Big
data.Hadoop provides solution to all big data problem.

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 4 | May-Jun 2018 Page: 1301

You might also like