0% found this document useful (0 votes)
61 views8 pages

Report Title: Wasit University

This document discusses data-intensive computing and the MapReduce framework Hadoop. It provides definitions of data-intensive computing as applications that require large volumes of data and devote most processing time to input/output. Hadoop uses MapReduce and HDFS (Hadoop Distributed File System). MapReduce allows distributed processing of large datasets across clusters. HDFS stores data across nodes and provides high fault tolerance. The document outlines the architecture and components of Hadoop and MapReduce.

Uploaded by

bassam lateef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views8 pages

Report Title: Wasit University

This document discusses data-intensive computing and the MapReduce framework Hadoop. It provides definitions of data-intensive computing as applications that require large volumes of data and devote most processing time to input/output. Hadoop uses MapReduce and HDFS (Hadoop Distributed File System). MapReduce allows distributed processing of large datasets across clusters. HDFS stores data across nodes and provides high fault tolerance. The document outlines the architecture and components of Hadoop and MapReduce.

Uploaded by

bassam lateef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

wasit University

College of Information Technology and Computer


Science

Report title
(data-intensive computing and mapreduce \ hadoop)

Student Name: ‫زهراء حيدر فاضل \\ مسائي‬

Grade:

Evaluation score: -----

|Page1
Introduction :

Data-intensive computing is a class of parallel computing applications which use


a data parallel approach to process large volumes of data
typically terabytes or petabytes in size and typically referred to as big data. Computing
applications which devote most of their execution time to computational requirements
are deemed compute-intensive, whereas computing applications which require large
volumes of data and devote most of their processing time to I/O and manipulation of
data are deemed data-intensive.[1]
➢ Data Intensive Computing and its Relationship to Data Curation/Preservation .
➢ The talk is slightly tangential…
– but there are many overlaps in the subjects, technologies and aims of the data
preservation and data intensive computing/research .
➢ Data is being preserved so that it can be re-used .
Data Intensive Computing
Computing applications which devote most of their execution time to computational
requirements are deemed computeintensive and typically require small volumes of
data, whereas computing applications which require large volumes of data and devote
most of their processing time to I/O and manipulation of data are deemed data-
intensive. – Wikipedia
➢ My working definition:
– I/O-bound computations
➢ Data is (generally) too big to fit in memory
– Efficient disk access is required to get the data to the CPU on time
– Having the data in the right place at the right time is vital

|Page2
Making best use of a machine designed for data-intensive
computing
➢ Work on streams of data, not files
– Not (so easily) searchable – Not (so easily) sortable
– Not all programs can benefit from this approach
➢ and those that can, might require work
➢ Use multiple threads and asynchronous I/O
➢ If you’re using files, use a library that does some of the hard work for you, e.g.
MPI-IO
Conclusions
➢ Data-Intensive is a new(ish) kind of computing
– necessitated by the huge amounts of data – and offering new opportunities
➢ Need to think about new ways of doing computing
– It’s usually parallel computing, but not “traditional HPC”
➢ Matters for data preservation. Either:
– you’re preserving huge amounts of data that need to be easily reused
– you need to process large amounts of data to do a meaningful reduction
so that the stored data retains its value

|Page3
MapReduce and Hadoop

➢ MapReduce paradigm
The MapReduce [24] programming model is inspired by two main functions
commonly used in functional programming: Map and Reduce. The Map function
processes key/value pairs to generate a set of intermediate key/value pairs and the
Reduce function merges all the same intermediate values. Many real-world
applications are expressed using this model. The most popular implementation of the
MapReduce model is the Hadoop framework [25], which allows applications to run on
large clusters built from commodity hardware. The Hadoop framework transparently
provides both reliability and data transfer. Other MapReduce implementations are
available for various architectures, such as for CUDA [26], in a multicore architecture
[27], in FPGA platforms [28], for a multiprocessor architecture [29], in a large-scale
shared-memory system [30], in a large-scale cluster [31], in multiple virtual machines
[32], in a.Net environment [33], in a streaming runtime environment [34], in a Grid
environment [35], in an opportunistic environment [36], and in a mobile computing
environment [37]. The Apache Hadoop on Demand (HOD) [38] provides virtual
Hadoop clusters over a large physical cluster. It uses the Torque resource manager to
do node allocation. myHadoop [39] is a system for provisioning on-demand Hadoop
instances via traditional schedulers on HPC resources. To the best of our knowledge,
there is no existing implementation of the MapReduce paradigm across distributed
multiple clusters.
➢ Hadoop
The MapReduce programming model is designed to process large volumes of data in
parallel by dividing the Job into a set of independent Tasks. The Job referred to here as
a full MapReduce program, which is the execution of a Mapper or Reducer across a
set of data. A Task is an execution of a Mapper or Reducer on a slice of data. So the

|Page4
MapReduce Job usually splits the input data set into independent chunks, which are
processed by the map tasks in a completely parallel manner. The Hadoop MapReduce
framework consists of a single Master node that runs a Jobtracker instance which
accepts Job requests from a client node and Slave nodes each running a TaskTracker
instance. The Jobtracker assumes the responsibility of distributing the software
configuration to the Slave nodes, scheduling the job’s component tasks on the
TaskTrackers, monitoring them and reassigning tasks to the TaskTrackers when they
failed. It is also responsible for providing the status and diagnostic information to the
client. The TaskTrackers execute the tasks as directed by the JobTracker. The
TaskTracker executes tasks in separate java processes so that several task instances
can be performed in parallel at the same time. Fig. 1 depicts the different components
of the MapReduce framework.

Fig. 1. Hadoop MapReduce

➢ Hadoop framework consists on two main layers


• Distributed file system (HDFS)
• Execution engine (MapReduce)

➢ HDFS
The HDFS has some desired features for massive data parallel processing, such as: (1)
work in commodity clusters with hardware failures, (2) access with streaming data, (3)
deal with large data sets, (4) employ a simple coherency model, and (5) portable

|Page5
across heterogeneous hardware and software platforms. The HDFS has a master/slave
architecture (Fig. 3). A HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of DataNodes. The NameNode
executes file system namespace operations like opening, closing, and renaming files
and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file systems
clients. The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.

Fig. 3. HDFS architecture

❖ Summary
• Hadoop is a distributed systems for processing large-scale datasets
• Scales to thousands of nodes and petabytes of data
• Two main layers
• HDFS: Distributed file system(NameNode is centralized)

|Page6
• MapReduce: Execution engine (JobTracker is centralized)
• Simple data model, any format will fit
• At query time, specify how to read (write) the data using input (output) formats
• Simple computation model based on Map-Reduce phases
• Very efficient in aggregation and joins
• Higher-level Languages on top of Hadoop

• Hive, Jaql, Pig

References
1. ^ Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud
Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.
2. [24] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (2008) 107–113.
3. [25] Apache Hadoop project, Web Page. http://hadoop.apache.org/.
4. [26] B. He, W. Fang, Q. Luo, N.K. Govindaraju, T. Wang, Mars: a mapreduce
framework on graphics processors, in: Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, PACT’08,
ACM, New York, NY, USA, 2008, pp. 260–269.
5. [27] R. Chen, H. Chen, B. Zang, Tiled-mapreduce: optimizing resource usages of
data-parallel applications on multicore with tiling, in: Proceedings of the 19th
International Conference on Parallel Architectures and Compilation Techniques,
PACT’10, ACM, New York, NY, USA, 2010, pp. 523–534.
6. [28] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, H. Yang, Fpmr: mapreduce
framework on fpga, in: Proceedings of the 18th Annual ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, FPGA’10, ACM, New York,
NY, USA, 2010, pp. 93–102.
7. [29] C. Ranger, R. Raghuraman, A. Penmetsa, G.R. Bradski, C. Kozyrakis,
Evaluating mapreduce for multi-core and multiprocessor systems, in: 13st
International Conference on High-Performance Computer Architecture, 2007, pp.
13–24.
8. [30] R.M. Yoo, A. Romano, C. Kozyrakis, Phoenix rebirth: scalable mapreduce on
a large-scale shared-memory system, in: Proceedings of the 2009 IEEE

|Page7
International Symposium onWorkload Characterization, IEEE, Austin, TX, USA,
2009, pp. 198–207.
9. [31] M.M. Rafique, B. Rose, A.R. Butt, D.S. Nikolopoulos, Supporting mapreduce
on large-scale asymmetric multi-core clusters, SIGOPS Operating Systems Review
43 (2009) 25–34.
10. [32] S. Ibrahim, H. Jin, B. Cheng, H. Cao, S. Wu, L. Qi, Cloudlet: towards
mapreduce implementation on virtual machines, in: Proceedings of the 18th ACM
International Symposium on High Performance Distributed Computing, HPDC’09,
ACM, New York, NY, USA, 2009, pp. 65–66.
11. [33] C. Jin, R. Buyya, Mapreduce programming model for.net-based cloud
computing, in: Euro-Par, in: Lecture Notes in Computer Science, vol. 5704,
Springer, 2009, pp. 417–428.
12. [34] S. Pallickara, J. Ekanayake, G. Fox, Granules: a lightweight, streaming
runtime for cloud computing with support for map-reduce, in: CLUSTER, IEEE,
New Orleans, Louisiana, USA, 2009, pp. 1–10.
13. [35] C. Miceli, M. Miceli, S. Jha, H. Kaiser, A. Merzky, Programming
abstractions for data intensive computing on clouds and grids, in: Proceedings of
the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the
Grid, CCGRID’09, IEEE Computer Society, Washington, DC, USA, 2009, pp.
478–483.
14.
15. [36] H. Lin, X. Ma, J. Archuleta, W.-C. Feng, M. Gardner, Z. Zhang, Moon:
mapreduce on opportunistic environments, in: Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing, HPDC’10,
ACM, New York, NY, USA, 2010, pp. 95–106.
16. [37] A. Dou, V. Kalogeraki, D. Gunopulos, T. Mielikainen, V.H. Tuulos, Misco: a
mapreduce framework for mobile systems, in: Proceedings of the 3rd International
Conference on PErvasive Technologies Related to Assistive Environments,
PETRA’10, ACM, New York, NY, USA, 2010, pp. 32:1–32:8.
17. [38] Apache Hadoop on Demand (HOD), Website. http://hadoop.apache.org/
common/docs/r0.21.0/hod_scheduler.html.
18. [39] S. Krishnan, M. Tatineni, C. Baru, myhadoop—hadoop-on-demand on
traditional hpc resources, University of California, San Diego, Technical Report,
2011.

|Page8

You might also like