Report Title: Wasit University
Report Title: Wasit University
Report title
(data-intensive computing and mapreduce \ hadoop)
Grade:
|Page1
Introduction :
|Page2
Making best use of a machine designed for data-intensive
computing
➢ Work on streams of data, not files
– Not (so easily) searchable – Not (so easily) sortable
– Not all programs can benefit from this approach
➢ and those that can, might require work
➢ Use multiple threads and asynchronous I/O
➢ If you’re using files, use a library that does some of the hard work for you, e.g.
MPI-IO
Conclusions
➢ Data-Intensive is a new(ish) kind of computing
– necessitated by the huge amounts of data – and offering new opportunities
➢ Need to think about new ways of doing computing
– It’s usually parallel computing, but not “traditional HPC”
➢ Matters for data preservation. Either:
– you’re preserving huge amounts of data that need to be easily reused
– you need to process large amounts of data to do a meaningful reduction
so that the stored data retains its value
|Page3
MapReduce and Hadoop
➢ MapReduce paradigm
The MapReduce [24] programming model is inspired by two main functions
commonly used in functional programming: Map and Reduce. The Map function
processes key/value pairs to generate a set of intermediate key/value pairs and the
Reduce function merges all the same intermediate values. Many real-world
applications are expressed using this model. The most popular implementation of the
MapReduce model is the Hadoop framework [25], which allows applications to run on
large clusters built from commodity hardware. The Hadoop framework transparently
provides both reliability and data transfer. Other MapReduce implementations are
available for various architectures, such as for CUDA [26], in a multicore architecture
[27], in FPGA platforms [28], for a multiprocessor architecture [29], in a large-scale
shared-memory system [30], in a large-scale cluster [31], in multiple virtual machines
[32], in a.Net environment [33], in a streaming runtime environment [34], in a Grid
environment [35], in an opportunistic environment [36], and in a mobile computing
environment [37]. The Apache Hadoop on Demand (HOD) [38] provides virtual
Hadoop clusters over a large physical cluster. It uses the Torque resource manager to
do node allocation. myHadoop [39] is a system for provisioning on-demand Hadoop
instances via traditional schedulers on HPC resources. To the best of our knowledge,
there is no existing implementation of the MapReduce paradigm across distributed
multiple clusters.
➢ Hadoop
The MapReduce programming model is designed to process large volumes of data in
parallel by dividing the Job into a set of independent Tasks. The Job referred to here as
a full MapReduce program, which is the execution of a Mapper or Reducer across a
set of data. A Task is an execution of a Mapper or Reducer on a slice of data. So the
|Page4
MapReduce Job usually splits the input data set into independent chunks, which are
processed by the map tasks in a completely parallel manner. The Hadoop MapReduce
framework consists of a single Master node that runs a Jobtracker instance which
accepts Job requests from a client node and Slave nodes each running a TaskTracker
instance. The Jobtracker assumes the responsibility of distributing the software
configuration to the Slave nodes, scheduling the job’s component tasks on the
TaskTrackers, monitoring them and reassigning tasks to the TaskTrackers when they
failed. It is also responsible for providing the status and diagnostic information to the
client. The TaskTrackers execute the tasks as directed by the JobTracker. The
TaskTracker executes tasks in separate java processes so that several task instances
can be performed in parallel at the same time. Fig. 1 depicts the different components
of the MapReduce framework.
➢ HDFS
The HDFS has some desired features for massive data parallel processing, such as: (1)
work in commodity clusters with hardware failures, (2) access with streaming data, (3)
deal with large data sets, (4) employ a simple coherency model, and (5) portable
|Page5
across heterogeneous hardware and software platforms. The HDFS has a master/slave
architecture (Fig. 3). A HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of DataNodes. The NameNode
executes file system namespace operations like opening, closing, and renaming files
and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file systems
clients. The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.
❖ Summary
• Hadoop is a distributed systems for processing large-scale datasets
• Scales to thousands of nodes and petabytes of data
• Two main layers
• HDFS: Distributed file system(NameNode is centralized)
|Page6
• MapReduce: Execution engine (JobTracker is centralized)
• Simple data model, any format will fit
• At query time, specify how to read (write) the data using input (output) formats
• Simple computation model based on Map-Reduce phases
• Very efficient in aggregation and joins
• Higher-level Languages on top of Hadoop
References
1. ^ Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud
Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.
2. [24] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (2008) 107–113.
3. [25] Apache Hadoop project, Web Page. http://hadoop.apache.org/.
4. [26] B. He, W. Fang, Q. Luo, N.K. Govindaraju, T. Wang, Mars: a mapreduce
framework on graphics processors, in: Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, PACT’08,
ACM, New York, NY, USA, 2008, pp. 260–269.
5. [27] R. Chen, H. Chen, B. Zang, Tiled-mapreduce: optimizing resource usages of
data-parallel applications on multicore with tiling, in: Proceedings of the 19th
International Conference on Parallel Architectures and Compilation Techniques,
PACT’10, ACM, New York, NY, USA, 2010, pp. 523–534.
6. [28] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, H. Yang, Fpmr: mapreduce
framework on fpga, in: Proceedings of the 18th Annual ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, FPGA’10, ACM, New York,
NY, USA, 2010, pp. 93–102.
7. [29] C. Ranger, R. Raghuraman, A. Penmetsa, G.R. Bradski, C. Kozyrakis,
Evaluating mapreduce for multi-core and multiprocessor systems, in: 13st
International Conference on High-Performance Computer Architecture, 2007, pp.
13–24.
8. [30] R.M. Yoo, A. Romano, C. Kozyrakis, Phoenix rebirth: scalable mapreduce on
a large-scale shared-memory system, in: Proceedings of the 2009 IEEE
|Page7
International Symposium onWorkload Characterization, IEEE, Austin, TX, USA,
2009, pp. 198–207.
9. [31] M.M. Rafique, B. Rose, A.R. Butt, D.S. Nikolopoulos, Supporting mapreduce
on large-scale asymmetric multi-core clusters, SIGOPS Operating Systems Review
43 (2009) 25–34.
10. [32] S. Ibrahim, H. Jin, B. Cheng, H. Cao, S. Wu, L. Qi, Cloudlet: towards
mapreduce implementation on virtual machines, in: Proceedings of the 18th ACM
International Symposium on High Performance Distributed Computing, HPDC’09,
ACM, New York, NY, USA, 2009, pp. 65–66.
11. [33] C. Jin, R. Buyya, Mapreduce programming model for.net-based cloud
computing, in: Euro-Par, in: Lecture Notes in Computer Science, vol. 5704,
Springer, 2009, pp. 417–428.
12. [34] S. Pallickara, J. Ekanayake, G. Fox, Granules: a lightweight, streaming
runtime for cloud computing with support for map-reduce, in: CLUSTER, IEEE,
New Orleans, Louisiana, USA, 2009, pp. 1–10.
13. [35] C. Miceli, M. Miceli, S. Jha, H. Kaiser, A. Merzky, Programming
abstractions for data intensive computing on clouds and grids, in: Proceedings of
the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the
Grid, CCGRID’09, IEEE Computer Society, Washington, DC, USA, 2009, pp.
478–483.
14.
15. [36] H. Lin, X. Ma, J. Archuleta, W.-C. Feng, M. Gardner, Z. Zhang, Moon:
mapreduce on opportunistic environments, in: Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing, HPDC’10,
ACM, New York, NY, USA, 2010, pp. 95–106.
16. [37] A. Dou, V. Kalogeraki, D. Gunopulos, T. Mielikainen, V.H. Tuulos, Misco: a
mapreduce framework for mobile systems, in: Proceedings of the 3rd International
Conference on PErvasive Technologies Related to Assistive Environments,
PETRA’10, ACM, New York, NY, USA, 2010, pp. 32:1–32:8.
17. [38] Apache Hadoop on Demand (HOD), Website. http://hadoop.apache.org/
common/docs/r0.21.0/hod_scheduler.html.
18. [39] S. Krishnan, M. Tatineni, C. Baru, myhadoop—hadoop-on-demand on
traditional hpc resources, University of California, San Diego, Technical Report,
2011.
|Page8