0% found this document useful (0 votes)

61 views8 pages

Report Title: Wasit University

This document discusses data-intensive computing and the MapReduce framework Hadoop. It provides definitions of data-intensive computing as applications that require large volumes of data and devote most processing time to input/output. Hadoop uses MapReduce and HDFS (Hadoop Distributed File System). MapReduce allows distributed processing of large datasets across clusters. HDFS stores data across nodes and provides high fault tolerance. The document outlines the architecture and components of Hadoop and MapReduce.

Uploaded by

bassam lateef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views8 pages

Report Title: Wasit University

Uploaded by

bassam lateef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

wasit University

College of Information Technology and Computer

Science

Report title
(data-intensive computing and mapreduce \ hadoop)

Student Name: ‫زهراء حيدر فاضل \\ مسائي‬

Grade:

Evaluation score: -----

|Page1
Introduction :

Data-intensive computing is a class of parallel computing applications which use

a data parallel approach to process large volumes of data
typically terabytes or petabytes in size and typically referred to as big data. Computing
applications which devote most of their execution time to computational requirements
are deemed compute-intensive, whereas computing applications which require large
volumes of data and devote most of their processing time to I/O and manipulation of
data are deemed data-intensive.[1]
➢ Data Intensive Computing and its Relationship to Data Curation/Preservation .
➢ The talk is slightly tangential…
– but there are many overlaps in the subjects, technologies and aims of the data
preservation and data intensive computing/research .
➢ Data is being preserved so that it can be re-used .
Data Intensive Computing
Computing applications which devote most of their execution time to computational
requirements are deemed computeintensive and typically require small volumes of
data, whereas computing applications which require large volumes of data and devote
most of their processing time to I/O and manipulation of data are deemed data-
intensive. – Wikipedia
➢ My working definition:
– I/O-bound computations
➢ Data is (generally) too big to fit in memory
– Efficient disk access is required to get the data to the CPU on time
– Having the data in the right place at the right time is vital

|Page2
Making best use of a machine designed for data-intensive
computing
➢ Work on streams of data, not files
– Not (so easily) searchable – Not (so easily) sortable
– Not all programs can benefit from this approach
➢ and those that can, might require work
➢ Use multiple threads and asynchronous I/O
➢ If you’re using files, use a library that does some of the hard work for you, e.g.
MPI-IO
Conclusions
➢ Data-Intensive is a new(ish) kind of computing
– necessitated by the huge amounts of data – and offering new opportunities
➢ Need to think about new ways of doing computing
– It’s usually parallel computing, but not “traditional HPC”
➢ Matters for data preservation. Either:
– you’re preserving huge amounts of data that need to be easily reused
– you need to process large amounts of data to do a meaningful reduction
so that the stored data retains its value

|Page3
MapReduce and Hadoop

➢ MapReduce paradigm
The MapReduce [24] programming model is inspired by two main functions
commonly used in functional programming: Map and Reduce. The Map function
processes key/value pairs to generate a set of intermediate key/value pairs and the
Reduce function merges all the same intermediate values. Many real-world
applications are expressed using this model. The most popular implementation of the
MapReduce model is the Hadoop framework [25], which allows applications to run on
large clusters built from commodity hardware. The Hadoop framework transparently
provides both reliability and data transfer. Other MapReduce implementations are
available for various architectures, such as for CUDA [26], in a multicore architecture
[27], in FPGA platforms [28], for a multiprocessor architecture [29], in a large-scale
shared-memory system [30], in a large-scale cluster [31], in multiple virtual machines
[32], in a.Net environment [33], in a streaming runtime environment [34], in a Grid
environment [35], in an opportunistic environment [36], and in a mobile computing
environment [37]. The Apache Hadoop on Demand (HOD) [38] provides virtual
Hadoop clusters over a large physical cluster. It uses the Torque resource manager to
do node allocation. myHadoop [39] is a system for provisioning on-demand Hadoop
instances via traditional schedulers on HPC resources. To the best of our knowledge,
there is no existing implementation of the MapReduce paradigm across distributed
multiple clusters.
➢ Hadoop
The MapReduce programming model is designed to process large volumes of data in
parallel by dividing the Job into a set of independent Tasks. The Job referred to here as
a full MapReduce program, which is the execution of a Mapper or Reducer across a
set of data. A Task is an execution of a Mapper or Reducer on a slice of data. So the

|Page4
MapReduce Job usually splits the input data set into independent chunks, which are
processed by the map tasks in a completely parallel manner. The Hadoop MapReduce
framework consists of a single Master node that runs a Jobtracker instance which
accepts Job requests from a client node and Slave nodes each running a TaskTracker
instance. The Jobtracker assumes the responsibility of distributing the software
configuration to the Slave nodes, scheduling the job’s component tasks on the
TaskTrackers, monitoring them and reassigning tasks to the TaskTrackers when they
failed. It is also responsible for providing the status and diagnostic information to the
client. The TaskTrackers execute the tasks as directed by the JobTracker. The
TaskTracker executes tasks in separate java processes so that several task instances
can be performed in parallel at the same time. Fig. 1 depicts the different components
of the MapReduce framework.

Fig. 1. Hadoop MapReduce

➢ Hadoop framework consists on two main layers

• Distributed file system (HDFS)
• Execution engine (MapReduce)

➢ HDFS
The HDFS has some desired features for massive data parallel processing, such as: (1)
work in commodity clusters with hardware failures, (2) access with streaming data, (3)
deal with large data sets, (4) employ a simple coherency model, and (5) portable

|Page5
across heterogeneous hardware and software platforms. The HDFS has a master/slave
architecture (Fig. 3). A HDFS cluster consists of a single NameNode, a master server
that manages the file system namespace and regulates access to files by clients. In
addition, there are a number of DataNodes, usually one per node in the cluster, which
manage storage attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is split into one
or more blocks and these blocks are stored in a set of DataNodes. The NameNode
executes file system namespace operations like opening, closing, and renaming files
and directories. It also determines the mapping of blocks to DataNodes. The
DataNodes are responsible for serving read and write requests from the file systems
clients. The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.

Fig. 3. HDFS architecture

❖ Summary
• Hadoop is a distributed systems for processing large-scale datasets
• Scales to thousands of nodes and petabytes of data
• Two main layers
• HDFS: Distributed file system(NameNode is centralized)

|Page6
• MapReduce: Execution engine (JobTracker is centralized)
• Simple data model, any format will fit
• At query time, specify how to read (write) the data using input (output) formats
• Simple computation model based on Map-Reduce phases
• Very efficient in aggregation and joins
• Higher-level Languages on top of Hadoop

• Hive, Jaql, Pig

References
1. ^ Handbook of Cloud Computing, "Data-Intensive Technologies for Cloud
Computing," by A.M. Middleton. Handbook of Cloud Computing. Springer, 2010.
2. [24] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large
clusters, Communications of the ACM 51 (2008) 107–113.
3. [25] Apache Hadoop project, Web Page. http://hadoop.apache.org/.
4. [26] B. He, W. Fang, Q. Luo, N.K. Govindaraju, T. Wang, Mars: a mapreduce
framework on graphics processors, in: Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques, PACT’08,
ACM, New York, NY, USA, 2008, pp. 260–269.
5. [27] R. Chen, H. Chen, B. Zang, Tiled-mapreduce: optimizing resource usages of
data-parallel applications on multicore with tiling, in: Proceedings of the 19th
International Conference on Parallel Architectures and Compilation Techniques,
PACT’10, ACM, New York, NY, USA, 2010, pp. 523–534.
6. [28] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, H. Yang, Fpmr: mapreduce
framework on fpga, in: Proceedings of the 18th Annual ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, FPGA’10, ACM, New York,
NY, USA, 2010, pp. 93–102.
7. [29] C. Ranger, R. Raghuraman, A. Penmetsa, G.R. Bradski, C. Kozyrakis,
Evaluating mapreduce for multi-core and multiprocessor systems, in: 13st
International Conference on High-Performance Computer Architecture, 2007, pp.
13–24.
8. [30] R.M. Yoo, A. Romano, C. Kozyrakis, Phoenix rebirth: scalable mapreduce on
a large-scale shared-memory system, in: Proceedings of the 2009 IEEE

|Page7
International Symposium onWorkload Characterization, IEEE, Austin, TX, USA,
2009, pp. 198–207.
9. [31] M.M. Rafique, B. Rose, A.R. Butt, D.S. Nikolopoulos, Supporting mapreduce
on large-scale asymmetric multi-core clusters, SIGOPS Operating Systems Review
43 (2009) 25–34.
10. [32] S. Ibrahim, H. Jin, B. Cheng, H. Cao, S. Wu, L. Qi, Cloudlet: towards
mapreduce implementation on virtual machines, in: Proceedings of the 18th ACM
International Symposium on High Performance Distributed Computing, HPDC’09,
ACM, New York, NY, USA, 2009, pp. 65–66.
11. [33] C. Jin, R. Buyya, Mapreduce programming model for.net-based cloud
computing, in: Euro-Par, in: Lecture Notes in Computer Science, vol. 5704,
Springer, 2009, pp. 417–428.
12. [34] S. Pallickara, J. Ekanayake, G. Fox, Granules: a lightweight, streaming
runtime for cloud computing with support for map-reduce, in: CLUSTER, IEEE,
New Orleans, Louisiana, USA, 2009, pp. 1–10.
13. [35] C. Miceli, M. Miceli, S. Jha, H. Kaiser, A. Merzky, Programming
abstractions for data intensive computing on clouds and grids, in: Proceedings of
the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the
Grid, CCGRID’09, IEEE Computer Society, Washington, DC, USA, 2009, pp.
478–483.
14.
15. [36] H. Lin, X. Ma, J. Archuleta, W.-C. Feng, M. Gardner, Z. Zhang, Moon:
mapreduce on opportunistic environments, in: Proceedings of the 19th ACM
International Symposium on High Performance Distributed Computing, HPDC’10,
ACM, New York, NY, USA, 2010, pp. 95–106.
16. [37] A. Dou, V. Kalogeraki, D. Gunopulos, T. Mielikainen, V.H. Tuulos, Misco: a
mapreduce framework for mobile systems, in: Proceedings of the 3rd International
Conference on PErvasive Technologies Related to Assistive Environments,
PETRA’10, ACM, New York, NY, USA, 2010, pp. 32:1–32:8.
17. [38] Apache Hadoop on Demand (HOD), Website. http://hadoop.apache.org/
common/docs/r0.21.0/hod_scheduler.html.
18. [39] S. Krishnan, M. Tatineni, C. Baru, myhadoop—hadoop-on-demand on
traditional hpc resources, University of California, San Diego, Technical Report,
2011.

|Page8

The 6 Rules of Teaching Grammar
100% (4)
The 6 Rules of Teaching Grammar
5 pages
Maths Mock Paper 1
No ratings yet
Maths Mock Paper 1
17 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
Firm-Name Mobile1 City
100% (1)
Firm-Name Mobile1 City
190 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Chapter 4 MapReduce
No ratings yet
Chapter 4 MapReduce
82 pages
Unit 1,2,3,4
No ratings yet
Unit 1,2,3,4
116 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Vanish Magic Magazine Paul Romhany Edition 26
100% (1)
Vanish Magic Magazine Paul Romhany Edition 26
238 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
1 MapReduce Introduction With Example
No ratings yet
1 MapReduce Introduction With Example
52 pages
Unit 5
No ratings yet
Unit 5
32 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
No ratings yet
Analysis of Hadoop MapReduce Scheduling in Heterog 2021 Ain Shams Engineerin
10 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Communicating Across Cultures
No ratings yet
Communicating Across Cultures
232 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Big Data Analytics AAM Unit 5
No ratings yet
Big Data Analytics AAM Unit 5
28 pages
HADOOP
No ratings yet
HADOOP
55 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 2
No ratings yet
Unit 2
22 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Yes Heaven Is The Prize
100% (1)
Yes Heaven Is The Prize
1 page
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Chapter 2 - 大数据生态系统
No ratings yet
Chapter 2 - 大数据生态系统
31 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Big Data Introduction PDF
No ratings yet
Big Data Introduction PDF
180 pages
Unit 5
No ratings yet
Unit 5
35 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop
No ratings yet
Hadoop
5 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Hadoop & HDFS Final
No ratings yet
Hadoop & HDFS Final
31 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Unit 2
No ratings yet
Unit 2
9 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Demons PDF
No ratings yet
Demons PDF
24 pages
Good Night Stories For Children
No ratings yet
Good Night Stories For Children
47 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Simatic st80 STPC Complete English 2018 PDF
100% (1)
Simatic st80 STPC Complete English 2018 PDF
638 pages
Budget & The Economy
No ratings yet
Budget & The Economy
27 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
CEBU CITY Multi-Stakeholders Summit On Juvenile Justice With City
No ratings yet
CEBU CITY Multi-Stakeholders Summit On Juvenile Justice With City
24 pages
MDO Tender
No ratings yet
MDO Tender
245 pages
MS Firm Profile
No ratings yet
MS Firm Profile
7 pages
Brief Course Content Outline-El 117 Campus Journalism
No ratings yet
Brief Course Content Outline-El 117 Campus Journalism
3 pages
Consumer Research
No ratings yet
Consumer Research
24 pages
Unit 4
0% (1)
Unit 4
21 pages
Highway Eng BCVE 213 Notes
No ratings yet
Highway Eng BCVE 213 Notes
68 pages
Last Round Arthurs The Snow Maiden & The King Who
No ratings yet
Last Round Arthurs The Snow Maiden & The King Who
198 pages
Deepak Resume
No ratings yet
Deepak Resume
4 pages
Diagnostic Assessment TASK
No ratings yet
Diagnostic Assessment TASK
2 pages
Types of Reproduction: Asexual and Sexual Reproduction
No ratings yet
Types of Reproduction: Asexual and Sexual Reproduction
36 pages
5684 - Carolina Conduit Systems, Inc. Unity Park FINAL SIGNED AGREE 5684 PDF - Redacted
No ratings yet
5684 - Carolina Conduit Systems, Inc. Unity Park FINAL SIGNED AGREE 5684 PDF - Redacted
246 pages
Chapter 1 - Part 2
No ratings yet
Chapter 1 - Part 2
21 pages
Can Tech 2006 4 CANopen Safety
No ratings yet
Can Tech 2006 4 CANopen Safety
19 pages
Athletic Resume Book SP 2018 PDF
No ratings yet
Athletic Resume Book SP 2018 PDF
128 pages
AsiaTEFL Book Series-ELT Curriculum Innovation and Implementation in Asia-English Language Teaching Curriculum Innovations and Implementation Strategies Philippine Experience-Dinah F. Mindo PDF
No ratings yet
AsiaTEFL Book Series-ELT Curriculum Innovation and Implementation in Asia-English Language Teaching Curriculum Innovations and Implementation Strategies Philippine Experience-Dinah F. Mindo PDF
30 pages
Industrial Training Report On: "Autocad"
No ratings yet
Industrial Training Report On: "Autocad"
28 pages
Industrial Training Report On: "Autocad"
No ratings yet
Industrial Training Report On: "Autocad"
28 pages
2D Transformation PDF
No ratings yet
2D Transformation PDF
11 pages
Veiled Sex in Ball Games
No ratings yet
Veiled Sex in Ball Games
16 pages
Development of An Early Warning System To Support Educational Planning Process by Identifying At-Risk Students
No ratings yet
Development of An Early Warning System To Support Educational Planning Process by Identifying At-Risk Students
12 pages
Jungbunzlauer Zinc Citrate
No ratings yet
Jungbunzlauer Zinc Citrate
11 pages
Zensar
No ratings yet
Zensar
5 pages
Bernoulli Equation11 PDF
No ratings yet
Bernoulli Equation11 PDF
10 pages
A150-14-R2282 - SLIP Guidance EN
No ratings yet
A150-14-R2282 - SLIP Guidance EN
9 pages
Advanced Design Technique of Human-Machine Interfa
No ratings yet
Advanced Design Technique of Human-Machine Interfa
7 pages
Advanced Design Technique of Human-Machine Interfa
No ratings yet
Advanced Design Technique of Human-Machine Interfa
7 pages
Multimedia Lesson Plan
No ratings yet
Multimedia Lesson Plan
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Report Title: Wasit University

Uploaded by

Report Title: Wasit University

Uploaded by

wasit University

College of Information Technology and Computer

Student Name: ‫زهراء حيدر فاضل \\ مسائي‬

Evaluation score: -----

Data-intensive computing is a class of parallel computing applications which use

Fig. 1. Hadoop MapReduce

➢ Hadoop framework consists on two main layers

Fig. 3. HDFS architecture

• Hive, Jaql, Pig

You might also like