Data-Intensive Computing

This document discusses data-intensive computing, which involves the production, manipulation, and analysis of large datasets ranging from hundreds of megabytes to petabytes in size. It characterizes challenges in data-intensive applications including scalable algorithms, metadata management, high-performance computing platforms, distributed file systems, and data reduction techniques. The document then provides a historical perspective on technologies that have enabled data-intensive computing such as high-speed networking, data grids, cloud computing, databases, and programming models like MapReduce.

Uploaded by

Amogh B S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

246 views88 pages

Data-Intensive Computing

Uploaded by

Amogh B S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

CHAPTER 8 Data-Intensive Computing

What is data-intensive computing?

• Data-intensive computing is concerned with
production, manipulation, and analysis of
large-scale data in the range of hundreds of
megabytes (MB) to petabytes (PB)
Characterizing data-intensive
computations
• Data-intensive applications not only deal with
huge volumes of data but, very often, also
exhibit compute-intensive properties
• Datasets are commonly persisted in several
formats and distributed across different
locations.
Challenges ahead
• Scalable algorithms that can search and process
massive datasets
• New metadata management technologies that
can scale to handle complex, heterogeneous, and
distributed data sources
• Advances in high-performance computing
platforms aimed at providing a better support for
accessing in-memory multiterabyte data
structures
• High-performance, highly reliable, petascale
distributed file systems
Challenges ahead
• Data signature-generation techniques for data
reduction and rapid processing
• New approaches to software mobility for
delivering algorithms that are able to move the
computation to where the data are located
• Specialized hybrid interconnection architectures
that provide better support for filtering
multigigabyte datastreams coming from high-
speed networks and scientific instruments
• Flexible and high-performance software
integration techniques
Historical perspective
storage, networking technologies, algorithms,
and infrastructure software all together
• The early age: high-speed wide-area
networking
• Data grids
• Data clouds and “Big Data”
• Databases and data-intensive computing
high-speed wide-area networking

• 1989, the first experiments in high-speed

networking as a support for remote
visualization of scientific data led the way
• Two years later, the potential of using high-
speed wide area networks for enabling high-
speed, TCP/IP-based distributed applications
was demonstrated at Supercomputing 1991
• Another important milestone was set with the
Clipper project,
Data grids
• huge computational power and storage
facilities
• A data grid provides services that help users
discover, transfer, and manipulate large
datasets stored in distributed repositories
• Data grids offer two main functionalities: high-
performance and reliable file transfer for
moving large amounts of data
Characteristics and introduce new challenges
• Massive datasets
• Shared data collections
• Unified namespace
• Access restrictions
Data clouds and “Big Data”
• Scientific computing
• searching, online advertising, and social media
• It is critical for such companies to efficiently
analyze these huge datasets because they
constitute a precious source of information
about their customers
• Log analysis is an example
Data clouds and “Big Data”
Cloud technologies support data-intensive
computing in several ways:
• By providing a large amount of compute
instances on demand
• By providing a storage system
• By providing frameworks and programming
APIs
Databases and data-intensive
computing
• Distributed Database
Technologies for data-intensive
computing
Data-intensive computing concerns the
development of applications that are mainly
focused on processing large quantities of data.
Storage systems
• Growing of popularity of Big Data
• Growing importance of data analytics in the
business chain
• Presence of data in several forms, not only
structured
• New approaches and technologies for
computing
Storage systems
• High-performance distributed file systems and
storage clouds
– Lustre
– IBM General Parallel File System (GPFS)
– Google File System (GFS)
– Sector
– Amazon Simple Storage Service (S3)
• NoSQL systems
– Apache CouchDB and MongoDB
– Amazon Dynamo
– Google Bigtable
– Hadoop HBase
High-performance distributed file
systems and storage clouds
• Lustre
• The Lustre file system is a massively parallel distributed file
system that covers the needs of a small workgroup of
clusters to a large-scale computing cluster.
• The file system is used by several of the Top 500
supercomputing systems,
• Lustre is designed to provide access to petabytes (PBs) of
storage to serve thousands of clients with an I/O
throughput of hundreds of gigabytes per second (GB/s)
High-performance distributed file
systems and storage clouds
• IBM General Parallel File System (GPFS).
• high-performance distributed file system developed by
IBM
• support for the RS/6000 supercomputer and Linux
computing clusters
• GPFS is built on the concept of shared disks
• GPFS distributes the metadata of the entire file system and
provides transparent access
High-performance distributed file
systems and storage clouds
• Google File System (GFS)
• Distributed applications in Google’s computing cloud
• The system has been designed to be a fault tolerant, highly
available, distributed file system built on commodity
hardware and standard Linux operating systems.
• large files
• workloads primarily consist of two kinds of reads: large
streaming reads and small random reads.
High-performance distributed file
systems and storage clouds
• Sector
• storage cloud that supports the execution of data-intensive
applications
• deployed on commodity hardware across a wide-area
network.
• Compared to other file systems, Sector does not partition
a file into blocks but replicates the entire files on multiple
nodes
• The system’s architecture is composed of four nodes: a
security server, one or more master nodes, slave nodes,
and client machines
High-performance distributed file
systems and storage clouds
• Amazon Simple Storage Service (S3)
• Amazon S3 is the online storage service provided by
Amazon.
• support high availability, reliability, scalability, infinite
storage,
• The system offers a flat storage space organized into
buckets
• Each bucket can store multiple objects, each identified by a
unique key. Objects are identified by unique URLs and
exposed through HTTP,
NoSQL systems
• Document stores (Apache Jackrabbit, Apache
CouchDB, SimpleDB, Terrastore).
• Graphs (AllegroGraph, Neo4j, FlockDB,
Cerebrum).
• Multivalue databases (OpenQM, Rocket U2,
OpenInsight).
• Object databases (ObjectStore, JADE, ZODB).
• Tabular stores (Google BigTable, Hadoop HBase,
Hypertable).
• Tuple stores (Apache River).
NoSQL systems
• Apache CouchDB and MongoDB
– document stores
– schema-less
– RESTful interface and represent data in JSON
format.
– allow querying and indexing data by using the
MapReduce programming model
– JavaScript as a base language for data querying
and manipulation rather than SQL
NoSQL systems
• Amazon Dynamo
– The main goal of Dynamo is to provide an
incrementally scalable and highly available storage
system.
– serving 10 million requests per day
– objects are stored and retrieved with a unique
identifier (key)
NoSQL systems
• Google Bigtable
– scale up to petabytes of data across thousands of
server
– Bigtable provides storage support for several
Google applications
– Bigtable’s key design goals are wide applicability,
scalability, high performance, and high availability.
– Bigtable organizes the data storage in tables of
which the rows are distributed over the
distributed file system supporting the middleware
NoSQL systems
• Apache Cassandra
– managing large amounts of structured data spread
across many commodity servers
– Cassandra was initially developed by Facebook
– Currently, it provides storage support for several
very large Web applications such as Facebook,
Digg, and Twitter
– second-generation distributed database
– column family
NoSQL systems
• Hadoop HBase.
– distributed database
– Hadoop distributed programming platform.
– HBase is designed by taking inspiration from
Google Bigtable
– main goal is to offer real-time read/write
operations for tables with billions of rows and
millions of columns by leveraging clusters of
commodity hardware
Programming platforms
• large quantity of information
• runtime systems able to efficiently manage huge
volumes of data.
• database management systems based on the
relational model - unsuccessful
• unstructured or semistructured
• large size or a huge number of medium-sized files
rather than rows in a database
• Distributed workflows
The MapReduce programming model
• map and reduce
• Google introduced for processing large
quantities of data.
• Data transfer and management are completely
handled by the distributed storage
infrastructure
Examples of MapReduce
• Distributed grep
– recognition of patterns within text streams
• Count of URL-access frequency
– key-value <pair , URL,1>, <URL, total-count>
• Reverse Web-link graph
– <target, source>, <target, list (source) >
• Term vector per host.
– Word Counting
Exapmles of MapReduce
• Inverted index
– <word, document-id>, < word, list(document-id)>
• Distributed sort

• Statistical algorithms such as Support Vector

Machines (SVM), Linear Regression (LR), Naive
Bayes (NB), and Neural Network (NN)
• two major stages can be represented in the
terms of Map Reduce computation.
– Analysis
• operates directly on the data input file
• embarrassingly parallel
– Aggregation
• operates on the intermediate results
• aimed at aggregating, summing, and/or elaborating
• previous stage to present the data in their final form
Variations and extensions of
MapReduce
• MapReduce constitutes a simplified model for
processing large quantities of data
• model can be applied to several different
problem scenarios
• They aim at extending the MapReduce
application space and providing developers
with an easier interface for designing
distributed algorithms.
frameworks
• Hadoop
• Pig
• Hive
• Map-Reduce-Merge
• Twister
Hadoop

• Apache Hadoop is an open-source software

framework used for distributed storage and
processing of dataset of big data using the
MapReduce programming model.
• Initially developed and supported by Yahoo
• 40,000 machines and more than 300,000
cores
• http://hadoop.apache.org/
Pig
• platform that allows the analysis of large
datasets
• high-level language for expressing data
analysis programs
• Developers can express their data analysis
programs in a textual language called Pig
Latin,
• https://pig.apache.org/
Hive.
• Provides a data warehouse infrastructure on
top of Hadoop MapReduce.
• It provides tools for easy data summarization
• classical data warehouse,
• https://hive.apache.org/
Map-Reduce-Merge
• Map-Reduce-Merge is an extension of the
MapReduce model
• merging data already partitioned and sorted
by map and reduce modules
Twister
• extension of the MapReduce model that allows
the creation of iterative executions of
MapReduce jobs.
• 1. Configure Map
• 2. Configure Reduce
• 3. While Condition Holds True Do
– a. Run MapReduce
– b. Apply Combine Operation to Result
– c. Update Condition
• 4. Close
Alternatives to MapReduce
• Sphere.
• All-Pairs.
• DryadLINQ
Sphere.

• Sector Distributed File System (SDFS).

• Sphere implements the stream processing model
(Single Program, Multiple Data)
• user-defined functions (UDFs)
• it is built on top of Sector’s API for data access
• UDFs are expressed in terms of programs that read and
write streams.
• Sphere client sends a request for processing to the
master node, which returns the list of available slaves,
and the client will choose the slaves on which to
execute Sphere processes
All-Pairs
• Biometrics
• (1) model the system;
• (2) distribute the data;
• (3) dispatch batch jobs; and
• (4) clean up the system
DryadLINQ.
• Microsoft Research project that investigates
programming models for writing parallel and
distributed programs
• small cluster to a large datacenter
• Automatically parallelizing the execution of
applications without requiring the developer
to know about distributed and parallel
programming.
Aneka MapReduce programming
• Developing MapReduce applications on top of
Aneka
• Mapper and Reducer - Aneka MapReduce APIs
• Three classes are of Importent for application
development:
– Mapper < K,V >
– Reducer <K,V >
– MapReduceApplication <M,R >

– The submission and execution of a MapReduce

job is performed through the class
MapReduceApplication <M,R >
• Map Function APIs.
• IMapInput<K,V> provides access to the input
key-value pair on which the map operation is
performed
Reduce Function APIs

• Reduce (IReduceInputEnumerator < V > input)

• reduce operation is applied to a collection of
values that are mapped to the same key
• MapReduceApplication <M,R>
• InvokeAndWait method:
ApplicationBase<M,R>
• WordCounterMapper and
WordCounterReducer classes
• The parameters that can be controlled
– Partitions
– Attempts
– SynchReduce
– IsInputReady
– FetchResults
– LogFile
• WordCounter Job. -Program
Runtime support
• Task Scheduling
– MapReduceScheduler class.
• Task Execution.
– MapReduceExecutor
Task Scheduling
Task Execution.
Distributed file system support
• Supports - Other programming models
• MapReduce model does not leverage the default
Storage Service for storage and data transfer
• uses a distributed file system implementation
• management are significantly different with
respect to the other models
• Distributed file system implementations
guarantee high availability and better efficiency
• Aneka provides the capability of interfacing
with different storage implementations
• Retrieving the location of files and file chunks
• Accessing a file by means of a stream
• classes SeqReader and SeqWriter
Example application
• MapReduce is a very useful model for
processing large quantities of data
• Semistructured
• logs or Web pages
• logs produced by the Aneka container
Parsing Aneka logs
• Aneka components produce a lot of
information that is stored in the form of log
files
• DD MMM YY hh:mm:ss level - message
Mapper design and implementation
Reducer design and implementation
Driver program
Result

B. Discuss Key Enabling Technologies in Cloud Computing Systems
No ratings yet
B. Discuss Key Enabling Technologies in Cloud Computing Systems
3 pages
IOT Mod4@AzDOCUMENTS - in
No ratings yet
IOT Mod4@AzDOCUMENTS - in
17 pages
Unit Iii Virtualization Infrastructure and Docker Desktop Virtualization
No ratings yet
Unit Iii Virtualization Infrastructure and Docker Desktop Virtualization
20 pages
Past Papers CC
No ratings yet
Past Papers CC
3 pages
CS8791-CC Unit-II
No ratings yet
CS8791-CC Unit-II
75 pages
Cloud Computing Unit-5
No ratings yet
Cloud Computing Unit-5
11 pages
CCS335-Cloud-Computing-QB - Unit 3, 4 & 5
No ratings yet
CCS335-Cloud-Computing-QB - Unit 3, 4 & 5
57 pages
Advanced Java Unit 3 Digital Notes
100% (1)
Advanced Java Unit 3 Digital Notes
67 pages
Unit 6- Cloud Platforms and Applications
100% (1)
Unit 6- Cloud Platforms and Applications
32 pages
Overview of The Computing Paradigm: 1.1 Recent Trends in Distributed Computing
No ratings yet
Overview of The Computing Paradigm: 1.1 Recent Trends in Distributed Computing
5 pages
NIMBUS
No ratings yet
NIMBUS
9 pages
Notes - SN
No ratings yet
Notes - SN
5 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Pros and Cons of Virtualization in Cloud Computing
No ratings yet
Pros and Cons of Virtualization in Cloud Computing
6 pages
Anna University CP 5092-CLOUD COMPUTING TECHNOLOGIES
0% (1)
Anna University CP 5092-CLOUD COMPUTING TECHNOLOGIES
2 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
CS8791 Cloud Computing
0% (1)
CS8791 Cloud Computing
3 pages
Characteristics of Virtualized Environment
No ratings yet
Characteristics of Virtualized Environment
14 pages
A Brief Analysis of Cloud Computing Infrastructure As A Service (IaaS)
No ratings yet
A Brief Analysis of Cloud Computing Infrastructure As A Service (IaaS)
4 pages
Principles of Parallel and Distributed Computing
No ratings yet
Principles of Parallel and Distributed Computing
54 pages
Module-5 Cloud Computing - Notes
No ratings yet
Module-5 Cloud Computing - Notes
14 pages
Cse-CSEViii-web 2.0 & Rich Internet Application (06cs832) - Notes
No ratings yet
Cse-CSEViii-web 2.0 & Rich Internet Application (06cs832) - Notes
86 pages
Cloud Computing Unit-4
No ratings yet
Cloud Computing Unit-4
61 pages
Unit 5 - Class - Lecture - Slides
No ratings yet
Unit 5 - Class - Lecture - Slides
55 pages
Unit 1 PPT CC
No ratings yet
Unit 1 PPT CC
38 pages
Unit 3 Cloud Virtualization Technology
No ratings yet
Unit 3 Cloud Virtualization Technology
18 pages
Cloud Computing - NOV DEC 2021
No ratings yet
Cloud Computing - NOV DEC 2021
3 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
15 pages
Cloud Computing Chapter-11
No ratings yet
Cloud Computing Chapter-11
15 pages
Cloud Computing Unit 2
No ratings yet
Cloud Computing Unit 2
54 pages
Ccs335 Cloud Computing-Unit - I Notes
No ratings yet
Ccs335 Cloud Computing-Unit - I Notes
37 pages
Chapter 3 Cloud Applications
No ratings yet
Chapter 3 Cloud Applications
15 pages
Chapter 06 Part1
No ratings yet
Chapter 06 Part1
20 pages
CC Ques Bank Cloud Computing Question Bank
No ratings yet
CC Ques Bank Cloud Computing Question Bank
137 pages
EH Syllabus
No ratings yet
EH Syllabus
2 pages
11 Aneka in Cloud Computing
No ratings yet
11 Aneka in Cloud Computing
14 pages
Web Application Security - Unit 5 Notes
No ratings yet
Web Application Security - Unit 5 Notes
18 pages
6 Access Layer PDF
50% (2)
6 Access Layer PDF
84 pages
Session 7 ITU T Cloud Computing Reference Architecture
100% (1)
Session 7 ITU T Cloud Computing Reference Architecture
11 pages
Cyber Tech Mastery
No ratings yet
Cyber Tech Mastery
21 pages
21CSE354T - Full Stack Web Development Question Bank (1)
100% (1)
21CSE354T - Full Stack Web Development Question Bank (1)
9 pages
Data Storage Technologies and Networks
No ratings yet
Data Storage Technologies and Networks
7 pages
Unit - 4-Cloud
No ratings yet
Unit - 4-Cloud
122 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
CS2402 Mobile and Pervasive Computing Syllabus
No ratings yet
CS2402 Mobile and Pervasive Computing Syllabus
1 page
Practical File Cloud Computing IT-704
No ratings yet
Practical File Cloud Computing IT-704
27 pages
Unit II - SCADA and RFID Protocols
0% (1)
Unit II - SCADA and RFID Protocols
6 pages
RRIT Question Bank 1 - CC - IA-1-2021-22
No ratings yet
RRIT Question Bank 1 - CC - IA-1-2021-22
2 pages
Grid Architecture and Relationship To Other Distributed Technologies
67% (3)
Grid Architecture and Relationship To Other Distributed Technologies
24 pages
Contemporary Social Issues of Tamil Nadu
No ratings yet
Contemporary Social Issues of Tamil Nadu
8 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
Grid Architecture
No ratings yet
Grid Architecture
19 pages
Cloud Computing Introduction - Unit I-1
No ratings yet
Cloud Computing Introduction - Unit I-1
32 pages
Unit V - Security in The Cloud
0% (1)
Unit V - Security in The Cloud
10 pages
Principles of Pervasive Computing
No ratings yet
Principles of Pervasive Computing
15 pages
CCS335 CLOUD COMPUTING MCQ Questions
No ratings yet
CCS335 CLOUD COMPUTING MCQ Questions
24 pages
Cloud Computing
No ratings yet
Cloud Computing
13 pages
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Jyothy Institute of Technology: Fee Receipt
No ratings yet
Jyothy Institute of Technology: Fee Receipt
1 page
Computer Science and Engineering: A Case Study Report For The Subject Supply Chain Management (18ME653)
No ratings yet
Computer Science and Engineering: A Case Study Report For The Subject Supply Chain Management (18ME653)
21 pages
Jyothy Institute of Technology: Department of Computer Science and Engineering I Internal Assessment April-2020
No ratings yet
Jyothy Institute of Technology: Department of Computer Science and Engineering I Internal Assessment April-2020
1 page
New Doc 2019-11-10 00.06.01
No ratings yet
New Doc 2019-11-10 00.06.01
18 pages
John Babikian - All The Blockchain Basics in 15 Minutes
No ratings yet
John Babikian - All The Blockchain Basics in 15 Minutes
25 pages
Unit 4
No ratings yet
Unit 4
63 pages
HPC 3rd Unit
No ratings yet
HPC 3rd Unit
16 pages
Distributed Control Systems Lecture 3 PVSS in
No ratings yet
Distributed Control Systems Lecture 3 PVSS in
44 pages
Parallel 123
No ratings yet
Parallel 123
28 pages
Nswer: Not Limited To The Following Applications: Instruction: Answer ALL Questions in The Space Provided
No ratings yet
Nswer: Not Limited To The Following Applications: Instruction: Answer ALL Questions in The Space Provided
11 pages
Exemplo AWS Billing
No ratings yet
Exemplo AWS Billing
3 pages
Ooad Unit-6
No ratings yet
Ooad Unit-6
29 pages
Cloud and Virtualization To Support Grid Infrastructures
No ratings yet
Cloud and Virtualization To Support Grid Infrastructures
20 pages
4-1 R20 DISTRIBUTED SYSTEMS Syllabus-1
No ratings yet
4-1 R20 DISTRIBUTED SYSTEMS Syllabus-1
2 pages
Microsoft Azure Data Engineering
No ratings yet
Microsoft Azure Data Engineering
8 pages
Cloud Computing Assignment
No ratings yet
Cloud Computing Assignment
2 pages
Service Oriented Architecture: Notes From
No ratings yet
Service Oriented Architecture: Notes From
51 pages
Dbms Notes Unit 3
No ratings yet
Dbms Notes Unit 3
30 pages
The Chubby Lock Service For Loosely-Coupled Distributed Systems
No ratings yet
The Chubby Lock Service For Loosely-Coupled Distributed Systems
64 pages
White Paper Token
No ratings yet
White Paper Token
39 pages
Question Bank - Big Data Analytics - Final1
100% (1)
Question Bank - Big Data Analytics - Final1
6 pages
Module 5 - Private Blockchain
No ratings yet
Module 5 - Private Blockchain
12 pages
Operating System: Semester 5
No ratings yet
Operating System: Semester 5
18 pages
CC Lesson Plan - modified
No ratings yet
CC Lesson Plan - modified
3 pages
CH8 Slide
No ratings yet
CH8 Slide
97 pages
BE Physics Lab Manual - BIT Deoghar
No ratings yet
BE Physics Lab Manual - BIT Deoghar
8 pages
Centralized System Characteristics
No ratings yet
Centralized System Characteristics
12 pages
Incrementum Bitcoin-Compass 2024 Q4 en-1
No ratings yet
Incrementum Bitcoin-Compass 2024 Q4 en-1
56 pages
Middleware: Dr. Amor Lazzez
No ratings yet
Middleware: Dr. Amor Lazzez
11 pages
Unit 4
No ratings yet
Unit 4
4 pages
NoSQL Interview Questions
No ratings yet
NoSQL Interview Questions
1 page
Lab Manual - LP V - LA 6
No ratings yet
Lab Manual - LP V - LA 6
11 pages
Unit 1 HPC
No ratings yet
Unit 1 HPC
11 pages
ZSD Invoice Form01
No ratings yet
ZSD Invoice Form01
8 pages

Data-Intensive Computing

Uploaded by

Data-Intensive Computing

Uploaded by

CHAPTER 8 Data-Intensive Computing

What is data-intensive computing?

• 1989, the first experiments in high-speed

• Statistical algorithms such as Support Vector

• Apache Hadoop is an open-source software

• Sector Distributed File System (SDFS).

– The submission and execution of a MapReduce

• Reduce (IReduceInputEnumerator < V > input)

You might also like