0% found this document useful (0 votes)
15 views26 pages

Cloud Short Note by Dipu #2

The document discusses cloud computing, defining it as a service delivery model that allows for the storage, processing, and sharing of data over the internet. It outlines the requirements and benefits of cloud computing, including scalability, cost efficiency, and improved resource utilization, while also addressing technical and non-technical challenges such as security and vendor lock-in. Additionally, it covers parallel computing concepts, including architectures, programming models, and the Message Passing Interface (MPI) for efficient communication in parallel processing environments.

Uploaded by

chrissnazib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

Cloud Short Note by Dipu #2

The document discusses cloud computing, defining it as a service delivery model that allows for the storage, processing, and sharing of data over the internet. It outlines the requirements and benefits of cloud computing, including scalability, cost efficiency, and improved resource utilization, while also addressing technical and non-technical challenges such as security and vendor lock-in. Additionally, it covers parallel computing concepts, including architectures, programming models, and the Message Passing Interface (MPI) for efficient communication in parallel processing environments.

Uploaded by

chrissnazib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Cloud Short Note by Dipu

Slide 1
Data: facts and statistics collected together for reference or analysis.

Big data is defined as large pools of data that can be captured, communicated, aggregated, stored, and analyzed.

What do we do with data?

Store, Access, Share, Process, Encrypt & so on.

Requirements to Transform IT to a Service (CIRPP EMES)

 Connectivity – Internet
 Interactivity - Web 2.0 & 3.0
 Reliability – Fault Tolerance
 Performance – Parallel or Distributed Computing
 Pay-as-you-Go – Utility computing
 Ease of Programmability – Programming Model
 Manage Large Amounts of Data – Big data
 Efficiency Cost Power – Storage technology
 Scalability & Elasticity – Virtualization Technology

Cloud Computing is the delivery of computing as a service rather than a product, whereby shared resources,
software, and information are provided to computers and other devices, as a metered service over a network.

Why Cloud Computing?

 Pay-as-You-Go economic model


 Simplified IT management
 Scale quickly and effortlessly
 Flexible options
 Resource Utilization is improved
 Carbon Footprint decreased

Applications Enabled by Cloud Computing

 Startup Business
 Seasonal Business
 Research Computing
 Changing computational power over time

Technical Challenges

 Programming is tricky but improving


 Tools are continuously evolving
 Moving large data is still expensive
 Security
 Quality of Service
 Green computing
 Internet Dependence
Non-Technical Challenges

 Vendor Lock-In
 Non-standardized
 Security
 Risks
 Privacy
 Legal
 Service Level Agreements

Slide – 2
Servers are computers that provide “services” to “clients”. They are typically designed for reliability and to
service a large number of requests. Organizations typically require many physical servers to provide various
services (Web, Email, Database, etc.)

Equipment (e.g., servers) are typically placed in racks. A single rack can hold up to 42 1U servers

A blade server is a stripped down computer with a modular design. A blade enclosure holds multiple blade
servers and provides power, interfaces and cooling for the individual blade servers.

A data center is a facility used to house computer systems and associated components, such as networking and
storage systems, cooling, uninterruptable power supply, air filters.

Data Center Components (FAR PCM)

 Air conditioning
 Redundant Power
 Fire protection
 Physical security
 Monitoring Systems
 Connectivity

The Network of a Modern Data Center

Communication In Data Centers


 Based on networks running the IP protocol suite
 Contain a set of routers and switches

Traffic in today’s data centers:

 80% of the packets stay inside the data center


 Trend is towards even more internal communication

Typically, data centers run two kinds of applications:

 Outward facing (serving web pages to users)


 Internal computation (data mining and index computations–think of MapReduce and HPC)

Communication Latency

Propagation delay in the data center is essentially 0. Light goes a foot in a nanosecond

End to end latency comes from

 Switching latency: 10G to 10G:~ 2.5 usec (store & fwd); 2 usec (cut-thru)
 Queuing latency: Depends on the size of queues and network load

Typical times across a quiet data center: 10-20usec

Elasticity and Performance

 Bare data centers make it hard for applications to grow/shrink


 VLANs can be used to isolate applications from each other
 IP addresses are topologically determined by Access Routers

Power in Data Centers

Pretty good data centers have efficiency of 1.7 where 0.7 Watts lost for each 1W delivered to the servers

Conventional server uses 200 to 500W

Total power consumed by switches amortizes to 10-20W per server

Utilization In Data Centers

Utilization of 10% to 30% is considered “good” in data centers

Causes:

 Uneven application fit


 Long provisioning timescales
 Uncertainty in demand
 Risk management

Two main requirements

 Enabled by Virtualization
 Enabled by Programming Models and Distributed File Systems

SaaS: Software is delivered as a service over the Internet, eliminating the need to install and run the application
on the customer's own computer.

Attributes: configurability, multi-tenant efficiency, scalability

PaaS: The Cloud provider exposes a set of tools (a platform) which allows users to create SaaS applications
IaaS: The cloud provider leases to users Virtual Machine Instances (i.e., computer infrastructure) using the
virtualization technology. The user has access to a standard Operating System environment and can install and
configure all the layers above it

The cloud Stack:

 Application: Web applications


 Data: cloud- specific databases and management systems. E.g., Hbase, Cassandra, Hive, Pig etc.
 Runtime: Runtime platforms to support cloud programming models. MPI, MapReduce, Pregel etc.
 Middleware: Resource Management, Monitoring, Provisioning, Identity Management and Security
 Operating System: AMI
 Virtualization: Key Component, Resource Virtualization, EC2 based Xen virtualization platform
 Servers
 Storage

Cloud Service Layers in the Service Levels

Types of Cloud:

 Public: Open market for on demand computing and IT resources


 Private: For enterprises/corporations with large scale IT
 Hybrid: Extend the private cloud(s) by connecting it to other external cloud vendors to make use of their
available cloud services

Cloud Burst: Use the local cloud, and when you need more resources, burst into the public cloud
Primary costs associated with it:

 Software Cost (Media + License cost/user)


 Support Cost (Vendor Support, Updates and Patches etc.)
 Management Cost (IT Infrastructure costs, Manpower, etc.)

Software Models:

 Traditional (Classical): Develops software and charges a license fee per user. Management: Client. Oracle
 Open Source (Free): At little or no cost, High charge for support. Management 4x cost of Software
 Outsourcing: Primary cost of management is manpower which cheaper labor costs.
 Hybrid: Quickly configured and deployed. Automate support through remote access. Sell easy to deploy
software to many clients
 Hybrid+: Charge a flat monthly fee for the software, support and management
 SaaS: Develop Web Application, Offer to customers over Internet, No deployment costs. Lowest
Management and Support costs over many clients
Slide 3
Amdahl’s Law: Suppose that the sequential execution of a program takes T1 time units and the parallel execution
on p processors takes Tp time units.

Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is
parallelizable.

In order to efficiently benefit from parallelization, we ought to follow these guidelines:

 Maximize the fraction of our program that can be parallelized


 Balance the workload of parallel processes
 Minimize the time spent for communication

Parallel Computer Architectures

 Multi-Chip Multiprocessor
 Single-Chip Multiprocessor

Multi-Chip Multiprocessor

 Symmetric Multiprocessor (SMP)


1. Shared memory that can be accessed equally from all processors
2. A single OS controls
3. Example: Intel Xeon Scalable Processors, AMD EPYC Processors, Sun/Oracle SPARC Enterprise
Servers, etc.
 Massively Parallel Multiprocessor (MPP)
1. Consists of nodes with each having its own processor, memory and I/O subsystem
2. An independent OS runs at each node
3. Examples: IBM Power Systems, Cray Supercomputers, NVIDIA DDX Systems, High Performing
Computer (HPC) Cluster ETC

 Distributed Shared Memory (DSM)


1. Typically built on a similar hardware model as MPP
2. Provides a shared address space
3. Typically a single OS controls a DSM system

Moore’s Law

As chip manufacturing technology improves, transistors are getting smaller and smaller and it is possible to put
more of them on a chip. This empirical observation is often called Moore’s Law (# of transistors doubles every
18 to 24 months)

 Option 1: At some point increasing the cache size may only increase the hit rate from 99% to 99.5%,
which does not improve application performance much.
 Option 2: Reduces complexity and power consumption as well as improves performance

Chip Multiprocessor: The Processor we use on our Laptop & Computer for regular use

 A single-chip multiprocessor referred to as Chip Multiprocessor (CMP)


 Considered the architecture of choice
 Might be coupled either tightly or loosely
1. Cores may or may not share caches
2. Cores may implement a message passing or a shared memory inter-core communication method
 CMPs could be homogeneous or heterogeneous:
1. Homogeneous CMPs include only identical cores (e.g., Intel Core i7 processor series)
2. Heterogeneous CMPs have cores that are not identical (e.g., ARM big. LITTLE architecture)
 Example Intel Core i-series processors, AMD Ryzen Prcessor, ARM Cortex Series Processor

Parallel programming model:

 A programming model is an abstraction provided by the hardware to programmers


 It determines how easily programmers can specify their algorithms into parallel unit of computations
(i.e., tasks) that the hardware understands
 It determines how efficiently parallel tasks can be executed on the hardware

Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, CMP) and minimize the
elapsed time of your program
Parallel programming model Types:

 Shared Memory
 Message Passing

Shared Memory Model:

 Parallel tasks can access any location of the memory


 Parallel task can communicate through reading and writing common memory locations
 This is similar to threads from a single process which share a single address space
 Multi-threaded programs (e.g., OpenMP programs) are the best fit with shared memory programming
model

Shared Memory Model Types:

 Single Thread
 Multi Thread

Why Locks happen in parallel processing model?

 Unfortunately, threads in a shared memory model need to synchronize


 This is usually achieved through mutual exclusion
 Mutual exclusion requires that when there are multiple threads, only one thread is allowed to write to a
shared memory location (or the critical section) at any time

To solve The Synchronization Problem, We use Peterson’s Algorithm (No Race & with Race)
Message Passing Model

 Parallel tasks have their own local memories


 One task cannot access another task’s memory
 Hence, to communicate data they have to rely on explicit messages sent to each other
 This is similar to the abstraction of processes which do not share an address space
 Message Passing Interface (MPI) programs are the best fit with the message passing programming model

In Message passing no mutual exclusion required, cause no shared memory or address space

Examples of Parallel Processing:

 Single Program Multiple Data (SPMD) model


 There is only one program and each process uses the same executable working on different sets of
data.
 Multiple Programs Multiple Data (MPMP) model
 Uses different programs for different processes, but the processes collaborate to solve the same
problem
 two styles, the master/worker and the coupled analysis

Key Notes:

 The purpose of parallelization is to reduce the time for computation spent


 Ideally, the parallel program is p times faster than the sequential program, where p is the number of
processes involved in the parallel execution, but this is not always achievable
 Message-passing is the tool to consolidate what parallelization has separated. It should not be
regarded as the parallelization itself

Slide 4
Message Passing Interface:

 The Message Passing Interface (MPI) is a message passing library standard for writing message
passing programs
 The goal of MPI is to establish a portable, efficient, and flexible
 By itself, MPI is NOT a library - but rather the specification of what such a library should be
 MPI is not an IEEE or ISO standard, but has in fact, become the industry standard for writing
message passing programs on HPC platforms

Reasons of Using MPI: (SP FAP)


MPI is now used on just about any common parallel architecture including MPP, SMP clusters, workstation
clusters and heterogeneous networks. With MPI, the programmer is responsible for correctly identifying
parallelism and implementing parallel algorithms using MPI constructs.

 MPI uses objects called communicators and groups to define which collection of processes may
communicate with each other to solve a certain problem
 Most MPI routines require you to specify a communicator as an argument
 The communicator MPI_COMM_WORLD is often used in calling communication subroutines
 MPI_COMM_WORLD is the predefined communicator that includes
 all of your MPI processes

Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the
system when the process initializes. Another name Task ID, which start from zero & Contigeous value.

It is possible that a problem consists of several sub-problems where each can be solved independently. This type
of application is typically found in the category of MPMD coupled analysis. MPI allows you to achieve that by
using MPI_COMM_SPLIT
A blocking send routine will only return after it is safe to modify the application buffer for reuse.

A blocking send can be:

 Synchronous: Means there is a handshaking occurring with the receive task to confirm a safe send
 Asynchronous: Means the system buffer at the sender side is used to hold the data for eventual delivery
to the receiver.

A blocking receive only returns after the data has arrived (i.e., stored at the application recvbuf) and is ready for
use by the program.

Non-blocking send and non-blocking receive routines behave similarly, they return almost immediately.

They don't wait Message copying from user buffer to system buffer or, the actual arrival of a message.

It is unsafe modify application buffer until the request performed. You can make sure of the completion of the
copy by using MPI_WAIT() after the send or receive operations.

Why do we use non-blocking communication despite its complexity?

 Non-blocking communication is generally faster than its corresponding blocking


communication
 We can overlap computations while the system is copying data back and forth between application and
system buffers

MPI Point-To-Point Communication Routines


Types of Communication in Point to Point:

 Unidirectional Communication
 Blocking send and blocking receive
 Non-blocking send and blocking receive
 Blocking send and non-blocking receive
 Non-blocking send and non-blocking receive
 Bidirectional Communication
 Case 1: Both processes call the send routine first, and then receive
 Case 2: Both processes call the receive routine first, and then send
 Case 3: One process calls send and receive routines in this order, and the other calls them in the
opposite order

With bidirectional communication, we have about deadlocks to be careful. When a deadlock occurs, processes
involved in the deadlock will not proceed any further

Deadlocks can take place:

 Either due to the incorrect order of send and receive


 Or due to the limited size of the system buffer

 Send First and Then Receive:

Deadlock Free Code:

Why?
 The program immediately returns from MPI_ISEND and starts receiving data from the other
process
 In the meantime, data transmission is completed and the calls of MPI_WAIT for the completion
of send at both processes do not lead to a deadlock
 Receive First & Then Send:
 One Process Sends and Receives; the other Receives and Sends

It is always safe to order the calls of MPI_(I)SEND and MPI_(I)RECV at the two processes in an opposite
order
 Recommended:

Collective communication:
Collective communication allows you to exchange data among a group of processes. It must involve all processes
in the scope of a communicator.

Patterns of Collective Communication (BSG ARSR)

 Broadcast
 Scatter
 Gather
 Allgather
 Alltoall
 Reduce
 Allreduce
 Scan
 Reducescatter

 Broadcast sends a message from the process with rank root to all other processes in the group.

 Scatter distributes distinct messages from a single source task to each task in the group
 Gather gathers distinct messages from each task in the group to a single destination task
 Allgather gathers data from all tasks and distribute them to all tasks. Each task in the group, in effect,
performs a one-to-all broadcasting operation within the group

 With Alltoall, each task in a group performs a scatter operation, sending a distinct message to all the
tasks in the group in order by index

 Reduce applies a reduction operation on all tasks in the group and places the result in one task
 Allreduce applies a reduction operation and places the result in all tasks in the group. This is equivalent
to an MPI_Reduce followed by an MPI_Bcast
 Scan computes the scan (partial reductions) of data on a collection of processes

 Reduce Scatter combines values and scatters the results. It is equivalent to an MPI_Reduce followed by
an MPI_Scatter operation.

Slide 5 - MapReduce
 MapReduce is a programming model for data processing, which has ability to scale to 100s or 1000s of
computers, each with several processor cores.
 MapReduce is designed to efficiently process large volume of data by connecting many commodity
computers together to work in parallel.

Commodity Cluster:

 A theoretical 1000-CPU machine would cost a very large amount of money, far more than 1000 single-
CPU or 250 quad-core machines
 MapReduce ties smaller and more reasonably priced machines together into a single cost-effective
commodity cluster

Isolated task:

 MapReduce divides the workload into multiple independent tasks and schedule them across cluster
nodes
 A work performed by each task is done in isolation from one another
 The nodes synchronized at all times would prevent the model from performing reliably and efficiently at
large scale

Data Distribution:

 An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by
different nodes in the cluster
 Even though the file chunks are distributed across several machines, they form a single namespace

MapReduce: A Bird’s-Eye View

 Chunks are processed in isolation by tasks called Mappers


 The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second
set of tasks called Reducers
 The process of bringing together IOs into a set of Reducers is known as shuffling process
 The Reducers produce the final outputs (FOs)
 Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase

Key & Values: In MapReduce, data elements are always structured as keyvalue (i.e., (K, V)) pairs

Partitions: All values with the same key are presented to a single Reducer together

Network Topology In MapReduce:


MapReduce (an example) Steps:

 Input
 Map Phase
 Shuffle and Sort Phase
 Reduce Phase

Hadoop: Hadoop is an open source implementation of MapReduce and is currently enjoying wide popularity.

Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer
referred to as Hadoop Distributed File System (HDFS)

 Input Files: Input files are where the data for a MapReduce task is initially stored which resides in a
distributed file system (e.g. HDFS). The format of input files is arbitrary
 Line-based log files
 Binary files
 Multi-line input records
 Or something else entirely

 InputFormat: How the input files are split up and read is defined by the InputFormat

InputFormat is a class that does the following:

 Files loaded from local HDFS store


 Selects the files that should be used for input
 Defines the InputSplits that break a file
 Provides a factory for RecordReader objects that read the file
 Input Splits: An input split describes a unit of work that comprises a single map task in a MapReduce
program
 InputFormat breaks a file up into 64MB splits
 By dividing the file into splits, we allow several map tasks to operate on a single file in parallel
 If the file is very large, this can improve performance significantly through parallelism
 Each map task corresponds to a single input split
 RecordReader: The RecordReader class actually loads data from its source and converts it into (K, V) pairs
suitable for reading by Mappers. The RecordReader is invoked repeatedly on the input until the entire
split is consumed
 Mapper and Reducer:

The Mapper performs the user-defined work of the first phase of the MapReduce program. A new instance of
Mapper is created for each split

The Reducer performs the user-defined work of the second phase of the MapReduce program.

 Partitioner: The partitioner class determines which partition a given (K,V) pair will go to. The default
partitioner computes a hash value for a given key and assigns it to a partition based on this result.
 Sort: The set of intermediate keys on a single node is automatically sorted by MapReduce before they are
presented to the Reducer.
 OutputFormat: The OutputFormat class defines the way (K,V) pairs produced by Reducers are written to
output files. The instances of OutputFormat provided by Hadoop write to files on the local disk or in
HDFS.

MapReducer Additional Function:

 Combiner Functions: It pays to minimize the data shuffled between map and reduce tasks. Hadoop
allows the user to specify a combiner function (just like the reduce function) to be run on a map output.

Task Scheduling in MapReduce:

 MapReduce adopts a master-slave architecture


 The master node in MapReduce is referred to as Job Tracker (JT)
 Each slave node in MapReduce is referred to as Task Tracker (TT)
 MapReduce adopts a pull scheduling strategy rather than a push one
Map and Reduce Task Scheduling: Every TT sends a heartbeat message periodically to JT encompassing a request
for a map or a reduce task to run

 Map Task Scheduling


 Reduce Task Scheduling

Job Scheduling in MapReduce: A job encompasses multiple map and reduce tasks. MapReduce in Hadoop comes
with a choice of schedulers:

 The default is the FIFO scheduler which schedules in order of submission jobs
 There is also a multi-user scheduler called the Fair scheduler which aims to give every user a fair share of
the cluster capacity over time

Fault Tolerance in Hadoop: If a TT fails to communicate with JT for a period of time (by default, 1 minute in
Hadoop), JT will assume that TT in question has crashed

 If the job is still in the map phase, JT asks another TT to re-execute all Mappers that previously ran at the
failed T.
 If the job is in the reduce phase, JT asks another TT to re-execute all Reducers that were in progress on
the failed TT

What Makes MapReduce Unique?

 Its simplified programming model which allows the user to quickly write and test distributed systems
 Its efficient and automatic distribution of data and workload across machines
 Its flat scalability curve. Specifically, after a Mapreduce program is written and functioning on 10 nodes,
very little-if any- work is required for making that same program run on 1000 nodes
 Its fault tolerance approach
Slide 6 - Virtualization

Operating Systems Limtations:

 OSs provide a way of virtualizing hardware resources among processes


 This may help isolate processes from one another
 Having hardware resources managed by a single OS limits the flexibility of the system in terms of
available software, security, and failure isolation

Virtualization Properties:

Virtualization:

 Informally, a virtualized system (or subsystem) is a mapping of its interface, and all resources visible
through that interface, to the interface and resources of a real system
 Formally, virtualization involves the construction of an isomorphism that maps a virtual guest system to
a real host system
Abstraction: The key to managing complexity in computer systems is their division into levels of abstraction
separated by well-defined interfaces. Levels of abstraction allow implementation details at lower levels of a
design to be ignored or simplified

Virtualization and Abstraction: Virtualization uses abstraction but is different in that it doesn’t necessarily hide
details; the level of detail in a virtual system is often the same as that in the underlying real system.

Virtualization provides a different interface and/or resources at the same level of abstraction.

Virtual Machines and Hypervisors:

 The concept of virtualization can be applied not only to subsystems such as disks, but to an entire
machine denoted as a virtual machine (VM)
 A VM is implemented by adding a layer of software to a real machine so as to support the desired VM’s
architecture
 This layer of software is often referred to as virtual machine monitor (VMM)
 Early VMMs are implemented in firmware
 Today, VMMs are often implemented as a co-designed firmware-software layer, referred to as the
hypervisor

Traditional VMMs provide full-virtualization:

 The functionally provided is identical to the underlying physical hardware


 The functionality is exposed to the VMs
 They allow unmodified guest OSs to execute on the VMs
 This might result in some performance degradation
 VMWare provides full virtualization

VMMs provide para-virtualization:

 They provide a virtual hardware abstraction that is similar, but not identical to the real hardware
 They modify the guest OS to cooperate with the VMM
 They result in lower overhead leading to better performance
 E.g., Xen provides both para-virtualization as well as full-virtualization

Virtualization and Emulation

 VMs can employ emulation techniques to support cross-platform software compatibility


 Compatibility can be provided either at the system level (e.g., to run a Windows OS on Macintosh) or at
the program or process level (e.g., to run Excel on a Sun Solaris/SPARC platform)
 Emulation is the process of implementing the interface and functionality of one system on a system
having a different interface and functionality
 It can be argued that virtualization itself is simpl1y a form of emulation
Slide 7 – Virtualization II
Types of Virtual Machines:

 Process VM: Capable of supporting an individual process


 System VM: Supports an OS with potentially many types of processes

Process Virtual Machine:

 Runtime is placed at the ABI interface


 Runtime emulates both user-level instructions and OS system calls

System Virtual Machine:

 VMM emulates the ISA used by one hardware platform to another, forming a system VM
 A system VM is capable of executing a system software environment developed for a different set of
hardware

A Taxonomy
Resource Virtualization

 CPU Virtualization
 Interpretation and Binary Translation
 Virtualizable ISAs
 Memory Virtualization
 I/O Virtualization

Instruction Set Architecture:

Typically, the architecture of a processor defines:

1. A set of storage resources (e.g., registers and memory)


2. A set of instructions that manipulate data held in storage resources

The definition of the storage resources and the instructions that manipulate data are documented in what is
referred to as Instruction Set Architecture (ISA)

Two parts in the ISA are important in the definition of VMs:


1. User ISA: visible to user programs
2. System ISA: visible to supervisor software (e.g., OS)

Ways to Virtualize CPUs:

1. Emulation: the only processor virtualization mechanism available when the ISA of the guest is different from
the ISA of the host
2. Direct native execution: possible only if the ISA of the host is identical to the ISA of the guest

Emulation: Emulation is the functionality of one system (or subsystem) process of implementing the interface
and on a system (or subsystem) having different interface and functionality

In other words, emulation allows a machine implementing one ISA (the target), to reproduce the behavior of a
software compiled for another ISA (the source)

Slide 8 - DFS
Why File Systems?

 To organize data (as files)


 To provide a means for applications to store, access, and modify data

Why Distributed File Systems?

Big data continues to grow. In contrary to a local file system, a distributed file system (DFS) can hold big data and
provide access to this data to many clients distributed across a network.

Network attached storage (NAS), referring to attaching storage to network servers that provide file systems

Storage area network (SAN) makes storage devices (not file systems) available over a network.

Benefits of DFSs

 File sharing over a network: without a DFS, we would have to exchange files by e-mail or use
applications such as the Internet’s FTP
 Transparent files accesses: A user’s programs can access remote files as if they are local. The remote files
have no special APIs; they are accessed just like local ones
 Easy file management: managing a DFS is easier than managing multiple local file systems
DFS Components:

 The data state: This is the contents of files


 The attribute state (meta data): This is the information about each file (e.g., file’s size and access control
list)
 The open-file state: This includes which files are open or otherwise in use, as well as describing how files
are locked

Network File System

 Many distributed file systems are organized along the lines of client-server architectures
 Sun Microsystem’s Network File System (NFS) is one of the most widely-deployed DFSs for Unix-based
systems
 NFS comes with a protocol that describes precisely how a client can access a file stored on a (remote)
NFS file server
 NFS allows a heterogeneous collection of processes, possibly running on different OSs and machines, to
share a common file system

Remote Access Model

 The model underlying NFS and similar systems is that of remote access model
 Are offered transparent access to a file system that is managed by a remote server
 Are normally unaware of the actual location of files
 Are offered an interface to a file system similar to the interface offered by a conventional local file
system

Upload/Download Model:

 A contrary model, referred to as upload/download model, allows a to access a file locally after having
downloaded it client from the server
 The Internet’s FTP service can be used this way when a client downloads a complete file, modifies it, and
then puts it back
The Basic NFS Architecture:

Cluster-Based Distributed File Systems

 The underlying cluster-based file system is a key component for providing scalable data-intensive
application performance
 The cluster-based file system divides and distributes big data, using file striping techniques, for allowing
concurrent data accesses
 The cluster-based file system could be either a cloud computing or an HPC oriented distributed file
system
 Google File System (GFS) and S3 are examples of cloud computing DFSs
 Parallel Virtual File System (PVFS) and IBM’s General Parallel File System (GPFS) are examples of HPC
DFSs

File Striping Techniques

 Server clusters are often used for parallel applications and their associated file systems are adjusted to
satisfy their requirements
 One well-known technique is to deploy file-striping techniques, by which a single file is distributed across
multiple servers
 Hence, it becomes possible to fetch different parts in parallel

Best Wishes for Cloud Computing Exam


Dipu (CSE18)

You might also like