Distributed File Systems
What is Distributed File System?
A file system is a subsystem of an operating system that performs file management activities such as organization, storing, retrieval, naming, sharing and protection of files.
A Distributed File System provides abstraction to the users of a distributed system and makes it convenient for them to use files in a distributed environment.
Distributed File Systems
Purpose of files Permanent storage of information Sharing of information
Distributed File System more complex as users and storage devices are physically dispersed. It supports: Remote information sharing User mobility Availability/ replication Allows use of Diskless workstations
Distributed File System Services
Storage service Allocation & management of space on storage device. Provides logical view of storage system. True file service Creating & deleting files, accessing & modifying data in files.
Name service Provides mapping between text names for files and references to files, that is, file ids. Directory service
Desirable Features of DFS
Transparency Structure transparency: Multiplicity of file servers should be transparent to clients. Access transparency: Both local and remote files should be accessible in the same way. Naming transparency: Name of a file should not give hint of its location. Replication transparency : Client should be unaware of both the existence of multiple copies of a file and their location.
User mobility Performance Simplicity and ease of use Scalability High availability High reliability Data integrity Security Heterogeneity
File Models
Unstructured files File appear as an uninterpreted sequence of bytes. Structured files File treated as ordered sequence of records. Record is smallest unit of access. Different files can have different properties as their record size varies. Non indexed records access by position of record in file Indexed records access by value of key fields
Mutable files An update of a file overwrites produce the new contents.
its old content to
Immutable files A file cannot be modified once it has been created except to be deleted. The file versioning approach is used to implement file updates. Results in increased use of disk space and increased disk allocation activity. Can specify versions to be retained
File Accessing Models
Depends on two factors the method used for accessing remote files and the unit of data access.
Accessing remote files Remote service model The processing of clients request is performed at servers node. Data packing & communication overhead Data caching model If required data not present locally, it is copied from the servers node to the clients node and cached. Write operations can lead to cache consistency problem. Reduces server contention and network traffic.
Unit of data transfer File - level transfer model More efficient than transmitting page by page Reduced server load & network traffic Disk access routines are optimized After caching, client is immune to server or network failure Simplifies transfer between heterogeneous systems Client requires to have large storage space. Moving entire file wasteful if only small fraction is required
Block - level transfer model Does not require client to have large storage space. Multiple request are needed for accessing entire file. Byte - level transfer model Flexible as it allows storage & retrieval of an arbitrary sequential sub range of file Variable length data makes cache management difficult Record - level transfer model Suitable for structured files.
File Sharing Semantics
Define when modifications of file data made by a user are observable by other users. Unix semantics Enforces absolute time ordering on all operations and ensures that every read operation on a file sees the effects of all previous write operations performed on that file. Write to a file is immediately visible to all other users who have this file open at the same time. Difficult to achieve for distributed file systems. Cannot be achieved even by having all access requests processed by single server due to network delays. Immutable shared-file semantics File is treated as immutable once it is declared sharable (only read mode). Changes to the file are handled by creating a new updated version of file.
Session semantics Session is a series of file accesses between the open and the close operations. Changes made to file are made visible to remote processes after closing the session. Already open instances of file do not reflect changes. What is final file image (that is sent back to server), when multiple file sessions are closed one after another? Used with file-level transfer model. Transaction-like semantics For controlling concurrent access to shared mutable data. A transaction is a set of operations enclosed in-between a pair of begin_transaction and end_transaction like operations. Partial modifications made to shared data by a transaction will not be visible to other concurrently executing transactions until the transaction ends. The final file content is the same as if all transactions were run in some sequential order.
File Caching
File caching retains recently accessed file data in main memory, so that same data can be accessed repeatedly without additional disk transfer. Design issues for implementing file caching scheme in centralized file systems granularity of cached data, cache size (fixed/dynamically changing), replacement policy.
Design issues for distributed file systems : Cache location Modification Propagation Cache Validation
Cache Location
Cache location No caching Servers main memory Clients disk Cache hit access cost One disk & network access One network access Easy to implement Easy to maintain consistency Can support UNIX like file sharing semantics Reliability against crashes Large storage capacity Scalable & reliable does not support diskless workstation Advantages
One disk access
Clients main memory
Maximum performance gain Permits diskless workstations Scalable
Modification Propagation
when the master copy of a file at the server node is updated upon modification of a cache entry.
When the caches of all the nodes contain exactly the same copies of the file data, we say that the caches are consistent.
Write - through scheme
When a cache entry is modified, it is immediately sent to the server for updating master copy of file. Reliable & supports UNIX like semantics. Poor write performance & increase in network traffic. All updated cache entries corresponding to a file are gathered together and sent to the server at a time. Write accesses faster All modifications need not be propagated to server Three approaches Write on ejection from cache when modified data need to be replaced. Periodic write Write on close - Write when corresponding file is closed by client
Delayed - write scheme
Cache Validation Schemes
To verify if the data cached at a client node is consistent with the master copy. If not, the cached data must be invalidated and the updated version of the data must be fetched again from the server.
Client initiated approach Client contacts the server and checks whether its locally cached data is consistent with the master copy. Validity check is performed by comparing the time of last modification of cached data with the master copy version. Frequency of validity checks Checking before every access Periodic checking Check on file open Server initiated approach Server keeps monitoring the file usage modes being used by different clients and reacts whenever two or more clients try to open file in conflicting modes. When a client closes a file, it informs server along with any modifications made to file. Requires file servers to be stateful.
File Replication
A replicated file is a file that has multiple copies, which are located on separate file server. Replication vs. Caching Replica is associated with a server, whereas caching is normally associated with a client. Cached copy dependent on locality in file access patterns; replica depends on availability & performance requirements. As compared to cached copy, replica more persistent. Cached copy needs to be validated with respect to a replica.
Advantages of File Replication
Increased availability Increased reliability. Improved response time. Reduced network traffic. Improved system throughput. Better scalability. Autonomous operation
Replication Transparency
Replication of files should be transparent to the users so that multiple copies of a replicated file appear as a single logical file to its user. Naming of replicas Assignment of single identifier can be done to all replicas for immutable objects. Naming system maps a user supplied identifier into appropriate replica of mutable objects. Replication control Determines number & location of replicas of a replicated file. Explicit replication : Users control entire replication process. Implicit/lazy replication : Replication system automatically controlled by system.
Multi - copy Update Problem
Maintaining consistency among copies when a replicated file is updated. Read - only replication Allows replication of only immutable files Read - any - write all protocol Read operation performed by reading any copy of file & write is performed on all files. All copies need to be locked before updation. Write operation can not be performed if any server having replica is down. Available - copies protocol Primary - copy protocol Read can be done on primary or secondary copy. Write has to be done on primary copy only. Each server having a secondary copy updates its copy either by receiving notification of changes from the server having the primary copy or by requesting the updated copy from it.
Quorum-Based Protocols
The read-any-write-all and available copies protocols cannot handle the network partition problem in which the copies of a replicated file are partitioned into two more active groups. Primary-copy protocol is too restrictive.
For total N copies of a replicated file, minimum r copies have to be read (read quorum) & w copies have to be written (write quorum) where r + w > N. There is at least one common up to date copy in any read write quorum. Version number of a copy is updated every time it is modified. Copy with largest version number in a quorum is 2 8 current. 1
Read quorum
4 7 3
Write quorum
Read A read is executed as follows: Retrieve a read quorum (any r copies) of F. Of the r copies retrieved, select the copy with the largest version number. Perform the read operation on the selected copy.
Write A write is executed as follows: Retrieve a write quorum (any w copies) of F Of the w copies retrieved, get the version number of the copy with the largest version number. Increment the version number. Write the new value & the new version number to all w copies of write quorum.
The quorum protocol described above is a very general one and several special algorithms can be derived from it. A few are: 1. Read-any-write-all protocol: is actually a special case of the generalized quorum protocol with r=1 and w=n.
2.
Read-all-write-any protocol: for this protocol r=n and w=1. Majority-consensus protocol: In this protocol, the sizes of both the read quorum and the write quorum are made either equal or nearly equal. For example, if n=11, a possible quorum assignment for this protocol will be r=6 and w=6. when n=12, a possible quorum assignment will be r=6 and w=7. Consensus with weighted voting: All copies of a replicated file are given equal importance. All copies are assigned a single vote. Suppose that of the n replicas of a replicated file, which are located on different nodes, the replica at the node A is accessed more frequently than other replicas. This fact can be modeled by assigning more votes to the copy at node A than other replicas. Number of copies in the quorum will be selected based on the votes.
3.
4.
Fault Tolerance
Fault Tolerance is an important issue in the design of a distributed file system. Various types of faults could harm the integrity of the data stored by such a system. The primary file properties that directly influence the ability of a distributed file system to tolerate faults are as follows:
Availability Depends on the location of the file and clients. Replication improves availability. Robustness Power to survive crashes and decay of the storage device. Stable storage used. Recoverability Ability to be rolled back to an earlier, consistent state when an operation on the file fails or is aborted by the client. Atomic updates done.
Stable Storage
In context of crash resistance capability, storage may be broadly classified into three types:
Volatile Storage Can not withstand power failures or machine crashes. Ex. RAM Nonvolatile Storage Can withstand CPU failures but not transient I/O faults & decay of storage medium. Ex. disk Stable Storage Can withstand transient I/O faults & decay of storage medium. Duplicate storage devices used.
Server Implementation
A server may be implemented by using any one of the following two service paradigms
Stateful Servers
Stateless Servers
Atomic Transaction
A transaction is a logical unit of program execution. An atomic transaction is a computation consisting of a collection of operations that take place indivisibly in presence of failures & concurrent computations. Transactions help to preserve the consistency of a set of shared data objects in case of failures & concurrent access.
Transaction Properties (ACID)
Atomicity (Failure atomicity) All or nothing property Consistency (Concurrency atomicity) If the database state at the start of a transaction is consistent, it will be consistent at the end of the transaction.only the final state becomes visible to other processes after the transaction completes.
Isolation (Serializability) Ensures that concurrently executing transactions do not interfere with each other. Result of performing them concurrently is always the same as if they had been executed one at a time.
Durability (Permanence) Once a transaction completes successfully, result of transaction become permanent & can not be lost even if corresponding process or processor on which it is running crashes.
Why use Transactions ?
For improving recoverability of files in event of failures. For allowing concurrent sharing of mutable files by multiple clients in a consistent manner
Inconsistency due to system failure
a1: read balance(x) of account X a2: read balance(z) of account Z a3: write (x-5) to account X a4: write (z+5) to account Z
Consider the baking transaction which is comprised of four operations (a1,a2,a3,a4) for transferring $5 from account X to account Z.
Let the initial balance in both the accounts be $100 Successful execution a1:x=100 a2:z=100 a3:x=95 a4:z=105 Final Result X=95 Z=105 Unsuccessful execution a1:x=100 a2:z=100 a3:x=95 system crashes Final result when the four operations are not treated as a transaction x=95 & z=100
treated as a transaction x=100 & z=100
Transaction-based File Server Operation
In a file system that provides transaction facility, transaction consists of sequence of elementary file access operations such as read & write. Operations for transaction service : begin_transaction Begins new transaction & returns unique transaction id. end_transaction Returns status indicating whether transaction committed or is inactive because it was aborted. abort_transaction(TID) Restores any changes made so far within the transaction and changes its status to inactive.
Recovery Techniques
A transaction has two phases. First Phase Begins with begin_transaction & ends with end_transaction. Clients add changes to items progressively. Second Phase Starts after end_transaction or abort_transaction. Changes made by transaction are either permanently recorded or undone by server. Approaches to record file updates in reversible manner File Versions Approach Write Ahead Log Approach
File Versions Approach
Server uses current file for all file access operations that do not modify file. Server creates tentative version of file from current, if access modifies file.
If transaction is committed, tentative file is made new current file; If transaction is aborted tentative file is discarded.
Several tentative versions of a file may exist at a time as file may be involved in several concurrent transactions. Can be implemented using shadow blocks.
0
4 1
Current index of file Transaction modifies block 4 & appends a new block to file data
0 3 1 5
Tentative index of file
5 0 4
Transaction aborts
7 10 11 13 14 15
3 5
7 10 11 13 14 15
List of free blocks
7
10 11 13 14 15
List of free blocks
Transaction commits
7 10 0 11 13 14 15
3
1 5
Write Ahead Log Approach
For each operation of transaction that modifies file, record is created & written to log file known as a write-ahead log. After this, the operation is performed on the file to modify its contents. A write-ahead log is maintained on stable storage and contains a record for each operation that makes changes to files. Each record contains transaction id, identifier that is being modified & its old & new value.
If transaction aborts, information in log is used to roll back to initial state. X=100; Z=100; Begin_transaction Read balance(x) of account x; Log Log Read balance(z) of account z; Write (x-5) to account x; -> x=100/95 Write (z+5) to account z; -> x=100/95 End_transaction; z=100/105
Concurrency Control
A good concurrency control mechanism allows maximum concurrency with minimum overhead and ensure that transactions are run in a manner so that their effect on shared data are serially equivalent.
Approaches Locking Two phase locking protocol Timestamps
Locking
Transaction locks a data item before accessing it. Each lock is labeled with the transaction identifier and only the transaction that locked the data item can access it any number of times. Type of Type of lock to be set lock Read Write already set None Read Write Permitted Permitted Not Permitted Permitted Not Permitted Not Permitted
Type specific locking
Type of lock already set Read None Read I - Write Permitted Permitted Permitted
Type of lock to be set I write Permitted Permitted Not Permitted Commit Permitted Not Permitted Not Permitted
Commit
Not Permitted
Not Permitted
Not Permitted
Intention to write locks
Two Phase Locking Protocol
To increase concurrency, it is tempting to lock a data item for use by a transaction only for the period during which the transaction actually works on it and to release the lock as soon as the access operation is over. Locking/Unlocking data precisely at the moment they are required can lead to inconsistency. If locks are released early, between two subsequent read accesses to same data item by a transaction, another transaction may update that data & commit. Might result into cascading aborts.
T read(A) read(B) write(A)
T1
T2
read(A) write(A) read(A) If T fails, then it will lead to rolling back T1 and T2.
To avoid the data inconsistency problems, transaction systems use the two phase locking protocol.
In the first phase of a transaction, known as the growing phase, all locks needed by the transaction are gradually acquired. Then in the second phase of the transaction, known as shrinking phase, the acquired locks are released. Granularity of locking unit of lockable of data items Handling of locking deadlocks Avoidance - lock dataitems in predefined order so that no cycle formed. Detection by constructing and checking who-waits-forwhom graph. A cycle in the graph indicates the existence of deadlock Timeouts associating a timeout period with each lock is another method for handling deadlock.
Optimistic Concurrency Control
Transactions proceed uncontrolled up to end of first phase. In the second phase, before a transaction is committed, the transaction is validated to see if any of its data items have been changed by any other transaction since it started. The transaction is committed if found valid; otherwise it is aborted. For the validation process, two records are kept of the data items accessed within a transaction a read set that contains the data items read by the transaction and a write set that contains the data items changed, created or deleted by the transaction. To validate a transaction, its read set and write set are compared with the write sets of all of the concurrent transactions that reached the end of their first phase before it. Allows maximum parallelism.
Timestamps
Validation performed at operation level. To perform validation at the operation level, each transaction is assigned a unique timestamp at the moment it does begin_transaction. Every dataitem has a read timestamp and a write timestamp associated with it. When a transaction accesses a data item, depending on the type of access (read or write), the data items read timestamp or write timestamp is updated to the transactions timestamp.
Distributed Transaction Service
Distributed transaction supports transactions involving files managed by more than one server. Client begins transaction by sending begin_transaction request to any server. The contacted server (coordinator server) executes it & returns transaction id (TID) to the client. Coordinator server is responsible for aborting or committing transaction & adding other servers called worker servers.
Two Phase Multiserver Commit Protocol
end_transaction is performed in two phases: Preparation phase Coordinator makes entry in log that it is starting commit protocol. Sends prepare message with a timeout value associated to it to all workers. If worker is ready to commit, it makes entry in log & sends ready message; Else it sends abort message. Commitment phase If all workers ready to commit, transaction is committed & commit message is sent to workers. If reply of any worker was abort or timeout happened, transaction is aborted. When worker receives commit message , it makes committed entry in log & sends commit message to coordinator. When coordinator receives commit message from all workers, transaction finally committed & its entry deleted from log.
Nested Transaction
Transaction may be composed of other transactions called subtransactions. A transaction may commit only after all its descendants have committed. A subtransaction appears atomic to its parent. Abort of a subtransaction may not cause abort of its ancestors.
Advantages Allows concurrency within a transaction as subtransactions can run in parallel on different processors. Subtransactions act as checkpoints within a transaction which protection against failures.