Lecture Notes Course Outcome 1 & Session 4 Topic: SFS File System Implementation
Lecture Notes Course Outcome 1 & Session 4 Topic: SFS File System Implementation
Simple File System(SFS): The SFS is also known as Very Simple File System (VSFS). It is a simplified
version of a typical UNIX file system.
It serves to introduce some of the basic on-disk structures, access methods, and various policies. The
file system is pure software, there is no addition of hardware features to make some aspect of the
file system to work better.
There are two aspects of the file system to understand how the file system works. The first is the
data structures of the file system. The second one is access methods.
The types of on-disk structures that are utilized by the file system to organize its data and metadata
are the data structures. The file system employ simple structures like arrays of blocks and more
complicated tree-based structures.
The system calls made by a process such as open(), write(), etc., onto its structures, the structures
that are read during the execution of a particular system call, the writing made to the file. These are
the steps performed and are checked how efficiently they are working through the access methods.
Overall Organization of the data structures: In the overall on-disk organization of the data structures,
the disk is divided into blocks.
Simple file system use just one block size. The disk partition of file system is done as a series of
blocks each of equal size. The blocks are addressed from 0 to N-1, where N is the partition of size N
equal sized blocks.
Most of the space in any file system is user data, so, the blocks of SFS are also filled with the user
data. The region of the disk that is used for user data is called the data region.
The file system has to track information about each file that is metadata and tracks which data
blocks comprise a file, the size of the file, access rights, its owner, access and modify times.
This information is stored in a structure called inode. To accommodate inodes, some space on the
disk is reserved and this portion is called inode table. To track whether inodes or data blocks are free
or allocated, allocation structures are used.
Many allocation-tracking methods are used, for example, a free list that points to the first free block
which then points to the next free block and so on.
A bitmap is a allocation structure, where each bit is used to indicate whether the corresponding
block is free(0) or in use(1). For inodes, inode bitmap is used and for data, data bitmap is used.
The superblock contains information about the particular file system including how many inodes and
data blocks are in the file system, where the inode table begins and so on.
When mounting a file system, the operating system will read the superblock to initialize various
parameters and attach the volume to the file-system tree. When the files within the volume are
accessed, the system will know exactly where to look for the needed structures.
Now, let us develop a simple file system with block size 4KB. Assume that we have a small disk with
just 64 blocks as shown below.
Now add the data region for user data. Assume that we have given last 56 blocks of 64 blocks as
shown.
The part represented with D is the data region in our small file system. Now, as we need to keep
track of the files, inodes are used. We reserve some space on the disk for them as well.
So, in our file system we used 5 blocks for the inode table. Inodes are not that big, they may be of
size 128 or 256 bytes. We assume 256 bytes per inode, a 4KB block can hold 16 inodes, and our file
system contains 80 inodes.
Now, allocation structures must be given space. Here we use bitmap structure. Inode bitmap is
denoted as i and data bitmap is denoted as d.
Now, we are left with one block in our file system. This block is reserved for superblock denoted by
S. The superblock contains information about this particular file system. For example, how many
inodes and data blocks are in the file system. Here it notes 80 inodes and 56 data blocks. Where the
inode table begins, here it starts from block 3.
This is the simple file system built.
File Organization – The Inode: One of the most important on-disk structures of a file system is the
inode. The name inode is short for index node, the historical name given to it in UNIX.
These inodes were originally arranged in an array, and the array indexed into when accessing a
particular inode. Each inode is referred to by a number called the i-number, which is also called the
low-level name of the file.
By the given i-number, one should directly be able to calculate where on the disk the corresponding
inode is located. For example, take the inode table from the above file system, 20KB in size that is 5 x
4KB blocks, consisting of 80 inodes with assumption that each inode is 256 bytes. The inode region
starts at 12 KB from the figure.
To read the inode number 32, the file system first calculate the offset into the inode region.
32 * sizeof(inode) = 32 * 256 = 8192.
Add it to the start address of the inode table on disk, inodeStartAddr = 12KB, to arrive to the correct
byte address of the desired block of inodes. The disks are not byte addressable, but they are a large
number of addressable sectors, each sector usually of size 512 bytes.
To fetch the block of inodes that contain inode 32, the file system would issue a read to a sector to
fetch the desired inode block. The sector address sector and the block blk are calculated as follows.
Inside each inode, it contains information about a file – its type, its size, the number of blocks
allocated to it, protection information, time information such as when the file is created, modified or
last accessed, and as well as information about where its data blocks reside on disk.
This information is referred as metadata. An example inode from ext2 is shown in figure.
One of the most important decisions in the inode design is how it refers to where data blocks are.
One simple approach is to have one or more direct pointers inside the inode.
Each pointer refers to one disk block that belongs to the file. To support bigger files, file system
designers have introduced different structures within inodes, a special pointer known as an indirect
pointer.
Instead of pointing to a block that contains user data, it points to a block that contain more pointers
each of which point to user data. Inode may have some fixed direct pointers and single indirect
pointer.
If the file grows large enough, an indirect block is allocated and the inode’s slot for an indirect
pointer is set to point to it. In this approach, to support even larger files, the double indirect pointer
is used. That is add another pointer to the inode.
This pointer refers to a block that contains pointers to indirect blocks, each of which contains
pointers to data blocks. To point out even more larger files, triple indirect pointer can be used. This
becomes an imbalanced tree and is referred to as multi-level index approach to pointing to file
blocks.
Many file systems use a multi-level index, including commonly used file systems such as Linux ext2
and ext3, NetApp’s WAFL, the original UNIX file system.
Directory Organization: In file system, directories have a simple organization. A directory just
contains a list of entry name and inode number pairs.
For each file or directory in a given directory, there is a string and a number in the data blocks of the
directory. For example, assume a directory dir with inode number 5 has three files in it such as foo,
bar, foobar_is_a_pretty_longname with inode numbers 12, 13 and 24.
Each directory has two extra entries, . “dot” and .. “dot-dot”. The dot directory is the current
directory, here it is dir. The dot-dot directory is the parent directory, here it is root.
If a file is deleted, it may leave an empty space in the middle of the directory. So, the record length is
used to mark from where the file is deleted.
A new file or new entry may reuse an old, bigger entry and can have extra space within. File systems
treat directories as a special type of file. So, a directory has an inode in the inode table with the type
field marked as “directory”.
The directory has indirect data blocks pointed to by the inode. The directory entries can be stored by
using any other data structure. For example, XFS stores directories in B-tree form.
Free Space Management: Free space management is important for all file systems. A file system
must track which inodes and data blocks are free and which are not free.
When a new file or directory is allocated, the file system can find free space fot it. In SFS, two simple
bitmaps are used for the free space management. One for inodes and one for data blocks.
When a file is created, an inode should be allocated to it. The file system searches through the
bitmap for an inode that is free, and allocate it to the file.
The file system marks the inode as used with a 1 and update the on-disk bitmap. The file system uses
the same procedure for allocating the data blocks. But some considerations might be used when
allocating data blocks for a new file.
For example, some Linux systems such as ext2 and ext3 look for a sequence of blocks that are free.
When a new file is created and needs data blocks, it finds a sequence of free blocks, and allocate
them to the new file.
The file system guarantees that the portion of the file will be contiguous on the disk by improving
performance. This pre-allocation policy is commonly used when allocating space for data blocks.
Access Paths: One should follow the flow of operation during the activity of reading or writing a file.
These are the access paths for a file. One should understand how a file system works with the
access paths.
Reading a file from disk: One can read the data of a file from disk, when the file system is mounted
and the superblock is in memory. Inodes, directories are also available on the disk.
For example, assume that simply open a file i.e., /foo/bar read it and then close it. Let us assume the
file is just 12KB in size. When an open system call for reading the file is issued, the file system first
needs to find the inode for the file bar, to obtain the information about file access permissions, file
size etc.
For this, the file system has to find the inode by traversing the pathname and locate the desired
inode. All traversals begin at the root of the file system, in the root directory which is simply called
“/”.
The file system has to find the inode of the root directory, by using its i-number. As the root is the
parent, the file system knows its inode number when the file system is mounted itself.
By default, the inode number of the root is 2. Now the file system reads in the first inode block. The
file system now find pointers to data blocks that contain the contents of the root directory.
The file system use these pointers to read the directory for finding an entry for “foo”. When the
entry is found, the file system finds the inode number of foo. Again, the file system reads the inode
block of “foo” to find an entry for “bar”.
When the entry is found, the file system finds the inode number of bar. Now the final step of open()
is to read the inode of the file bar into memory. The file system checks the access permissions,
allocates a file descriptor for this process in the per-process open-file table and return to the user.
After opening the file “bar”, the program can issue read() system call to read from the file. The first
read will read the first block of the file, by finding the location of the block through inode.
The read will update the in-memory open-file table by updating the file offset such as its next read.
When the file is closed, the file descriptor should be deallocated. The entire process is as shown.
Writing a file to disk: Writing to a file is a similar process. First the file must be opened, then the
application can issue write() system calls to update the file with new contents. Finally the file is
closed.
Write() may also allocate a block. When writing out a new file, each write has to write data to disk
and also has to decide which block to allocate to the file.
Each write to a file logically generates five I/Os – to read the data bitmap, to write the bitmap, to
read and then write the inode, and to write the actual block itself.
To create a file, the file system must not only allocate the inode, but also allocate space within the
directory containing the new file. The total amount of I/O traffic for this process is high, because the
file system must perform five I/Os. If the directory needs to grow to accommodate the new entry,
additional I/Os are needed.
For example, the file /foo/bar is created and three blocks are written to it as shown in figure.
In the figure, reads and writes to the disk are grouped under which system call caused them to occur
and the order goes from top to bottom of the figure. In this case, 10 I/Os are needed to walk the
pathname and then create the file.
And each allocating write costs 5 I/Os – a pair of read and write to update inode, another pair of
read and write to update the data bitmap, and final write for the data itself.
Caching and Buffering: Reading and writing files can be expensive, when many I/Os are incurred to
the disk. But the remedy for this problem makes huge performance problem and file systems use
system memory to cache important blocks.
From the above example, without caching, every file require at least two reads for every level in the
directory hierarchy. If the file has a long pathname, then the system has to perform hundreds of
reads just to open a file.
Early file systems introduced a fixed-size cache to hold popular blocks. Strategies such as LRU and
different variants decide which blocks to keep in cache.
This fixed-size cache usually be allocated at boot time roughly 10% of the total memory. With this
static partitioning and fixed-size cache, unused pages in the file cache cannot be reused and thus
wasted.
Modern systems, use a dynamic partitioning approach. Many modern operating systems integrate
virtual memory pages and file system pages into a unified page cache.
In this way, memory can be allocated more flexibly across virtual memory and file system depending
on which needs more memory at a given time. Now, from the above example, imagine file open with
caching.
The first open may generate a lot of I/O traffic to read in directory inode and data. But subsequent
opens of the same file mostly hit in the cache and no I/O is needed. Read I/O can be avoided with a
sufficiently large cache.
But, a cache does not serve as the same kind on the write I/O. Write buffering has a number of
performance benefits. By delaying writes, the file system can batch some updates into a smaller set
of I/Os.
By buffering a number of writes in memory, the system can schedule the subsequent I/Os and
increase performance. Some writes are avoided by delaying them. For these reasons, the file system
buffer writes in memory for between five and thirty seconds.
If the system crashes before the updates have been written to disk the updates are lost. By keeping
writes in memory longer, performance can be improved by batching, scheduling and even avoiding
writes.
To avoid unexpected data loss due to write buffering, some applications such as databases simply
force writes to disk by calling fsync(), by using direct I/O interfaces or by using the raw disk interface
and avoiding the file system.