II BSC IT
DATA STRUCTURES
UNIT-4
EXTERNAL SORTING
4.1 Storage Devices
4.2 Sorting with Disks
4.3 K-Way Merging
4.4 Sorting with Tapes Symbol Tables
4.5 Static & Dynamic Tree Tables
4.6 Hash Tables
4.7 Hashing Functions
4.8 Overflow Handling
4.1 STORAGE DEVICES
A storage device is a kind of hardware, which is also known as storage, storage
medium, digital storage, or storage media that has the ability to store information
either temporarily or permanently. Generally, it is used to hold, port, and extract data
files.
It can be used either internally or externally to a computer system, server or any
comparable computing device to hold information.
For any computing device, a storage device is one of the core components that is
available in several structures and sizes on the basis of requirements and
functionalities.
A storage device is available in various form factors; for case, a computer device
includes different storage media such as hard disk, RAM, cache. They also have optical
disk drives and externally connected USB drives.
Two types of storage devices
1. Primary storage devices
2. Secondary storage devices
Primary storage devices: They are fit internally to the computer and very fast in
terms of accessing data files. The RAM and cache memory are the examples of the
primary storage devices.
Secondary storage devices: The hard disk, USB storage devices and optical disk
drive are examples of secondary storage devices, which are designed to store data
permanently. They include a large storage capacity while comparing with primary
storage devices.
Why is storage needed in a computer?
A computer would be considered a dumb terminal without a storage device. It cannot
store or hold any type of information or settings if it has no storage device.
Although your computer can run without storage media, you only can view or read
the information on it unless it was a computer that is connected to another computer
contained storage abilities. Furthermore, a storage device is needed to store
information about such tasks, like browsing the Internet.
What is a storage location?
When you store any type of information on a computer or other similar devices, it
may ask you to the storage location where you want to store the information. By
default, there is various type of data stored on your computer hard disk.
If you want to move this information to another device, you need to transfer it to
another storage media, like a USB flash drive that makes capable you to move it to any
other computer.
Why so many different storage devices?
As the use of a computer is increasing rapidly, the technologies used to store data are
also increasing day by day due to the higher need for storage capacity. There is need
to invent the new technologies as the use of storage device is increasing day by day
and people want take it with them. As new storage devices are invented, people
replace the old device with a new storage device. Therefore, the need for older devices
is ended and stop being used.
The CD-ROM drives replaced them with the introduction of floppy diskettes, and CD-
ROM was replaced by DVD drives. Then flash drives are designed to replace the DVD
drives. The cost of the first hard disk drive from IBM that contained only 5 MB was
$50,000. In modern times, we have smart phones that contain much storage capacity
at a smaller price, which can also be carried out in pocket easily. Furthermore, every
enhancement of storage device makes it capable of a computer system to store a large
amount of data, including accessing it speedily.
Examples of computer storage Magnetic storage devices
Nowadays, magnetic storage is commonly founded on hybrid hard drives or extremely large
HDDs.
A list is given below of magnetic storage devices:
Floppy diskette: A floppy disk drive (FDD) offers users the benefit of saving data to
removable diskettes. FDDs have been replaced with other storage devices like
network file transfer and USB.
Hard drive: A hard disk drive (HDD) is used to store data permanently as it is a non-
volatile computer storage device, and directly connected to the disk controller of the
computer's motherboard. Usually, it is installed internally in a computer, known as
secondary storage device.
Magnetic Card: A magnetic card is a card that may have information about an
individual, such as passcodes to enter secure buildings or available recognition on a
credit card.
Tape cassette: A tape is a rectangular and flat container that is capable of storing
data. As compared to other storage media, it is less expensive and commonly used for
backing up a huge amount of data.
Zip diskette: A Zip drive is a hardware data storage device that is an advanced
version of the floppy disk. Its functions like a diskette and standard 1.44" floppy drive
and developed by Iomega. It became very popular in the late 1990s and capable of
storing data that was not possible with ordinary floppy disks.
Optical storage devices
Another type of storage devices is given below:
1. Blu-ray disc
2. CD-ROM disc
3. CD-R and CD-RW disc.
4. DVD-R, DVD+RW, DVD+R, and DVD-RW disc.
Flash memory devices
Flash memory is cheaper as well as portable. Due to become more reliable and efficient
solution, most magnetic and optical media have replaced by flash memory device.
Flash drive: A USB flash drive is a portable storage device used for data storage that
is also known as pen drive, thumb drive, data stick, keychain drive. They are
connected to a computer via a USB port and often the size of a human thumb.
Memory card: A memory card is commonly used in digital cameras, printers, MP3
players, PDAs, digital camcorders, game consoles, and handheld computers. The most
common memory card format was CompactFlash for many years, but today are
CFexpress, SD, MicroSD, and XQD.
Compact Flash (CF): Compact Flash is a type of flash memory that is commonly found
in digital cameras, PDAs and other portable devices. It is a 50-pin connection storage
device that is capable of storing data ranging from 2 MB to 128 GB.
Multi Media Card: A MultiMediaCard or MMC is an Integrated Circuit that is used in
car radios, printers, PDAs, MP3 players, and digital cameras. It acts as external storage
for data. The MMCP (MMCplus) and MMCM (MMCmobile / MMCmicro) are the
variations of the MMC card.
Sony Memory Stick: Sony Memory Stick is a family of flash memory cards, first
introduced by Sony in October 1998. It is designed for digital storage in cameras and
other Sony products.
SD Card: An SD Card, stands for Secure Digital Card, is most commonly used with
electronics that are designed to offer high-capacity memory with small size. It is often
used in small portable devices like cell phones, digital cameras, digital video
camcorders, mp3 players, etc. It is used by more than 400 brands of electronic
devices.
Online and cloud storage
The need to store data online and in cloud storage is increasing rapidly.
Cloud storage: Cloud storage is a cloud computing model that transmits and holds
data on remote storage systems where a cloud computing provider manages,
maintains, and made available data to users over a network. It offers users the
reliability, confidentiality, durability, and 'access data anytime'.
Network media: Network media is used on a computer network such as the Internet,
as it is any audio, video, images or text.
Paper storage
Initially, computers were not able to store data on any storage technologies, like flash
memory devices, optical storage devices; they had to depend on paper. In modern times, the
method of paper storage to store data is rarely used or found.
Punch card: A punch card is also known as Hollerith cards or IBM cards that are able
to store data in the form of small punched holes. It is a simple piece of paper stock
that was widely used to input data into early computers.
OMR: It stands for optical mark recognition or optical mark reading. It is a method of
extracting data from human beings by identifying certain markings on a document,
such as checkboxes and fill-infields, on printed forms. Generally, the OMR process is
accomplished by scanning that detects a reflection or transmission with the help of a
piece of paper. This technology provides advantages for applications such as ballots,
reply cards, surveys, and questionnaires as they need a large amount of hand-filled
forms to be processed quickly and with accuracy.
4.2 SORTING WITH DISKS
Sorting refers to the operation or technique of arranging and rearranging sets of data
in some specific order.
A collection of records called a list where every record has one or more fields. The
fields which contain a unique value for each record is termed as the key field.
For example, a phone number directory can be thought of as a list where each record has
three fields - 'name' of the person, 'address' of that person, and their 'phone numbers'.
1. Being unique phone number can work as a key to locate any record in the list.
2. Sorting is the operation performed to arrange the records of a table or list in some
order according to some specific ordering criterion. Sorting is performed according
to some key value of each record.
3. The records are either sorted either numerically or alphanumerically. The records
are then arranged in ascending or descending order depending on the numerical
value of the key.
4. Here is an example, where the sorting of a lists of marks obtained by a student in any
particular subject of a class.
Categories of Sorting
The techniques of sorting can be divided into two categories. These are:
1. Internal Sorting
2. External Sorting
Internal Sorting: If all the data that is to be sorted can be adjusted at a time in the main
memory, the internal sorting method is being performed.
External Sorting: When the data that is to be sorted cannot be accommodated in the
memory at the same time and some has to be kept in auxiliary memory such as hard disk,
floppy disk, magnetic tapes etc, then external sorting methods are performed.
The Complexity of Sorting Algorithms
The complexity of sorting algorithm calculates the running time of a function in which 'n'
number of items are to be sorted. The choice for which sorting method is suitable for a
problem depends on several dependency configurations for different problems. The most
noteworthy of these considerations are:
1. The length of time spent by the programmer in programming a specific sorting
program
2. Amount of machine time necessary for running the program
3. The amount of memory necessary for running the program
The Efficiency of Sorting Techniques
To get the amount of time required to sort an array of 'n' elements by a particular method,
the normal approach is to analyze the method to find the number of comparisons (or
exchanges) required by it. Most of the sorting techniques are data sensitive, and so the
metrics for them depends on the order in which they appear in an input array.
Various sorting techniques are analyzed in various cases and named these cases as follows:
1. Best case
2. Worst case
3. Average case
Hence, the result of these cases is often a formula giving the average time required for a
particular sort of size 'n.' Most of the sort methods have time requirements that range from
O(nlog n) to O(n2).
Types of Sorting Techniques
Bubble Sort
Selection Sort
Merge Sort
Insertion Sort
Quick Sort
Heap Sort
4.3 K-WAY MERGING
k-way merge algorithms or multiway merges are a specific type of sequence merge
algorithms that specialize in taking in k sorted lists and merging them into a single
sorted list. These merge algorithms generally refer to merge algorithms that take in a
number of sorted lists greater than two.
The k-way merge problem consists of merging k sorted arrays to produce a single
sorted array with the same elements. Denote by n the total number of elements. n is
equal to the size of the output array and the sum of the sizes of the k input arrays.
Merge algorithms are a family of algorithms that take multiple sorted lists as input
and produce a single list as output, containing all the elements of the inputs lists in
sorted order. These algorithms are used as subroutines in various sorting algorithms,
most famously merge sort.
Merge sort is the algorithm which follows divide and conquer approach. Consider an array
A of n number of elements.
The algorithm processes the elements in 3 steps:
1. If A Contains 0 or 1 elements then it is already sorted, otherwise, Divide A into two
sub- arrays of equal number of elements.
2. Conquer means sort the two sub-arrays recursively using the merge sort.
3. Combine the sub-arrays to form a single final sorted array maintaining the ordering
of the array.
The main idea behind merge sort is that, the short list takes less time to be sorted.
Example:
Consider the following array of 7 elements. Sort the array by using merge sort.
A = {10, 5, 2, 23, 45, 21, 7}
4.4 SORTING WITH TAPES
Sorting with tapes is essentially similar to the merge sort used for sorting with disks.
The differences arise due to the sequential access restriction of tapes.
This makes the selection time prior to data transmission an important factor, unlike
seek time and latency time. Thus, in sorting with tapes we will be more concerned
with arrangement of blocks and runs on the tape so as to reduce the selection or
access time.
Example: A file of 6000 records is to be sorted. It is stored on a tape and the block length is
500. The main memory can sort up to 1000 records at a time. We have in addition 4 search
tapes T1 -T4 The steps in merging can be summarized as follows:
4.5 SYMBOL TABLE
A symbol table is a data structure used by a language translator such as a compiler or
interpreter, where each identifier (or symbol) in a program's source code is
associated with information relating to its declaration or appearance in the source.
Symbol table is an important data structure used in a compiler. Symbol table is used
to store the information about the occurrence of various entities such as objects,
classes, variable name, interface, function name etc. it is used by both the analysis and
synthesis phases.
The symbol table used for following purposes:
It is used to store the name of all entities in a structured form at one place.
It is used to verify if a variable has been declared.
It is used to determine the scope of a name
It is used to implement type checking by verifying assignments and expressions in the
source code are semantically correct.
A symbol table can either be linear or a hash table. Using the following format, it maintains
the entry for each name.
<symbol name, type, attribute>
For example, suppose a variable store the information about the following variable
declaration:
static int salary
Operations:
A symbol table, either linear or hash, should provide the following operations.
Allocate: to allocate a new empty symbol table.
Free: to remove all entries and free the storage of a symbol table.
Insert: to insert a name in a symbol table and return a pointer to its entry.
lookup: to search for a name and return a pointer to its entry.
set attribute: to associate an attribute with a given entry.
get _attribute: to get an attribute associated with a given entry.
4.6 STATIC & DYNAMIC TREE TABLE
Data structure is a way of storing and organizing data efficiently such that the
required operations on them can be performed be efficient with respect to time as
well as memory.
Simply, Data Structure is used to reduce complexity (mostly the time complexity) of
the code.
Data structures can be two types
1. Static Data Structure
2. Dynamic Data Structure
What is a Static Data structure?
In Static data structure the size of the structure is fixed. The content of the data structure can
be modified but without changing the memory space allocated to it.
Example: of Static Data Structures: Array
What is Dynamic Data Structure?
In Dynamic data structure the size of the structure in not fixed and can be modified during
the operations performed on it. Dynamic data structures are designed to facilitate change of
data structures in the run time.
Example: of Dynamic Data Structures: Linked List
Static Data Structure vs Dynamic Data Structure
Static Data structure has fixed memory size whereas in Dynamic Data Structure, the
size can be randomly updated during run time which may be considered efficient with
respect to memory complexity of the code.
Static Data Structure provides more easier access to elements with respect to
dynamic data structure. Unlike static data structures, dynamic data structures are
flexible.
The Binary search tree holds data items in a sorted order, but with the addition of a
simple rule.
Rule: The LEFT node always contains values that come before the root node and the RIGHT
node always contain values that come after the root node.
For numbers, this means the left sub-tree contains numbers less than the root and the right
sub-tree contains numbers greater than the root. For words, as might be in a sorted
dictionary, the order is alphabetic.
Example - forming a binary search tree.
Sequences of numbers are to formed into a binary search tree. These numbers are available
in this order: 20, 17, 29, 22, 45, 9, 19. Task: form a sorted binary tree diagram.
This is done step by step.
Sequence 20, 17, 29, 22, 45, 9, 19
The first item is 20 and this is the root node, so begin the diagram
Sequence 20, 17, 29, 22, 45, 9, 19
This is a binary search tree, so there are two child nodes available, the LEFT and the RIGHT.
The next number is 17, the rule is applied (left is less than parent node) and so it has to be
the LEFT node, like this,
Sequence 20, 17, 29, 22, 45, 9, 19
The next number is 29, this is higher than the root node so it goes to the RIGHT sub-tree
which happens to be empty at this stage, so the tree now looks like,
Sequence 20, 17, 29, 22, 45, 9, 19
The next number is 22. This is more that the root and so need so be on the RIGHT sub- tree.
The first node is already occupied. So the rule is applied again to that node, 22 comes before
29 and so it needs to be on the LEFT sub-tree of that node, like this,
Sequence 20, 17, 29, 22, 45, 9, 19
The next number is 45, this is more than the root and more than the first right node, so it is
placed on the right side of the tree like this
Sequence 20, 17, 29, 22, 45, 9, 19
The next number is 9 which is less than the root, the first left node is occupied and 9 is less
than that node too, so it is placed on the left sub-tree, like this
Sequence 20, 17, 29, 22, 45, 9, 19
The next number is 19, which is less than the root, so it will need to be in the left sub-tree. It
is greater than the occupied 17 node and so it is placed in the right sub-tree, like this
4.7 HASH TABLE
In all search techniques like linear search, binary search and search trees, the time
required to search an element depends on the total number of elements present in
that data structure. In all these search techniques, as the number of elements
increases the time required to search an element also increases linearly.
Hashing is another approach in which time required to search an element doesn't
depend on the total number of elements. Using hashing data structure, a given
element is searched with constant time complexity. Hashing is an effective way to
reduce the number of comparisons to search an element in a data structure.
Hashing is defined as follows
Hashing is the process of indexing and retrieving element (data) in a data structure
to provide a faster way of finding the element using a hash key.
Here, the hash key is a value which provides the index value where the actual data is
likely to be stored in the data structure.
In this data structure, we use a concept called Hash table to store data. All the data
values are inserted into the hash table based on the hash key value. The hash key
value is used to map the data with an index in the hash table. And the hash key is
generated for every data using a hash function. That means every entry in the hash
table is based on the hash key value generated using the hash function.
Hash Table is defined as follows
Hash Table is a data structure which stores data in an associative manner. In a hash
table, data is stored in an array format, where each data value has its own unique
index value. Access of data becomes very fast if we know the index of the desired data.
Thus, it becomes a data structure in which insertion and search operations are very
fast irrespective of the size of the data. Hash Table uses an array as a storage medium
and uses hash technique to generate an index where an element is to be inserted or
is to be located from.
A hash function is defined as follows...
Hash function is a function which takes a piece of data (i.e. key) as input and produces
an integer (i.e. hash value) as output which maps the data to a particular index in the
hash table.
Basic concept of hashing and hash table is shown in the following figure.
4.8 HASH FUNCTION
The hash function is a function that uses the constant-time operation to store and
retrieve the value from the hash table, which is applied on the keys as integers and
this is used as the address for values in the hash table.
Hash function is a function which is applied on a key by which it produces an integer,
which can be used as an address of hash table. Hence one can use the same hash
function for accessing the data from the hash table. In this the integer returned by the
hash function is called hash key.
Types of hash function
There are various types of hash function which are used to place the data in a hash table
1. Division Method
2. Mid Square Method
3. Digit Folding Method
Division Method
In this the hash function is dependent upon the remainder of a division. For example:-if the
record 52,68,99,84 is to be placed in a hash table and let us take the table size is 10.
h(key)=record% table size.
o 2=52%10
o 8=68%10
o 9=99%10
o 4=84%10
Mid Square Method
In this method firstly key is squared and then mid part of the result is taken as the index.
For example: consider that if we want to place a record of 3101 and the size of table is 1000.
So 3101*3101=9616201 i.e. h (3101) = 162 (middle 3 digit)
In this method, the middle part of the squared element is taken as the index.
Element to be placed in the hash table are 3205, 7148, 890 and the size of the table be 100.
3205 * 3205 = 10272025, index = 72 as the middle part of the result (10272025) is 72.
7148* 7148 = 51093904, index = 93 as the middle part of the result (51093904) is 93.
890* 890 = 792100, index = 21 as the middle part of the result (792100) is 21.
Mid-Square hashing is a hashing technique in which unique keys are generated. In this
technique, a seed value is taken and it is squared. Then, some digits from the middle are
extracted. These extracted digits form a number which is taken as the new seed. This
technique can generate keys with high randomness if a big enough seed value is taken.
However, it has a limitation. As the seed is squared, if a 6-digit number is taken, then the
square will have 12-digits. This exceeds the range of int data type. So, overflow must be taken
care of. In case of overflow, use long long int data type or use string as multiplication if
overflow still occurs. The chances of a collision in mid-square hashing are low, not obsolete.
So, in the chances, if a collision occurs, it is handled using some hash map.
Digit Folding Method
In this method the key is divided into separate parts and by using some simple operations
these parts are combined to produce a hash key.
For example: 123, 456, 789. After dividing the parts combine these parts by adding it.
H(123) = 1+2+3 = 6
H(43)=4+3=7
H(56)=5+6=11
4.8 OVERFLOW HANDLING HASH FUNCTION
The following are the techniques used to handle the overflow in hash table
1. Collision Resolution Techniques
2. Quadratic Probing
3. Double hashing
Collision Resolution Techniques
Separate chaining is one of the most commonly used collision resolution techniques.
It is usually implemented using linked lists. In separate chaining, each element of the
hash table is a linked list.
To store an element in the hash table you must insert it into a specific linked list. If
there is any collision (i.e. two different elements have same hash value) then store
both the elements in the same linked list.
Quadratic Probing
Here, when the slot at a hashed index for an entry record is already occupied, you
must start traversing until you find an unoccupied slot.
The interval between slots is computed by adding the successive value of an arbitrary
polynomial in the original hashed index
Example Table size is 11 (0…10)
Hash function h(x) = x mod 11
Insert keys: 20,30,2,13,25,24,10,9
20 % 11 = 9
30 % 11 = 8
2 % 11 = 2
13 % 11 = 2 - > 2 +12=3
25 % 11 = 2 -> 3 + 12=4
24 % 11 = 2 -> 2 + 12, 2 + 22, 2 + 32=5
10 % 11 = 10
9 % 11 = 9 -> 9 + 12, 9 + 22 % 11 = 0
Double hashing
Double hashing is similar to linear probing and the only difference is the interval between
successive probes. Here, the interval between probes is computed by using two hash
functions.