File Structures by Folk, Zoellick, and Riccardi
Chap 8. Cosequential Processing
and the Sorting of Large Files
SNU-OOPSLA-LAB
File Structures
SNU-OOPSLA Lab.
Chapter Objectives(1)
Describe a class of frequently used processing activities
known as cosequential process
Provide a general object-oriented model for implementing
varieties of cosequential processes
Illustrate the use of the model to solve a number of
different kinds of cosequential processing problems,
including problems other than simple merges and
matches
Introduce heapsort as an approach to overlapping I/O with
sorting in RAM
File Structure
SNU-OOPSLA Lab.
Chapter Objectives(2)
Show how merging provides the basis for sorting very
large files
Examine the costs of K-way merges on disk and find ways
to reduce those costs
Introduce the notion of replacement selection
Examine some of the fundamental concerns associated
with sorting large files using tapes rather than disks
Introduce UNIX utilities for sorting, merging, and
cosequential processing
File Structure
SNU-OOPSLA Lab.
Contents
8.1 Cosequential operations
8.2 Application of the OO Model to a General Ledger Program
8.3 Extension of the OO Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Cosequential operations
Coordinated processing of two or more sequential
lists to produce a single list
Kinds of operations
merging, or union
matching, or intersection
combination of above
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Matching Names in Two Lists(1)
So called intersection operation
Output the names common to two lists
Things that must be dealt with to make match procedure
work reasonably
initializing that is to arrange things
methods that are getting and accessing the next list item
synchronizing between two lists
handling EOF conditions
recognizing errors
e.g. duplicate names or names out of sequence
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Matching Names in Two Lists(2)
In comparing two names
if Item(1) is less than Item(2), read the next from List 1
if Item(1) is greater than Item(2), read the next name from
List 2
if the names are the same, output the name and read the
next names from the two lists
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Cosequential match procedure(1)
PROGRAM: match
Item(1)
Item(1) < Item(2)
List 1
use input() & initialize() procedure
same
name
List 2
Item(1) > Item(2)
File Structure
Item(2)
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Cosequential match procedure(2)
int Match(char * List1, char List2, char *OutputList)
{
int MoreItems; // true if items remain in both of the lists
// initialize input and output lists
InitializeList(1, List1); InitializeList(2, List2);
InitializeOutput(OutputList);
// get first item from both lists
MoreItems = NextItemInLIst(1) && NextItemInList(2);
while (MoreItems) { // loop until no items in one of the lists
if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);
else if (Item(1) == Item (2) ) {
ProcessItem(1);
// match found
MoreItems = NextItemInList(1) && NextItemInList(2);
}
else
MoreItems = NextItemInList(2); // Item(1) > Item(2)
}
FinishUp();
return 1;
File Structure
SNU-OOPSLA Lab.
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
General Class for Cosequential Processing(1)
template <class ItemType> class CosequentialProcess
// base class for cosequential processing
{ public:
// the following methods provide basic list processing
// these must be defined in subclasses
virtual int InitializeList (int ListNumber, char *LintName) = 0;
virtual int InitializeOutput (char * OutputListName) = 0;
virtual int NextItemInList (int ListNumber) = 0;
// advance to next item in this list
virtual ItemType Item(int ListNumber) = 0;
// return current item from this list
virtual int ProcessItem(int ListNumber) = 0;
// process the item in this list
virtual int FinishUp() = 0; // complete the processing
// 2-way cosequential match method
virtual int Match2Lists (char *List1, char * List2, char *OutputList);
};
File Structure
SNU-OOPSLA Lab.
10
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
General Class for Cosequential Processing(2)
A Subclass to support lists that are files of strings, one per line
class StringListProcess : public CosequentialProcess<String &>
{ public:
StringListProcess (int NumberOfLists); // constructor
// Basic list processing methods
int InitializeList (int ListNumber, char * List1);
int InitializeOutput(char * OutputList);
int NextItemInList (int ListNumber); // get next
String & Item (int ListNumber); // return current
int ProcessItem (int ListNumber); // process the item
int FinishUp(); // complete the processing
protected:
ifstream * List; // array of list files
String * Items; // array of current Item from each list
ofstream OutputLsit;
static const char * LowValue; //used so that NextItemInList() doesnt
// have to get the first item in an special way
static const char * HighValue;
};
File Structure
SNU-OOPSLA Lab.
11
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
General Class for Cosequential Processing(3)
Appendix H: full implementation
An example of main
#include coseq.h
int main()
{
StringListProcess ListProcess(2); // process with 2 lists
ListProces.Match2Lists (list1.txt, list2.txt, match.txt);
}
File Structure
SNU-OOPSLA Lab.
12
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Merging Two Lists(1)
Based on matching operation
Difference
must read each of the lists completely
must change MoreNames behavior
keep this flag set to true as long as there are records in
either list
HighValue
the special value (we use \xFF)
come after all legal input values in the files to ensure both
input files are read to completion
File Structure
SNU-OOPSLA Lab.
13
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Merging Two Lists(2)
Cosequential merge procedure based on a single
loop
This method has been added to class CosequentialProcess
No modifications are required to class StringListProcess
template <class ItemType>
int CosequentialProcess<ItemType> :: Merge2Lists
(char * List1Name, char * List2Name, char * OutputList)
{
int MoreItems1, MoreItems2; // true if more items in list
(continued )
File Structure
SNU-OOPSLA Lab.
14
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Merging Two Lists(3)
InitializeList (1 List1Name);
InitializeList (2, List2Name);
InitializeOutput (OutputListName);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInLIst(2);
while (MoreItems1 || MoreItems(2) ) { // if either file has more
if (Item(1) < Item(2)) { // list 1 has next item to be processed
ProcessItem(1);
MoreItem1 = NextItemInList(1);
}
else if (Item(1) == Item(2) ) {
ProcessItem(1);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInList(2);
}
else // Item(1) > Item(2) {
ProcessItem(2);
MoreItem2 = NextItemInList(2);
}
}
FinishUp(); return 1;
File Structure
SNU-OOPSLA Lab.
15
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Cosequential merge procedure(1)
PROGRAM: merge
List 1
(Item(1) < Item(2) )or match
NAME_1
OutputList
List 2
NAME_2
Item(1) > Item(2)
File Structure
SNU-OOPSLA Lab.
16
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Summary of the Cosequential Processing Model(1)
Assumptions
two or more input files are processed in a parallel fashion
each file is sorted
in some cases, there must exist a high key value or a low
key
records are processed in a logical sorted order
for each file, there is only one current record
records should be manipulated only in internal memory
File Structure
SNU-OOPSLA Lab.
17
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Summary of the Cosequential Processing Model(2)
Essential Components
initialization - reads from first logical records
one main synchronization loop
- continues as long as relevant records remain
selection in main synchronization loop
if
(Item(1) > Item(2) then ..........
else if ( Item(1) < Item(2)) then .........
else ........... /* current keys equal */
endif
Input files & Output files are sequence checked by
comparing the previous item value with new one
File Structure
SNU-OOPSLA Lab.
18
8.1 An Object-Oriented Model for
Implementation Cosequential Processes
Summary of the Cosequential Processing Model(3)
Essential
components (contd)
substitute
high values for actual key when EOF
main loop terminates when high values have occurred for
all relevant input files
no special code to deal with EOF
I/O or error detection are to be relegated to supporting method
so the details of these activities do not obscure the principal
processing logic
File Structure
SNU-OOPSLA Lab.
19
8.2 The General Ledger Program (1)
Account table (Fig 8.6)
Acct-No Acct-Title
Jan
101
check #1
100
102
check #2
500
505
advertize
300
Feb
200
270
129
Mar Apr
170
320
230
Journal entry table (Fig 8.7)
Acct-No Check-No Date
Description
101
112
04/02/86
auto-repair
505
213
05/13/86
newspaper
540
670
04/13/86
printer
Debit/Credit
-30
-39
+60
Ledger Printout (Fig 8.8)
101 check #1
1271 04/02/86 auto-expense
1272 04/03/86 advertise
-78
-30
File Structure
SNU-OOPSLA Lab.
20
8.2 The General Ledger Program(2)
Ledger List and Journal List (Fig 8.10)
101 check#1
101 1271 Auto-expense
101 1272 Rent
101 1273 Advertising
102 check#2 102 670 Office-expense
The ledger (master) account number
The journal (transaction) account number
Class MasterTransactionProcess (Fig 8.12)
Subclass LedgeProcess (Fig 8.14)
File Structure
SNU-OOPSLA Lab.
21
8.2 The General Ledger Program (3)
Template <class ItemType>
class MasterTransactionProcess: Public CosequentialProcess<ItemType>
// a cosequential process that supports master/transaction processing
{public:
MasterTransactionProcess(); // constructor
Virtual int ProcessNewMaster() = 0; //processing when new master read
Virtual int ProcessCurrentMaster() = 0;
Virtual int ProcessEndMaster() = 0;
Virtual int ProcessTransactionError()= 0;
//cosequential processing of master and transaction records
int PostTransactions (char * MasterFileName, char * TransactionFileName,
char * OutputListName);
};
File Structure
SNU-OOPSLA Lab.
22
8.3 Extension of the Model to Include
Multiway Merging
A K-way Merge Algorithm
A very general form of cosequential file processing
Merge K input lists to create a single, sequentially
ordered output list
Algorithm
begin loop
determine which list has the key with the lowest value
output that key
move ahead one key in that list
in duplicate input entries, move ahead in each list
loop again
File Structure
SNU-OOPSLA Lab.
23
8.3 Extension of the Model to Include
Multiway Merging
Selection Tree for Merging Large Number of Lists
K-way merge
nice if K is no larger than 8 or so
if K > 8, the set of comparisons for minimum key is expensive
loop of comparison (computing)
Selection Tree (if K > 8)
time vs. space trade off
a kind of tournament tree
the minimum value is at root node
the depth of tree is log2 K
File Structure
SNU-OOPSLA Lab.
24
8.3 Extension of the Model to Include
Multiway Merging
Selection Tree
7, 10, 17....List 0
9, 19, 23....List 1
7
11
input
11, 13, 32....List 2
18, 22, 24....List 3
5
5
5
12, 14, 21....List 4
5, 6, 25....List 5
15, 20, 30....List 6
8
8, 16, 29....List 7
File Structure
SNU-OOPSLA Lab.
25
8.4 A Second Look at Sorting in Memory
8.4 A Second Look at Sorting in Memory
Read the whole file from into memory, perform
sorting, write the whole file into disk
Can we improve on the time that it takes for this RAM
sort?
perform some of parts in parallel
selection sort is good but cannot be used to sort entire file
Using Heap technique!
processing and I/O can occur in parallel
keep all the keys in heap
Heap building while reading a block
Heap rebuilding while writing a block
File Structure
SNU-OOPSLA Lab.
26
8.4 A Second Look at Sorting in Memory
Overlapping processing and I/O : Heapsort
Heap
a kind of binary tree, complete binary tree
each node has a single key, that key is less than or equal to
the key at its parent node
storage for tree can be allocated sequentially
so there is no need for pointers or other dynamic overhead
for maintaining the heap
File Structure
SNU-OOPSLA Lab.
27
8.4 A Second Look at Sorting in Memory
A heap in both its tree form and
as it would be stored in an array
A (1)
* n, 2n, 2n+1 positions
B (2)
E (4)
G (8)
File Structure
c (3)
H (5) I (6)
F (9)
D (7)
SNU-OOPSLA Lab.
28
8.4 A Second Look at Sorting in Memory
Class Heap and Method Insert(1)
class Heap
{ public:
Heap(int maxElements);
int Insert (char * newKey);
char * Remove();
protected:
int MaxElements; int NumElements;
char ** HeapArray;
void Exchange (int i, int j); // exchange element i and j
int Compare (int i, int j) // compare element i and j
{ return strcmp(Heaparray[i], HeapArray[j]); }
};
File Structure
SNU-OOPSLA Lab.
29
8.4 A Second Look at Sorting in Memory
Class Heap and Method Insert(2)
int Heap::Insert(char * newKey)
{
if (NumElements == MaxElements) return FALSE;
NumElements++; // add the new key at the last position
HeapAray[NumElements] = newKey;
// re-order the heap
int k = NumElements; int parent;
while(k > 1) { // k has a parent
parent = k/2;
if (Compare(k, parent) >= 0) break;
// HeapArray[k] is in the right place
// else exchange k and parent
Exchange(k, parent);
k = parent;
}
return;
}
File Structure
SNU-OOPSLA Lab.
30
8.4 A Second Look at Sorting in Memory
Heap Building Algorithm(1)
input key order : F D C G H I B E A
New key to
be inserted
Heap, after insertion
of the new key
1 2 3 4 5 6 7 8 9
F
1 2 3 4 5 6 7 8 9
DF
1 2 3 4 5 6 7 8 9
CFD
G
H
Selected heaps
in tree form
C
F
1 2 3 4 5 6 7 8 9
CF D G
1 2 3 4 5 6 7 8 9
CFD GH
File Structure
(continued....)
SNU-OOPSLA Lab.
31
8.4 A Second Look at Sorting in Memory
Heap Building Algorithm(2)
input key order : F D C G H B E A
New key to
be inserted
Heap, after insertion
of the new key
1 2 3 4 5 6 7 8 9
CF D GH I
1 2 3 4 5 6 7 8 9
BFC GH I D
1 2 3 4 5 6 7 8 9
B EC F H I D G
E
A
1 2 3 4 5 6 7 8 9
A BC E HI D G F
File Structure
Selected heaps
in tree form
C
F
G
D
H
B
C
F
G
(continued....)
SNU-OOPSLA Lab.
32
8.4 A Second Look at Sorting in Memory
Heap Building Algorithm(3)
input key order : F D C G H B E A
New key to Heap, after insertion
of the new key
be inserted
A
Selected heaps
in tree form
1 2 3 4 5 6 7 8 9
A BC E HI D G F
A
C
B
H
E
G
File Structure
SNU-OOPSLA Lab.
33
8.4 A Second Look at Sorting in Memory
Illustration for overlapping input with heap building(1)
(Free ride of main memory processing: heap building is faster than IO!)
Total RAM area allocated for heap
First input buffer. First part of heap is built here. The
first record is added to the heap, then the second
record
is added, and so forth
Second input buffer. This buffer is being filled
while heap is being built in first buffer.
File Structure
SNU-OOPSLA Lab.
34
8.4 A Second Look at Sorting in Memory
Illustration for overlapping input with heap building(2)
(One Heap is growing during IO time!)
Second part of heap is built here. The first record is
added to the heap, then the second record, etc
Third input buffer. This buffer is filled while heap is being
built in second buffer
Third part of heap is built here
File Structure
Fourth input buffer is filled while heap is being
built
in third bufferLab.
SNU-OOPSLA
35
8.4 A Second Look at Sorting in Memory
Sorting while Writing to the File
Heap rebuilding while writing a block
(Free ride of main memory processing)
Retrieving the keys in order (Fig 8.20)
while( there is no elements)
get the smallest value
put largest value into root
decrease the # of elements
reorder the heap
Overlapping retrieve-in-order with I/O
retrieve-in-order a block of records
while writing this block,
retrieve-in-order the next block
File Structure
SNU-OOPSLA Lab.
36
8.5 Merging as a Way of Sorting Large
Files on Disk
8.5 Merging as a Way of Sorting Large Files on Disk
Keysort: holding keys in memory
Two Shortcomings of Keysort
substantial cost of seeking may happen after keysort
cannot sort really large files
e.g. a file with 800,000 records, size of each record: 100 bytes,
size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!
cannot even sort all the keys in RAM
Multiway merge algorithm
small overhead for maintaining pointers, temporary variables
run: sorted subfile
using heap sort for each run
split, read-in, heap sort, write-back
File Structure
SNU-OOPSLA Lab.
37
8.5 Merging as a Way of Sorting Large
Files on Disk
Sorting through the creation of runs
and subsequential merging of runs
800,000 unsorted records
80 internal sorts
.............
80runs, each containing 10,000 sorted records
.............
Merge
File Structure
800,000 records in sorted order
SNU-OOPSLA Lab.
38
8.5 Merging as a Way of Sorting Large
Files on Disk
Multiway merging (K-way merge-sort)
Can be extended to files of any size
Reading during run creation is sequential
no seeking due to sequential reading
Reading & writing is sequential
Sort each run: Overlapping I/O using heapsort
K-way merges with k runs
Since I/O is largely sequential, tapes can be used
File Structure
SNU-OOPSLA Lab.
39
8.5 Merging as a Way of Sorting Large
Files on Disk
How Much Time Does a Merge Sort Take?
Assumptions
only one seek is required for any sequential access
only one rotational delay is required per access
Four I/Os (
refer to page of 39 )
during the sort phase
reading all records into RAM for sorting, forming runs
writing sorted runs out to disk
during the merge phase
reading sorted runs into RAM for merging
writing sorted file out to disk
File Structure
SNU-OOPSLA Lab.
40
8.5 Merging as a Way of Sorting Large
Files on Disk
Four Steps(1)
Step1: Reading records into RAM for sorting and forming runs
assume: 10MB input buffer, 800MB file size
seek time --> 8msec, rotational delay --> 3msec
transmission rate --> 0.0145MB/msec
Time for step1:
access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec
Step2: Writing sorted runs out to disk
writing is reverse of reading
time that it takes for step2 equals to time of step1
File Structure
SNU-OOPSLA Lab.
41
Four Steps(2)
Step3: Reading sorted runs into RAM for merging
10 MB of RAM is for storing runs. 80 runs
reallocate each of 80 buffers 10MB RAM as 80 input buffers
access each run 80 buffers to read all of it
Each buffer holds 1/80 of a run (0.125MB)
total seek & rotational time --> 80 runs X 80 seeks
--> 6400 seeks. 6400 X 11 msec = 70 seconds
transfer time --> 60 seconds
total time = total seek & rotation time + transfer time
File Structure
SNU-OOPSLA Lab.
42
8.5 Merging as a Way of Sorting Large
Files on Disk
Four Steps(3)
Step4:
Writing sorted file out to disk
need
to know how big output buffers are
with 20,000-byte output buffers,
80,000,000 bytes
20,000 bytes per seek
4,000 seeks
total
seek & rotation time = 4,000 x 11 msec
transfer time is still 60 seconds
Consider
Table 8.1 (323pp)
What if we use keysort for 800M file? --> 24hrs 26mins 40secs
File Structure
SNU-OOPSLA Lab.
43
8.5 Merging as a Way of Sorting Large
Files on Disk
Effect of buffering on the number of seeks required
10MB file
1st run = 80 buffers worth(80 accesses)
800MB file
2nd run = 80 buffers worth(80 accesses)
800,000
sorted records
:
:
:
80 buffers(10MB)
80th run = 80 buffers worth(80 accesses)
File Structure
SNU-OOPSLA Lab.
44
8.5 Merging as a Way of Sorting Large
Files on Disk
Sorting a Very Large File
Two kinds of I/O
Sort phase
I/O is sequential if using heapsort
Since sequential access is minimal seeking, we cannot
algorithmically speed up I/O
Merge phase
RAM buffers for each run get loaded, reloaded at predictable
times -> random access
For performance, look for ways to cut down on the number of
random accesses that occur while reading runs
you can have some chance here!
File Structure
SNU-OOPSLA Lab.
45
8.5 Merging as a Way of Sorting Large
Files on Disk
The Cost of Increasing the File Size
K-way merge of K runs
Merge sort = O(K2) ( merge op. -> K2 seeks )
If K is a big number, you are in trouble!
Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)
more hardware (disk drives, RAM, I/O channel)
reducing the order of merge (k), increasing buffer size
of each run
increase the lengths of the initial sorted runs
find the ways to overlap I/O operations
File Structure
SNU-OOPSLA Lab.
46
8.5 Merging as a Way of Sorting Large
Files on Disk
Hardware-base Improvements
Increasing the amount of RAM
Increasing the number of disk drives
longer & fewer initial runs
fewer seeks
no delay due to seek time after generation of runs
assign input and output to separate drives
Increasing the number of I/O channels
separate I/O channels, I/O can overlap
Improve transmission time
File Structure
SNU-OOPSLA Lab.
47
8.5 Merging as a Way of Sorting Large
Files on Disk
Decreasing the Num of Seeks Using Multiple-step Merges
K-way merge characteristics
a selection tree is used
K is proportional to N
the number of comparisons is N*log K
(K-way merge with N records)
O(N*log N) : reasonably efficient
Reducing seeks is to reduce the number of runs
give each run a bigger buffer space
multiple-step merge provides the way without more RAM
File Structure
SNU-OOPSLA Lab.
48
8.5 Merging as a Way of Sorting Large
Files on Disk
Multiple-step merge(1)
Do not merge all runs at one time
Break the original set of runs into small groups and
Merge runs in these group separately
Leads fewer seeks, but extra transmission time in
second pass
Reads every record twice
to form the intermediate runs & the final sorted file
Similar to have selection tree in merging n lists!!
File Structure
SNU-OOPSLA Lab.
49
8.5 Merging as a Way of Sorting Large
Files on Disk
Two-step merge of 800 runs
(25 sets X 32 runs) = 800 runs
25 sets of 32 runs each
32 runs
......
32 runs
......
......
32 runs
......
......
File Structure
SNU-OOPSLA Lab.
50
8.5 Merging as a Way of Sorting Large
Files on Disk
Multiple-step merge(2)
Essence of multiple-step merging
Can we do even better with more than two steps?
increase the available buffer space for each run
extra pass vs. random access decrease
trade-offs between the seek&rotation time and the
transmission time
major cost in merge sort
seek, rotation time, transmission time, buffer size, number of
runs
File Structure
SNU-OOPSLA Lab.
51
8.5 Merging as a Way of Sorting Large
Files on Disk
Increasing Run Lengths Using Replacement Selection(1)
Facts of Life
Want to use the heap sort in memory
Want to allocate longer output runs
Can we pack the longer output runs using the heap sort in memory?
Replacement Selection
Idea
always select the key from memory that has the lowest value
output the key
replace it with a new key from the input list
use 2 heaps in the memory buffer
File Structure
SNU-OOPSLA Lab. (continued...) 52
8.5 Merging as a Way of Sorting Large
Files on Disk
Increasing Run Lengths Using Replacement Selection(2)
Implementation
step1: read records and sort using heap sort
this heap is the primary heap
step2: write out only the record with the lowest value
step3: bring in new record and compare its key with that of
record just output
step3-a: if the new key is higher, insert new record into its proper in the
primary heap along with the other records selected for output
step3-b: if the new key is lower, place the record in a secondary heap
with key values lower than already written out
step4: repeat step 3 while there are records in the primary heap and
there are records to be read in. When the primary heap is empty, make
the secondary heap into the primary heap and repeat step2 & step3
File Structure
SNU-OOPSLA Lab.
53
8.5 Merging as a Way of Sorting Large
Files on Disk
Example of the principle underlying
replacement selection
Input:
21, 67, 12, 5, 47, 16
Front of input string
(Heap sort!)
Remaining input
21, 67, 12
21, 67
21
-
File Structure
Memory(p=3)
5
12
67
67
67
67
-
47
47
47
47
47
-
16
16
16
21
-
SNU-OOPSLA Lab.
Output run
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5
54
8.5 Merging as a Way of Sorting Large
Files on Disk
Replacement Selection(1)
What happens if a key arrives in memory too late to be output into ins
proper position relative to the other keys? (if 4th key is 2 rather than 12)
use of second heap, to be included in next run
refer to page 335 Figure 8.25
Two questions
Given P locations in memory, how long a run can we expect replacement
selection to produce, on the average?
On the average, we can expect a run length of 2P
Knuth provides an excellent description (page 335-336)
File Structure
SNU-OOPSLA Lab.
(continued...)
55
8.5 Merging as a Way of Sorting Large
Files on Disk
Comparisons of access times required to sort 8 million records
both RAM sort and replacement selection
Approach
# of Records
per Seek to
Form Runs
Size of
Runs
Formed
# of Seeks
Required to
Form Runs
Merge
Order
Used
Total
Number
of Seeks
Total Seek &
Rotation Delay
Time
(hr)
800 RAM
sorts followed 10,000
by an 800-way
merge
Replacement
selection followed
by 534-way merge 2,500
(records in random
order)
Replacement
selection followed
by 200-way merge 2,500
(records partially
ordered)
File Structure
10,000
800
1,600
(min)
681,600
58
15,000
534
6,400
521,134
48
40,000
200
200
206,400
30
SNU-OOPSLA Lab.
56
8.5 Merging as a Way of Sorting Large
Files on Disk
Step-by-step op. of replacement selection with 2 heaps
working to form two sorted runs(1)
Input
33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16
Front of input string
(Heap sort!)
Remaining input
Memory(P=3)
33, 18, 24, 58, 14, 17, 7, 21, 67, 12
33, 18, 24, 58, 14, 17, 7, 21, 67
33, 18, 24, 58, 14, 17, 7, 21
33, 18, 24, 58, 14, 17, 7
33, 18, 24, 58, 14, 17
33, 18, 24, 58, 14
33, 18, 24, 58
File Structure
5 47 16
12 47 16
67 47 16
67 47 21
67 47 ( 7)
67 (17) ( 7)
(14) (17) ( 7)
SNU-OOPSLA Lab.
Output run(A)
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5
57
8.5 Merging as a Way of Sorting Large
Files on Disk
Step-by-step op. of replacement selection
working to form two sorted runs(2)
Remaining input
Memory(P=3)
Output run(B)
First run complete; start building the second
33, 18, 24, 58
33, 18, 24
33, 18
-
File Structure
14
14
24
24
24
-
17
17
17
18
33
33
-
7
58
58
58
58
58
58
SNU-OOPSLA Lab.
7
14, 7
17, 14, 7
18, 17, 14, 7
24, 18, 17, 14, 7
33, 24, 18, 17, 14, 7
58, 33, 24, 18, 17, 14, 7
58
8.5 Merging as a Way of Sorting Large
Files on Disk
Replacement Selection Plus Multiple Merging
Total number of seeks is less than for the one-step merges
The two-step merge requires transferring the data two more
times than do the one-step merge
the two-step merges & replacement selection are still better, but the
results are less dramatic
refer to table of the next slide
File Structure
SNU-OOPSLA Lab.
59
8.5 Merging as a Way of Sorting Large
Files on Disk
Comparison of merges, considering transmission times(1)
:1-step merge
Approach Number of Merge
Records per Pattern
Seek to
Used
Form Runs
RAM sorts
Number
of Seeks
for Sorts
and Merges
Seek +
Rotational
Delay
Time(min)
Total
Passes
over the
File
Total
Transmission
Time(min)
298
43
341
341
10,000
800way
681,700
replacement
selection
2,500
(records in
random order)
534way
521,134
228
43
replacement
2,500
selection
(records part
-ially ordered)
200way
206,400
90
43
File Structure
SNU-OOPSLA Lab.
Total of Seek,
Rotation, and
Transmission
Times(min)
341
(continued...) 60
8.5 Merging as a Way of Sorting Large
Files on Disk
Comparison of merges, considering transmission times(2)
:2-step merge
Approach Number of Merge
Records per Pattern
Seek to
Used
Form Runs
Number
of Seeks
for Sorts
and Merges
10,000
25 x 32
127,200
-way
(one 25-way)
replacement
selection
2,500
(records in
random order)
19 x 28
-way
124,438
(one 19-way)
replacement
2,500
selection
(records part
-ially ordered)
20 x 10
110,400
-way
(one 20-way)
RAM sorts
File Structure
Seek +
Rotational
Delay
Time(min)
Total
Passes
over the
File
56
55
48
SNU-OOPSLA Lab.
Total
Transmission
Time(min)
Total of Seek,
Rotation, and
Transmission
Times(min)
65
121
65
120
65
113
61
8.5 Merging as a Way of Sorting Large
Files on Disk
Using Two Disks with Replacement Selection
Two disk drives
Sort phase
the run selection & output can overlap
Merge phase
input & output can overlap
reduce transmission by 50%
seeking is virtually eliminated
output disk becomes input disk, and vice versa
seeking will occur on input disk, output is sequential
substantially reducing merge & transmission time
File Structure
SNU-OOPSLA Lab.
62
8.5 Merging as a Way of Sorting Large
Files on Disk
Memory organization for replacement selection
disk1
input
buffers
heap
disk2
output
buffers
File Structure
SNU-OOPSLA Lab.
63
8.5 Merging as a Way of Sorting Large
Files on Disk
More Drives? More Processors?
More drives?
Until I/O becomes so fast that processing cannot keep up
with it
More processors?
mainframes
vector and array processors
massively parallel machines
very fast local area networks
File Structure
SNU-OOPSLA Lab.
64
8.5 Merging as a Way of Sorting Large
Files on Disk
Effects of Multiprogramming
Increase the efficiency of overall system by
overlapping processing and I/O
Effects are very hard to predict
File Structure
SNU-OOPSLA Lab.
65
8.5 Merging as a Way of Sorting Large
Files on Disk
A Concept Toolkit for External Sorting
For in-RAM sorting, use heapsort
Use as much RAM as possible
Use a multiple-step merge when
Use replacement selection when
possibility of partially ordered
Use more than one disk drive and I/O channel
the number of initial runs is so long that seek and rotation time is much
greater than transmission time
read/write can overlap
Look for ways to take advantage of new architecture and systems
parallel processing or high-speed networks
File Structure
SNU-OOPSLA Lab.
66
Sorting Files on Tape
Balanced Merge with several tape drivers
Tape
Step1
T1
T2
T3
T4
contains runs
R1 R3 R5
R2 R4 R6
---
R7
R8
R9
R10
Figure 8.28 (2 way-balanced 4 tape merge)
P is the number of passes, N is the number of runs, k is the number of
input drivers ==> then,
P = ceiling of (logkN)
4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes
20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes
File Structure
SNU-OOPSLA Lab.
67
Sorting Files on Tape
Other ways of Balanced Merge
(Fig 8.30)
T1
T2
T3
T4
Step1
Step2
Step3
Step4
11111
-4
--
11111
-4
--
-2 2 2
.. 2
--
-2 2
-10
(Fig 8.31)
T1
11111
1 1 1
11
. 1
--
T2
111
.. 1
-4
--
T3
11
-5
5
--
T4
-3 3
.3
-10
Step1
Step2
Step3
Step4
Step5
File Structure
SNU-OOPSLA Lab.
68
K-way Balanced Merge on Tapes
Some difficult questions
How does one choose an initial distribution that leads readily to an
efficient merge pattern?
Are there algorithmic descriptions of the merge patterns, given an
initial distribution?
Given N runs and J tape drives, is there some way to compute the
optimal merging performance so we have a yardstick against which
to compare the performance of any specific algorithm?
File Structure
SNU-OOPSLA Lab.
69
Unix: Sorting and Cosequential Processing
Sorting in Unix
The Unix sort command
The qsort library routine
Cosequential processing utilities in Unix
Compares: cmp
Difference: diff
Common: comm
File Structure
SNU-OOPSLA Lab.
70
Lets Review !!
8.1 Cosequential operations
8.2 Application of the Model to a General Ledger Program
8.3 Extension of the Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix
File Structure
SNU-OOPSLA Lab.
71