0% found this document useful (0 votes)

153 views71 pages

Cosequential Processing and Sorting

This document summarizes an object-oriented model for implementing cosequential processing operations like merging and matching of sequential lists. It describes: - A general CosequentialProcess class that provides an initialization, main loop, and item selection framework for cosequential operations. - A StringListProcess subclass that implements lists as text files, with methods for initializing, getting next items, and processing items from multiple lists. - Examples of using the model to implement matching and merging of two lists by comparing current items from each list in a single loop until all items are processed. The model provides a reusable approach for cosequential operations on sequential data in files like sorting, merging, and combining lists while abstracting away

Uploaded by

Russel Ponferrada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

153 views71 pages

Cosequential Processing and Sorting

Uploaded by

Russel Ponferrada

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 71

File Structures by Folk, Zoellick, and Riccardi

Chap 8. Cosequential Processing

and the Sorting of Large Files

SNU-OOPSLA-LAB

File Structures

SNU-OOPSLA Lab.

Chapter Objectives(1)

Describe a class of frequently used processing activities

known as cosequential process
Provide a general object-oriented model for implementing
varieties of cosequential processes
Illustrate the use of the model to solve a number of
different kinds of cosequential processing problems,
including problems other than simple merges and
matches
Introduce heapsort as an approach to overlapping I/O with
sorting in RAM

File Structure

SNU-OOPSLA Lab.

Chapter Objectives(2)

Show how merging provides the basis for sorting very

large files
Examine the costs of K-way merges on disk and find ways
to reduce those costs
Introduce the notion of replacement selection
Examine some of the fundamental concerns associated
with sorting large files using tapes rather than disks
Introduce UNIX utilities for sorting, merging, and
cosequential processing

File Structure

SNU-OOPSLA Lab.

Contents
8.1 Cosequential operations
8.2 Application of the OO Model to a General Ledger Program
8.3 Extension of the OO Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Cosequential operations

Coordinated processing of two or more sequential

lists to produce a single list
Kinds of operations

merging, or union
matching, or intersection
combination of above

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Matching Names in Two Lists(1)

So called intersection operation

Output the names common to two lists
Things that must be dealt with to make match procedure
work reasonably

initializing that is to arrange things

methods that are getting and accessing the next list item
synchronizing between two lists
handling EOF conditions
recognizing errors
e.g. duplicate names or names out of sequence

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Matching Names in Two Lists(2)

In comparing two names

if Item(1) is less than Item(2), read the next from List 1

if Item(1) is greater than Item(2), read the next name from

List 2

if the names are the same, output the name and read the
next names from the two lists

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Cosequential match procedure(1)

PROGRAM: match
Item(1)

Item(1) < Item(2)

List 1
use input() & initialize() procedure

same
name

List 2
Item(1) > Item(2)

File Structure

Item(2)

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Cosequential match procedure(2)

int Match(char * List1, char List2, char *OutputList)
{
int MoreItems; // true if items remain in both of the lists
// initialize input and output lists
InitializeList(1, List1); InitializeList(2, List2);
InitializeOutput(OutputList);
// get first item from both lists
MoreItems = NextItemInLIst(1) && NextItemInList(2);
while (MoreItems) { // loop until no items in one of the lists
if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);
else if (Item(1) == Item (2) ) {
ProcessItem(1);
// match found
MoreItems = NextItemInList(1) && NextItemInList(2);
}
else
MoreItems = NextItemInList(2); // Item(1) > Item(2)
}
FinishUp();

return 1;

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

General Class for Cosequential Processing(1)

template <class ItemType> class CosequentialProcess
// base class for cosequential processing
{ public:
// the following methods provide basic list processing
// these must be defined in subclasses
virtual int InitializeList (int ListNumber, char *LintName) = 0;
virtual int InitializeOutput (char * OutputListName) = 0;
virtual int NextItemInList (int ListNumber) = 0;
// advance to next item in this list
virtual ItemType Item(int ListNumber) = 0;
// return current item from this list
virtual int ProcessItem(int ListNumber) = 0;
// process the item in this list
virtual int FinishUp() = 0; // complete the processing
// 2-way cosequential match method
virtual int Match2Lists (char *List1, char * List2, char *OutputList);
};

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

General Class for Cosequential Processing(2)

A Subclass to support lists that are files of strings, one per line

class StringListProcess : public CosequentialProcess<String &>

{ public:
StringListProcess (int NumberOfLists); // constructor
// Basic list processing methods
int InitializeList (int ListNumber, char * List1);
int InitializeOutput(char * OutputList);
int NextItemInList (int ListNumber); // get next
String & Item (int ListNumber); // return current
int ProcessItem (int ListNumber); // process the item
int FinishUp(); // complete the processing
protected:
ifstream * List; // array of list files
String * Items; // array of current Item from each list
ofstream OutputLsit;
static const char * LowValue; //used so that NextItemInList() doesnt
// have to get the first item in an special way
static const char * HighValue;
};

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

General Class for Cosequential Processing(3)

Appendix H: full implementation

An example of main
#include coseq.h
int main()
{
StringListProcess ListProcess(2); // process with 2 lists
ListProces.Match2Lists (list1.txt, list2.txt, match.txt);
}

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Merging Two Lists(1)

Based on matching operation

Difference

must read each of the lists completely

must change MoreNames behavior
keep this flag set to true as long as there are records in
either list

HighValue

the special value (we use \xFF)

come after all legal input values in the files to ensure both
input files are read to completion

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Merging Two Lists(2)

Cosequential merge procedure based on a single

loop

This method has been added to class CosequentialProcess

No modifications are required to class StringListProcess

template <class ItemType>

int CosequentialProcess<ItemType> :: Merge2Lists
(char * List1Name, char * List2Name, char * OutputList)
{
int MoreItems1, MoreItems2; // true if more items in list
(continued )

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Merging Two Lists(3)

InitializeList (1 List1Name);
InitializeList (2, List2Name);
InitializeOutput (OutputListName);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInLIst(2);
while (MoreItems1 || MoreItems(2) ) { // if either file has more
if (Item(1) < Item(2)) { // list 1 has next item to be processed
ProcessItem(1);
MoreItem1 = NextItemInList(1);
}
else if (Item(1) == Item(2) ) {
ProcessItem(1);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInList(2);
}
else // Item(1) > Item(2) {
ProcessItem(2);
MoreItem2 = NextItemInList(2);
}
}
FinishUp(); return 1;

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Cosequential merge procedure(1)

PROGRAM: merge

List 1

(Item(1) < Item(2) )or match

NAME_1

OutputList
List 2

NAME_2

Item(1) > Item(2)

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Summary of the Cosequential Processing Model(1)

Assumptions

two or more input files are processed in a parallel fashion

each file is sorted
in some cases, there must exist a high key value or a low
key
records are processed in a logical sorted order
for each file, there is only one current record
records should be manipulated only in internal memory

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Summary of the Cosequential Processing Model(2)

Essential Components
initialization - reads from first logical records
one main synchronization loop
- continues as long as relevant records remain

selection in main synchronization loop

if
(Item(1) > Item(2) then ..........
else if ( Item(1) < Item(2)) then .........
else ........... /* current keys equal */
endif

Input files & Output files are sequence checked by

comparing the previous item value with new one

File Structure

SNU-OOPSLA Lab.

8.1 An Object-Oriented Model for

Implementation Cosequential Processes

Summary of the Cosequential Processing Model(3)

Essential

components (contd)

substitute

high values for actual key when EOF

main loop terminates when high values have occurred for
all relevant input files
no special code to deal with EOF
I/O or error detection are to be relegated to supporting method
so the details of these activities do not obscure the principal
processing logic

File Structure

SNU-OOPSLA Lab.

8.2 The General Ledger Program (1)

Account table (Fig 8.6)
Acct-No Acct-Title
Jan
101
check #1
100
102
check #2
500
505
advertize
300

Feb
200
270
129

Mar Apr
170
320
230

Journal entry table (Fig 8.7)

Acct-No Check-No Date
Description
101
112
04/02/86
auto-repair
505
213
05/13/86
newspaper
540
670
04/13/86
printer

Debit/Credit
-30
-39
+60

Ledger Printout (Fig 8.8)

101 check #1
1271 04/02/86 auto-expense
1272 04/03/86 advertise

-78
-30

File Structure

SNU-OOPSLA Lab.

8.2 The General Ledger Program(2)

Ledger List and Journal List (Fig 8.10)
101 check#1
101 1271 Auto-expense
101 1272 Rent
101 1273 Advertising
102 check#2 102 670 Office-expense

The ledger (master) account number

The journal (transaction) account number
Class MasterTransactionProcess (Fig 8.12)
Subclass LedgeProcess (Fig 8.14)

File Structure

SNU-OOPSLA Lab.

8.2 The General Ledger Program (3)

Template <class ItemType>
class MasterTransactionProcess: Public CosequentialProcess<ItemType>
// a cosequential process that supports master/transaction processing
{public:
MasterTransactionProcess(); // constructor
Virtual int ProcessNewMaster() = 0; //processing when new master read
Virtual int ProcessCurrentMaster() = 0;
Virtual int ProcessEndMaster() = 0;
Virtual int ProcessTransactionError()= 0;
//cosequential processing of master and transaction records
int PostTransactions (char * MasterFileName, char * TransactionFileName,
char * OutputListName);
};

File Structure

SNU-OOPSLA Lab.

8.3 Extension of the Model to Include

Multiway Merging

A K-way Merge Algorithm

A very general form of cosequential file processing

Merge K input lists to create a single, sequentially
ordered output list
Algorithm

begin loop
determine which list has the key with the lowest value
output that key
move ahead one key in that list
in duplicate input entries, move ahead in each list
loop again

File Structure

SNU-OOPSLA Lab.

8.3 Extension of the Model to Include

Multiway Merging

Selection Tree for Merging Large Number of Lists

K-way merge

nice if K is no larger than 8 or so

if K > 8, the set of comparisons for minimum key is expensive
loop of comparison (computing)

Selection Tree (if K > 8)

time vs. space trade off

a kind of tournament tree
the minimum value is at root node
the depth of tree is log2 K

File Structure

SNU-OOPSLA Lab.

8.3 Extension of the Model to Include

Multiway Merging

Selection Tree

7, 10, 17....List 0

9, 19, 23....List 1
7
11
input

11, 13, 32....List 2

18, 22, 24....List 3

5
5
5

12, 14, 21....List 4

5, 6, 25....List 5
15, 20, 30....List 6

8
8, 16, 29....List 7

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Read the whole file from into memory, perform

sorting, write the whole file into disk

Can we improve on the time that it takes for this RAM

sort?

perform some of parts in parallel

selection sort is good but cannot be used to sort entire file

Using Heap technique!

processing and I/O can occur in parallel
keep all the keys in heap
Heap building while reading a block
Heap rebuilding while writing a block
File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Overlapping processing and I/O : Heapsort

Heap

a kind of binary tree, complete binary tree

each node has a single key, that key is less than or equal to
the key at its parent node
storage for tree can be allocated sequentially
so there is no need for pointers or other dynamic overhead
for maintaining the heap

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

A heap in both its tree form and

as it would be stored in an array
A (1)
* n, 2n, 2n+1 positions

B (2)

E (4)

G (8)

File Structure

c (3)

H (5) I (6)
F (9)

D (7)

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(1)

class Heap
{ public:
Heap(int maxElements);
int Insert (char * newKey);
char * Remove();
protected:
int MaxElements; int NumElements;
char ** HeapArray;
void Exchange (int i, int j); // exchange element i and j
int Compare (int i, int j) // compare element i and j
{ return strcmp(Heaparray[i], HeapArray[j]); }
};

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(2)

int Heap::Insert(char * newKey)
{
if (NumElements == MaxElements) return FALSE;
NumElements++; // add the new key at the last position
HeapAray[NumElements] = newKey;
// re-order the heap
int k = NumElements; int parent;
while(k > 1) { // k has a parent
parent = k/2;
if (Compare(k, parent) >= 0) break;
// HeapArray[k] is in the right place
// else exchange k and parent
Exchange(k, parent);
k = parent;
}
return;
}

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(1)

input key order : F D C G H I B E A
New key to
be inserted

Heap, after insertion

of the new key

1 2 3 4 5 6 7 8 9
F

1 2 3 4 5 6 7 8 9
DF

1 2 3 4 5 6 7 8 9
CFD

G
H

Selected heaps
in tree form

C
F

1 2 3 4 5 6 7 8 9
CF D G
1 2 3 4 5 6 7 8 9
CFD GH

File Structure

(continued....)
SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(2)

input key order : F D C G H B E A
New key to
be inserted

Heap, after insertion

of the new key

1 2 3 4 5 6 7 8 9
CF D GH I

1 2 3 4 5 6 7 8 9
BFC GH I D
1 2 3 4 5 6 7 8 9
B EC F H I D G

E
A

1 2 3 4 5 6 7 8 9
A BC E HI D G F

File Structure

Selected heaps
in tree form
C
F
G

D
H

B
C

F
G

(continued....)
SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(3)

input key order : F D C G H B E A
New key to Heap, after insertion
of the new key
be inserted
A

Selected heaps
in tree form

1 2 3 4 5 6 7 8 9
A BC E HI D G F

A
C

B
H

E
G

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(1)

(Free ride of main memory processing: heap building is faster than IO!)

Total RAM area allocated for heap

First input buffer. First part of heap is built here. The

first record is added to the heap, then the second
record
is added, and so forth
Second input buffer. This buffer is being filled
while heap is being built in first buffer.

File Structure

SNU-OOPSLA Lab.

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(2)

(One Heap is growing during IO time!)

Second part of heap is built here. The first record is

added to the heap, then the second record, etc

Third input buffer. This buffer is filled while heap is being

built in second buffer
Third part of heap is built here

File Structure

Fourth input buffer is filled while heap is being

built
in third bufferLab.
SNU-OOPSLA
35

8.4 A Second Look at Sorting in Memory

Sorting while Writing to the File

Heap rebuilding while writing a block

(Free ride of main memory processing)
Retrieving the keys in order (Fig 8.20)

while( there is no elements)

get the smallest value

put largest value into root
decrease the # of elements
reorder the heap

Overlapping retrieve-in-order with I/O

retrieve-in-order a block of records

while writing this block,
retrieve-in-order the next block

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

8.5 Merging as a Way of Sorting Large Files on Disk

Keysort: holding keys in memory

Two Shortcomings of Keysort

substantial cost of seeking may happen after keysort

cannot sort really large files
e.g. a file with 800,000 records, size of each record: 100 bytes,
size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!
cannot even sort all the keys in RAM

Multiway merge algorithm

small overhead for maintaining pointers, temporary variables

run: sorted subfile

using heap sort for each run

split, read-in, heap sort, write-back

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Sorting through the creation of runs

and subsequential merging of runs
800,000 unsorted records
80 internal sorts

.............
80runs, each containing 10,000 sorted records

.............
Merge

File Structure

800,000 records in sorted order

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Multiway merging (K-way merge-sort)

Can be extended to files of any size

Reading during run creation is sequential

no seeking due to sequential reading

Reading & writing is sequential

Sort each run: Overlapping I/O using heapsort
K-way merges with k runs
Since I/O is largely sequential, tapes can be used

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

How Much Time Does a Merge Sort Take?

Assumptions

only one seek is required for any sequential access

only one rotational delay is required per access

Four I/Os (

refer to page of 39 )

during the sort phase

reading all records into RAM for sorting, forming runs

writing sorted runs out to disk

during the merge phase

reading sorted runs into RAM for merging

writing sorted file out to disk

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Four Steps(1)

Step1: Reading records into RAM for sorting and forming runs

assume: 10MB input buffer, 800MB file size

seek time --> 8msec, rotational delay --> 3msec
transmission rate --> 0.0145MB/msec
Time for step1:

access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec

Step2: Writing sorted runs out to disk

writing is reverse of reading

time that it takes for step2 equals to time of step1

File Structure

SNU-OOPSLA Lab.

Four Steps(2)

Step3: Reading sorted runs into RAM for merging

10 MB of RAM is for storing runs. 80 runs

reallocate each of 80 buffers 10MB RAM as 80 input buffers

access each run 80 buffers to read all of it
Each buffer holds 1/80 of a run (0.125MB)

total seek & rotational time --> 80 runs X 80 seeks

--> 6400 seeks. 6400 X 11 msec = 70 seconds
transfer time --> 60 seconds

total time = total seek & rotation time + transfer time

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Four Steps(3)
Step4:

Writing sorted file out to disk

need

to know how big output buffers are

with 20,000-byte output buffers,
80,000,000 bytes
20,000 bytes per seek

4,000 seeks

total

seek & rotation time = 4,000 x 11 msec

transfer time is still 60 seconds
Consider

Table 8.1 (323pp)

What if we use keysort for 800M file? --> 24hrs 26mins 40secs

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Effect of buffering on the number of seeks required

10MB file

1st run = 80 buffers worth(80 accesses)

800MB file

2nd run = 80 buffers worth(80 accesses)

800,000
sorted records

:
:
:

80 buffers(10MB)

80th run = 80 buffers worth(80 accesses)

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Sorting a Very Large File

Two kinds of I/O

Sort phase

I/O is sequential if using heapsort

Since sequential access is minimal seeking, we cannot
algorithmically speed up I/O

Merge phase

RAM buffers for each run get loaded, reloaded at predictable

times -> random access
For performance, look for ways to cut down on the number of
random accesses that occur while reading runs
you can have some chance here!

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

The Cost of Increasing the File Size

K-way merge of K runs

Merge sort = O(K2) ( merge op. -> K2 seeks )

If K is a big number, you are in trouble!

Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)

more hardware (disk drives, RAM, I/O channel)

reducing the order of merge (k), increasing buffer size
of each run
increase the lengths of the initial sorted runs
find the ways to overlap I/O operations

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Hardware-base Improvements

Increasing the amount of RAM

Increasing the number of disk drives

longer & fewer initial runs

fewer seeks
no delay due to seek time after generation of runs
assign input and output to separate drives

Increasing the number of I/O channels

separate I/O channels, I/O can overlap

Improve transmission time

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Decreasing the Num of Seeks Using Multiple-step Merges

K-way merge characteristics

a selection tree is used

K is proportional to N

the number of comparisons is N*log K

(K-way merge with N records)
O(N*log N) : reasonably efficient

Reducing seeks is to reduce the number of runs

give each run a bigger buffer space

multiple-step merge provides the way without more RAM

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Multiple-step merge(1)

Do not merge all runs at one time

Break the original set of runs into small groups and
Merge runs in these group separately
Leads fewer seeks, but extra transmission time in
second pass
Reads every record twice

to form the intermediate runs & the final sorted file

Similar to have selection tree in merging n lists!!

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Two-step merge of 800 runs

(25 sets X 32 runs) = 800 runs

25 sets of 32 runs each

32 runs

......

32 runs

......

32 runs

......

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Multiple-step merge(2)

Essence of multiple-step merging

Can we do even better with more than two steps?

increase the available buffer space for each run

extra pass vs. random access decrease
trade-offs between the seek&rotation time and the
transmission time

major cost in merge sort

seek, rotation time, transmission time, buffer size, number of

runs

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Increasing Run Lengths Using Replacement Selection(1)

Facts of Life

Want to use the heap sort in memory

Want to allocate longer output runs
Can we pack the longer output runs using the heap sort in memory?

Replacement Selection

Idea
always select the key from memory that has the lowest value
output the key
replace it with a new key from the input list
use 2 heaps in the memory buffer

File Structure

SNU-OOPSLA Lab. (continued...) 52

8.5 Merging as a Way of Sorting Large

Files on Disk

Increasing Run Lengths Using Replacement Selection(2)

Implementation
step1: read records and sort using heap sort
this heap is the primary heap
step2: write out only the record with the lowest value
step3: bring in new record and compare its key with that of
record just output

step3-a: if the new key is higher, insert new record into its proper in the
primary heap along with the other records selected for output

step3-b: if the new key is lower, place the record in a secondary heap
with key values lower than already written out

step4: repeat step 3 while there are records in the primary heap and
there are records to be read in. When the primary heap is empty, make
the secondary heap into the primary heap and repeat step2 & step3

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Example of the principle underlying

replacement selection
Input:
21, 67, 12, 5, 47, 16

Front of input string

(Heap sort!)

Remaining input
21, 67, 12
21, 67
21
-

File Structure

Memory(p=3)
5
12
67
67
67
67
-

47
47
47
47
47
-

16
16
16
21
-

SNU-OOPSLA Lab.

Output run
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5

8.5 Merging as a Way of Sorting Large

Files on Disk

Replacement Selection(1)

What happens if a key arrives in memory too late to be output into ins
proper position relative to the other keys? (if 4th key is 2 rather than 12)

use of second heap, to be included in next run

refer to page 335 Figure 8.25

Two questions

Given P locations in memory, how long a run can we expect replacement

selection to produce, on the average?

On the average, we can expect a run length of 2P

Knuth provides an excellent description (page 335-336)

File Structure

SNU-OOPSLA Lab.

(continued...)

8.5 Merging as a Way of Sorting Large

Files on Disk

Comparisons of access times required to sort 8 million records

both RAM sort and replacement selection

Approach

# of Records
per Seek to
Form Runs

Size of
Runs
Formed

# of Seeks
Required to
Form Runs

Merge
Order
Used

Total
Number
of Seeks

Total Seek &

Rotation Delay
Time
(hr)

800 RAM
sorts followed 10,000
by an 800-way
merge
Replacement
selection followed
by 534-way merge 2,500
(records in random
order)
Replacement
selection followed
by 200-way merge 2,500
(records partially
ordered)

File Structure

10,000

800

1,600

(min)

681,600

15,000

534

6,400

521,134

40,000

200

206,400

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Step-by-step op. of replacement selection with 2 heaps

working to form two sorted runs(1)
Input
33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16

Front of input string

(Heap sort!)

Remaining input

Memory(P=3)

33, 18, 24, 58, 14, 17, 7, 21, 67, 12

33, 18, 24, 58, 14, 17, 7, 21, 67
33, 18, 24, 58, 14, 17, 7, 21
33, 18, 24, 58, 14, 17, 7
33, 18, 24, 58, 14, 17
33, 18, 24, 58, 14
33, 18, 24, 58

File Structure

5 47 16
12 47 16
67 47 16
67 47 21
67 47 ( 7)
67 (17) ( 7)
(14) (17) ( 7)

SNU-OOPSLA Lab.

Output run(A)
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5

8.5 Merging as a Way of Sorting Large

Files on Disk

Step-by-step op. of replacement selection

working to form two sorted runs(2)
Remaining input

Memory(P=3)

Output run(B)

First run complete; start building the second

33, 18, 24, 58
33, 18, 24
33, 18
-

File Structure

14
14
24
24
24
-

17
17
17
18
33
33
-

7
58
58
58
58
58
58

SNU-OOPSLA Lab.

7
14, 7
17, 14, 7
18, 17, 14, 7
24, 18, 17, 14, 7
33, 24, 18, 17, 14, 7
58, 33, 24, 18, 17, 14, 7

8.5 Merging as a Way of Sorting Large

Files on Disk

Replacement Selection Plus Multiple Merging

Total number of seeks is less than for the one-step merges

The two-step merge requires transferring the data two more
times than do the one-step merge

the two-step merges & replacement selection are still better, but the
results are less dramatic

refer to table of the next slide

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Comparison of merges, considering transmission times(1)

:1-step merge

Approach Number of Merge

Records per Pattern
Seek to
Used
Form Runs
RAM sorts

Number
of Seeks
for Sorts
and Merges

Seek +
Rotational
Delay
Time(min)

Total
Passes
over the
File

Total
Transmission
Time(min)

298

341

10,000

800way

681,700

replacement
selection
2,500
(records in
random order)

534way

521,134

228

replacement
2,500
selection
(records part
-ially ordered)

200way

206,400

File Structure

SNU-OOPSLA Lab.

Total of Seek,
Rotation, and
Transmission
Times(min)

341

(continued...) 60

8.5 Merging as a Way of Sorting Large

Files on Disk

Comparison of merges, considering transmission times(2)

:2-step merge

Approach Number of Merge

Records per Pattern
Seek to
Used
Form Runs

Number
of Seeks
for Sorts
and Merges

10,000

25 x 32
127,200
-way
(one 25-way)

replacement
selection
2,500
(records in
random order)

19 x 28
-way
124,438
(one 19-way)

replacement
2,500
selection
(records part
-ially ordered)

20 x 10
110,400
-way
(one 20-way)

RAM sorts

File Structure

Seek +
Rotational
Delay
Time(min)

Total
Passes
over the
File

SNU-OOPSLA Lab.

Total
Transmission
Time(min)

Total of Seek,
Rotation, and
Transmission
Times(min)

121

120

113

8.5 Merging as a Way of Sorting Large

Files on Disk

Using Two Disks with Replacement Selection

Two disk drives

Sort phase

the run selection & output can overlap

Merge phase

input & output can overlap

reduce transmission by 50%
seeking is virtually eliminated

output disk becomes input disk, and vice versa

seeking will occur on input disk, output is sequential

substantially reducing merge & transmission time

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Memory organization for replacement selection

disk1

input
buffers

heap
disk2

output
buffers

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

More Drives? More Processors?

More drives?

Until I/O becomes so fast that processing cannot keep up

with it

More processors?

mainframes
vector and array processors
massively parallel machines
very fast local area networks

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

Effects of Multiprogramming

Increase the efficiency of overall system by

overlapping processing and I/O
Effects are very hard to predict

File Structure

SNU-OOPSLA Lab.

8.5 Merging as a Way of Sorting Large

Files on Disk

A Concept Toolkit for External Sorting

For in-RAM sorting, use heapsort

Use as much RAM as possible
Use a multiple-step merge when

Use replacement selection when

possibility of partially ordered

Use more than one disk drive and I/O channel

the number of initial runs is so long that seek and rotation time is much
greater than transmission time

read/write can overlap

Look for ways to take advantage of new architecture and systems

parallel processing or high-speed networks

File Structure

SNU-OOPSLA Lab.

Sorting Files on Tape

Balanced Merge with several tape drivers

Tape

Step1

T1
T2
T3
T4

contains runs
R1 R3 R5
R2 R4 R6
---

R7
R8

R9
R10

Figure 8.28 (2 way-balanced 4 tape merge)

P is the number of passes, N is the number of runs, k is the number of

input drivers ==> then,
P = ceiling of (logkN)

4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes

20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

File Structure

SNU-OOPSLA Lab.

Sorting Files on Tape

Other ways of Balanced Merge

(Fig 8.30)

Step1
Step2
Step3
Step4

11111
-4
--

-2 2 2
.. 2
--

-2 2
-10

(Fig 8.31)

T1
11111
1 1 1
11
. 1
--

T2
111
.. 1
-4
--

T3
11
-5
5
--

T4
-3 3
.3
-10

Step1
Step2
Step3
Step4
Step5

File Structure

SNU-OOPSLA Lab.

K-way Balanced Merge on Tapes

Some difficult questions

How does one choose an initial distribution that leads readily to an

efficient merge pattern?

Are there algorithmic descriptions of the merge patterns, given an

initial distribution?

Given N runs and J tape drives, is there some way to compute the
optimal merging performance so we have a yardstick against which
to compare the performance of any specific algorithm?

File Structure

SNU-OOPSLA Lab.

Unix: Sorting and Cosequential Processing

Sorting in Unix

The Unix sort command

The qsort library routine

Cosequential processing utilities in Unix

Compares: cmp
Difference: diff
Common: comm

File Structure

SNU-OOPSLA Lab.

Lets Review !!
8.1 Cosequential operations
8.2 Application of the Model to a General Ledger Program
8.3 Extension of the Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix

File Structure

SNU-OOPSLA Lab.

Cosequential Processing Model Overview
No ratings yet
Cosequential Processing Model Overview
26 pages
Cosequential Processing Guide
No ratings yet
Cosequential Processing Guide
34 pages
Multiway Merge-Sort Page Reading Analysis
No ratings yet
Multiway Merge-Sort Page Reading Analysis
25 pages
Module 3
No ratings yet
Module 3
33 pages
List Operations for CS Students
No ratings yet
List Operations for CS Students
40 pages
DS Lec3
No ratings yet
DS Lec3
59 pages
Dsa Ia3
No ratings yet
Dsa Ia3
2 pages
Job Tech Questions
No ratings yet
Job Tech Questions
48 pages
03 134241 017 132709114925 24032025 082035pm
No ratings yet
03 134241 017 132709114925 24032025 082035pm
18 pages
Data Structures - Lab Manual1
No ratings yet
Data Structures - Lab Manual1
39 pages
Labc
No ratings yet
Labc
68 pages
DSA Lab Manual
No ratings yet
DSA Lab Manual
80 pages
Linked List4 1725361521928
No ratings yet
Linked List4 1725361521928
46 pages
Simple File System with Linked Lists
No ratings yet
Simple File System with Linked Lists
8 pages
C Programming and Data Structures - Unit III Notes
No ratings yet
C Programming and Data Structures - Unit III Notes
32 pages
Test1 Key PDF
No ratings yet
Test1 Key PDF
8 pages
Unsorted List Data Structure Overview
No ratings yet
Unsorted List Data Structure Overview
30 pages
Data Structures & OOP Lab Manual
No ratings yet
Data Structures & OOP Lab Manual
59 pages
Anjana S (20104014) Experiment No. 2 DS Laboratory
No ratings yet
Anjana S (20104014) Experiment No. 2 DS Laboratory
9 pages
Data Structures Lab Guide
No ratings yet
Data Structures Lab Guide
9 pages
Lab-07 Sorted List Array
No ratings yet
Lab-07 Sorted List Array
6 pages
M3 Technical
No ratings yet
M3 Technical
14 pages
Ex No: 2 Array Implementation of List Date
No ratings yet
Ex No: 2 Array Implementation of List Date
39 pages
CSE 207 Midterm Exam Fall 2021
No ratings yet
CSE 207 Midterm Exam Fall 2021
3 pages
Final Presentation
No ratings yet
Final Presentation
9 pages
Data Structures
No ratings yet
Data Structures
23 pages
20CS3L3 Data Structures Laboratory 2.
No ratings yet
20CS3L3 Data Structures Laboratory 2.
85 pages
LISTLinked List
No ratings yet
LISTLinked List
22 pages
DS Solutions
No ratings yet
DS Solutions
14 pages
Unsorted Sorted List - Array
No ratings yet
Unsorted Sorted List - Array
64 pages
Data Structure - Algorithm - Algorithm Analysis
No ratings yet
Data Structure - Algorithm - Algorithm Analysis
120 pages
Lec3 Linked List Applications and Operations 24032023 093738am
No ratings yet
Lec3 Linked List Applications and Operations 24032023 093738am
39 pages
C++ Sorting and Data Structure Programs
No ratings yet
C++ Sorting and Data Structure Programs
14 pages
Lab 12
No ratings yet
Lab 12
18 pages
Data Structures Lab Manual 4
No ratings yet
Data Structures Lab Manual 4
7 pages
CS204-lec04 - Linked List, Doubly, Circularly
No ratings yet
CS204-lec04 - Linked List, Doubly, Circularly
39 pages
PG OOPS
No ratings yet
PG OOPS
49 pages
In Today's Lab We Will Design and Implement The List ADT Where The Items in The List Are Unsorted
No ratings yet
In Today's Lab We Will Design and Implement The List ADT Where The Items in The List Are Unsorted
2 pages
5.1 List - Unsorted
No ratings yet
5.1 List - Unsorted
24 pages
Group C Assign 19
No ratings yet
Group C Assign 19
6 pages
LateralEx6 To 14 Record 2
No ratings yet
LateralEx6 To 14 Record 2
98 pages
Ds Lab Index Sheet F
No ratings yet
Ds Lab Index Sheet F
2 pages
Linear Data Structures Guide
No ratings yet
Linear Data Structures Guide
45 pages
FA20 BEE 146 Lab 03
No ratings yet
FA20 BEE 146 Lab 03
12 pages
1dt17is072 FS Lab
No ratings yet
1dt17is072 FS Lab
13 pages
Data Structure Record
No ratings yet
Data Structure Record
94 pages
1151CS301 DS Lab Task Ii
No ratings yet
1151CS301 DS Lab Task Ii
7 pages
15 MergeSort
No ratings yet
15 MergeSort
24 pages
Cs3353 CD Updated
No ratings yet
Cs3353 CD Updated
391 pages
DS Lab Mannual
No ratings yet
DS Lab Mannual
61 pages
C++ Program for Merging Linked Lists
No ratings yet
C++ Program for Merging Linked Lists
14 pages
2 Link
No ratings yet
2 Link
75 pages
DSA Assignment 1
No ratings yet
DSA Assignment 1
10 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
21 pages
Lesson 7 INF211 Lect 08
No ratings yet
Lesson 7 INF211 Lect 08
29 pages
Database Indexing Basics
No ratings yet
Database Indexing Basics
31 pages
B+ Trees and Multilevel Indexing Explained
No ratings yet
B+ Trees and Multilevel Indexing Explained
33 pages
Cosequential Processing (Sorting Large Files)
No ratings yet
Cosequential Processing (Sorting Large Files)
8 pages
Introduction to File Management
No ratings yet
Introduction to File Management
82 pages
B+ Trees and Multilevel Indexing Explained
No ratings yet
B+ Trees and Multilevel Indexing Explained
38 pages
Lesson 9 Mod2l2
No ratings yet
Lesson 9 Mod2l2
16 pages
HPE CS3200: Data Storage & Indexing
No ratings yet
HPE CS3200: Data Storage & Indexing
22 pages
Organizing Files For Performance: Jim Skon
No ratings yet
Organizing Files For Performance: Jim Skon
36 pages
Lesson 9 Lecture9
No ratings yet
Lesson 9 Lecture9
45 pages
Lesson 3 Fileorganization 111101105553 Phpapp02
No ratings yet
Lesson 3 Fileorganization 111101105553 Phpapp02
23 pages
File Processing in C++ and Unix
No ratings yet
File Processing in C++ and Unix
31 pages
Database Indexing Techniques Overview
No ratings yet
Database Indexing Techniques Overview
13 pages
Lesson 2-3 Fundamental File Processing Operations
No ratings yet
Lesson 2-3 Fundamental File Processing Operations
16 pages
Lesson 3 - 1 Managing Files of Records
No ratings yet
Lesson 3 - 1 Managing Files of Records
18 pages
Lesson 2-2
No ratings yet
Lesson 2-2
8 pages
Routing Essentials for CCNA Students
0% (1)
Routing Essentials for CCNA Students
131 pages
Hitachi VSP G1000 ALUA Support
No ratings yet
Hitachi VSP G1000 ALUA Support
23 pages
Verilog Coprocessor Simulation for Multicomputers
No ratings yet
Verilog Coprocessor Simulation for Multicomputers
9 pages
Salesforce Record Types Guide
No ratings yet
Salesforce Record Types Guide
9 pages
S LH1080P AUDIO V17 Spec 20180504 1
No ratings yet
S LH1080P AUDIO V17 Spec 20180504 1
9 pages
Wireless Diagnostic Interface User Manual V2.00
No ratings yet
Wireless Diagnostic Interface User Manual V2.00
16 pages
Dataman 8050 Manual
No ratings yet
Dataman 8050 Manual
27 pages
Can-Do Performance : Allday
No ratings yet
Can-Do Performance : Allday
4 pages
Business Requirement Document (BRD)
50% (2)
Business Requirement Document (BRD)
8 pages
世界主流PACS系统介绍及对比
No ratings yet
世界主流PACS系统介绍及对比
68 pages
FOC Notes Unit 1 and Unit 2 PDF
No ratings yet
FOC Notes Unit 1 and Unit 2 PDF
46 pages
(0d) (0a) OK (0d) (0a) (0d) (0a) ERROR (0d) (0a) (0d) (0a) INFO (0d) (0a)
No ratings yet
(0d) (0a) OK (0d) (0a) (0d) (0a) ERROR (0d) (0a) (0d) (0a) INFO (0d) (0a)
4 pages
Upload A Document - Scribd
No ratings yet
Upload A Document - Scribd
6 pages
Classification of CMOS Digital Logic Circuit (Part - I)
No ratings yet
Classification of CMOS Digital Logic Circuit (Part - I)
35 pages
Akib Hasrat 123
No ratings yet
Akib Hasrat 123
68 pages
DIAL Communication Framework Setup Log
No ratings yet
DIAL Communication Framework Setup Log
2 pages
MSG-RCV Function Specifications Guide
No ratings yet
MSG-RCV Function Specifications Guide
8 pages
Network Architecture & Protocols Guide
No ratings yet
Network Architecture & Protocols Guide
4 pages
IBM Blade Center, Linux, and Open Source Blueprint For E-Business On Demand
No ratings yet
IBM Blade Center, Linux, and Open Source Blueprint For E-Business On Demand
258 pages
ACSLS Error Troubleshooting Guide
No ratings yet
ACSLS Error Troubleshooting Guide
12 pages
08.ABAP Dialog Programming Overview
100% (1)
08.ABAP Dialog Programming Overview
70 pages
User Manual Plusoptix Connect en
No ratings yet
User Manual Plusoptix Connect en
3 pages
Combo Debug Card
No ratings yet
Combo Debug Card
48 pages
Chits Emr
0% (1)
Chits Emr
3 pages
Top 10 Best Practices For Windows 10 OSD With ConfigMgr
No ratings yet
Top 10 Best Practices For Windows 10 OSD With ConfigMgr
22 pages
Vending Machine - Piso Wifi For Business
100% (3)
Vending Machine - Piso Wifi For Business
17 pages
Difference Between Dynamic Loading and Dynamic Linking - Gr8AmbitionZ
No ratings yet
Difference Between Dynamic Loading and Dynamic Linking - Gr8AmbitionZ
9 pages
Samsung SM-J415F Parts List
0% (1)
Samsung SM-J415F Parts List
43 pages
Overview of Windows Vista Release
No ratings yet
Overview of Windows Vista Release
32 pages
Creating PDFs with QuarkXPress
No ratings yet
Creating PDFs with QuarkXPress
25 pages