Organizing Files for
Performance
Chapter 6
Jim Skon
File Processing - Organizing file for Performance
MVNC
Organizing Files for
Performance
Data Compression
Reclaiming space in files
Fast Searching
Keysorting
File Processing - Organizing file for Performance
MVNC
Data Compression
Making files smaller
Use less storage, save space
Faster Transmission
Processed faster
Data Compression
encoding information more efficiently
Many techniques exist
File Processing - Organizing file for Performance
MVNC
Data Compression
Consider fields with fixed length or fixed set of
values
A binary representation can save space
States - 50 states - 6 bits (one byte)
Zip - 0 to 99999. 17 bits (three bytes)
Called Compact Notation
Redundancy reduction
File Processing - Organizing file for Performance
MVNC
Data Compression
Cost of binary representations
file not readable as test
Processing time for conversion
All software must including appropriate/compatable
encoding and decoding routines.
Potential lost of flexibility
File Processing - Organizing file for Performance
MVNC
Data Compression
Suppressing repreating sequences
Consider a picture
Series of pixels - each a color
Colors represented by 8 bit value
usually come in bunches, e.g.
24 23 22 22 22 22 22 25 25 25 25 25 25 65 65 66 66 66 66
Run length encoding
Represent long runs with a prefix (FF) follwed by count, followed by color
24 23 FF 05 22 FF 06 25 65 65 FF 04 66
Simple images would be small, busy images would be no
bigger.
File Processing - Organizing file for Performance
MVNC
Data Compression
Assigning variable length codes
Some codes are more likely then others
Use shorter codes for often used values, longer
ones for less used values.
Each code must have the property of a unique
prefix
No code is the prefix of any other code
Thus we always know if we are at the end of a given code
File Processing - Organizing file for Performance
MVNC
Variable length codes
Example:
Letter:
Prob:
Code:
a
0.4
1
b
0.1
010
c
0.1
011
d
e
f
g
0.1
0.1
0.1
0.1
0000 0001 0010 0011
Can be decoded with a binary tree!
Called Huffman code
Algorithm exists to easily create optimal code
Requires that a table of codes be mainted with file
Most often used for fixed codes
Example - Type 3 FAX
File Processing - Organizing file for Performance
MVNC
Data Compression
Irreversible Compression
Compression which losses some information
Example - compress a 400x400 image into a
100x100 image by averaging groups of 16
adjacent pixels
Saves space, but resolution of picture reduced
Used most often for visual or audio information
(which has inherient redundancy)
File Processing - Organizing file for Performance
MVNC
Data Compression
Compression in UNIX
pack and unpack programs
Uses Huffman coding
25% to 40% savings on text files
much less on binary files
Uses .z file prefix
compress and uncompress programs
Uses Lempel-Ziv compression
No coding table needed - self coding
Uses .Z file prefix
File Processing - Organizing file for Performance
MVNC
10
Reclaiming space in files
Suppose a variable length record in the
middle of a file is modified so it is:
Longer?
Shorter?
Suppose a record is
Added to to the middle?
Deleted from middle?
File Processing - Organizing file for Performance
MVNC
11
Reclaiming space in files
Record deletion and storage compaction
storage compaction
recovering unused space in a file
from deletion or from record size changing
Consider deleted records
Must be able to recognize deleted records
Have a special mark for record
e,g, asterisk in first charater in key field
May be undeleted if not overwritten!
File Processing - Organizing file for Performance
MVNC
12
Dealing with Deleted
records
Occasional compaction
Dynamic maintanance
File Processing - Organizing file for Performance
MVNC
13
Occasional compaction
A process periodically run which reads file,
and rewrites with no empty space.
Could happen every night automactically
every night/week/month
File unavailable while operation underway.
File Processing - Organizing file for Performance
MVNC
14
Dynamic maintanance
Delete records by marking
Reuse deleted records a new records added,
updated
Need:
Way of knowing if deleted records exist
Where deleted records are so we can jump right to
them
File Processing - Organizing file for Performance
MVNC
15
Dynamic maintanance
Solution: linked list of deleted records
Each deleted record contains a mark, and a pointer
to the next deleted record
The file header contains a pointer to the first
deleted record.
File Processing - Organizing file for Performance
MVNC
16
Linked list of deleted
records
Fixed-length records
Variable-length records
File Processing - Organizing file for Performance
MVNC
17
Linked list of deleted
records
Fixed-length records
Simply maintain a stack of deleted records rooted
in header record
Deletion - add to front of list
Addition - use record at front of list
Minimal list maintanance cost
File Processing - Organizing file for Performance
MVNC
18
Linked list of deleted
records
Variable-length records
Store for each deleted record
Deletion Marker
link to nect deleted record
record size indicator
File Processing - Organizing file for Performance
MVNC
19
Variable-length records
Insertion
Which deleted record?
Deletion
Add records to list (stack?)
Where
File Processing - Organizing file for Performance
MVNC
20
Variable-length records Insertion
Select and use a deleted record
Break up records
pick a record
If size of deleted record bigger, break into two - a record
to use and a new, smaller, deleted record.
Put smaller deleted record back in list
Leave empty space at end
pick a record
If size of deleted record bigger, just leave empty space
at end.
File Processing - Organizing file for Performance
MVNC
21
Variable-length records Fragmentation
Recall fragmentation in Fixed-length records
At the end of fields if fixed length fields
At the end of records in variable length fields
Called internal fragmentation
Leaving space and the end of a variable length
records also leads to internal fragmentation.
Breaking up variable length records get rid of
fragmentation, right? Wrong!
File Processing - Organizing file for Performance
MVNC
22
Variable-length records Fragmentation
As records get broken up, smaller and smaller
pieces get left over.
These pieces are external fragmentation
File Processing - Organizing file for Performance
MVNC
23
Variable-length records Insertion strategy
How to pick record to use?
First Fit
Use first deleted record found in list
Best Fit
Use deleted record closest in size
Worst Fit
Use deleted record that is largest
No good when not breaking up records!
File Processing - Organizing file for Performance
MVNC
24
Variable-length records Insertion
How do we find the record with the desired
size?
Search them ALL!
Keep the records in sorted order by record size
Increasing size facilitates Best fit
Decreasing size facilitates worst fit (just pick first in list)
This increases deletion time!
File Processing - Organizing file for Performance
MVNC
25
Variable-length records Reducing fragmentation
Merge adjacent free records
How do we know if a newly deleted record is
adjacent to a free record?
Search the deleted list
Keep deleted records sorted by position in file
This makes finding of adjacent free space trivial
Costs more at deletion time
File Processing - Organizing file for Performance
MVNC
26
Fast Searching
Binary Searching
O(log n), where n is number of records
requires file be sorted
Question - how do we sort file?
File Processing - Organizing file for Performance
MVNC
27
File Sorting
Sort in Ram
read in entire file - sort
Called internal sorting
Limited by size of memory
File Processing - Organizing file for Performance
MVNC
28
Binary Search - Problems
Binary searching requires more then one or
two accesses
Accesses are VERY expensive
Access are very random (much seek time)
100,000 requires average of 16.5 accesses
We would like to approach the speed of a direct
lookup!
File Processing - Organizing file for Performance
MVNC
29
Binary Search - Problems
Keeping a file sorted is expensive
Every record added must be entered in sorted
order
Reordering is costly
Internal sorted is limited to small files
We will see there are sort methods to sort a file
that will not fit in memory. But it is still expensive!
File Processing - Organizing file for Performance
MVNC
30
Keysorting
Rather then sorting file, we could sort an array
of primary keys, where each key is
accompanied by the address of the
associated record.
Pointer could be a byte offset from start, or (if
records fixed length) a RRN.
After sort keys, the file can be rewritten in
order.
File Processing - Organizing file for Performance
MVNC
31
Keysorting
Advantages
Keys can be sorted in smaller space then whole
file
Faster to sort (swap!) keys then entire records
File Processing - Organizing file for Performance
MVNC
32
Keysorting
Disadvantages
Still limited in size to key lists which fit in memory
Sequential processing cannot not take advantage
of buffering!
File Processing - Organizing file for Performance
MVNC
33
Keysorting
Alternative - keeping sorted keylist,pointer
structure around.
Is a type of index file!
Can be read in and searched in memory!
File Processing - Organizing file for Performance
MVNC
34
Key Sorted Index
Advantages
Keys and pointers can be searched in memery.
Only one I/O per lookup!
File can be maintained in ANY order. Searching
and key order sequential processing still possible.
File Processing - Organizing file for Performance
MVNC
35
Key Sorted Index
Disadvantages
Sequential processing cannot not take advantage
of buffering!
Pinned records
Records in main file cannot change location without
invalidating index file!
Must either maintain index in parallel, or rebuild!
File Processing - Organizing file for Performance
MVNC
36