Index Compression
Cam-Tu Nguyen, Ph.D
Email: ncamtu@[Link]
Ch. 5
Today
• Collection statistics in more detail (with RCV1)
• How big will the dictionary and postings be?
• Dictionary compression
• Postings compression
2
Ch. 5
Why compression (in general)?
• Use less disk space
• Save a little money; give users more space
• Keep more stuff in memory
• Increases speed
• Increase speed of data transfer from disk to memory
• [read compressed data | decompress] is faster than [read uncompressed
data]
• Premise: Decompression algorithms are fast
• True of the decompression algorithms we use
3
Ch. 5
Why compression for inverted indexes?
• Dictionary
• Make it small enough to keep in main memory
• Make it so small that you can keep some postings lists in main memory too
• Postings file(s)
• Reduce disk space needed
• Decrease time needed to read postings lists from disk
• Large search engines keep a significant part of the postings in memory.
• Compression lets you keep more in memory
• We will devise various IR-specific compression schemes
4
Sec. 5.1
Reuters RCV1
• symbol statistic value
•N documents 800,000
•L avg. # tokens per doc 200
•M terms (= word types) ~400,000
• avg. # bytes per token 6
(incl. spaces/punct.)
• avg. # bytes per token 4.5
(without spaces/punct.)
• avg. # bytes per term 7.5
• non-positional postings 100,000,000
5
Sec. 5.1
Index parameters vs. what we index
(details IIR Table 5.1, p.80)
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index positional index
Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul
(K) % % % % %
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns? 6
Sec. 5.1
Lossless vs. lossy compression
• Lossless compression: All information is preserved.
• What we mostly do in IR.
• Lossy compression: Discard some information
• Several of the preprocessing steps can be viewed as lossy
compression: case folding, stop words, stemming, number
elimination.
• Lossy compression makes sense when the “lost” information is
unlikely ever to be used in IR systems.
• Prune postings entries that are unlikely to turn up in the top k list for any
query. Almost no loss of quality in top k list.
7
Sec. 5.1
Vocabulary size vs. collection size
• How big is the term vocabulary?
• That is, how many distinct words are there?
• Can we assume an upper bound?
• Not really: At least 70!" = 10#$ different words of length 20.
• In practice, the vocabulary will keep growing with the collection size
• Oxford English Dictionary (OED) defines more than 600,000 words. OED
doesn’t contain names!
• Especially with Unicode J
8
Sec. 5.1
Vocabulary size vs. collection size
• Heaps’ law: M = kTb
• M is the size of the vocabulary, T is the number of tokens in the
collection
• Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
• In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line
with slope about ½
• It is the simplest possible (linear) relationship between the two in log-log
space
• log M = log k + b log T
• An empirical finding (“empirical law”)
9
Sec. 5.1
Heaps’ Law
Fig 5.1 p81
For RCV1, the dashed line
log10M = 0.49 log10T + 1.64
is the best least squares fit.
Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.
Good empirical fit for
Reuters RCV1 !
For first 1,000,020 tokens,
law predicts 38,323 terms;
actually, 38,365 terms
10
Sec. 5.1
Exercises
• What is the effect of including spelling errors, vs. automatically
correcting spelling errors on Heaps’ law?
• Compute the vocabulary size M for this scenario:
• Looking at a collection of web pages, you find that there are 3000 different
terms in the first 10,000 tokens and 30,000 different terms in the first
1,000,000 tokens.
• Assume a search engine indexes a total of 20,000,000,000 (2 × 1010) pages,
containing 200 tokens on average
• What is the size of the vocabulary of the indexed collection as predicted by
Heaps’ law?
11
Sec. 5.1
Zipf’s law
• Heaps’ law gives the vocabulary size in collections.
• We also study the relative frequencies of terms.
• In natural language, there are a few very frequent terms and very
many very rare terms.
• Zipf’s law: The ith most frequent term has frequency proportional
to 1/i .
• cfi ∝ 1/i = K/i where K is a normalizing constant
• cfi is collection frequency: the number of occurrences of the term
ti in the collection.
12
Sec. 5.1
Zipf consequences
• If the most frequent term (the) occurs cf1 times
• then the second most frequent term (of) occurs cf1/2 times
• the third most frequent term (and) occurs cf1/3 times …
• Equivalent: cfi = K/i where K is a normalizing factor, so
• log cfi = log K - log i
• Linear relationship between log cfi and log i
• Another power law relationship
13
Sec. 5.1
Zipf’s law for Reuters RCV1
14
Ch. 5
Compression
• Now, we will consider compressing the space for the
dictionary and postings. We’ll do:
• Basic Boolean index only
• No study of positional indexes, etc.
• But these ideas can be extended
• We will consider compression schemes
15
Sec. 5.2
DICTIONARY COMPRESSION
16
Sec. 5.2
Why compress the dictionary?
• Search begins with the dictionary
• We want to keep it in memory
• Memory footprint competition with other applications
• Embedded/mobile devices may have very little memory
• Even if the dictionary isn’t in memory, we want it to be small for a
fast search startup time
• So, compressing the dictionary is important
17
Sec. 5.2
Dictionary storage – naïve version
• Array of fixed-width entries
• ~400,000 terms; 28 bytes/term = 11.2 MB.
Terms Freq. Postings ptr.
a 656,265
aachen 65
…. ….
zulu 221
Dictionary search 20 bytes 4 bytes each
structure 18
Sec. 5.2
Fixed-width terms are wasteful
• Most of the bytes in the Term column are wasted – we allot 20 bytes
for 1 letter terms.
• And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.
• Written English averages ~4.5 characters/token.
• Exercise: Why is/isn’t this the number to use for estimating the dictionary
size?
• Ave. dictionary term in English: ~8 characters
• How do we use ~8 characters per dictionary term?
• Short words dominate token counts but not type average.
19
Sec. 5.2
Compressing the term list: Dictionary-as-a-String
Store dictionary as a (long) string of characters:
n
nPointer to next word shows end of current word
nHope to save up to 60% of dictionary space
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes
20
Sec. 5.2
Space for dictionary as a string
• 4 bytes per term for Freq. ü Now avg. 11
• 4 bytes per term for pointer to Postings. ý bytes/term,
þ not 20.
• 3 bytes per term pointer
• Avg. 8 bytes per term in term string
• 400K terms x 19 Þ 7.6 MB (against 11.2MB for fixed width)
21
Sec. 5.2
Blocking
• Store pointers to every kth term string.
• Example below: k=4.
• Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.
33
29 ü Save 9 bytes Lose 4 bytes on
44 ý on 3 term lengths.
126 þ pointers.
7 22
Sec. 5.2
Blocking Net Gains
• Example for block size k = 4
• Where we used 3 bytes/pointer without blocking
• 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.
Shaved another ~0.5MB. This reduces the size of the
dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.
Question: Why not go with larger k?
23
Sec. 5.2
Dictionary search without blocking
• Assuming each dictionary
term equally likely in
query (not really so in
practice!), average
number of comparisons =
(1+2·2+4·3+4)/8 ~2.6
Exercise: what if the frequencies
of query terms were non-uniform
but known, how would you
structure the dictionary search
tree?
24
Dictionary search with blocking
• Binary search down to 4-term block;
• Then linear search through terms in block.
• Blocks of 4 (binary tree), avg. = (1+2·2+2·3+2·4+5)/8 = 3 compares
Sec. 5.2
Exercises
• Estimate the space usage (and savings compared to 7.6 MB) with
blocking, for block sizes of k = 4, 8 and 16.
• Estimate the impact on search performance (and slowdown
compared to k=1) with blocking, for block sizes of k = 4, 8 and 16.
26
Sec. 5.2
Front coding
• Front-coding:
• Sorted words commonly have long common prefix – store differences only
• (for last k-1 in a block of k)
8automata8automate9automatic10automation
®8automat*a1àe2àic3àion
Encodes prefix automat Extra length
beyond automat.
Begins to resemble general string compression. 27
Sec. 5.2
RCV1 dictionary compression summary
Technique Size in MB
Fixed width 11.2
Dictionary-as-String with pointers to every term 7.6
+ blocking, k = 4 7.1
+ blocking + front coding 5.9
28
Sec. 5.3
POSTINGS COMPRESSION
29
Sec. 5.3
Postings compression
• The postings file is much larger than the dictionary, factor of at least
10, often over 100 times larger
• Key desideratum: store each posting compactly.
• A posting for our purposes is a docID.
• For Reuters (800,000 documents), we would use 32 bits per docID
when using 4-byte integers.
• Alternatively, we can use log2 800,000 ≈ 20 bits per docID.
• Our goal: use far fewer than 20 bits per docID.
30
Sec. 5.3
Postings: two conflicting forces
• A term like arachnocentric occurs in maybe one doc out of a million –
we would like to store this posting using log2 1M ≈ 20 bits.
• A term like the occurs in virtually every doc, so 20 bits/posting ≈ 2MB
is too expensive.
• Prefer 0/1 bitmap vector in this case (≈100K)
31
Sec. 5.3
Three postings entries
the gap between postings
32
Sec. 5.3
Gap encoding of postings file entries
• We store the list of docs containing a term in increasing order of
docID.
• computer: 33,47,154,159,202 …
• Consequence: it suffices to store gaps.
• 33,14,107,5,43 …
• Hope: most gaps can be encoded/stored with far fewer than 20 bits.
• Especially for common words
33
Sec. 5.3
Variable length encoding
• Aim:
• For arachnocentric, we will use ~20 bits/gap entry.
• For the, we will use ~1 bit/gap entry.
• If the average gap for a term is G, we want to use ~log2G bits/gap
entry.
• Key challenge: encode every integer (gap) with about as few bits as
needed for that integer.
• This requires a variable length encoding
• Variable length codes achieve this by using short codes for small
numbers
34
Sec. 5.3
Variable Byte (VB) codes
• For a gap value G, we want to use close to the fewest bytes needed
to hold log2 G bits
• Begin with one byte to store G and dedicate 1 bit in it to be a
continuation bit c
• If G ≤127, binary-encode it in the 7 available bits and set c =1
• Else encode G’s lower-order 7 bits and then use additional bytes to
encode the higher order bits using the same algorithm
• At the end set the continuation bit of the last byte to 1 (c =1) – and for
the other bytes c = 0.
35
Sec. 5.3
Example
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001
Postings stored as the byte concatenation
000001101011100010000101000011010000110010110001
Key property: VB-encoded postings are
uniquely prefix-decodable.
For a small gap (5), VB
uses a whole byte. 36
Gamma code preliminary: Unary code
• Represent n as n 1s with a final 0.
• Unary code for 3 is 1110.
• Unary code for 40 is
11111111111111111111111111111111111111110 .
• Unary code for 80 is:
11111111111111111111111111111111111111111111111111111111111111111
1111111111111110
• This doesn’t look promising, but….
• We can use it as part of our solution
37
Sec. 5.3
Gamma codes
• We can compress better with bit-level codes
• The Gamma code is the best known of these.
• Represent a gap G as a pair length and offset
• offset is G in binary, with the leading bit cut off
• For example 13 → 1101 → 101
• length is the length of offset
• For 13 (offset 101), this is 3.
• We encode length with unary code: 1110.
• Gamma code of 13 is the concatenation of length and offset: 1110101
38
Sec. 5.3
Gamma code examples
number length offset g-code
0 none
1 0 0
2 10 0 10,0
3 10 1 10,1
4 110 00 110,00
9 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
39
Sec. 5.3
Gamma code properties
• G is encoded using 2 ëlog Gû + 1 bits
• Length of offset is ëlog Gû bits
• Length of length is ëlog Gû + 1 bits
• All gamma codes have an odd number of bits
• Almost within a factor of 2 of best possible, log2 G
• Gamma code is uniquely prefix-decodable, like VB
• Gamma code can be used for any distribution
• Optimal for P(n) » 1/(2n2)
• Gamma code is parameter-free
40
Sec. 5.3
Gamma seldom used in practice
• Machines have word boundaries – 8, 16, 32, 64 bits
• Operations that cross word boundaries are slower
• Compressing and manipulating at the granularity of bits can be too
slow
• All modern practice is to use byte or word aligned codes
• Variable byte encoding is a faster, conceptually simpler compression
scheme, with decent compression
41
Sec. 5.3
RCV1 compression
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, g-encoded 101.0
42
Sec. 5.3
Index compression summary
• We can now create an index for highly efficient Boolean retrieval that
is very space efficient
• Only 4% of the total size of the collection
• Only 10-15% of the total size of the text in the collection
• We’ve ignored positional information
• Hence, space savings are less for indexes used in practice
• But techniques substantially the same
47
Acknowledgement
• The slides and examples of this presentation are from “Introduction
to Information Retrieval” by Christopher D. Manning, Prabhakar
Raghavan and Hinrich Schutze, Introduction to Information
Retrieval, Cambridge University Press. 2008.