-
1
Data Warehousing
Need for Speed:
Conventional Indexing Techniques
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology, Islamabad
Pakistan
anwarchaudary@gmail.com
-
2
Need For Indexing: Speed
Consider searching your hard disk using the Windows
SEARCH command.
 Search goes into directory hierarchies.
 Takes about a minute, and there are only a few thousand files.
Assume a fast processor and (even more importantly) a fast
hard disk.
 Assume file size to be 5 KB.
 Assume hard disk scan rate of a million files per second.
 Resulting in scan rate of 5 GB per second.
Largest search engine indexes more than 8 billion pages
 At above scan rate 1,600 seconds required to scan ALL pages.
 This is just for one user!
 No one is going to wait for 26 minutes, not even 26 seconds.
Hence, a sequential scan is simply not feasible.
-
3
Need For Indexing: Query Complexity
 How many customers do I have in Karachi?
 How many customers in Karachi made calls during
April?
 How many customers in Karachi made calls to
Multan during April?
 How many customers in Karachi made calls to
Multan during April using a particular calling
package?
-
4
Need For Indexing: I/O Bottleneck
 Throwing hardware just speeds up the CPU
intensive tasks.
 The problem is of I/O, which does not scales up
easily.
 Putting the entire table in RAM is very very
expensive.
 Therefore, index!
-
5
Indexing Concept
 Purely physical concept, nothing to do with logical model.
 Invisible to the end user (programmer), optimizer chooses
it, effects only the speed, not the answer.
 With the library analogy, the time complexity to find a
book? The average time taken
 Using a card catalog organized in many different ways i.e.
author, topic, title etc and is sorted.
 A little bit of extra time to first check the catalog, but it
“gives” a pointer to the shelf and the row where book is
located.
 The catalog has no data about the book, just an efficient
way of searching.
-
6
Indexing Goal
Look at as few blocks as
possible to find the
matching record(s)
-
7
Conventional indexing Techniques
 Dense
 Sparse
 Multi-level (or B-Tree)
 Primary Index vs. Secondary Indexes
-
8
Dense Index
10
20
30
40
50
60
70
80
90
100
110
120
Data File
20
10
40
30
60
50
80
70
100
90
Every key in the data
file is represented in
the index file
Dense Index: Concept
-
9
Dense Index: Adv & Dis Adv
 Advantage:
 A dense index, if fits in the memory, is very
efficient in locating a record given a key
 Disadvantage:
 A dense index, if too big and doesn’t fit into the
memory, will be expensive when used to find a
record given its key
-
10
Sparse Index
10
30
50
70
90
110
130
150
170
190
210
230
Data File
20
10
40
30
60
50
80
70
100
90
Normally keeps
only one key per
data block
Some keys in the
data file will not
have an entry in
the index file
Sparse Index: Concept
-
11
Sparse Index: Adv & Dis Adv
 Advantage:
 A sparse index uses less space at the expense of
somewhat more time to find a record given its
key
 Support multi-level indexing structure
 Disadvantage:
 Locating a record given a key has different
performance for different key values
-
12
Sparse 2nd level
10
90
170
250
330
410
490
570
Data File
20
10
40
30
60
50
80
70
100
90
10
30
50
70
90
110
130
150
170
190
210
230
Sparse Index: Multi level
-
13
B-tree Indexing: Concept
 Can be seen as a general form of multi-level
indexes.
 Generalize usual (binary) search trees (BST).
 Allow efficient and fast exploration at the expense of
using slightly more space.
 Popular variant: B+-tree
 Support more efficiently queries like:
SELECT * FROM R WHERE a = 11
 SELECT * FROM R WHERE 0<= b and b<42
-
14
200
220
250
280
130
B-tree Indexing: Example
Each node stored in one disk block
RIDlist
9
20
100
140
145
200
210
215
220
230
250
256
279
280
300
Looking for Empno 250
-
15
B-tree Indexing: Limitations
 If a table is large and there are fewer unique values.
 Capitalization is not programmatically enforced
(meaning case-sensitivity does matter and
“FLASHMAN" is different from “Flashman").
 A noun spelled differently will result in different
results.
 Insertion can be very expensive.
-
16
B-tree Indexing: Limitations Example
Given that MOHAMMED is the most common first name in Pakistan,
a 5-million row Customers table would produce many screens of
matching rows for MOHAMMED AHMAD, yet would skip potential
matching values such as the following:
VALUE MISSED REASON MISSED
Mohammed Ahmad Case sensitive
MOHAMMED AHMED AHMED versus AHMAD
MOHAMMED AHMAD Extra space between names
MOHAMMED AHMAD DR DR after AHMAD
MOHAMMAD AHMAD Alternative spelling of MOHAMMAD
-
17
Hash Based Indexing
 You may recall that in internal memory, hashing can
be used to quickly locate a specific key.
 The same technique can be used on external
memory.
 However, advantage over search trees is smaller in
external search than internal. WHY?
 Because part of search tree can be brought into
the main memory.
-
18
Hash Based Indexing: Concept
In contrast to B-tree indexing, hash based indexes do not
(typically) keep index values in sorted order.
 Index entry is found by hashing on index value requiring
exact match.
SELECT * FROM Customers WHERE AccttNo= 110240
 Index entries kept in hash organized tables rather than B-
tree structures.
 Index entry contains ROWID values for each row
corresponding to the index value.
 Remember few numbers in real-life to be useful for hashing.
-
19
.
.
.
records
.
.
key  h(key) disk block
Note on terminology:
The word "indexing" is often used
synonymously with "B-tree indexing".
Hashing as Primary Index
-
20
key  h(key)
Index
recordkey
Can always be transformed to a secondary index using
indirection, as above.
Indexing the Index
Hashing as Secondary Index
-
21
 Indexing (using B-trees) good for range
searches, e.g.:
SELECT * FROM R WHERE A > 5
 Hashing good for match based searches,
e.g.:
SELECT * FROM R WHERE A = 5
B-tree vs. Hash Indexes
-
22
Primary Key vs. Primary Index
Relation Students
Name ID dept
AHMAD 123 CS
Akram 567 EE
Numan 999 CS
 Primary Key & Primary Index:
 PK is ALWAYS unique.
 PI can be unique, but does not have to be.
 In DSS environment, very few queries are PK based.
Special Index Structures
-
23
-
24
Special Index Structures
 Inverted index
 Bit map index
 Cluster index
 Join indexes
-
25
Sample table
Student Name Age Campus Tech
s1 amir 20 Lahore Elect
s2 javed 20 Islamabad CS
s3 salim 21 Lahore CS
s4 imran 20 Peshawar Elect
s5 majid 20 Karachi Telecom
s6 taslim 25 Karachi CS
s7 tahir 21 Peshawar Telecom
s8 sohaib 26 Peshawar CS
s9 afridi 19 Lahore CS
-
26
Inverted index: Concept
-
27
Inverted Index: Example-1
D1: M. Asalm BS Computer Science Lahore Campus
D2: Sana Aslam of Lahore MS Computer Engineering with GPA 3.4 Karachi
Campus
Inverted index for the documents D1 and D2 is as follows:
3.4  [D2]
Asalm  [D1, D2]
BS  [D1]
Campus  [D1, D2]
Computer  [D1, D2]
Engineering  [D2]
GPA  [D2]
Karachi  [D2]
Lahore  [D1, D2]
M.  [D1]
MS  [D2]
of  [D2]
Sana  [D2]
Science  [D1]
with  [D2]
-
28
Inverted Index: Example-2
20
23
18
19
20
21
22
23
25
26
r4
r18
r34
r35
r5
r19
r37
r40
inverted
index
B-tree
Index
RID name age Campus
r4 amir 20 Elect
r18 javed 20 CS
r19 salim 21 CS
r34 imran 20 Elect
r35 majid 20 Telecom
r36 taslim 25 CS
r5 tahir 21 Telecom
r41 sohaib 26 CS
...
data
records
r500 afridi 19 CS
-
29
 Query:
 Get students with age = 20 and tech = “telecom”
 List for age = 20: r4, r18, r34, r35
 List for tech = “telecom”: r5, r35
 Answer is intersection: r35
Inverted Index: Query
-
30
Bitmap Indexes: Concept
-
31
Bitmap Indexes: Example
 The index consists of bitmaps, with a column for
each unique value:
SID Islamabad Lahore Karachi Peshawar
1 0 1 0 0
2 1 0 0 0
3 0 1 0 0
4 0 0 0 1
5 0 0 1 0
6 0 0 1 0
7 0 0 0 1
8 0 0 0 1
9 0 1 0 0
SID CS Elect Telecom
1 1 0 0
2 0 1 0
3 0 1 0
4 1 0 0
5 0 0 1
6 0 1 0
7 0 0 1
8 1 0 0
9 1 0 0
Index on Tech (smaller table):Index on City (larger table):
-
32
 Query:
 Get students with age = 20 and campus = “Lahore”
 List for age = 20: 1101100000
 List for campus = “Lahore”: 1010000001
 Answer is AND : 1000000000
 Good if domain cardinality is small
 Bit vectors can be compressed
 Run length encoding
Bitmap Index: Query
-
33
Basic Concept
1111000011110000001111100000011111 INPUT
14#04#14#06#15#06#15 OUTPUT
1010101010101010101010101010101010 INPUT
11#01#11#01#11#01#11#01#… OUTPUT
11111111111111110000000000000000 INPUT
117#017 OUTPUT
Bitmap Index: Compression
Case-1
Case-2
Case-3
-
34
Cluster Index: Concept
-
35
Cluster Index: Example
Student Name Age Campus Tech
s9 afridi 19 Lahore CS
s1 amir 20 Lahore Elect
s2 javed 20 Islamabad CS
s4 imran 20 Peshawar Elect
s5 majid 20 Karachi Telecom
s3 salim 21 Lahore CS
s7 tahir 21 Peshawar Telecom
s6 taslim 25 Karachi CS
s8 sohaib 26 Peshawar CS
Cluster indexing on AGE
Student Name Age Campus Tech
s9 afridi 19 Lahore CS
s2 javed 20 Islamabad CS
s3 salim 21 Lahore CS
s6 taslim 25 Karachi CS
s8 sohaib 26 Peshawar CS
s1 amir 20 Lahore Elect
s4 imran 20 Peshawar Elect
s5 majid 20 Karachi Telecom
s7 tahir 21 Peshawar Telecom
Cluster indexing on TECH
One indexing column at a time
-
36
Join Index: Example
id name NoS jIndex
p1 BS 10 r1,r3,r5,r6
p2 MS 5 r2,r4
rId progid CID date NoS
r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4
join indexPROGRAM
CAMPUS
The rows of the table consist entirely of such references, which are the RIDs of the
relevant rows.

Intro to Data warehousing lecture 14

  • 1.
    - 1 Data Warehousing Need forSpeed: Conventional Indexing Techniques Ch Anwar ul Hassan (Lecturer) Department of Computer Science and Software Engineering Capital University of Sciences & Technology, Islamabad Pakistan [email protected]
  • 2.
    - 2 Need For Indexing:Speed Consider searching your hard disk using the Windows SEARCH command.  Search goes into directory hierarchies.  Takes about a minute, and there are only a few thousand files. Assume a fast processor and (even more importantly) a fast hard disk.  Assume file size to be 5 KB.  Assume hard disk scan rate of a million files per second.  Resulting in scan rate of 5 GB per second. Largest search engine indexes more than 8 billion pages  At above scan rate 1,600 seconds required to scan ALL pages.  This is just for one user!  No one is going to wait for 26 minutes, not even 26 seconds. Hence, a sequential scan is simply not feasible.
  • 3.
    - 3 Need For Indexing:Query Complexity  How many customers do I have in Karachi?  How many customers in Karachi made calls during April?  How many customers in Karachi made calls to Multan during April?  How many customers in Karachi made calls to Multan during April using a particular calling package?
  • 4.
    - 4 Need For Indexing:I/O Bottleneck  Throwing hardware just speeds up the CPU intensive tasks.  The problem is of I/O, which does not scales up easily.  Putting the entire table in RAM is very very expensive.  Therefore, index!
  • 5.
    - 5 Indexing Concept  Purelyphysical concept, nothing to do with logical model.  Invisible to the end user (programmer), optimizer chooses it, effects only the speed, not the answer.  With the library analogy, the time complexity to find a book? The average time taken  Using a card catalog organized in many different ways i.e. author, topic, title etc and is sorted.  A little bit of extra time to first check the catalog, but it “gives” a pointer to the shelf and the row where book is located.  The catalog has no data about the book, just an efficient way of searching.
  • 6.
    - 6 Indexing Goal Look atas few blocks as possible to find the matching record(s)
  • 7.
    - 7 Conventional indexing Techniques Dense  Sparse  Multi-level (or B-Tree)  Primary Index vs. Secondary Indexes
  • 8.
    - 8 Dense Index 10 20 30 40 50 60 70 80 90 100 110 120 Data File 20 10 40 30 60 50 80 70 100 90 Everykey in the data file is represented in the index file Dense Index: Concept
  • 9.
    - 9 Dense Index: Adv& Dis Adv  Advantage:  A dense index, if fits in the memory, is very efficient in locating a record given a key  Disadvantage:  A dense index, if too big and doesn’t fit into the memory, will be expensive when used to find a record given its key
  • 10.
    - 10 Sparse Index 10 30 50 70 90 110 130 150 170 190 210 230 Data File 20 10 40 30 60 50 80 70 100 90 Normallykeeps only one key per data block Some keys in the data file will not have an entry in the index file Sparse Index: Concept
  • 11.
    - 11 Sparse Index: Adv& Dis Adv  Advantage:  A sparse index uses less space at the expense of somewhat more time to find a record given its key  Support multi-level indexing structure  Disadvantage:  Locating a record given a key has different performance for different key values
  • 12.
    - 12 Sparse 2nd level 10 90 170 250 330 410 490 570 DataFile 20 10 40 30 60 50 80 70 100 90 10 30 50 70 90 110 130 150 170 190 210 230 Sparse Index: Multi level
  • 13.
    - 13 B-tree Indexing: Concept Can be seen as a general form of multi-level indexes.  Generalize usual (binary) search trees (BST).  Allow efficient and fast exploration at the expense of using slightly more space.  Popular variant: B+-tree  Support more efficiently queries like: SELECT * FROM R WHERE a = 11  SELECT * FROM R WHERE 0<= b and b<42
  • 14.
    - 14 200 220 250 280 130 B-tree Indexing: Example Eachnode stored in one disk block RIDlist 9 20 100 140 145 200 210 215 220 230 250 256 279 280 300 Looking for Empno 250
  • 15.
    - 15 B-tree Indexing: Limitations If a table is large and there are fewer unique values.  Capitalization is not programmatically enforced (meaning case-sensitivity does matter and “FLASHMAN" is different from “Flashman").  A noun spelled differently will result in different results.  Insertion can be very expensive.
  • 16.
    - 16 B-tree Indexing: LimitationsExample Given that MOHAMMED is the most common first name in Pakistan, a 5-million row Customers table would produce many screens of matching rows for MOHAMMED AHMAD, yet would skip potential matching values such as the following: VALUE MISSED REASON MISSED Mohammed Ahmad Case sensitive MOHAMMED AHMED AHMED versus AHMAD MOHAMMED AHMAD Extra space between names MOHAMMED AHMAD DR DR after AHMAD MOHAMMAD AHMAD Alternative spelling of MOHAMMAD
  • 17.
    - 17 Hash Based Indexing You may recall that in internal memory, hashing can be used to quickly locate a specific key.  The same technique can be used on external memory.  However, advantage over search trees is smaller in external search than internal. WHY?  Because part of search tree can be brought into the main memory.
  • 18.
    - 18 Hash Based Indexing:Concept In contrast to B-tree indexing, hash based indexes do not (typically) keep index values in sorted order.  Index entry is found by hashing on index value requiring exact match. SELECT * FROM Customers WHERE AccttNo= 110240  Index entries kept in hash organized tables rather than B- tree structures.  Index entry contains ROWID values for each row corresponding to the index value.  Remember few numbers in real-life to be useful for hashing.
  • 19.
    - 19 . . . records . . key  h(key)disk block Note on terminology: The word "indexing" is often used synonymously with "B-tree indexing". Hashing as Primary Index
  • 20.
    - 20 key  h(key) Index recordkey Canalways be transformed to a secondary index using indirection, as above. Indexing the Index Hashing as Secondary Index
  • 21.
    - 21  Indexing (usingB-trees) good for range searches, e.g.: SELECT * FROM R WHERE A > 5  Hashing good for match based searches, e.g.: SELECT * FROM R WHERE A = 5 B-tree vs. Hash Indexes
  • 22.
    - 22 Primary Key vs.Primary Index Relation Students Name ID dept AHMAD 123 CS Akram 567 EE Numan 999 CS  Primary Key & Primary Index:  PK is ALWAYS unique.  PI can be unique, but does not have to be.  In DSS environment, very few queries are PK based.
  • 23.
  • 24.
    - 24 Special Index Structures Inverted index  Bit map index  Cluster index  Join indexes
  • 25.
    - 25 Sample table Student NameAge Campus Tech s1 amir 20 Lahore Elect s2 javed 20 Islamabad CS s3 salim 21 Lahore CS s4 imran 20 Peshawar Elect s5 majid 20 Karachi Telecom s6 taslim 25 Karachi CS s7 tahir 21 Peshawar Telecom s8 sohaib 26 Peshawar CS s9 afridi 19 Lahore CS
  • 26.
  • 27.
    - 27 Inverted Index: Example-1 D1:M. Asalm BS Computer Science Lahore Campus D2: Sana Aslam of Lahore MS Computer Engineering with GPA 3.4 Karachi Campus Inverted index for the documents D1 and D2 is as follows: 3.4  [D2] Asalm  [D1, D2] BS  [D1] Campus  [D1, D2] Computer  [D1, D2] Engineering  [D2] GPA  [D2] Karachi  [D2] Lahore  [D1, D2] M.  [D1] MS  [D2] of  [D2] Sana  [D2] Science  [D1] with  [D2]
  • 28.
    - 28 Inverted Index: Example-2 20 23 18 19 20 21 22 23 25 26 r4 r18 r34 r35 r5 r19 r37 r40 inverted index B-tree Index RIDname age Campus r4 amir 20 Elect r18 javed 20 CS r19 salim 21 CS r34 imran 20 Elect r35 majid 20 Telecom r36 taslim 25 CS r5 tahir 21 Telecom r41 sohaib 26 CS ... data records r500 afridi 19 CS
  • 29.
    - 29  Query:  Getstudents with age = 20 and tech = “telecom”  List for age = 20: r4, r18, r34, r35  List for tech = “telecom”: r5, r35  Answer is intersection: r35 Inverted Index: Query
  • 30.
  • 31.
    - 31 Bitmap Indexes: Example The index consists of bitmaps, with a column for each unique value: SID Islamabad Lahore Karachi Peshawar 1 0 1 0 0 2 1 0 0 0 3 0 1 0 0 4 0 0 0 1 5 0 0 1 0 6 0 0 1 0 7 0 0 0 1 8 0 0 0 1 9 0 1 0 0 SID CS Elect Telecom 1 1 0 0 2 0 1 0 3 0 1 0 4 1 0 0 5 0 0 1 6 0 1 0 7 0 0 1 8 1 0 0 9 1 0 0 Index on Tech (smaller table):Index on City (larger table):
  • 32.
    - 32  Query:  Getstudents with age = 20 and campus = “Lahore”  List for age = 20: 1101100000  List for campus = “Lahore”: 1010000001  Answer is AND : 1000000000  Good if domain cardinality is small  Bit vectors can be compressed  Run length encoding Bitmap Index: Query
  • 33.
    - 33 Basic Concept 1111000011110000001111100000011111 INPUT 14#04#14#06#15#06#15OUTPUT 1010101010101010101010101010101010 INPUT 11#01#11#01#11#01#11#01#… OUTPUT 11111111111111110000000000000000 INPUT 117#017 OUTPUT Bitmap Index: Compression Case-1 Case-2 Case-3
  • 34.
  • 35.
    - 35 Cluster Index: Example StudentName Age Campus Tech s9 afridi 19 Lahore CS s1 amir 20 Lahore Elect s2 javed 20 Islamabad CS s4 imran 20 Peshawar Elect s5 majid 20 Karachi Telecom s3 salim 21 Lahore CS s7 tahir 21 Peshawar Telecom s6 taslim 25 Karachi CS s8 sohaib 26 Peshawar CS Cluster indexing on AGE Student Name Age Campus Tech s9 afridi 19 Lahore CS s2 javed 20 Islamabad CS s3 salim 21 Lahore CS s6 taslim 25 Karachi CS s8 sohaib 26 Peshawar CS s1 amir 20 Lahore Elect s4 imran 20 Peshawar Elect s5 majid 20 Karachi Telecom s7 tahir 21 Peshawar Telecom Cluster indexing on TECH One indexing column at a time
  • 36.
    - 36 Join Index: Example idname NoS jIndex p1 BS 10 r1,r3,r5,r6 p2 MS 5 r2,r4 rId progid CID date NoS r1 p1 c1 1 12 r2 p2 c1 1 11 r3 p1 c3 1 50 r4 p2 c2 1 8 r5 p1 c1 2 44 r6 p1 c2 2 4 join indexPROGRAM CAMPUS The rows of the table consist entirely of such references, which are the RIDs of the relevant rows.