Course Intro &
01 Relational Model
Intro to Database Systems Andy Pavlo
15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2
CMU 15-445/645 (Fall 2019)
3
Wait List
Overview
Course Logistics
Relational Model
Relational Algebra
CMU 15-445/645 (Fall 2019)
4
WA I T L I S T
There are currently 150 people on the waiting list.
Max capacity is 100.
We will enroll people based on your S3 position.
CMU 15-445/645 (Fall 2019)
5
COURSE OVERVIEW
This course is on the design and implementation
of disk-oriented database management systems.
This is not a course on how to use a database to
build applications or how to administer a database.
→ See CMU 95-703 (Heinz College)
Database Applications (15-415/615) is not offered
this semester.
CMU 15-445/645 (Fall 2019)
6
COURSE OUTLINE
Relational Databases
Storage
Execution
Concurrency Control
Recovery
Distributed Databases
Potpourri
CMU 15-445/645 (Fall 2019)
7
C O U R S E LO G I S T I C S
Course Policies + Schedule:
→ Refer to course web page.
Academic Honesty:
→ Refer to CMU policy page.
→ If you’re not sure, ask the professors.
→ Don’t be stupid.
All discussion + announcements will be on Piazza.
CMU 15-445/645 (Fall 2019)
8
TEXTBOOK
Database System Concepts
7th Edition
Silberschatz, Korth, & Sudarshan
We will also provide lecture notes
that covers topics not found in textbook.
CMU 15-445/645 (Fall 2019)
9
COURSE RUBRIC
Homeworks (15%)
Projects (45%)
Midterm Exam (20%)
Final Exam (20%)
Extra Credit (+10%)
CMU 15-445/645 (Fall 2019)
10
HOMEWORKS
Five homework assignments throughout the
semester.
First homework is a SQL assignment. The rest will
be pencil-and-paper assignments.
All homework should be done individually.
CMU 15-445/645 (Fall 2019)
11
PROJECTS
You will build your own storage manager from
scratch of the course of the semester.
Each project builds on the previous one.
We will not teach you how to write/debug C++17
CMU 15-445/645 (Fall 2019)
12
BUSTUB
All projects will use the new BusTub
academic DBMS.
→ Source code will be released on Github.
Architecture:
→ Disk-Oriented Storage
→ Volcano-style Query Processing
→ Pluggable APIs
→ Currently does not support SQL.
CMU 15-445/645 (Fall 2019)
13
L AT E P O L I C Y
You are allowed four slip days for either
homework or projects.
You lose 25% of an assignment’s points for every
24hrs it is late.
Mark on your submission (1) how many days you
are late and (2) how many late days you have left.
CMU 15-445/645 (Fall 2019)
14
P L A G I A R I S M WA R N I N G
The homework and projects must be your own
work. They are not group assignments.
You may not copy source code from other people
or the web.
Plagiarism will not be tolerated.
See CMU's Policy on Academic Integrity for
additional information.
CMU 15-445/645 (Fall 2019)
D ATA B A S E R E S E A R C H
Database Group Meetings
→ Mondays @ 4:30pm (GHC 8102)
→ https://db.cs.cmu.edu
Advanced DBMS Developer Meetings
→ Tuesdays @ 12:00pm (GHC 8115)
→ https://github.com/cmu-db/terrier
CMU 15-445/645 (Fall 2019)
Databases
17
D ATA B A S E
Organized collection of inter-related data that
models some aspect of the real-world.
Databases are core the component of most
computer applications.
CMU 15-445/645 (Fall 2019)
18
D ATA B A S E E X A M P L E
Create a database that models a digital music store
to keep track of artists and albums.
Things we need store:
→ Information about Artists
→ What Albums those Artists released
CMU 15-445/645 (Fall 2019)
19
F L AT F I L E S T R AWM A N
Store our database as comma-separated value
(CSV) files that we manage in our own code.
→ Use a separate file per entity.
→ The application has to parse the files each time they want
to read/update records.
CMU 15-445/645 (Fall 2019)
20
F L AT F I L E S T R AWM A N
Create a database that models a digital music store.
Artist(name, year, country) Album(name, artist, year)
"Wu Tang Clan",1992,"USA" "Enter the Wu Tang","Wu Tang Clan",1993
"Notorious BIG",1992,"USA" "St.Ides Mix Tape","Wu Tang Clan",1994
"AmeriKKKa's Most Wanted","Ice Cube",1990
"Ice Cube",1989,"USA"
CMU 15-445/645 (Fall 2019)
21
F L AT F I L E S T R AWM A N
Example: Get the year that Ice Cube went solo.
Artist(name, year, country)
for line in file:
"Wu Tang Clan",1992,"USA" record = parse(line)
"Notorious BIG",1992,"USA" if “Ice Cube” == record[0]:
"Ice Cube",1989,"USA" print int(record[1])
CMU 15-445/645 (Fall 2019)
22
F L AT F I L E S : D ATA I N T E G R I T Y
How do we ensure that the artist is the same for
each album entry?
What if somebody overwrites the album year with
an invalid string?
How do we store that there are multiple artists on
an album?
CMU 15-445/645 (Fall 2019)
23
F L AT F I L E S : I M P L E M E N TAT I O N
How do you find a particular record?
What if we now want to create a new application
that uses the same database?
What if two threads try to write to the same file at
the same time?
CMU 15-445/645 (Fall 2019)
24
F L AT F I L E S : D U R A B I L I T Y
What if the machine crashes while our program is
updating a record?
What if we want to replicate the database on
multiple machines for high availability?
CMU 15-445/645 (Fall 2019)
25
D ATA B A S E M A N A G E M E N T S Y S T E M
A DBMS is software that allows applications to
store and analyze information in a database.
A general-purpose DBMS is designed to allow the
definition, creation, querying, update, and
administration of databases.
CMU 15-445/645 (Fall 2019)
26
E A R LY D B M S s
Database applications were difficult to
build and maintain.
Tight coupling between logical and
physical layers.
You have to (roughly) know what
queries your app would execute
before you deployed the database.
Edgar F. Codd
CMU 15-445/645 (Fall 2019)
26
E A R LY D B M S s
Database applications were difficult to
build and maintain.
Tight coupling between logical and
physical layers.
You have to (roughly) know what
queries your app would execute
before you deployed the database.
Edgar F. Codd
CMU 15-445/645 (Fall 2019)
27
R E L AT I O N A L M O D E L
Proposed in 1970 by Ted Codd.
Database abstraction to avoid this
maintenance:
→ Store database in simple data structures.
→ Access data through high-level language.
→ Physical storage left up to
implementation.
Edgar F. Codd
CMU 15-445/645 (Fall 2019)
28
D ATA M O D E L S
A data model is collection of concepts for
describing the data in a database.
A schema is a description of a particular collection
of data, using a given data model.
CMU 15-445/645 (Fall 2019)
29
D ATA M O D E L
Relational ← Most DBMSs
Key/Value
Graph
Document
Column-family
Array / Matrix
Hierarchical
Network
CMU 15-445/645 (Fall 2019)
29
D ATA M O D E L
Relational
Key/Value
Graph
← NoSQL
Document
Column-family
Array / Matrix
Hierarchical
Network
CMU 15-445/645 (Fall 2019)
29
D ATA M O D E L
Relational
Key/Value
Graph
Document
Column-family
Array / Matrix ← Machine Learning
Hierarchical
Network
CMU 15-445/645 (Fall 2019)
29
D ATA M O D E L
Relational
Key/Value
Graph
Document
Column-family
Array / Matrix
Hierarchical
← Obsolete / Rare
Network
CMU 15-445/645 (Fall 2019)
29
D ATA M O D E L
Relational ← This Course
Key/Value
Graph
Document
Column-family
Array / Matrix
Hierarchical
Network
CMU 15-445/645 (Fall 2019)
30
R E L AT I O N A L M O D E L
Structure: The definition of relations and their
contents.
Integrity: Ensure the database’s contents satisfy
constraints.
Manipulation: How to access and modify a
database’s contents.
CMU 15-445/645 (Fall 2019)
31
R E L AT I O N A L M O D E L
A relation is unordered set that Artist(name, year, country)
contain the relationship of attributes name year country
that represent entities. Wu Tang Clan 1992 USA
Notorious BIG 1992 USA
A tuple is a set of attribute values (also Ice Cube 1989 USA
known as its domain) in the relation.
→ Values are (normally) atomic/scalar.
n-ary Relation
→ The special value NULL is a member of =
every domain. Table with n columns
CMU 15-445/645 (Fall 2019)
32
R E L AT I O N A L M O D E L : P R I M A RY K E Y S
A relation’s primary key uniquely Artist(name, year, country)
identifies a single tuple. name year country
Wu Tang Clan 1992 USA
Some DBMSs automatically create an
Notorious BIG 1992 USA
internal primary key if you don't
Ice Cube 1989 USA
define one.
Auto-generation of unique integer
primary keys:
→ SEQUENCE (SQL:2003)
→ AUTO_INCREMENT (MySQL)
CMU 15-445/645 (Fall 2019)
32
R E L AT I O N A L M O D E L : P R I M A RY K E Y S
A relation’s primary key uniquely Artist(id, name, year, country)
identifies a single tuple. id name year country
123 Wu Tang Clan 1992 USA
Some DBMSs automatically create an
456 Notorious BIG 1992 USA
internal primary key if you don't
789 Ice Cube 1989 USA
define one.
Auto-generation of unique integer
primary keys:
→ SEQUENCE (SQL:2003)
→ AUTO_INCREMENT (MySQL)
CMU 15-445/645 (Fall 2019)
33
R E L AT I O N A L M O D E L : F O R E I G N K E Y S
A foreign key specifies that an attribute from one
relation has to map to a tuple in another relation.
CMU 15-445/645 (Fall 2019)
33
R E L AT I O N A L M O D E L : F O R E I G N K E Y S
Artist(id, name, year, country)
id name year country
123 Wu Tang Clan 1992 USA
456 Notorious BIG 1992 USA
789 Ice Cube 1989 USA
Album(id, name, artists, year)
id name artists year
11 Enter the Wu Tang 123 1993
22 St.Ides Mix Tape ??? 1994
33 AmeriKKKa's Most Wanted 789 1990
CMU 15-445/645 (Fall 2019)
33
R E L AT I O N A L M O D E L : F O R E I G N K E Y S
Artist(id, name, year, country)
id name year country
123 Wu Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 456 Notorious BIG 1992 USA
artist_id album_id 789 Ice Cube 1989 USA
123 11
123 22 Album(id, name, artists, year)
789 22 id name artists year
456 22 11 Enter the Wu Tang 123 1993
22 St.Ides Mix Tape ??? 1994
33 AmeriKKKa's Most Wanted 789 1990
CMU 15-445/645 (Fall 2019)
33
R E L AT I O N A L M O D E L : F O R E I G N K E Y S
Artist(id, name, year, country)
id name year country
123 Wu Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 456 Notorious BIG 1992 USA
artist_id album_id 789 Ice Cube 1989 USA
123 11
123 22 Album(id, name, year)
789 22 id name year
456 22 11 Enter the Wu Tang 1993
22 St.Ides Mix Tape 1994
33 AmeriKKKa's Most Wanted 1990
CMU 15-445/645 (Fall 2019)
33
R E L AT I O N A L M O D E L : F O R E I G N K E Y S
Artist(id, name, year, country)
id name year country
123 Wu Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 456 Notorious BIG 1992 USA
artist_id album_id 789 Ice Cube 1989 USA
123 11
123 22 Album(id, name, year)
789 22 id name year
456 22 11 Enter the Wu Tang 1993
22 St.Ides Mix Tape 1994
33 AmeriKKKa's Most Wanted 1990
CMU 15-445/645 (Fall 2019)
34
D ATA M A N I P U L AT I O N L A N G UA G E S ( D M L )
How to store and retrieve information from a
database.
Procedural: ← Relational
→ The query specifies the (high-level) strategy Algebra
the DBMS should use to find the desired result.
Non-Procedural:
→ The query specifies only what data is wanted
and not how to find it.
CMU 15-445/645 (Fall 2019)
34
D ATA M A N I P U L AT I O N L A N G UA G E S ( D M L )
How to store and retrieve information from a
database.
Procedural: ← Relational
→ The query specifies the (high-level) strategy Algebra
the DBMS should use to find the desired result.
Non-Procedural: ← Relational
→ The query specifies only what data is wanted Calculus
and not how to find it.
CMU 15-445/645 (Fall 2019)
35
R E L AT I O N A L A LG E B R A
Fundamental operations to retrieve σ Select
and manipulate tuples in a relation. Projection
→ Based on set algebra.
∪ Union
Each operator takes one or more ∩ Intersection
relations as its inputs and outputs a Difference
new relation.
→ We can “chain” operators together to create × Product
more complex operations.
⋈ Join
CMU 15-445/645 (Fall 2019)
36
R E L AT I O N A L A LG E B R A : S E L E C T
R(a_id,b_id)
Choose a subset of the tuples from a a_id b_id
relation that satisfies a selection a1
a2
101
102
predicate. a2 103
→ Predicate acts as a filter to retain only a3 104
tuples that fulfill its qualifying σa_id='a2'(R) σa_id='a2'∧ b_id>102(R)
requirement. a_id b_id a_id b_id
→ Can combine multiple predicates using a2 102 a2 103
conjunctions / disjunctions. a2 103
Syntax: σpredicate(R) SELECT * FROM R
WHERE a_id='a2' AND b_id>102;
CMU 15-445/645 (Fall 2019)
37
R E L AT I O N A L A LG E B R A : P R O J E C T I O N
R(a_id,b_id)
Generate a relation with tuples that a_id
a1
b_id
101
contains only the specified attributes. a2 102
→ Can rearrange attributes’ ordering. a2 103
→ Can manipulate the values. a3 104
Πb_id-100,a_id(σa_id='a2'(R))
Syntax: A1,A2,…,An(R) b_id-100 a_id
2 a2
3 a2
SELECT b_id-100, a_id
FROM R WHERE a_id = 'a2';
CMU 15-445/645 (Fall 2019)
38
R E L AT I O N A L A LG E B R A : U N I O N
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains all a_id b_id a_id b_id
a1 101 a3 103
tuples that appear in either only one a2 102 a4 104
or both input relations. a3 103 a5 105
(R ∪ S)
Syntax: (R ∪ S) a_id b_id
a1 101
a2 102
(SELECT * FROM R) a3 103
UNION ALL a3 103
(SELECT * FROM S); a4 104
a5 105
CMU 15-445/645 (Fall 2019)
39
R E L AT I O N A L A LG E B R A : I N T E R S E C T I O N
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains only a_id b_id a_id b_id
a1 101 a3 103
the tuples that appear in both of the a2 102 a4 104
input relations. a3 103 a5 105
Syntax: (R ∩ S) (R ∩ S)
a_id b_id
a3 103
(SELECT * FROM R)
INTERSECT
(SELECT * FROM S);
CMU 15-445/645 (Fall 2019)
40
R E L AT I O N A L A LG E B R A : D I F F E R E N C E
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains only a_id b_id a_id b_id
a1 101 a3 103
the tuples that appear in the first and a2 102 a4 104
not the second of the input relations. a3 103 a5 105
Syntax: (R – S) (R – S)
a_id b_id
a1 101
a2 102
(SELECT * FROM R)
EXCEPT
(SELECT * FROM S);
CMU 15-445/645 (Fall 2019)
41
R E L AT I O N A L A LG E B R A : P R O D U C T
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains all a_id b_id a_id b_id
a1 101 a3 103
possible combinations of tuples from a2 102 a4 104
the input relations. a3 103 a5 105
(R × S)
Syntax: (R × S) R.a_id
a1
R.b_id
101
S.a_id
a3
S.b_id
103
a1 101 a4 104
a1 101 a5 105
SELECT * FROM R CROSS JOIN S; a2 102 a3 103
a2 102 a4 104
a2 102 a5 105
SELECT * FROM R, S; a3 103 a3 103
a3 103 a4 104
a3 103 a5 105
CMU 15-445/645 (Fall 2019)
42
R E L AT I O N A L A LG E B R A : J O I N
R(a_id,b_id) S(a_id,b_id)
Generate a relation that contains all a_id b_id a_id b_id
tuples that are a combination of two a1 101 a3 103
tuples (one from each input relation) a2
a3
102
103
a4
a5
104
105
with a common value(s) for one or
more attributes. (R ⋈ S)
a_id b_id
Syntax: (R ⋈ S) a3 103
SELECT * FROM R NATURAL JOIN S;
CMU 15-445/645 (Fall 2019)
43
R E L AT I O N A L A LG E B R A : E X T R A O P E R AT O R S
Rename (ρ)
Assignment (R←S)
Duplicate Elimination (δ)
Aggregation (γ)
Sorting (τ)
Division (R÷S)
CMU 15-445/645 (Fall 2019)
44
O B S E R VAT I O N
Relational algebra still defines the high-level steps
of how to compute a query.
→ σb_id=102(R⋈S) vs. (R⋈(σb_id=102(S))
A better approach is to state the high-level answer
that you want the DBMS to compute.
→ Retrieve the joined tuples from R and S where b_id
equals 102.
CMU 15-445/645 (Fall 2019)
45
R E L AT I O N A L M O D E L : Q U E R I E S
The relational model is independent of any query
language implementation.
SQL is the de facto standard.
for line in file:
SELECT year FROM artists
record = parse(line)
WHERE name = "Ice Cube“;
if “Ice Cube” == record[0]:
print int(record[1])
CMU 15-445/645 (Fall 2019)
46
C O N C LU S I O N
Databases are ubiquitous.
Relational algebra defines the primitives for
processing queries on a relational database.
We will see relational algebra again when we talk
about query optimization + execution.
CMU 15-445/645 (Fall 2019)