0% found this document useful (0 votes)
1 views

01-relationalmodel

The document outlines the course logistics for Database Systems (15-445/645) taught by Prof. Andy Pavlo in Fall 2024, including policies, schedules, and communication channels. It introduces key concepts of databases, focusing on the relational model and its components such as data integrity, primary keys, and foreign keys. The course aims to provide a comprehensive understanding of database management systems and various data models.

Uploaded by

wz1151897402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

01-relationalmodel

The document outlines the course logistics for Database Systems (15-445/645) taught by Prof. Andy Pavlo in Fall 2024, including policies, schedules, and communication channels. It introduces key concepts of databases, focusing on the relational model and its components such as data integrity, primary keys, and foreign keys. The course aims to provide a comprehensive understanding of database management systems and various data models.

Uploaded by

wz1151897402
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Database

Systems
Relational Model &
Algebra
15-445/645 FALL 2024 PROF. ANDY PAVLO

15-445/645 FALL 2024 PROF. ANDY PAVLO


#1 ⮕ KB + BD
#2 ⮕ DBs
3
3

5
5

COURSE LOGISTICS
Course Policies + Schedule: Course Web Page
Discussion + Announcements: Piazza
Homeworks + Projects: Gradescope
Final Grades: Canvas
Waitlist: Six open seats (as of 12pm today)

Non-CMU students can complete all assignments


using Gradescope (Code: WWWJZ5).
→ Do not post your solutions on Github.
→ Do not email instructors / TAs for help.
→ Discord Channel: https://discord.gg/YF7dMCg
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


6

TODAY’S AGENDA
Database Systems Background
Relational Model
Relational Algebra
Alternative Data Models
Q&A Session

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


Databases
8

DATABASE
Organized collection of inter-related data that
models some aspect of the real-world.

Databases are the core component of most


computer applications.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


9

DATABASE EXAMPLE
Create a database that models a digital music store
to keep track of artists and albums.

Information we need to keep track of in our store:


→ Information about Artists
→ The Albums those Artists released

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


10

FLAT FILE STRAWMAN


Store our database as comma-separated value (CSV)
files that we manage ourselves in application code.
→ Use a separate file per entity.
→ The application must parse the files each time they want to
read/update records.

Artist(name, year, country) Album(name, artist, year)


"Wu-Tang Clan",1992,"USA" "Enter the Wu-Tang","Wu-Tang Clan",1993

"Notorious BIG",1992,"USA" "St.Ides Mix Tape","Wu-Tang Clan",1994


"Liquid Swords","GZA",1990
"GZA",1990,"USA"
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


11

FLAT FILE STRAWMAN


Example: Get the year that GZA went solo.

Artist(name, year, country)


"Wu-Tang Clan",1992,"USA" for line in file.readlines():
record = parse(line)
"Notorious BIG",1992,"USA" if record[0] == "GZA":
"GZA",1990,"USA" print(int(record[1]))

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


12

FLAT FILES: DATA INTEGRITY


How do we ensure that the artist is the same for
each album entry?

What if somebody overwrites the album year with


an invalid string?

What if there are multiple artists on an album?

What happens if we delete an artist that has


albums?

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


13

FLAT FILES: IMPLEMENTATION


How do you find a particular record?

What if we now want to create a new application


that uses the same database? What if that
application is running on a different machine?

What if two threads try to write to the same file at


the same time?

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


14

FLAT FILES: DURABILITY


What if the machine crashes while our program is
updating a record?

What if we want to replicate the database on


multiple machines for high availability?

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


15

DATABASE MANAGEMENT SYSTEM


A database management system (DBMS) is
software that allows applications to store and
analyze information in a database.

A general-purpose DBMS supports the definition,


creation, querying, update, and administration of
databases in accordance with some data model.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


16

DATA MODELS
A data model is a collection of concepts for
describing the data in a database.

A schema is a description of a particular collection


of data, using a given data model.
→ This defines the structure of data for a data model.
→ Otherwise, you have random bits with no meaning.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


17

DATA MODELS
A data model is a collection of concepts for
describing the data in a database.

A schema is a description of a particular collection


of data, using a given data model.
→ This defines the structure of data for a data model.
→ Otherwise, you have random bits with no meaning.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


18

DATA MODELS
Relational ← Most DBMSs
Key/Value
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
19

DATA MODELS
Relational
Key/Value ← Simple Apps / Caching
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
20

DATA MODELS
Relational
Key/Value
Graph
Document / JSON / XML / Object ← NoSQL
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
21

DATA MODELS
Relational
Key/Value
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor) ← ML / Science
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
22

DATA MODELS
Relational
Key/Value
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
← Obsolete / Legacy / Rare
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
23

DATA MODELS
Relational ← This Course
Key/Value
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
24

EARLY DBMSs
Early database applications were difficult to build
and maintain on available DBMSs in the 1960s.
→ Examples: IDS, IMS, CODASYL
→ Computers were expensive, humans were cheap.

Tight coupling between logical and physical layers.

Programmers had to (roughly) know what queries


the application would execute before they could
deploy the database.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


25

EARLY DBMSs
Ted Codd was a mathematician at
IBM Research in the late 1960s.

Codd saw IBM’s developers rewriting


database programs every time the
database’s schema or layout changed.

Devised the relational model in 1969.


Edgar F. Codd

Edgar F. Codd

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


7

8 26

EARLY DBMSs
Ted Codd was a mathematician at
IBM Research in the late 1960s.

Codd saw IBM’s developers rewriting


database programs every time the
database’s schema or layout changed.

Devised the relational model in 1969.


Edgar F. Codd

Edgar F. Codd

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


CODASYL
The Differences and Similarities
Between the Data Base Set and
Relational Views of Data.
→ ACM SIGFIDET Workshop on Data
ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Description, Access, and Control in Ann


ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Arbor, Michigan, held 1–3 May 1974


ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Codd

Codd Bachman Gray Stonebraker


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


CODASYL
The Differences and Similarities
Between the Data Base Set and
Relational Views of Data.
→ ACM SIGFIDET Workshop on Data
ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Description, Access, and Control in Ann


ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Arbor, Michigan, held 1–3 May 1974


ACM SIGFIDET Workshop on Data Description, Access, and Control in Ann Arbor, Michigan, held 1–3 May 1974

Codd

Codd Bachman Gray Stonebraker


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


29

RELATIONAL MODEL
The relational model defines a database
abstraction based on relations to avoid maintenance
overhead.

Key tenets:
→ Store database in simple data structures (relations).
→ Physical storage left up to the DBMS implementation.
→ Access data through high-level language, DBMS figures
out best execution strategy.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


30

RELATIONAL MODEL
Structure: The definition of the database’s relations
and their contents independent of their physical
representation.

Integrity: Ensure the database’s contents satisfy


constraints.

Manipulation: Programming interface for


accessing and modifying a database's contents.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


31

DATA INDEPENDENCE
Application Application
Isolate the user/application from low-
level data representation.
→ The user only worries about high-level External Schema External Schema
application logic. Views (SQL)
Logical Data
→ DBMS optimizes the layout according Independence
to operating environment, database Logical Schema
contents, and workload. Schema, Constraints…
→ DBMS can then re-optimize the Physical Data
Independence (SQL)
database if/when these factors changes.
Physical Schema
Pages, Files, Extents…

Database
5-445/645 (Fall 2024)
Storage
15-445/645 (Fall 2024)
32

RELATIONAL MODEL
A relation is an unordered set that
contain the relationship of attributes Artist(name, year, country)
that represent entities. name year country
Wu-Tang Clan 1992 USA
A tuple is a set of attribute values Notorious BIG 1992 USA
(aka its domain) in the relation. GZA 1990 USA
→ Values are (normally) atomic/scalar.
→ The special value NULL is a member of n-ary Relation
every domain (if allowed). =
Table with n columns

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


33

RELATIONAL MODEL: PRIMARY KEYS


A relation's primary key uniquely
identifies a single tuple. Artist(name, year, country)
Some DBMSs automatically create an name year country

internal primary key if a table does Wu-Tang Clan 1992 USA

not define one. Notorious BIG 1992 USA


GZA 1990 USA

DBMS can auto-generation unique


primary keys via an identity column:
identity column

→ IDENTITY (SQL Standard)


→ SEQUENCE (PostgreSQL / Oracle)
→ AUTO_INCREMENT (MySQL)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


34

RELATIONAL MODEL: PRIMARY KEYS


A relation's primary key uniquely
identifies a single tuple. Artist(id, name, year, country)
Some DBMSs automatically create an id name year country

internal primary key if a table does 101 Wu-Tang Clan 1992 USA

not define one. 102 Notorious BIG 1992 USA


103 GZA 1990 USA

DBMS can auto-generation unique


primary keys via an identity column:
identity column

→ IDENTITY (SQL Standard)


→ SEQUENCE (PostgreSQL / Oracle)
→ AUTO_INCREMENT (MySQL)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


35

RELATIONAL MODEL: FOREIGN KEYS


A foreign key specifies that an
attribute from one relation maps to a
tuple in another relation.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


36

RELATIONAL MODEL: FOREIGN KEYS

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
102 Notorious BIG 1992 USA
103 GZA 1990 USA

Album(id, name, artists, year)


id name artists year
Enter the Wu-Tang

11 Enter the Wu-Tang 101 1993


St.Ides Mix Tape

22 St.Ides Mix Tape ??? 1994


Liquid Swords

33 Liquid Swords 103 1995


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


37

RELATIONAL MODEL: FOREIGN KEYS

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 102 Notorious BIG 1992 USA
artist_id album_id 103 GZA 1990 USA
101 11
101 22 Album(id, name, artists, year)
103 22 id name artists year
Enter the Wu-Tang

102 22 11 Enter the Wu-Tang 101 1993


St.Ides Mix Tape

22 St.Ides Mix Tape ??? 1994


Liquid Swords

33 Liquid Swords 103 1995


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


38

RELATIONAL MODEL: FOREIGN KEYS

Artist(id, name, year, country)


id name year country
101 Wu-Tang Clan 1992 USA
ArtistAlbum(artist_id, album_id) 102 Notorious BIG 1992 USA
artist_id album_id 103 GZA 1990 USA
101 11
101 22 Album(id, name, year)
103 22 id name year
Enter the Wu-Tang

102 22 11 Enter the Wu-Tang 1993


St.Ides Mix Tape

22 St.Ides Mix Tape 1994


Liquid Swords

33 Liquid Swords 1995


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


39

RELATIONAL MODEL: CONSTRAINTS


User-defined conditions that must Artist(id, name, year, country)
hold for any instance of the database. id name year country
→ Can validate data within a single tuple 101 Wu-Tang Clan 1992 USA
or across entire relation(s). 102 Notorious BIG 1992 USA
→ DBMS prevents modifications that 103 GZA 1990 USA
violate any constraint.
CREATE TABLE Artist (
Unique key and referential (fkey) name VARCHAR NOT NULL,
constraints are the most common. year INT,
country CHAR(60),
SQL:92 supports global asserts but CHECK (year > 1900)
these are rarely used (too slow). );
CREATE ASSERTION myAssert
CHECK ( <SQL> );
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


40

DATA MANIPULATION LANGUAGES (DML)


The API that a DBMS exposes to applications to
store and retrieve information from a database.
Procedural: ← Relational
→ The query specifies the (high-level) strategy to find
the desired result based on sets / bags. Algebra

Non-Procedural (Declarative): ← Relational


→ The query specifies only what data is wanted and Calculus
not how to find it.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


41

RELATIONAL ALGEBRA
Fundamental operations to retrieve σ Select
and manipulate tuples in a relation.
→ Based on set algebra (unordered lists with π Projection
no duplicates). ∪ Union
Each operator takes one or more ∩ Intersection
relations as its inputs and outputs a – Difference
new relation.
→ We can “chain” operators together to × Product
create more complex operations. ⋈ Join

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


42

RELATIONAL ALGEBRA: SELECT


Choose a subset of the tuples from a R(a_id,b_id)
relation that satisfies a selection a_id b_id
a1 101
predicate. a2 102
→ Predicate acts as a filter to retain only a2 103
tuples that fulfill its qualifying a3 104
requirement. σa_id='a2'(R) σa_id='a2'∧ b_id>102(R)
→ Can combine multiple predicates using a_id b_id a_id b_id
conjunctions / disjunctions. a2 102 a2 103
a2 103
Syntax: σpredicate(R)
SELECT * FROM R
WHERE a_id='a2' AND b_id>102;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


43

RELATIONAL ALGEBRA: SELECT


Choose a subset of the tuples from a R(a_id,b_id)
relation that satisfies a selection a_id b_id
a1 101
predicate. a2 102
→ Predicate acts as a filter to retain only a2 103
tuples that fulfill its qualifying a3 104
requirement. σa_id='a2'(R) σa_id='a2'∧ b_id>102(R)
→ Can combine multiple predicates using a_id b_id a_id b_id
conjunctions / disjunctions. a2 102 a2 103
a2 103
Syntax: σpredicate(R)
SELECT * FROM R
WHERE a_id='a2' AND b_id>102;

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


44

RELATIONAL ALGEBRA: PROJECTION


Generate a relation with tuples that R(a_id,b_id)
contains only the specified attributes. a_id b_id
a1 101
→ Rearrange attributes’ ordering.
a2 102
→ Remove unwanted attributes. a2 103
→ Manipulate values to create derived a3 104
attributes.
Πb_id-100,a_id(σa_id='a2'(R))
Syntax: ΠA1,A2,…,An(R) b_id-100 a_id
2 a2
3 a2

SELECT b_id-100, a_id


FROM R WHERE a_id = 'a2';

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


45

RELATIONAL ALGEBRA: UNION


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id)
tuples that appear in either only one a_id b_id a_id b_id
a1 101 a3 103
or both input relations. a2 102 a4 104
a3 103 a5 105
Syntax: (R ∪ S)
(R ∪ S)
a_id b_id
a1 101
a2 102
(SELECT * FROM R) a3 103
UNION a4 104
(SELECT * FROM S); a5 105

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


46

RELATIONAL ALGEBRA: INTERSECTION


Generate a relation that contains only R(a_id,b_id) S(a_id,b_id)
the tuples that appear in both of the a_id b_id a_id b_id
a1 101 a3 103
input relations. a2 102 a4 104
a3 103 a5 105
Syntax: (R ∩ S)
(R ∩ S)
a_id b_id
a3 103

(SELECT * FROM R)
INTERSECT
(SELECT * FROM S);

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


47

RELATIONAL ALGEBRA: DIFFERENCE


Generate a relation that contains only R(a_id,b_id) S(a_id,b_id)
the tuples that appear in the first and a_id b_id a_id b_id
a1 101 a3 103
not the second of the input relations. a2 102 a4 104
a3 103 a5 105
Syntax: (R – S)
(R – S)
a_id b_id
a1 101
a2 102
(SELECT * FROM R)
EXCEPT
(SELECT * FROM S);

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


48

RELATIONAL ALGEBRA: PRODUCT


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id)
possible combinations of tuples from a_id b_id a_id b_id
a1 101 a3 103
the input relations. a2 102 a4 104
a3 103 a5 105
Syntax: (R × S) (R × S)
R.a_id R.b_id S.a_id S.b_id
a1 101 a3 103
a1 101 a4 104
a1 101 a5 105
SELECT * FROM R CROSS JOIN S; a2 102 a3 103
a2 102 a4 104
a2 102 a5 105
SELECT * FROM R, S; a3 103 a3 103
a3 103 a4 104
a3 103 a5 105
5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


49

RELATIONAL ALGEBRA: JOIN


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id,val)
tuples that are a combination of two a_id b_id a_id b_id val
a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY
with a common value(s) for one or a3 103 a5 105 ZZZ
more attributes. (R ⋈ S)
a_id b_id val
Syntax: (R ⋈ S) a3 103 XXX

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


50

RELATIONAL ALGEBRA: JOIN


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id,val)
tuples that are a combination of two a_id b_id a_id b_id val
a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY
with a common value(s) for one or a3 103 a5 105 ZZZ
more attributes. (R ⋈ S)
R.a_id R.b_id S.a_id S.b_id S.val a_id b_id val
Syntax: (R ⋈ S) a3 103 a3 103 XXX a3 103 XXX

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


51

RELATIONAL ALGEBRA: JOIN


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id,val)
tuples that are a combination of two a_id b_id a_id b_id val
a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY
with a common value(s) for one or a3 103 a5 105 ZZZ
more attributes. (R ⋈ S)
R.a_id R.b_id S.a_id S.b_id S.val a_id b_id val
Syntax: (R ⋈ S) a3 103 a3 103 XXX a3 103 XXX

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


52

RELATIONAL ALGEBRA: JOIN


Generate a relation that contains all R(a_id,b_id) S(a_id,b_id,val)
tuples that are a combination of two a_id b_id a_id b_id val
a1 101 a3 103 XXX
tuples (one from each input relation) a2 102 a4 104 YYY
with a common value(s) for one or a3 103 a5 105 ZZZ
more attributes. (R ⋈ S)
a_id b_id val
Syntax: (R ⋈ S) a3 103 XXX

SELECT * FROM R NATURAL JOIN S;

SELECT * FROM R JOIN S USING (a_id, b_id);

SELECT * FROM R JOIN S


5-445/645 (Fall 2024)
ON R.a_id = S.a_id AND R.b_id = S.b_id;
15-445/645 (Fall 2024)
53

RELATIONAL ALGEBRA: EXTRA OPERATORS


Rename (ρ)
Assignment (R←S)
Duplicate Elimination (δ)
Aggregation (γ)
Sorting (τ)
Division (R÷S)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


54

OBSERVATION
Relational algebra defines an ordering of the high-
level steps of how to compute a query.
→ Example: σb_id=102(R⋈S) vs. (R⋈(σb_id=102(S))

A better approach is to state the high-level answer


that you want the DBMS to compute.
→ Example: Retrieve the joined tuples from R and S where
b_id equals 102.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


55

RELATIONAL MODEL: QUERIES


The relational model is independent of any query
language implementation.

SQL is the de facto standard (many dialects).

for line in file.readlines():


SELECT year FROM artists
record = parse(line)
WHERE name = 'GZA';
if record[0] == "GZA":
print(int(record[1]))

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


56

DATA MODELS
Relational ← This Course
Key/Value
Graph
Document / JSON / XML / Object
Wide-Column / Column-family
Array (Vector, Matrix, Tensor)
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
57

DATA MODELS
Relational
Key/Value
Graph
Document / JSON / XML / Object ← Leading Alternative
Wide-Column / Column-family
Array (Vector, Matrix, Tensor) ← New Hotness
Hierarchical
Network
Semantic
5-445/645 (Fall 2024)
Entity-Relationship
15-445/645 (Fall 2024)
40

DOCUMENT DATA MODEL


A collection of record documents containing a
hierarchy of named field/value pairs.
→ A field’s value can be either a scalar type, an array of values,
or another document.
→ Modern implementations use JSON. Older systems use
XML or custom object representations.
Avoid “relational-object impedance mismatch” by
tightly coupling objects and database.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


59

DOCUMENT DATA MODEL

Artist R1(id,…)


ArtistAlbum R2(artist_id,album_id)


Album R3(id,…)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


60

DOCUMENT DATA MODEL

Artist R1(id,…)


ArtistAlbum R2(artist_id,album_id)


Album R3(id,…)

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


61

DOCUMENT DATA MODEL

Application Code {
class Artist { "name": "GZA",
Artist int id;
"year": 1990,
"albums": [
String name; {
int year; "name": "Liquid Swords",
Album albums[]; "year": 1995
},
} {
class Album { "name": "Beneath the Surface",
int id; "year": 1999
Album String name; }
]
int year;
}
}

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


62

DOCUMENT DATA MODEL

Application Code {
class Artist { "name": "GZA",
Artist int id;
"year": 1990,
"albums": [
String name; {
int year; "name": "Liquid Swords",
Album albums[]; "year": 1995
},
} {
class Album { "name": "Beneath the Surface",
int id; "year": 1999
Album String name; }
]
int year;
}
}

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


42

VECTOR DATA MODEL


One-dimensional arrays used for nearest-neighbor
search (exact or approximate).
→ Used for semantic search on embeddings generated by ML-
trained transformer models (think ChatGPT).
→ Native integration with modern ML tools and APIs (e.g.,
LangChain, OpenAI).
At their core, these systems use specialized indexes
to perform NN searches quickly.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


42

VECTOR DATA MODEL


One-dimensional arrays used for nearest-neighbor
search (exact or approximate).
→ Used for semantic search on embeddings generated by ML-
trained transformer models (think ChatGPT).
→ Native integration with modern ML tools and APIs (e.g.,
LangChain, OpenAI).
At their core, these systems use specialized indexes
to perform NN searches quickly.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


65

VECTOR DATA MODEL


Album(id, name, year) Embeddings
id name year Id1 → [0.32, 0.78, 0.30, ...]
Enter the Wu-Tang

11 Enter the Wu-Tang 1993 Id2 → [0.99, 0.19, 0.81, ...]


St.Ides Mix Tape

Transformer
22 St.Ides Mix Tape 1994 Id3 → [0.01, 0.18, 0.85, ...]


Liquid Swords

33 Liquid Swords 1995

Query
Find albums similar
Vector
to "Liquid Swords" Index
HNSW

HNSW, IVFFlat
Meta Faiss Spotify Annoy

Meta Faiss, Spotify Annoy


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


66

VECTOR DATA MODEL


Album(id, name, year) Embeddings
id name year Id1 → [0.32, 0.78, 0.30, ...]
Enter the Wu-Tang

11 Enter the Wu-Tang 1993 Id2 → [0.99, 0.19, 0.81, ...]


St.Ides Mix Tape

Transformer
22 St.Ides Mix Tape 1994 Id3 → [0.01, 0.18, 0.85, ...]


Liquid Swords

33 Liquid Swords 1995

Query [0.02, 0.10, 0.24, ...]

Find albums similar


Vector
to "Liquid Swords" Ranked List of Ids
Index
HNSW

HNSW, IVFFlat
Meta Faiss Spotify Annoy

Meta Faiss, Spotify Annoy


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


67

VECTOR DATA MODEL


Album(id, name, year) Embeddings
id name year Id1 → [0.32, 0.78, 0.30, ...]
Enter the Wu-Tang

11 Enter the Wu-Tang 1993 Id2 → [0.99, 0.19, 0.81, ...]


St.Ides Mix Tape

Transformer
22 St.Ides Mix Tape 1994 Id3 → [0.01, 0.18, 0.85, ...]


Liquid Swords

33 Liquid Swords 1995

Query [0.02, 0.10, 0.24, ...]

Find albums similar


Vector
to "Liquid Swords" Ranked List of Ids
Index
HNSW

HNSW, IVFFlat
Meta Faiss Spotify Annoy

Meta Faiss, Spotify Annoy


5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


68

CONCLUSION
Databases are ubiquitous.

Relational algebra defines the primitives for


processing queries on a relational database.

We will see relational algebra again when we talk


about query optimization + execution.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


69

NEXT CLASS
Modern SQL
→ Make sure you understand basic SQL before the lecture.

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)


46

ASK ANDY ANYTHING


Questions about database industry?
Questions about database jobs?
Questions about database systems?

5-445/645 (Fall 2024)

15-445/645 (Fall 2024)

You might also like