Image: Mathias Krumbholz (wikipedia commons)
Plan for Today
Recap: Unix System 5 File System
Creating a File
Better File Systems: ZFS, RAID
Flash Memory
1
PS4 is due
11:59pm
Sunday, 6 April
Exam 2 Redo: posted on
course site, due 11:69pm
2
0
1
2
…
9
10
11
12 Disk Block
(1K bytes)
Indirect
Disk Block
(1K bytes)
4 bytes for each = 256 pointers
Disk Block
(1K bytes)
Disk Block
(1K bytes)
Disk Block
(1K bytes)
Double
Indirect
Disk Block
Indirect
Disk Block
(1K bytes)
Indirect
Disk Block
(1K bytes)
D
(
D
(1
D
(
Diskmap
(Unix System 5)
Directories are Files Too!
3
Filename Inode
. 494211
.. 494205
.DS_Store 494212
class0 6565946
class1 6565826
class10 1467012
class11 2252968
… …
class16 5649155
class2 494218
… …
ls -ali
How do you create a new file?
4
Finding a Free Block
5
Data
I-List (inodes)
Superblock
Boot block
Not to scale!
0
1
…
98
99
List of free disk blocks
0
1
…
98
99
Finding a Free inode
6
Data
I-List (inodes)
Superblock
Boot block
Not to scale!
0 0
1 1
2 0
3 0
… …
Superblock keeps a cache of free inodes
Finding a Free inode
7
Data
I-List (inodes)
Superblock
Boot block
Not to scale!
0 0
1 1
2 0
3 0
… …
Superblock keeps a cache of free inodes
Lots more to do!
Need to select disk blocks, update directory, etc.
Read the OSTEP chapter.
Modern File Systems
8
IBM 350 Disk Storage (1956)
118,000 in3, 5MB, 600ms seek
Seagate HDD (2013)
23 in3, 4TB (4M MB), 5ms seek
What should a modern file system do
that Unix S5FS doesn’t?
9
10
11
ZFSDeveloped for Solaris, 2005
Now open source:
http://open-zfs.org/
12
“MacZFS is free data storage and protection software
for all Mac OS users. It’s for people who have Mac OS,
who have any data, and who really like their data.
Whether on a single-drive laptop or on a massive
server, it’ll store your petabytes with ragingly redundant
RAID reliability, and it’ll keep the bit-rotted bleeps and
bloops out of your iTunes library.”
Handling Failures
13
Block Checksums
14
0
1
2
…
9
10
11
12
Disk Block
(1K bytes)
S5FS
Block
Checksum
(SHA-256)
0 40a3dc…
1 2c5829d…
2 955d253…
… …
ZFS
How do you check the checksums?
Hashing the Hashes
15
Block 1 Block 2 Block 3 Block 4
Hash(B1) Hash(B2) Hash(B3) Hash(B4)
Merkle Tree
16
Ralph Merkle
Block 1 Block 2 Block 3 Block 4
Hash(B1) Hash(B2) Hash(B3) Hash(B4)
Recovery
17
copies = 2
One
Copy
Copy 1
Copy 2
Keep 2 copies of every block: if
checksum fails for first copy
read, try reading second copy.
18
copies = 3
One
Copy
Copy 1
Copy 2
For the truly paranoid…
Copy 3
RAID
19
For the fairly paranoid but cheap… Redundant
Arrays of
Inexpensive
DisksACM SIGMOD 1988
whitehouse.gov
Case for
RAID
20
21
Redundancy
22
23
Improving Performance
24
Cache (64MB DRAM)
Adaptive Replacement Cache
Adaptive Replacement Cache
25
T1: Recent Cache Entries
Accessed Again
T2: Frequently-Used Blocks
Size of T1 adapts
B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU)
How should relative size of T1 and T2 be adjusted?
BlocksinCache“Ghost”Entries
Adaptive Replacement Cache
26
T1: Recent Cache Entries
Accessed Again
T2: Frequently-Used Blocks
Size of T1 adapts
B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU)
BlocksinCache“Ghost”Entries
Hit in B1: should increase size of T1, drop entry from T2 to B2
Hit in B2: should increase size of T2, drop entry from T1 to B1
27
IBM Almaden Research Center
Do you actually have
a disk like this on
your EC2 node/main
computing device?
28
Cache (64MB DRAM)
Flash Memory
29
Solid State Drive
30
Fujio Masuoka
Drain
How NAND Flash Works
31
Oxide Layer
Adapted from http://computer.howstuffworks.com/flash-memory1.htm
Word Line
BitLine
Control gate
Floating gate
stores electrons
Source 1
Uncharged State
Drain
How NAND Flash Works
32
Oxide Layer
Adapted from http://computer.howstuffworks.com/flash-memory1.htm
Word Line
BitLine
Control gate
Floating gate
stores electrons
Source 0
Charged State
----------------------------------------
Flash Memory
Non-volatile
preserves state without any power
Solid State
no moving parts larger than electrons
Fast (compared to disk)
random read time ~10,000ns
33
Summary: Storage Systems
34
Device Example Time to Access Cost per Bit
Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average)
$ 0.38 (1968)
(a bazillion n$)
DRAM
Kingston KVR16N11/4
4GB DDR3 ($40)
13.75ns 1.16 n$
SSD
Samsung 500GB
($300)
~10,000 ns
(for random read)
0.075 n$
Disk Drive
Seagate Desktop HDD 4
TB SATA 6Gb/s NCQ
64MB
5,000,000ns 0.0046 n$
Challenges of Flash
Writing (1  0) is expensive
Erasing (0  1) is super expensive:
Apply electric field to release charge
Can only erase a full block (often 128K) at a time
Cells wear out after 10,000-1M erasings
Reading disturbs nearby cells
Cannot read same cell too many times
35
But: no seek time – time to access every cell is the same!
How should we design a file
system for flash memory?
36
37
UVa Mathematics (1984)
Berkeley CS PhD
Stanford Professor
Log-Structured File System
38
Write sequentially: never overwrite data
File 1 File 2
Updated
File 1
Disk
April Fool’s? What’s wrong with this picture?
Where does the meta-data go?
39
Block 0
Disk
Block 1 Block 2
InodeA
When should we do the writes?
40
Block 0
Disk
Block 1 Block 2
InodeA
When should we do the writes?
41
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
In-Memory Buffer
Block 6 Block 7
InodeB
When should we do the writes?
42
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
In-Memory Buffer
Block 6 Block 7
InodeB
Updating a File
43
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Suppose the contents of Block 1 are modified?
Updating a File
44
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
Updating a File
45
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
Finding an Inode
46
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
Recap: how did we do this for S5FS?
47
Filename Inode
. 494211
.. 494205
.DS_Store 494212
class0 6565946
class1 6565826
… …
class16 5649155
class2 494218
… …
Recap: how did we do this for S5FS?
48
Filename Inode
. 494211
.. 494205
.DS_Store 494212
class0 6565946
class1 6565826
… …
class16 5649155
class2 494218
… …
Finding an Inode
49
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
50
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
imap
0
1
2
Pointer to most recent version of inode.
51
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
imap
0
1
2
Pointer to most recent version of inode.
Where should we store the imap?
52
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block
InodeB
Block 7
Block 1 -
update
InodeA’
imap
0
1
2
Pointer to most recent version of inode.
At the end of each write! (when
necessary) – its small (4 bytes *
number of inodes), and sequential
writes are cheap!
53
Block 0
Disk
Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block 7InodeB
Block 7
Block 1 -
update
InodeA’
imap
Block 8
Block 0 -
update
…
Won’t the disk fill up with lots of old junk?
Block 5 -
update
InodeA’
InodeB’
imap
54
Class 8:
Garbage Collection in LSFS
55
Block 0 Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block 7
InodeB
Block 7
Block 1 -
update
InodeA’
imap
Block 8
Block 0 -
update
…Block 5 -
update
InodeA’
InodeB’
imap
Garbage Collection in LSFS
56
Block 0 Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block 7
InodeB
Block 7
Block 1 -
update
InodeA’
imap
Block 8
Block 0 -
update
…Block 5 -
update
InodeA’
InodeB’
imap
Segment
Garbage Collection in LSFS
57
Block 0 Block 1 Block 2
InodeA
Block 3 Block 4 Block 5
Disk, continued
Block 6 Block 7
InodeB
Block 7
Block 1 -
update
InodeA’
imap
Block 8
Block 0 -
update
…Block 5 -
update
InodeA’
InodeB’
imap
Segment
Garbage Collection in LSFS
58
Block 6 Block 7
InodeB
Block 7
Block 1 -
update
InodeA’
imap
Block 8
Block 0 -
update
…Block 5 -
update
InodeA’
InodeB’
imap
Segment
A full clean segment!
Block 2 Block 3 Block 4
InodeA’
InodeB’
imap
…
59
SOSP 1991
1987
60
http://www.jcmit.com/flash2013.htm
2003: $0.25/MB
2006: $0.02/MB
2010: $0.002/MB
2013: $0.0005/MB
< $1/GB
Differences with Flash
No need for sequential writes
Just need to find unused blocks
Can do 1  0 rewrites!
Maintain a bitmap of used blocks at fixed block
Lots of complexities:
Bits wear out, read disruption, etc.
61
Who should deal with those complexities?
62
2GB microSD card
Andrew “bunnie” Huang
63
2GB microSD card
Andrew “bunnie” Huang
ARM Processor!
64
Summary: Storage Systems
65
Device Example Time to Access Cost per Bit
Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average)
$ 0.38 (1968)
(a bazillion n$)
DRAM
Kingston KVR16N11/4
4GB DDR3 ($40)
13.75ns 1.16 n$
SSD
Samsung 500GB
($300)
~10,000 ns
(for random read)
0.075 n$
Disk Drive
Seagate Desktop HDD 4
TB SATA 6Gb/s NCQ
64MB
5,000,000ns 0.0046 n$
ModernHardDrive
Relevance to PS4?
66
Not expected to implement any of this
– a very simple filesystem in memory is
fine (but feel free to surprise us!)
Your filesystem is in memory: no need to deal with
complexities of interfacing with persistent media
(but doing this could be a good post-PS4 project!).
FlashKernel?
67
by shamserg
PS4 Due
Sunday,
11:59pm

More Related Content

PPTX
Storage
PPTX
Microkernels and Beyond
PPTX
Gash Has No Privileges
PPTX
The Internet
PDF
An introduction and evaluations of a wide area distributed storage system
PDF
Dfrws eu 2014 rekall workshop
PDF
Deconstruct 2017: All programmers MUST learn C and Assembly
PDF
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]
Storage
Microkernels and Beyond
Gash Has No Privileges
The Internet
An introduction and evaluations of a wide area distributed storage system
Dfrws eu 2014 rekall workshop
Deconstruct 2017: All programmers MUST learn C and Assembly
Jaime Peñalba - Kernel exploitation. ¿El octavo arte? [rooted2019]

What's hot (20)

PDF
Let Me Pick Your Brain - Remote Forensics in Hardened Environments
PPT
Mac Memory Analysis with Volatility
PDF
Namespaces in Linux
PDF
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
PDF
Hunting Mac Malware with Memory Forensics
PPTX
Linux Initialization Process (1)
PPTX
Memory forensics
PDF
PostgreSQL on ZFS Lightning Talk
PPTX
Memory Forensics: Defeating Disk Encryption, Skilled Attackers, and Advanced ...
PDF
Docker storage drivers by Jérôme Petazzoni
PPTX
Prosit google-cloud
PPTX
Linux Survival Kit for Proof of Concept & Proof of Technology
PPTX
Linux Kernel Booting Process (2) - For NLKB
PDF
Unix::Statgrab
PPTX
Malware analysis using volatility
PPTX
Union FileSystem - A Building Blocks Of a Container
PPTX
Forensic Memory Analysis of Android's Dalvik Virtual Machine
PPT
Linux Interview Questions Quiz
PPTX
Sql Bits Sql Server Crash Dump Analysis
PDF
Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...
Let Me Pick Your Brain - Remote Forensics in Hardened Environments
Mac Memory Analysis with Volatility
Namespaces in Linux
Android Mind Reading: Android Live Memory Analysis with LiME and Volatility
Hunting Mac Malware with Memory Forensics
Linux Initialization Process (1)
Memory forensics
PostgreSQL on ZFS Lightning Talk
Memory Forensics: Defeating Disk Encryption, Skilled Attackers, and Advanced ...
Docker storage drivers by Jérôme Petazzoni
Prosit google-cloud
Linux Survival Kit for Proof of Concept & Proof of Technology
Linux Kernel Booting Process (2) - For NLKB
Unix::Statgrab
Malware analysis using volatility
Union FileSystem - A Building Blocks Of a Container
Forensic Memory Analysis of Android's Dalvik Virtual Machine
Linux Interview Questions Quiz
Sql Bits Sql Server Crash Dump Analysis
Needle In An Encrypted Haystack: Forensics in a hardened environment (with Fu...
Ad

Viewers also liked (20)

PPTX
Invent the Future (Operating Systems in 2029)
PPTX
Scheduling in Linux and Web Servers
PPTX
Kernel-Level Programming: Entering Ring Naught
PPTX
Inventing the Future
PPTX
System Calls
PPT
Unix file systems 2 in unix internal systems
PPTX
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
PDF
ZFS Tutorial USENIX June 2009
PPTX
Nu: scrum op school
PPTX
What the &~#@&lt;!? (Pointers in Rust)
PPTX
SSL Failing, Sharing, and Scheduling
PPTX
Making a Process (Virtualizing Memory)
PPTX
Managing Memory
PPTX
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
PPTX
Once Upon a Process
ODP
Basis Linux (aan de hand van LPIC-1)
PDF
Class 1: What is an Operating System?
PPT
Unix File System
PPT
Building Linux IPv6 DNS Server (Complete Presentation)
PPT
Mca ii os u-5 unix linux file system
Invent the Future (Operating Systems in 2029)
Scheduling in Linux and Web Servers
Kernel-Level Programming: Entering Ring Naught
Inventing the Future
System Calls
Unix file systems 2 in unix internal systems
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
ZFS Tutorial USENIX June 2009
Nu: scrum op school
What the &~#@&lt;!? (Pointers in Rust)
SSL Failing, Sharing, and Scheduling
Making a Process (Virtualizing Memory)
Managing Memory
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Once Upon a Process
Basis Linux (aan de hand van LPIC-1)
Class 1: What is an Operating System?
Unix File System
Building Linux IPv6 DNS Server (Complete Presentation)
Mca ii os u-5 unix linux file system
Ad

Similar to Flash! (Modern File Systems) (20)

PPT
Secondarystoragedevices1 130119040144-phpapp02
PPTX
9_Storage_Devices.pptx
PPTX
9_Storage_Devices.pptx
PPT
Internal representation of files ppt
PDF
Operating System File Management disk_management.pdf
PDF
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
PDF
Just In Time LSM Compaction by Aleksei Kladov
PDF
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
PPT
PDF
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
PPTX
Computer Memory Hierarchy Computer Architecture
PPTX
Hard Disk
PPTX
Latest performance changes by Scylla - Project optimus / Nolimits
PDF
NAND-Flash-Data-Recovery-Cookbook-igor.pdf
PDF
Some analysis of BlueStore and RocksDB
PPT
Unix 6 en
PDF
AOS Lab 9: File system -- Of buffers, logs, and blocks
PDF
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
DOCX
ITC 360Professor John CovingtonSystem Administration And Managemen.docx
PDF
Secondarystoragedevices1 130119040144-phpapp02
9_Storage_Devices.pptx
9_Storage_Devices.pptx
Internal representation of files ppt
Operating System File Management disk_management.pdf
OpenZFS novel algorithms: snapshots, space allocation, RAID-Z - Matt Ahrens
Just In Time LSM Compaction by Aleksei Kladov
Linux Symposium 2009 Slide Suzaki "Effect of readahead and file system block ...
IAP09 CUDA@MIT 6.963 - Guest Lecture: Out-of-Core Programming with NVIDIA's C...
Computer Memory Hierarchy Computer Architecture
Hard Disk
Latest performance changes by Scylla - Project optimus / Nolimits
NAND-Flash-Data-Recovery-Cookbook-igor.pdf
Some analysis of BlueStore and RocksDB
Unix 6 en
AOS Lab 9: File system -- Of buffers, logs, and blocks
Linux.Conf.AU 2009 (LCA09) Slide "OS Circular: Internet bootable OS Archive" ...
ITC 360Professor John CovingtonSystem Administration And Managemen.docx

More from David Evans (20)

PPTX
Cryptocurrency Jeopardy!
PPTX
Trick or Treat?: Bitcoin for Non-Believers, Cryptocurrencies for Cypherpunks
PPTX
Hidden Services, Zero Knowledge
PPTX
Anonymity in Bitcoin
PPTX
Midterm Confirmations
PPTX
Scripting Transactions
PPTX
How to Live in Paradise
PPTX
Bitcoin Script
PPTX
Mining Economics
PPTX
Mining
PPTX
The Blockchain
PPTX
Becoming More Paranoid
PPTX
Asymmetric Key Signatures
PPTX
Introduction to Cryptography
PPTX
Class 1: What is Money?
PPTX
Multi-Party Computation for the Masses
PPTX
Proof of Reserve
PPTX
Silk Road
PPTX
Blooming Sidechains!
PPTX
Useful Proofs of Work, Permacoin
Cryptocurrency Jeopardy!
Trick or Treat?: Bitcoin for Non-Believers, Cryptocurrencies for Cypherpunks
Hidden Services, Zero Knowledge
Anonymity in Bitcoin
Midterm Confirmations
Scripting Transactions
How to Live in Paradise
Bitcoin Script
Mining Economics
Mining
The Blockchain
Becoming More Paranoid
Asymmetric Key Signatures
Introduction to Cryptography
Class 1: What is Money?
Multi-Party Computation for the Masses
Proof of Reserve
Silk Road
Blooming Sidechains!
Useful Proofs of Work, Permacoin

Recently uploaded (20)

PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Streamline Vulnerability Management From Minimal Images to SBOMs
PDF
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PDF
Altius execution marketplace concept.pdf
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
Examining Bias in AI Generated News Content.pdf
PDF
NewMind AI Journal Monthly Chronicles - August 2025
PDF
The AI Revolution in Customer Service - 2025
PDF
Lung cancer patients survival prediction using outlier detection and optimize...
PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PDF
Identification of potential depression in social media posts
PDF
CEH Module 2 Footprinting CEH V13, concepts
PDF
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
PDF
SaaS reusability assessment using machine learning techniques
PDF
Launch a Bumble-Style App with AI Features in 2025.pdf
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PPTX
Information-Technology-in-Human-Society.pptx
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Streamline Vulnerability Management From Minimal Images to SBOMs
zbrain.ai-Scope Key Metrics Configuration and Best Practices.pdf
Data Virtualization in Action: Scaling APIs and Apps with FME
Altius execution marketplace concept.pdf
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Presentation - Principles of Instructional Design.pptx
NewMind AI Weekly Chronicles – August ’25 Week IV
Examining Bias in AI Generated News Content.pdf
NewMind AI Journal Monthly Chronicles - August 2025
The AI Revolution in Customer Service - 2025
Lung cancer patients survival prediction using outlier detection and optimize...
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Identification of potential depression in social media posts
CEH Module 2 Footprinting CEH V13, concepts
CCUS-as-the-Missing-Link-to-Net-Zero_AksCurious.pdf
SaaS reusability assessment using machine learning techniques
Launch a Bumble-Style App with AI Features in 2025.pdf
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Information-Technology-in-Human-Society.pptx

Flash! (Modern File Systems)

  • 1. Image: Mathias Krumbholz (wikipedia commons)
  • 2. Plan for Today Recap: Unix System 5 File System Creating a File Better File Systems: ZFS, RAID Flash Memory 1 PS4 is due 11:59pm Sunday, 6 April Exam 2 Redo: posted on course site, due 11:69pm
  • 3. 2 0 1 2 … 9 10 11 12 Disk Block (1K bytes) Indirect Disk Block (1K bytes) 4 bytes for each = 256 pointers Disk Block (1K bytes) Disk Block (1K bytes) Disk Block (1K bytes) Double Indirect Disk Block Indirect Disk Block (1K bytes) Indirect Disk Block (1K bytes) D ( D (1 D ( Diskmap (Unix System 5)
  • 4. Directories are Files Too! 3 Filename Inode . 494211 .. 494205 .DS_Store 494212 class0 6565946 class1 6565826 class10 1467012 class11 2252968 … … class16 5649155 class2 494218 … … ls -ali
  • 5. How do you create a new file? 4
  • 6. Finding a Free Block 5 Data I-List (inodes) Superblock Boot block Not to scale! 0 1 … 98 99 List of free disk blocks 0 1 … 98 99
  • 7. Finding a Free inode 6 Data I-List (inodes) Superblock Boot block Not to scale! 0 0 1 1 2 0 3 0 … … Superblock keeps a cache of free inodes
  • 8. Finding a Free inode 7 Data I-List (inodes) Superblock Boot block Not to scale! 0 0 1 1 2 0 3 0 … … Superblock keeps a cache of free inodes Lots more to do! Need to select disk blocks, update directory, etc. Read the OSTEP chapter.
  • 9. Modern File Systems 8 IBM 350 Disk Storage (1956) 118,000 in3, 5MB, 600ms seek Seagate HDD (2013) 23 in3, 4TB (4M MB), 5ms seek
  • 10. What should a modern file system do that Unix S5FS doesn’t? 9
  • 11. 10
  • 12. 11 ZFSDeveloped for Solaris, 2005 Now open source: http://open-zfs.org/
  • 13. 12 “MacZFS is free data storage and protection software for all Mac OS users. It’s for people who have Mac OS, who have any data, and who really like their data. Whether on a single-drive laptop or on a massive server, it’ll store your petabytes with ragingly redundant RAID reliability, and it’ll keep the bit-rotted bleeps and bloops out of your iTunes library.”
  • 15. Block Checksums 14 0 1 2 … 9 10 11 12 Disk Block (1K bytes) S5FS Block Checksum (SHA-256) 0 40a3dc… 1 2c5829d… 2 955d253… … … ZFS How do you check the checksums?
  • 16. Hashing the Hashes 15 Block 1 Block 2 Block 3 Block 4 Hash(B1) Hash(B2) Hash(B3) Hash(B4)
  • 17. Merkle Tree 16 Ralph Merkle Block 1 Block 2 Block 3 Block 4 Hash(B1) Hash(B2) Hash(B3) Hash(B4)
  • 18. Recovery 17 copies = 2 One Copy Copy 1 Copy 2 Keep 2 copies of every block: if checksum fails for first copy read, try reading second copy.
  • 19. 18 copies = 3 One Copy Copy 1 Copy 2 For the truly paranoid… Copy 3
  • 20. RAID 19 For the fairly paranoid but cheap… Redundant Arrays of Inexpensive DisksACM SIGMOD 1988 whitehouse.gov
  • 22. 21
  • 24. 23
  • 25. Improving Performance 24 Cache (64MB DRAM) Adaptive Replacement Cache
  • 26. Adaptive Replacement Cache 25 T1: Recent Cache Entries Accessed Again T2: Frequently-Used Blocks Size of T1 adapts B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU) How should relative size of T1 and T2 be adjusted? BlocksinCache“Ghost”Entries
  • 27. Adaptive Replacement Cache 26 T1: Recent Cache Entries Accessed Again T2: Frequently-Used Blocks Size of T1 adapts B1: Evicted from T1 (LRU) B2: Evicted from T2 (LRU) BlocksinCache“Ghost”Entries Hit in B1: should increase size of T1, drop entry from T2 to B2 Hit in B2: should increase size of T2, drop entry from T1 to B1
  • 29. Do you actually have a disk like this on your EC2 node/main computing device? 28 Cache (64MB DRAM)
  • 32. Drain How NAND Flash Works 31 Oxide Layer Adapted from http://computer.howstuffworks.com/flash-memory1.htm Word Line BitLine Control gate Floating gate stores electrons Source 1 Uncharged State
  • 33. Drain How NAND Flash Works 32 Oxide Layer Adapted from http://computer.howstuffworks.com/flash-memory1.htm Word Line BitLine Control gate Floating gate stores electrons Source 0 Charged State ----------------------------------------
  • 34. Flash Memory Non-volatile preserves state without any power Solid State no moving parts larger than electrons Fast (compared to disk) random read time ~10,000ns 33
  • 35. Summary: Storage Systems 34 Device Example Time to Access Cost per Bit Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average) $ 0.38 (1968) (a bazillion n$) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ SSD Samsung 500GB ($300) ~10,000 ns (for random read) 0.075 n$ Disk Drive Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB 5,000,000ns 0.0046 n$
  • 36. Challenges of Flash Writing (1  0) is expensive Erasing (0  1) is super expensive: Apply electric field to release charge Can only erase a full block (often 128K) at a time Cells wear out after 10,000-1M erasings Reading disturbs nearby cells Cannot read same cell too many times 35 But: no seek time – time to access every cell is the same!
  • 37. How should we design a file system for flash memory? 36
  • 38. 37 UVa Mathematics (1984) Berkeley CS PhD Stanford Professor
  • 39. Log-Structured File System 38 Write sequentially: never overwrite data File 1 File 2 Updated File 1 Disk April Fool’s? What’s wrong with this picture?
  • 40. Where does the meta-data go? 39 Block 0 Disk Block 1 Block 2 InodeA
  • 41. When should we do the writes? 40 Block 0 Disk Block 1 Block 2 InodeA
  • 42. When should we do the writes? 41 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 In-Memory Buffer Block 6 Block 7 InodeB
  • 43. When should we do the writes? 42 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 In-Memory Buffer Block 6 Block 7 InodeB
  • 44. Updating a File 43 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Suppose the contents of Block 1 are modified?
  • 45. Updating a File 44 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update
  • 46. Updating a File 45 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’
  • 47. Finding an Inode 46 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’
  • 48. Recap: how did we do this for S5FS? 47 Filename Inode . 494211 .. 494205 .DS_Store 494212 class0 6565946 class1 6565826 … … class16 5649155 class2 494218 … …
  • 49. Recap: how did we do this for S5FS? 48 Filename Inode . 494211 .. 494205 .DS_Store 494212 class0 6565946 class1 6565826 … … class16 5649155 class2 494218 … …
  • 50. Finding an Inode 49 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’
  • 51. 50 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’ imap 0 1 2 Pointer to most recent version of inode.
  • 52. 51 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’ imap 0 1 2 Pointer to most recent version of inode. Where should we store the imap?
  • 53. 52 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block InodeB Block 7 Block 1 - update InodeA’ imap 0 1 2 Pointer to most recent version of inode. At the end of each write! (when necessary) – its small (4 bytes * number of inodes), and sequential writes are cheap!
  • 54. 53 Block 0 Disk Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block 7InodeB Block 7 Block 1 - update InodeA’ imap Block 8 Block 0 - update … Won’t the disk fill up with lots of old junk? Block 5 - update InodeA’ InodeB’ imap
  • 56. Garbage Collection in LSFS 55 Block 0 Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block 7 InodeB Block 7 Block 1 - update InodeA’ imap Block 8 Block 0 - update …Block 5 - update InodeA’ InodeB’ imap
  • 57. Garbage Collection in LSFS 56 Block 0 Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block 7 InodeB Block 7 Block 1 - update InodeA’ imap Block 8 Block 0 - update …Block 5 - update InodeA’ InodeB’ imap Segment
  • 58. Garbage Collection in LSFS 57 Block 0 Block 1 Block 2 InodeA Block 3 Block 4 Block 5 Disk, continued Block 6 Block 7 InodeB Block 7 Block 1 - update InodeA’ imap Block 8 Block 0 - update …Block 5 - update InodeA’ InodeB’ imap Segment
  • 59. Garbage Collection in LSFS 58 Block 6 Block 7 InodeB Block 7 Block 1 - update InodeA’ imap Block 8 Block 0 - update …Block 5 - update InodeA’ InodeB’ imap Segment A full clean segment! Block 2 Block 3 Block 4 InodeA’ InodeB’ imap …
  • 62. Differences with Flash No need for sequential writes Just need to find unused blocks Can do 1  0 rewrites! Maintain a bitmap of used blocks at fixed block Lots of complexities: Bits wear out, read disruption, etc. 61 Who should deal with those complexities?
  • 63. 62 2GB microSD card Andrew “bunnie” Huang
  • 64. 63 2GB microSD card Andrew “bunnie” Huang ARM Processor!
  • 65. 64
  • 66. Summary: Storage Systems 65 Device Example Time to Access Cost per Bit Mercury (Gin) Delay Line UNIVAC (1951) 220,000ns (average) $ 0.38 (1968) (a bazillion n$) DRAM Kingston KVR16N11/4 4GB DDR3 ($40) 13.75ns 1.16 n$ SSD Samsung 500GB ($300) ~10,000 ns (for random read) 0.075 n$ Disk Drive Seagate Desktop HDD 4 TB SATA 6Gb/s NCQ 64MB 5,000,000ns 0.0046 n$ ModernHardDrive
  • 67. Relevance to PS4? 66 Not expected to implement any of this – a very simple filesystem in memory is fine (but feel free to surprise us!) Your filesystem is in memory: no need to deal with complexities of interfacing with persistent media (but doing this could be a good post-PS4 project!).