DATA 228
Big Data Technologies and Applications (Fall 2024)
Sangjin Lee
Hadoop: distributed ilesystems
& HDFS
Ch pter 3, “H doop: the De initive Guide” 4th Edition, Tom White
a
a
f
f
What is a distributed ilesystem?
“A distributed ilesystem is ilesystem th t en bles clients to ccess ile stor ge from multiple
hosts through computer network s if the user w s ccessing loc l stor ge.”
f
a
a
f
a
a
a
f
a
a
a
a
f
a
a
What is a distributed ilesystem?
Key phrases
• A ilesystem
• Multiple hosts
• Through computer network
• As if the user w s ccessing loc l stor ge
f
a
a
a
a
a
f
What is a distributed ilesystem?
More attributes
• Sem ntics of ilesystem
• P ths, directories, ccess control, timest mps, etc.
• POSIX compli nce?
• Resiliency nd f ult toler nce import nt > loc l ilesystems
• Tr nsient network f ilures
• D t losses
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
f
f
Examples of distributed ilesystems
• More tr dition l: SMB, NFS
• Big-d t -driven: HDFS, GFS, M pR File System
• Stor ge-derived: CephFS, GlusterFS
• Cloud solutions (block-b sed): EBS (AWS), PD (GCP)
• Cloud solutions (object-b sed)*: s3 (AWS), GCS (GCP)
• Other vendor solutions: NetApp, Nut nix, Cohesity, …
* Not ll object stor ge systems re ilesystems.
a
a
a
a
a
a
a
a
a
a
f
a
a
f
Hadoop’s distributed ilesystem
• H doop provides n bstr ct (distributed) ilesystem API
• Clients of distributed ilesystems c n inter ct with them t the bstr ct level (vi URIs)
• HDFS is only one implement tion provided by H doop out of the box
• Ex mples
• file:// (loc l iles), hdfs:// (HDFS), s3n:// (s3 “n tive”), gs:// (GCS), …
a
a
a
f
a
a
f
a
a
a
a
f
a
f
a
a
a
a
a
HDFS
Design
• Ge red tow rds very l rge iles: GBs or TBs
• Stre ming d t ccess
• Write-once nd re d-m ny-times
• Re ding whole iles over r ndom seeks
• Commodity h rdw re
• Highly resilient to individu l node f ilures: multiple replic s, block rep irs, reb l ncing
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
f
a
a
a
a
a
HDFS
What HDFS is NOT so good at
• Low-l tency d t ccess
• Tr de-o between throughput nd l tency
• Lots of sm ll iles: tr de-o from rchitectur l nd sc le consider tions
• Multiple writers
• Arbitr ry ile modi ic tions
• Doesn’t provide full POSIX compli nce
a
a
a
ff
f
a
f
a
a
a
f
a
a
a
ff
a
a
a
a
a
a
a
a
HDFS
Core concepts
• Blocks
• Blocks re useful concept in ilesystem implement tions
• Loc l ilesystem blocks: commonly 512 B - 8 KB
• H doop’s def ult block size: 128 MB (often much l rger in re l clusters)
• Implic tions for sm ll iles
• Replic tion: 3 by def ult (er sure coding c n reduce it)
• Compression: up to users
a
a
f
a
a
a
a
a
a
a
f
a
f
a
a
a
a
HDFS
Architecture
• N menodes nd d t nodes
a
a
a
a
HDFS
Namenode
• “One” for single cluster
• N menode m n ges met d t : met d t for iles nd directories
• Block loc tions re reported by d t nodes (not persisted by n menode)
• N menode requires l rge mount of memory
• N menode d t (n mesp ce im ge nd the edit log) re written to disk in sever l loc tions
• N menode c n be sc l bility bottleneck
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
HDFS
Namenode high availability (HA)
• Redund nt stor ge of the ilesystem met d t
• Second ry n menode
• Gets periodic upd tes from the prim ry n menode nd ret ins the st te
• It c n run s hot st ndby
• F ilover vi ZooKeeper
• Fencing
• Client h ndles it vi client libr ry
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
a
a
a
a
a
a
a
a
HDFS
Datanodes
• One per node
• Stores nd retrieves blocks ( sked by clients nd the n menode)
• Veri ies blocks’ checksums periodic lly
• Reports the block list to the n menode
f
a
a
a
a
a
a
HDFS
Data ow: reads
fl
HDFS
Data ow: reads
• DistributedFileSystem returns the block loc tions from the n menode
• Actu l re ds re done vi FSDataInputStream
• Re ds go directly to d t nodes (not through n menode)
a
a
fl
a
a
a
a
a
a
a
a
HDFS
Data ow: writes
fl
HDFS
Data ow: writes
• Client m kes request to write new ile vi DistributedFileSystem
• N menode cre tes record of the new ile
• D t nodes form pipeline of writes: blocking oper tion
• D t nodes report block loc tions to N menode
• Replic pl cement
• R ck diversity: s me node s client —> o -r ck —> s me r ck
a
a
a
a
a
a
a
fl
a
a
a
a
a
a
a
a
a
a
f
a
f
ff
a
a
a
a
a
HDFS
Replica placement
HDFS
Coherency model
• A ile is gu r nteed to exist fter create()
• A ile content m y not be visible even fter the stre m is lushed (vi flush())
• A ile content is gu r nteed to be visible fter hflush()
• File ren mes or directory ren mes re NOT tomic
f
f
f
a
a
a
a
a
a
a
a
a
a
a
a
a
f
a
HDFS
Demo
Exploring lesystem APIs
fi