Vidyalekhani
DATE
PAGE
Structured
TYPES seMT TTUttrtd
Lunsturturet
patgb ase such as oracle
truetye : DB2, MysoL, ete
datq
spÝeadsheet i:
Transa on. PrO Cess
SYSte
1. STRUTURED DATA
strutured data be det'ned as the data hat
esides
thats most tamil to
+he type
febírthday addres
bwriNeny dny
the tuued daA &tutre d
(SQL)
pRAWBACKS
strytured data an, be Used in of
this Mean thet
prede-ttned furntionalites and is
de tq hs" uttedfleibtity
for (ertain 3pec/fic
data 's toTedto
datawarehouse with
sfrut
Conatatnt detined
qnddetined schenna
0 pro (ess different COPL
Vidyalekhans
DATE
PAGE
2. SE MI STRUCrURAED DATA
Semi sru ttyred dala not bournd by any
riqtd schema for data storage
There ajve' soime
sonMe fe qturey like key
paiywhich is uwed t0 hep ht
AfernHattng en+fti ro tath othev
Tn demi ttruit data NosQL { y . is used
Dgtd serializ qt'o n
(bntetnt to store mea daty
about the busíness roces S.
’
his tyr *ternal
of Info riÝn atton
typicalty (omes
St socda
Me dia ptattorms btt, other Sott'web
based data 4eeds
Audio Emaíl
semi
Json
dae sevializaHen othey markup
key vatue anguaqes
No sQL
Vidyalekharn
DATE
PAGE
UNSTRVCTURED DATA
that doesn't
I4 is kind of data havin
set of mUle.
oefined schema
IH manqment is
tx-texts photoS,lg ile!
Addionay known 45 glak data
þecause (annot be analyged wIth ut
the proper S0ftwoNe to ol
A-udio
(VI4eos
nstruetyred
Text
Messq es
chats Soua free form
mediq data
HISTORy of HADOOp
NeNseen
Hadoop qpen
by Aprhe J/w foun dation whjoh tS witten in
JAVA proceASing of hugt dotajc
with he H/N.
Hadeop ýond tgyed with Doup (uting and
Mite (afarella the yedn 2002
yon when hey
both stovted to Work Apache Nutch pYc
Vidyalekhans
DATE
PAGE
tondvdel
ATter a lot reseqreh they Nutth
cost aou nd
that uLh a systm will tttttt (ost
winh
half a mill lon dolloVs. In hlwiand along
MontH running (ost ot $3000 0 approx.
which S Yoryr txpunsive.
Papm that
acyoSS
In 2003, they car
tame
desribed he
. th i's ¥Ile d GFSi(G0oq
uted fle syste
half soluton
f)e syjiem) whlch found
thu oblen
publised ne more papr 0n
In 200+; tyooq1e
tchnique map Rtouce. No w
Map Reduce
papt
Wy qnethe halt so|n a the problt m
(UHing qnd Mite cafaYui Rtduce) tn
Douq cutHiny
ttehnyul telqES and Map
ther Nukh projett
cutHng found that Nuteh is 1imitd
In 200s, node dus ters b{cauJ
20 t 0 40
to
he profec there vere hwo enginog
which ae wotking the prolect
(utng jolned Yahoo wih
Ln 2006, (uttin oin
Yenamedt
Nutth Progret and he
Hadoop
Wy
The name Hadoop
yello w tlephant toy 's
GOPI
Vidyalekhan
DATE
PAGE
RDBMS Had 00p
Tradittonal To wI(o| ba An openeource slw used FoY
sed dqtabase , batqlly Sto nng datel and unnn
for ol oq ta sto qde, processes
mantputaien, vetrVel. (on ren H
In hs, shuhured
dta rs meJHy ctured dote yoroce sed
proce JS
at s best syited Best for B
for DLTP (onln e
trans a tion
4) e s sca ab|? It
highly Scql a ble.
than Hadoop
|Data noma l)saton Data nomnu|s abon isn
regu)red rijwred.
6 stores hransom huge Volume
e and cg gregate d
dota
1) sche-ma SHaic 8chem is dynam ic hype
type
datq aVail
3) Hrgh data integby
nwailoble able
(os4 is applieqble for no (o& open couri
IMP
k datq ana GOPL
Vidyalekhan2
DATE
PAGE
USe Map Reduce
ACID pYorrth
follow ACID doesnot folow the
BIG DA TA A NALY TICS
In this new dig thal wovld duta is
gehera ted in an enorm0u aMount bet
opens new panadist. ad
As We hawe hyh (owputng power ng
larqe CUmount of dat. We can
dato to help us t6 make date driven
deesion Mekinq
1) redicive (FoYeagttn)
Descriptrve
3) Pesip tVe (opttrmlzatton stfmulaton)
+) Diagaostfc
DiagnesH rdietve presrigh
DeSertprve Analy tic Analytics AnatyHe
AnatyHs
Deuy wlth uhat Dells orth Deals wiHh wht Houw can
huppenkdtn He why did f we make
happined tn the futuYc
the past
GOPI
Vidyalekhan
DATE
PAGE
Predlctlve AnalyHes
USes datd to detemine the probab|e ou(ome of
’Tehniques that re uye for redictve analyHes
Hnea reqretilon
Fype sertes analys IS and foYeasttn
data mning
2 pesripttve Analy ttes
(00ks at data and analyze past event
inslgh as how to approcth future
fu event
Common exqm p e Data querie
Reports
Desmiptve stHsH'c
Da t dashboqrds
PreserlptNe AnalyHes
Synthesizes d ata y Mathe mattca sence
bu'ness rute and
big qthine learnng FO
Make preorctHon and then suggest
of predieHen
to
cpton
EX
Heal thcare
3trateq ic planning
by uin analy tis.
Vicyalekhan
OATE
PAGE
4:
Diaqnestte Anaytes
ne qene ralls ust historieal ddata over other data
wütortcal
to answr any quetisn or for the so|^ of
any prOblem.
ommon preb data dýoVemy
data nmining
o- relatton
STE PS IN DATA ANALYSIs
step1 : Def'ne data requiremenf
Data colle ettor
Datel oTgranrsqHon
SIep+ Data cleantng
FUTURE ScOPE OF DA
Re tafl
Healtheare
finance (6) Tranpartat'o
t proce-s big dattog store
Vidyalekhari
DATE
PAGE
HDFS [HADO OP DISTRI BUTED FILE
3ys TEM
chartersHICr
cqn
store Rtabytes oF datq
Mighly scalable
(omm odih 86 se YVel qnd op en soue sJw.
Sp?os compu titton in eaCh sevey
Treates falle as inevíteble. (neAg ligible)
ARHITECTURE
Namenode Jobrackey Scconda x
Namenode.
ANam nade Metadak (Namne,
Meadq ta ofs epiasy-): thome
rient Bloct o
Read| D4}4 odes
Data nodes
Kepticottt
wYite
Rar
ftient
keptictlon no ot copies ho nn qny tols GOPI
fa (tor de
Vidyalekhansa
DATE
PAGE
HOFS Imp aspects "Clien : in ter fa ce
beth us l and ite
NaMNode and 0alaNode system
(ommuniea tes ith NN
Namen ode tor narnspact arnd 1N
for data 4cet
maste Serve
nly one. na enode. lesystem
n manages namespace. unique Iden Heicatron )
Requiates acrets to files by
qnd Psog3amse (clfens)
V-cpening l closing reromtng ftes and dveo iey
MAPping q9 biocks to aNode. (assigns blodk)
Dqta Node.
One per node in thu (luste
t managu stosoge attat ed to nede)
(reqting blocks <da ta. deletion, replicahcn
sey ve sead qnd wiite iequess kOm dients.
gecondoryName node
helper to the Namnode
saves the metada tq in case 0f fallure .
Status
1eplicaof meta &toragl
6oks heattbeat dqty node
awe
pATA STORAYE AND REPLI(ATION
blo ck
files ure toTed in a sequence
UYe of sqme size extept lat block.
AllB10C¢S
Te po vt blocks
port then the Mep licq +f0 n
to Te
i1 the datq node fails
iacr changes.
-|28
Hadoo
Vidyalekhan
DATE
PAGE
soltch
ToR tuot tch 4To R s w|tch ToP suo'h
block2
btott3
Nme Dat DataNode
0ode. Node
Dqta Node DataNod e.
24Y4
napme Dqta DataNode
Nede
Job
Tiackeu Doa Nod
FILE BLOCKS AND RE PLT(ATION FACTORS
Nheh tver ys mport any ile to Hadoop DishibuHd
fik &ystem thut gets divlded tno bocks Some
& then theee bloues
Vario slqve | ata nodey
default in Hadoop4 these blocks aYe 64 B
An i2 e and
ind in Madoop 2, 128 M8 n ize
.
Ex - J4ppo se we have ypl0a ded a file of 40D MB
this Ale 4ots dtde into 28 t12 t|2t
16 MB
By defawt tVay tqble COP
Vicdyalekhan
DATE
PAGE
In aboVe exanMple NA hav 4 tle blocks whith
means that 3 Tepll'ta o copres of ea ch
made neas tot 4 X3 12
blocks CYe made.
4DVANTAGE
O Fault Tolerence
eqn. make copies a t e blocks tor lback-up
purpose.
(clyter Datanodes)
RACK AWARENESS
treduce nw afic)
phyitel
hadoop lyter.
hadoop hutu
naks
atk tnform atfon Nameno de
Nith the help of thiy qnimum
shceses the closeyt Data nodeto ach teve
whlle puforming the read on wvÍte
perfermante nlw frafc
infomatHon Nhleh reduey the
82
B1 r B
2 B3 6 B, B
B2
3 . B3
B3
4
R2
R3
R1
fye Maste COP
blacks name Vidyalekhan
batanodes slaC DATE
PAGE
Hacdoop has sone aclc aworeness polteie
)1hene houtd not be moe tho n 1 ep ca on the
same datanode.
MOYe than fvo Iepllca'k the .&ngle block is not
allowed o the Brme Tace
The no wed mside the hadoop cut
Muet be &naller than the no
|READ o PE RAT IN HOFS
Digtrlbwted
2get btoc
Namenoce
Ile syotem
HDFS
Irens 3read FS
6close.
hput sam
eient vm 4read 6Tead.
L1'e node Datd Ddtd Data
Node Node Node
HOFS W&eks on the SI Team In4 datq aceegt ptt an
Means it s p eks wtR once and d many
Peatuyes
Whenere a cient sends a eeq to HOFS. to
read sornethinq pom HOfS to the da ta
oe oata nod¿ whee a ua data stored.
not dij ectly oanted 1 th cint becaQe
dicnt doesn't iare iner Tmatin about
the data e on nhich data node . daa
IS Stored or nelo 1h eplica od data
get’read foY yayn GOP1
Viyalckhani
PAGE
Is maoe on datanode stop
-’o, that' why the cllent irst gends a sequ
to nanmen ode since the narmeno de can tqln
the metadata.
bnce the rey is reueved by th
reeponds and send all the int NN.It
datanodes the ocation nhee th (no-9
u made, the no
repiro
databiotks -and the
Lotatiom.
Now the cint CAn. Tead dota wih all thy
normatton.
The elient reads the cdata paraHely tince, the
Kepllca of the s4me data /sis availbble on the
onu the whole olata y Tead it eombirnes
all tne blects a oriq|nl ile
HpFS NRIJE oPE RATZONN
1. 1he olien! mteqs wIh HO£S Namenede.
[o wit tile inHde fhe HDFS the
in terauh w' the the ng enode
Namenede s t checks tor thesltent prtv|eye
cllent Namenode
privllege9’ fy e R
yalekhan
DATE
PAGE.
Tregte
HDES
cllent File syst enm Name No de.
2 Wslte.
pa ta queue
8tehm AUK queue:
(ltent J VM
c)fen node
4.wIlte pacte
rs qcknowledgoment
tlpeine
2
JI the cdtent has ruffletent privlege &. there Is
nam
nodl 3eco d ne
Namenode then roVides the oBalreH. all
dotanods and Seusi teken
I the fie aready ents then file c Yeatro
f a . and the client reeve EAeptren"
The cieni tnteaths with the datqnodt
After TLdeving the tiet a the datane dey and
permi&tlon he d'ent stars wittnq
first dtanooe in th
dala, DiTetly to the firsH
Lin.
4tes finihing wwritog ofdata the a taNode
StATk making neplito' blocks to othey dataNod0
depe
dt) ndi upon xplieaton facton
manyndon
Vicdyalekhari2
DATE
PAGE
NOTE TMP VE
what happens if datansde Raily whie wrlHng
ke in tlDF S
4:ttens > he pipehne ges closed., packefs n the
qutue U then added, to ront 9 the dala
farttaty
walttent
que e makmg Aatamo dej do0m stream
hon the faed node to not miss any packet
Then the eument bloch an he allve da ta aode
a new tdentfty
The igiled dataNe de gett remred from the
pipellne nd newptpellne qets constucted
430 the tWo alive datanoded.
The Namenode ob serves that the tblok. is
underrep licated and Hqrranges for
fusthu eopy en 4nother Oatanode.