0% found this document useful (0 votes)

77 views45 pages

Big Data Lab Material

The document provides information about implementing various file management tasks in Hadoop such as adding files and directories, retrieving files, and deleting files from HDFS. It describes HDFS as a scalable distributed file system designed to run on top of operating system file systems and keep track of where data resides. The document lists the common Hadoop commands for interacting with HDFS and performing file operations. It provides the syntax and commands for adding files and directories to HDFS, retrieving files from HDFS to the local file system, and deleting files from HDFS.

Uploaded by

Gaurav Nagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views45 pages

Big Data Lab Material

Uploaded by

Gaurav Nagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

NAME:

ROLLNO:

YEAR/SEM: IST YEAR /IIND SEM

BRANCH: [Link] IN COMPUTER SCIENCE

&ENGINEERING

SUBJECT:BIG DATA ANALYTICS LAB

EXERCISE-1:-

AIM:-

ImplementthefollowingDataStructuresinJava

a)Linked Lists b)Stacks c)Queues d)Set

e)MapDESCRIPTION:

[Link] all theclassesandinterfaces forCollectionframework.

MethodsofCollectioninterface

Therearemanymethodsdeclared intheCollectioninterface. Theyareasfollows:

No. Method Description

1 public boolean isusedtoinsert anelementinthiscollection.
add(Objectelement)

2 public boolean isusedtoinsertthespecifiedcollectionelementsinthein

addAll(Collectionc) vokingcollection.

3 publicbooleanremove(Object isusedtodeleteanelementfromthiscollection.
element)

4 public boolean isusedtodeletealltheelementsofspecifiedcollection

removeAll(Collectionc) from theinvokingcollection.

5 public boolean isusedtodeletealltheelementsofinvokingcollection

retainAll(Collectionc) except thespecifiedcollection.

6 publicint size() return total number of elements in the

the
collection.

7 publicvoidclear() removesthetotalnoofelementfrom thecollection.

8 public boolean isusedtosearchan element.

contains(Objectelement)

9 public boolean isusedtosearchthespecifiedcollectioninthiscollection.

containsAll(Collectionc)

10 publicIteratoriterator() returnsaniterator.

11 publicObject[]toArray() convertscollectionintoarray.

12 publicbooleanisEmpty() checksifcollectionisempty.
[Link]

publicinterfaceCollection<E>extends

Iterable<E>{int size();

booleanisEmpty();

boolean contains(Object

o);Iterator<E>

iterator();Object[]

toArray();

<T>T[]toArray(T[]a);b

ooleanadd(Ee);

booleanremove(Objecto);

boolean addAll(Collection<? extends E>

c);boolean removeAll(Collection<?>

c);booleanretainAll(Collection<?> c);

voidclear();

booleanequals(Objecto);i

nt hashCode();

}
ALGORITHMforAllCollectionDataStructures:-

StepsofCreationof Collection

1. CreateaObjectofGeneric TypeE,T,KorV

2. CreateaModel classorPlainOldJavaObject(POJO)oftype.

3. GenerateSettersandGetters

4. CreateaCollection ObjectoftypeeitherSetorList orMaporQueue

5. Add Objects to the

collectionBooleanadd(Ee)

6. AddCollectiontotheCollection.

BooleanaddAll(Collection)

7. Remove or retain data from

CollectionRemove(Collection)retailAll(

Collection)

8. IterateObjectsusingEnumerationorIteratororListIteratorIter

atorlistIterator()

9. DisplayObjects from Collection

10. END
SAMPLEINPUT:

SampleEmployeeDataSet:(

[Link])

e100,james,[Link],cse,8000,16000,4000,8.7e1

01,jack,[Link],cse,8350,17000,4500,9.2e102,j

ane,[Link],cse,15000,30000,8000,7.8e104,j

ohn,prof,cse,30000,60000,15000,8.8e105,peter,

[Link],cse,16500,33000,8600,6.9e106,david

,[Link],cse,18000,36000,9500,8.3e107,danie

l,[Link],cse,9400,19000,5000,7.9e108,ramu,a

[Link],cse,17000,34000,9000,6.8e109,rani,as

[Link],cse,10000,21500,4800,6.4e110,murthy,p

rof,cse,35000,71500,15000,9,3

EXPECTEDOUTPUT:-

Printstheinformationofemployeewithallitsattributes
EXERCISE-2:-

AIM:-

i) PerformsettingupandInstallingHadoopinitsthreeoperatingmodes:

 Standalone
 PseudoDistributed
 FullyDistributed

DESCRIPTION:

Hadoop is written in Java, so you will need to have Java installed on your
machine,version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others
havebeenreported to work.

Hadoop runs on Unix and on Windows. Linux is the only supported production
platform,but other flavors of Unix (including Mac OS X) can be used to run Hadoop for
[Link] is only supported as a development platform, and additionally requires
Cygwin to [Link] the Cygwin installation process, you should include the openssh package
if you plan torunHadoop in pseudo-distributed mode

ALGORITHM

STEPS INVOLVEDININSTALLINGHADOOPINSTANDALONE MODE:-

1. Commandforinstallingsshis“sudoapt-getinstall ssh”.
2. Commandforkeygenerationisssh-keygen –trsa–P“”.
3. [Link]$HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extractthejavabyusingthecommand [Link].
5. [Link]
6. Extractthehadoop byusingthecommandtarxvfz [Link]
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
[Link]
8. Exportjavapathandhadooppath in ./bashrc
9. Checktheinstallationsuccessfulornotbycheckingthejavaversionand hadoopversion
10. Checkthe hadoop instancein standalone mode workingcorrectlyornot
byusinganimplicit hadoop jarfile named aswordcount.
11. Ifthewordcountisdisplayed correctlyinpart-r-
00000fileitmeansthatstandalonemodeisinstalled successfully.

ALGORITHM

STEPSINVOLVEDININSTALLINGHADOOPIN PSEUDODISTRIBUTEDMODE:-

1. Inorderinstallpseudodistributedmodeweneedtoconfigurethehadoopconfig
uration files resides in the directory /home/ACEIT/hadoop-
2.7.1/etc/hadoop.
2. [Link] filebychanging thejavapath.
3. [Link],it
[Link] [Link]://localhost:9000
4. [Link].
5. Configure [Link].
6. Configure [Link] before configure the copy [Link]
[Link].
7. Nowformat the name node byusingcommand hdfs namenode–format.
8. Type the command [Link],[Link] means that starts the daemons
likeNameNode,DataNode,SecondaryNameNode,ResourceManager,NodeMana
ger.
9. Run JPS which views all daemons. Create a directory in the hadoop by
usingcommand hdfs dfs –mkdr /csedir and enter some data into [Link]
usingcommand nano [Link] and copyfrom local directoryto hadoop
usingcommandhdfs dfs – copyFromLocal [Link] /csedir/and run sample jar file
wordcount tocheckwhetherpseudo distributed mode is workingornot.
10. Displaythecontents offile byusingcommand hdfs dfs–cat /newdir/part-r-00000.

FULLYDISTRIBUTEDMODEINSTALLATION:

ALGORITHM

1. Stopallsinglenodeclusters

$[Link]

2. DecideoneasNameNode(Master)andremainingasDataNodes(Slaves).

3. Copypublickeyto all threehosts togetapassword less SSHaccess

$ssh-copy-id–I$HOME/.ssh/id_rsa.pubACEIT@l5sys24

4. Configureall Configuration files,tonameMasterandSlaveNodes.

$cd$HADOOP_HOME/etc/hadoop

$[Link]

5. Addhostnames tofileslavesandsaveit.

$nanoslaves

6. Configure$[Link]

7. Doin MasterNode

$hdfs namenode–format

$[Link]

8. FormatNameNode
9. DaemonsStartinginMasterandSlaveNodes
10. END
INPUT

ubuntu@localhost>jps

OUTPUT:

Datanode,namenodemSecondarynamenode,No

deManager,ResourceManager

II) Using Web Based Tools to Manage Hadoop Set-

upDESCRIPTION

Hadoopset up can bemanaged bydifferentweb based tools, which can

beeasyfortheuserto identifytherunningdaemons. Fewofthe toolsusedin thereal world are:

a) ApacheAmbari

b) HortonWorks

c) ApacheSpark

LISTOFCLUSTERSINHADOOP

ApacheHadoopRunning atLocal Host

AMBARIAdmin PageforManagingHadoopClusters
AMBARIAdmin PageforViewingHadoop MapReduceJobs

HortonWorksToolforManaging MapReduceJobsinApachePig
RunningMapReduceJobs inHortonWorksforPigLatin Script
EXERCISE-3:-

AIM:-

ImplementthefollowingfilemanagementtasksinHadoop:

 Addingfiles and directories

 Retrievingfiles
 DeletingFiles

DESCRIPTION:-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data

whilerunning on top of the underlying filesystem of the operating system. HDFS keeps track of
wherethe data resides in a network by associating the name of its rack (or network switch) with
thedataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data,
orwhich are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
lineutilities that work similarly to the Linux file commands, and serve as your primary interface
withHDFS. We‘re going to have a look into HDFS by interacting with it from the command line.
Wewill takealook atthemostcommonfile managementtasks in Hadoop, whichinclude:

 Addingfilesand directoriestoHDFS
 Retrievingfiles from HDFStolocalfilesystem
 DeletingfilesfromHDFS 

ALGORITHM:-

SYNTAXANDCOMMANDSTOADD,RETRIEVEANDDELETEDATAFROMHDFS

Step-1
AddingFilesandDirectoriestoHDFS

Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
[Link]‘screateadirectory [Link]
/user/$USER, where $USER is your login user name. This directory isn‘t automatically
createdfor you, though, so let‘s create it withthe mkdir command. For the purpose of illustration,
weusechuck. You should substitute your usernameintheexamplecommands.

hadoopfs-mkdir/user/chuck
hadoopfs-put [Link]
[Link]/user/chuck

Step-2

RetrievingFilesfromHDFS

[Link]
.txt,wecan runthe followingcommand:

hadoopfs-cat [Link]

Step-3

Deleting FilesfromHDFS

[Link]
 Commandforcreatingadirectoryinhdfsis“hdfsdfs–mkdir /ACEITcse”.
 Addingdirectoryisdonethroughthe command “hdfsdfs –putACEIT_english/”.

Step-4
CopyingDatafromNFStoHDFS
Copyingfromdirectorycommandis“hdfsdfs–copyFromLocal
/home/ACEIT/Desktop/shakes/glossary/ACEITcse/”
 Viewthefilebyusingthecommand“hdfsdfs–cat/ACEIT_english/glossary”
 CommandforlistingofitemsinHadoopis“hdfsdfs–lshdfs://localhost:9000/”.
 CommandforDeletingfilesis “hdfsdfs –rmr/kartheek”.

SAMPLEINPUT:

Inputasanydataformatoftypestructured,UnstructuredorSemi Structured

EXPECTEDOUTPUT:
EXERCISE-

4AIM:-

RunabasicWordCountMapReduceProgramtounderstandMapReduce Paradigm

DESCRIPTION:--

MapReduce is the heart of [Link] is this programming paradigm that allows

formassive scalability across hundreds or thousands of servers in a Hadoop
[Link] MapReduce concept is fairly simple to understand for those who are familiar with
clusteredscale-out data processing solutions. The term MapReduce actually refers to two
separate anddistinct tasks that Hadoop programs perform. The first is the map job, which takes a
set of dataand converts it into another set of data, where individual elements are broken down
into tuples(key/value pairs). The reduce job takes the output from a map as input and combines
those datatuples into a smaller set of tuples. As the sequence of the name MapReduce implies,
the reducejobisalwaysperformedafterthemapjob.

ALGORITHMMAPREDUC

EPROGRAM

WordCountisasimpleprogram whichcountsthenumberofoccurrencesof eachwordinagiventext

input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:

1. Mapper

2. Reducer

3. Driver
[Link]

AMapperoverridesthe―map‖functionfromtheClass"[Link]"
whichprovides <key,value>pairsas the [Link] implementationmayoutput
<key,value>pairs usingtheprovided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the
keywould be the line number <line_number, line_of_text> . Map task outputs <word, one> for
eachwordin thelineoftext.

Pseudo-code

void Map (key,

value){foreachwordxinva
lue:
[Link](x,1);
}
[Link]

AReducer collects theintermediate<key,value>output

[Link],theWordCount program will
sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.

Pseudo-code

voidReduce(keyword,<listofvalue>){for
each xin<list ofvalue>:
sum+=x;final_output.collect(keywor
d,sum);
}
[Link]

[Link]
toperformbasicconfigurations such as:

Job Name: name of thisJob

Executable(Jar)Class:
[Link],[Link] Class: class which
overrides the "map" function. For here, [Link]: class which
override the "reduce" function. For here , [Link]:
[Link], Text.
OutputValue:[Link],[Link]
InputPath
FileOutputPath
INPUT:-

SetofDataRelatedShakespeareComedies,Glossary,Poems
OUTPUT:-
EXERCISE-5:-

AIM:-

WriteaMapReduce ProgramthatminesWeatherData.

DESCRIPTION:

Climate change has been seeking a lot of attention sincelong time. The antagonisticeffect
of this climate is being felt in every part of the earth. There are many examples for these,such as
sea levels are rising, less rainfall, increase in humidity. The propose system overcomesthesome
issues that occurred by using other techniques. Inthis project we use the concept ofBig data
Hadoop. In the proposed architecture we are able to process offline data, which isstored in the
National Climatic Data Centre (NCDC). Through this we are able to find out themaximum
temperature and minimum temperature of year, and able to predict the future weatherforecast.
Finally, weplot the graph for the obtained MAX and MIN temperaturefor each mothof the
particular year to visualize the temperature. Based on the previous year data weather
dataofcomingyear is predicted.

ALGORITHM:-

MAPREDUCEPROGRAM

WordCountisasimpleprogram whichcountsthenumberofoccurrencesofeachwordinagiventext
input data set. WordCount fits very well with the MapReduce programming model making ita
great example to understand the Hadoop Map/Reduce programming style. Our
implementationconsists ofthreemain parts:

1. Mapper

2. Reducer

3. Mainprogram
[Link]

AMapperoverridesthe―map‖functionfromtheClass"[Link]"
whichprovides <key,value>pairsas the [Link] implementationmayoutput
<key,value>pairs usingtheprovided Context .

Pseudo-code

voidMap(key,value){
for each max_temp x in
value:[Link](x,1);
}
voidMap(key,value){
foreachmin_tempxinvalue:ou
[Link](x,1);
}

Step-2WriteaReducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble
[Link],theWordCount programwill sumuptheoccurrenceof eachwordtopairsas
<word,occurrence>.

Pseudo-code

voidReduce(max_temp,<listofvalue>){for
each xin <listofvalue>:
sum+=x;final_output.collect(max_te
mp,sum);
}

void Reduce (min_temp, <list of

value>){foreach xin<list of value>:
sum+=x;final_output.collect(min_temp,
sum);
}

3. WriteDriver

[Link]
toperformbasicconfigurations such as:

Job Name : name ofthis Job

Executable (Jar) Class: the main executable class. For here,

[Link] Class: class which overrides the "map" function.

For here, [Link]: class which override the "reduce" function. For

here , [Link]: typeofoutput [Link], Text.

OutputValue:typeofoutput [Link],[Link]

Input Path

FileOutputPath

INPUT:-

SetofWeatherDataover the years

OUTPUT:-
EXERCISE-6:-

AIM:-

WriteaMapReduceProgramthatimplementsMatrixMultiplication.

DESCRIPTION:

We can represent a matrix as a relation (table) in RDBMSwhere each cell in the

matrixcan be representedas a record(i,j,value).Asanexample let us consider the following
matrixand its representation. It is important to understand that this relation is a very inefficient
relationif the matrix is [Link] us say we have 5 Rows and 6 Columns , then we need to store
only 30values. But if you consider above relation we are storing30 rowid, 30 col_id and 30
values inother sense we are tripling the data. So a natural question arises why we need to store in
thisformat ? In practice most of the matrices are sparse matrices . In sparse matrices not all
cellsused to have any values , so we don‘t have to store those cells in DB. So thisturns out to be
veryefficientin storingsuchmatrices.

MapReduceLogic

Logicistosendthecalculationpartofeachoutputcelloftheresultmatrixtoareducer.
Soinmatrixmultiplicationthefirstcellofoutput(0,0)hasmultiplicationandsummationof
[Link] of valuein
the output cell (0,0) of resultant matrixin a seperate reducer we need touse (0,0) as output key of
mapphase and value should have array of values from row 0 of matrixA and column 0 of matrix
[Link] this picture will explain the point. So in this algorithmoutput from map phase
should be having a <key,value> , wherekey represents the output
celllocation(0,0),(0,1)etc..andvaluewillbelistofallvaluesrequiredforreducertodocomputation. Let
us take an example for calculatiing value at output cell (00). Here we need tocollect values from
row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) [Link]
asinglereducercan do the calculation.
ALGORITHM

We assume that the input files for A and B are streams of (key,value) pairs in
sparsematrix format,whereeachkeyis apairofindices(i,j)and eachvalueis the
correspondingmatrixelementvalue. Theoutput files for matrixC=A*Barein thesameformat.

Wehavethe followinginput parameters:

The path of the input file or directory for matrix

[Link] oftheinput fileordirectoryformatrixB.
The path of the directory for the output files for matrix
[Link]=1, 2, 3 or4.
R=the number ofreducers.
I=thenumberof rows inAand C.
K=thenumberofcolumnsinAandrowsinB.J=the
number ofcolumns in Band C.
IB=thenumberofrowsperAblock andCblock.
KB=thenumberofcolumns per Ablockandrows
[Link]=the number ofcolumns per Bblock and C block.

Inthepseudo-
codefortheindividualstrategiesbelow,wehaveintentionallyavoidedfactoringcommon codefor
thepurposes of clarity.

Notethat inall thestrategies thememoryfootprintof boththemappersandthereducers isflat atscale.

Notethatthestrategies all workreasonablywellwith bothdenseandsparse

[Link] we do not emit zero elements. That said, the simple pseudo-code for
multiplying theindividualblocksshownhereiscertainlynotoptimal
[Link],our focus here is on mastering the MapReduce
complexities, not on optimizing the sequentialmatrix multipliation algorithm forthe
individualblocks.
Steps

1. setup()
2. varNIB=(I-1)/IB+1
3. varNKB=(K-1)/KB+1
4. varNJB=(J-1)/JB+1
5. map(key,value)
6. iffrommatrix Awithkey=(i,k)andvalue=a(i,k)
7. for0 <=jb<NJB
8. emit(i/IB,k/KB,jb, 0),(imodIB,kmodKB,a(i,k))
9. iffrommatrixBwith key=(k,j)andvalue=b(k,j)
10. for0 <=ib <NIB
emit(ib,k/KB,j/JB,1),(kmodKB,jmodJB,b(k,j))

Intermediatekeys (ib, kb, jb, m) sort in increasingorder first byib, then bykb,
thenbyjb,thenbym. Note that m =0 forA dataand m = 1 forBdata.

Thepartitionermapsintermediatekey(ib,kb,jb,m)toareducerrasfollows:

11. r =((ib*JB+jb)*KB+kb)mod R

12. These definitions for the sorting order and partitioner guarantee that each
reducerR[ib,kb,jb]receives thedataitneedsforblocksA[ib,kb]
andB[kb,jb],withthedatafortheAblock immediatelyprecedingthe data fortheBblock.
13. varA=newmatrixofdimensionIBxKB
14. varB=new matrixofdimensionKBxJB
15. varsib =-1
16. varskb =-1
Reduce(key,valueList)

17. if keyis(ib, kb, jb, 0)

18. //Savethe Ablock.
19. sib=ib
20. skb=kb
21. ZeromatrixA
22. foreachvalue =(i,k,v)invalueList A(i,k)=v
23. if keyis(ib, kb, jb, 1)
24. ifib !=sib orkb !=skbreturn // A[ib,kb] must bezero!
25. //Build theBblock.
26. ZeromatrixB
27. foreachvalue=(k,j,v)in valueListB(k,j)=v
28. //Multiplythe blocksandemit the result.
29. ibase=ib*IB
30. jbase=jb*JB
31. for0 <=i<rowdimension of A
32. for0 <=j<column dimensionof B
33. sum = 0
34. for0 <=k<column dimensionof A=rowdimension of B
a. sum+=A(i,k)*B(k,j)
35. ifsum!=0emit(ibase+i, jbase+j),sum

INPUT:-

SetofDatasetsoverdifferentClustersaretakenasRowsandColumns
OUTPUT:-
EXERCISE-7:-

AIM:-

InstallandRunPigthen writePigLatinscriptstosort,group,join,projectandfilterthe
data.

DESCRIPTION

Apache Pig is a high-level platform for creating programs that run on Apache
[Link] PigLatin.
PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,or
[Link] MapReduce idiom into a notation
which makes MapReduce programming high level,similartothatof SQL for
[Link] UserDefinedFunctions (UDFs) which the user can write
in Java, Python, JavaScript, Ruby or Groovy and thencalldirectlyfromthelanguage.

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL
isinstead declarative. In SQL users can specify that data from two tables must be joined, but
notwhat join implementation to use (You can specify the implementation of JOIN in SQL, thus
"...for many SQL applications the query writer may not have enough knowledge of the data
orenough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify
animplementationoraspectsofanimplementationtobeusedinexecutingascriptinseveralways. In
effect, Pig Latin programming is similar to specifying a query execution plan, making
iteasierforprogrammers to explicitlycontrol theflow oftheirdata processingtask.

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but
hasno built in mechanism for splitting a data processing stream and applying different operators
[Link] graph(DAG)ratherthanapipeline.

PigLatin'[Link]
isused,datamustfirstbe importedintothedatabase,andthenthecleansingand transformationprocess
can begin.
ALGORITHM

STEPSFORINSTALLINGAPACHEPIG

1) [Link] andmovetohomedirectory

2) SettheenvironmentofPIGinbashrcfile.

3) Pigcanrun intwomodes
LocalMode and Hadoop
ModePig–xlocal and
pig
4) GruntShell
Grunt>
5) LOADINGData into Grunt Shell
DATA=LOAD<CLASSPATH>USINGPigStorage(DELIMITER)as(ATTRIBUTE:
DataType1,ATTRIBUTE:DataType2…..)

6) DescribeData

DescribeDATA;

7) DUMPData

DumpDATA;

8) FILTERData

FDATA=FILTERDATAbyATTRIBUTE=VALUE;

9) GROUPData

GDATA=GROUP DATA byATTRIBUTE;

10) IteratingData

FOR_DATA=FOREACHDATAGENERATEGROUPASGROUP_FUN,AT
TRIBUTE=<VALUE>
11) SortingData

SORT_DATA=ORDERDATABYATTRIBUTEWITHCONDITION;

12) LIMITData

LIMIT_DATA=LIMITDATACOUNT;

13) JOINData

JOINDATA1BY(ATTRIBUTE1,ATTRIBUTE2….),DATA2BY(A
TTRIBUTE3,ATTRIBUTE….N)

INPUT:

InputasWebsiteClickCountData

OUTPUT:
EXERCISE-8:-

AIM:-

InstallandRunHivethenuseHiveto Create, alterand

dropdatabases,tables,views,functionsandIndexes.

DESCRIPTION

Hive, allows SQL developers to write Hive Query Language (HQL) statements that
aresimilar tostandard SQL statements;nowyoushouldbe aware thatHQL islimited in
thecommands it understands, but it is still pretty useful. HQL statements are broken down by
theHive service into MapReduce jobs and executed across a Hadoop cluster. Hive looks very
muchlike traditional database code with SQL access. However, because Hive is based on
Hadoop andMapReduce operations, there are several key differences. The first is that Hadoop is
intended forlong sequential scans, and because Hive is based on Hadoop, you can expect queries
to have
averyhighlatency(manyminutes).ThismeansthatHivewouldnotbeappropriateforapplications that
need very fast response times, as you would expect with a database such asDB2. Finally, Hive is
read-based and therefore not appropriate for transaction processing thattypicallyinvolves
ahighpercentageof writeoperations.

ALGORITHM:

ApacheHIVEINSTALLATIONSTEPS

1) InstallMySQL-Server
Sudoapt-getinstallmysql-server
2) ConfiguringMySQLUserNameandPassword
3) CreatingUserand
grantingallPrivilegesMysql –uroot –
proot
Createuser<USER_NAME> identified by<PASSWORD>
4) ExtractandConfigureApacheHive
[Link]
5) MoveApacheHivefromLocal directorytoHome directory
6) SetCLASSPATHinbashrc
Export HIVE_HOME = /home/apache-
hiveExportPATH=$PATH:$HIVE_HOME/
bin
7) [Link] byaddingMySQL ServerCredentials

<property>
<name>[Link]</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotE
xist=true
</value>
</property>
<property>
<name>[Link]</name>
<value>[Link]</value>
</property>
<property>
<name>[Link]</name>
<value>hadoop</value>
</property>
<property>
<name>[Link]</name>
<value>hadoop</value>
</property>

8) [Link]/libdirectory.

SYNTAXforHIVEDatabaseOperationsD

ATABASE Creation
CREATEDATABASE|SCHEMA[IFNOTEXISTS]<databasename>

DropDatabaseStatement
DROPDATABASEStatementDROP (DATABASE|SCHEMA)[IFEXISTS]
database_name[RESTRICT|CASCADE];

Creatingand DroppingTableinHIVE

CREATE[TEMPORARY][EXTERNAL]TABLE[IFNOTEXISTS][db_name.]table_name
[(col_namedata_type[COMMENTcol_comment],...)]

[COMMENT table_comment][ROWFORMATrow_format][STOREDASfile_format]

Loading Data into table

log_dataSyntax:
LOADDATALOCALINPATH'<path>/[Link]'OVERWRITE INTOTABLEu_data;

AlterTableinHIVE

Syntax

ALTERTABLE nameRENAME TOnew_name

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec
...])ALTERTABLEnameDROP[COLUMN]column_name
ALTER TABLE name CHANGE column_name new_name
new_typeALTERTABLEnameREPLACECOLUMNS(col_spec[,col_sp
ec...])

Creatingand DroppingView
CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENTcolumn_comment],...)][COMMENTtable_comment]AS SELECT ...

Dropping

ViewSyntax:

DROPVIEWview_name

Functionsin HIVE

StringFunctions:-round(),ceil(),substr(),upper(),reg_exp()etcDate

and Time Functions:- year(), month(), day(), to_date()

etcAggregateFunctions :-sum(),min(),max(), count(),avg()etc

INDEXES

CREATE INDEX index_name ON TABLE base_table_name (col_name,

...)AS'[Link]'
[WITHDEFERREDREBUILD]
[IDXPROPERTIES (property_name=property_value,
...)][INTABLE index_table_name]
[PARTITIONEDBY(col_name,...)][
[ROWFORMAT...]STOREDAS...
|STORED BY...
]
[LOCATION
hdfs_path][TBLPROPE
RTIES(...)]
CreatingIndex

CREATE INDEX index_ip ON TABLE log_data(ip_address)

AS'[Link]'WITHDEFERRE
DREBUILD;
AlteringandInsertingIndex

ALTERINDEXindex_ip_addressONlog_dataREBUILD;
StoringIndexDatainMetastore

SET
[Link]=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_res
ult;

SET
[Link]=[Link];

DroppingIndex

DROPINDEXINDEX_NAMEonTABLE_NAME;

INPUT

InputasWebServerLogData
OUTPUT
EXERCISE-10:-

AIM:Writeaprogram to
implementcombiningandpartitioninginhadooptoimplementacustompartitionerandCombiner

DESCRIPTION

:COMBINERS:

ManyMapReducejobsarelimitedbythebandwidthavailableonthecluster,soit
pays to minimize the data transferred between map and reduce tasks. Hadoop allows
theuserto specifyacombinerfunction to berun onthemapoutput—the combiner
function‘[Link]
ation, Hadoop does not provide a guarantee of how many times it will call it for
aparticular map output record, if at all. In other words, calling the combiner function
zero,one, or many times should produce the same output from the reducer. One can think
ofCombiners asmini-reducethat take place on the output of the mappers, prior to
theshuffle and sort phase. Each combiner operates in isolation and therefore does not
haveaccess to intermediate output from other mappers. The combiner is provided keys
andvalues associated with each key (the same types as the mapper output keys and
values).Critically, one cannot assume that a combiner will have the opportunity to
process allvalues associated with the same key. The combiner can emit any number of
key-valuepairs, but the keys and values must be of the same type as the mapper output
(same as
thereducerinput).Incaseswhereanoperationisbothassociativeandcommutative(e.g.,
additionormultiplication),reducerscandirectlyserveascombiners

PARTITIONERS

Acommonmisconceptionforfirst-
[Link]
nstraintisanonsenseandthat
using more than one reducer is most of the time necessary, else the map/reduce concept would
notbe very useful. With multiple reducers, we need some way to determine the appropriate one
tosenda(key/value)pairoutputtedby [Link] todetermine the
reducer. The partitioning phase takes place after the map phase and before
[Link]
across the reducers according to the partitioningfunction. This approach improves
theoverallperformanceandallowsmapperstooperate
[Link]/valuepairs,eachmapperdetermines
which reducer will receive them. Because all the mappers are using the same
partitioningfor any key, regardless of which mapper instance generated it, the destination
partition isthe same. Hadoop uses an interface called Partitioner to determine which
partition akey/[Link]/valuepairsthatwillbesentto
a single reduce task. You can configure the number of reducers in a job driver bysetting a
number of reducers on the job object ([Link]). Hadoop
[Link],whichhashesarecord‘ske
y to determine which partition the record belongs in. Each partition is processed by
areduce task, so the number of partitions is equal to the number of reduce tasks for the
job,When the map function starts producing output, it is not simply written to disk. Each
maptask has a circular memory buffer that it writes the output to. When the contents of
thebuffer reach a certain threshold size, a background thread will start tospill the contents
todisk. Map outputs will continue to be written to the buffer while the spill takes place,
butif the buffer fills up during this time, the map will block until the spill is complete.
Beforeit writes to disk, the thread first divides the data into partitions corresponding to
thereducers that they will ultimately be sent to. Within each partition, the background
threadperformsanin-memorysortbykey,andifthereisacombinerfunction,itisrunonthe
output of thesort.
ALGORITHM:

COMBINING

1) Dividethedatasource(thedatafiles)intofragmentsor blocks
[Link].
2) Thesesplits
arefurtherdividedintorecordsandtheserecordsareprovidedoneatatimetothemapperforproc
essing. ThisisachievedthroughaclasscalledasRecord Reader.
Create a Class and extend from TextInputFormat class to
createownNLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we
willimplementthelogic offeeding3 lines/recordsat atime.
MakeachangeinthedriverprogramtousenewNLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words (which
already know now how to do ) , emit out number of lines to get in the input at atime as a
key and 1 as a value , which after going through reducer will give the frequencyofeach
uniquenumberof lines to themappers.

PARTIONING

1. First,thekeyappearingmorewill besend toonepartition

2. Second,all otherkeyswill besendtopartitions accordingtotheirhashCode().

3. Now suppose if your hashCode() method does not uniformly distribute other keys
dataover partitions range. So the data is not evenly distributed in partitions as well
[Link]
havemore data than other [Link] other reducers will wait for one reducer(one with
userdefinedkeys)dueto thework load it share
INPUT:

DatasetsfromdifferentsourcesasInput

OUTPUT:

hduser@ACEIT-3446:/usr/local/hadoop/bin$hadoopfs-ls/partitionerOutput/
14/12/[Link][Link]:Unabletoloadnative-
hadooplibraryforyourplatform...usingbuiltin-javaclasses whereapplicable
Found4items
-rw-r--r-- 1hdusersupergroup 02014-12-0117:49/partitionerOutput/_SUCCESS
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00000
-rw-r--r-- 1hdusersupergroup 102014-12-0117:48/partitionerOutput/part-r-
00001
-rw-r--r-- 1hdusersupergroup 92014-12-0117:49/partitionerOutput/part-r-00002

Prachi 20CS111 BDALab File
No ratings yet
Prachi 20CS111 BDALab File
20 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
18 pages
Pro 3
No ratings yet
Pro 3
45 pages
Hadoop - Lab Program
No ratings yet
Hadoop - Lab Program
54 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
Hadoop Lab Manual
No ratings yet
Hadoop Lab Manual
54 pages
Bda 1
No ratings yet
Bda 1
54 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Big Datalab
No ratings yet
Big Datalab
4 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Ccs334 Bda Lab Manual PRINT
No ratings yet
Ccs334 Bda Lab Manual PRINT
53 pages
HDFS File Management Commands Guide
No ratings yet
HDFS File Management Commands Guide
2 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
HADOOP AND BIG DATA - Final
No ratings yet
HADOOP AND BIG DATA - Final
26 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Hadoop Lab Practical Guide
No ratings yet
Hadoop Lab Practical Guide
69 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Ccs 334 Bigdata Manual
No ratings yet
Ccs 334 Bigdata Manual
45 pages
J2EE Lab Assignment: HDFS & HBase Tasks
No ratings yet
J2EE Lab Assignment: HDFS & HBase Tasks
60 pages
Bigdatamanualfinal 231019063224 d211cb48
No ratings yet
Bigdatamanualfinal 231019063224 d211cb48
45 pages
Bdafile
No ratings yet
Bdafile
9 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Hadoop File Management Guide
No ratings yet
Hadoop File Management Guide
3 pages
BDH Record - Merged
No ratings yet
BDH Record - Merged
47 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
Java & Hadoop Setup Guide
No ratings yet
Java & Hadoop Setup Guide
67 pages
BDA Lab Manual 2023-2024
No ratings yet
BDA Lab Manual 2023-2024
54 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
BDA Record
No ratings yet
BDA Record
34 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
CCS334 Bda Lab Manual
No ratings yet
CCS334 Bda Lab Manual
48 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
26 pages
BDA Exp (1 To 7)
No ratings yet
BDA Exp (1 To 7)
22 pages
Week 1 in Terminal
No ratings yet
Week 1 in Terminal
10 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
Hadoop HDFS Setup and Commands Guide
No ratings yet
Hadoop HDFS Setup and Commands Guide
35 pages
Dsa Practical File
No ratings yet
Dsa Practical File
16 pages
Hadoop 1
No ratings yet
Hadoop 1
15 pages
Big Data
No ratings yet
Big Data
23 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
Exp 1-2
No ratings yet
Exp 1-2
9 pages
CCS334 Bda
No ratings yet
CCS334 Bda
23 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
50 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
48 pages
Lab Manual
No ratings yet
Lab Manual
34 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
62 pages
Big Data Lab Guide for AI Students
No ratings yet
Big Data Lab Guide for AI Students
83 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
BDA-Lab Record
No ratings yet
BDA-Lab Record
43 pages
Bigdata Manual Final
No ratings yet
Bigdata Manual Final
66 pages
Amrita CC 3.1
No ratings yet
Amrita CC 3.1
7 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
IBM - IBM Spectrum Scale Version 5.1.0 Concepts, Planning, and Installation Guide (2021)
No ratings yet
IBM - IBM Spectrum Scale Version 5.1.0 Concepts, Planning, and Installation Guide (2021)
626 pages
Anushka Jain: DevSecOps Engineer Profile
No ratings yet
Anushka Jain: DevSecOps Engineer Profile
2 pages
Checkpoint Actualtests 156-315 80 v2020-01-04 by Sherman 179q
No ratings yet
Checkpoint Actualtests 156-315 80 v2020-01-04 by Sherman 179q
77 pages
AWS VPC DNS Configuration Guide
No ratings yet
AWS VPC DNS Configuration Guide
6 pages
C Language Arrays: Types & Usage Guide
No ratings yet
C Language Arrays: Types & Usage Guide
58 pages
Ksuite 3.0 User Guide
100% (1)
Ksuite 3.0 User Guide
50 pages
CCI v01-80-03 00 Release Notes RN-90RD7194-52
No ratings yet
CCI v01-80-03 00 Release Notes RN-90RD7194-52
7 pages
First Generation of Computers
No ratings yet
First Generation of Computers
7 pages
MTA Certification Paths Overview
No ratings yet
MTA Certification Paths Overview
1 page
Exam History of Computers
No ratings yet
Exam History of Computers
3 pages
LU7 - CATF16D - Memory System Design I
No ratings yet
LU7 - CATF16D - Memory System Design I
58 pages
Chapter 8 Exam of IT Essentials 7
No ratings yet
Chapter 8 Exam of IT Essentials 7
21 pages
Microprocessors & Microcontrollers MCQ Quiz
No ratings yet
Microprocessors & Microcontrollers MCQ Quiz
10 pages
Assignment (1) 1
No ratings yet
Assignment (1) 1
3 pages
Computer Studies Exam Questions Guide
No ratings yet
Computer Studies Exam Questions Guide
2 pages
Library Cache Performance Issues Guide
No ratings yet
Library Cache Performance Issues Guide
20 pages
What Is BIOS Password - Definition From WhatIs - Com - 1597911842291
No ratings yet
What Is BIOS Password - Definition From WhatIs - Com - 1597911842291
5 pages
FAQ IOP-Versionen 2019-07-01 EN
No ratings yet
FAQ IOP-Versionen 2019-07-01 EN
3 pages
Unlocking Excel Sheets with VBA
No ratings yet
Unlocking Excel Sheets with VBA
4 pages
A500610 - en - Power Measurement Module 750-494 and Simatic S7 PLC
No ratings yet
A500610 - en - Power Measurement Module 750-494 and Simatic S7 PLC
46 pages
OnApp Integrated Storage Diagnostics Guide
No ratings yet
OnApp Integrated Storage Diagnostics Guide
7 pages
Pop OS How To Install
No ratings yet
Pop OS How To Install
6 pages
RUCKUS-SZ (ST-GA) SmartZone Upgrade Guide, 7.0.0-RevC-20240430
No ratings yet
RUCKUS-SZ (ST-GA) SmartZone Upgrade Guide, 7.0.0-RevC-20240430
62 pages
Elektrobit-Infographic SDV Levels
No ratings yet
Elektrobit-Infographic SDV Levels
2 pages
Oracle PL/SQL Lab Guide
0% (1)
Oracle PL/SQL Lab Guide
38 pages
HSC CS Paper II Most IMP Questions
No ratings yet
HSC CS Paper II Most IMP Questions
132 pages
NT6000 ControlX Manual
No ratings yet
NT6000 ControlX Manual
49 pages
SCCM Online Training
No ratings yet
SCCM Online Training
21 pages
MobiSTOP Ultima Technical Specifications
No ratings yet
MobiSTOP Ultima Technical Specifications
1 page

Big Data Lab Material

Uploaded by

Big Data Lab Material

Uploaded by

NAME:

YEAR/SEM: IST YEAR /IIND SEM

BRANCH: [Link] IN COMPUTER SCIENCE

SUBJECT:BIG DATA ANALYTICS LAB

a)Linked Lists b)Stacks c)Queues d)Set

[Link] all theclassesandinterfaces forCollectionframework.

Therearemanymethodsdeclared intheCollectioninterface. Theyareasfollows:

No. Method Description

2 public boolean isusedtoinsertthespecifiedcollectionelementsinthein

4 public boolean isusedtodeletealltheelementsofspecifiedcollection

5 public boolean isusedtodeletealltheelementsofinvokingcollection

6 publicint size() return total number of elements in the

7 publicvoidclear() removesthetotalnoofelementfrom thecollection.

8 public boolean isusedtosearchan element.

9 public boolean isusedtosearchthespecifiedcollectioninthiscollection.

boolean addAll(Collection<? extends E>

4. CreateaCollection ObjectoftypeeitherSetorList orMaporQueue

5. Add Objects to the

7. Remove or retain data from

9. DisplayObjects from Collection

STEPS INVOLVEDININSTALLINGHADOOPINSTANDALONE MODE:-

3. Copypublickeyto all threehosts togetapassword less SSHaccess

4. Configureall Configuration files,tonameMasterandSlaveNodes.

II) Using Web Based Tools to Manage Hadoop Set-

Hadoopset up can bemanaged bydifferentweb based tools, which can

ApacheHadoopRunning atLocal Host

 Addingfiles and directories

HDFS is a scalable distributed filesystem designed to scale to petabytes of data

MapReduce is the heart of [Link] is this programming paradigm that allows

WordCountisasimpleprogram whichcountsthenumberofoccurrencesof eachwordinagiventext

void Map (key,

AReducer collects theintermediate<key,value>output

Job Name: name of thisJob

void Reduce (min_temp, <list of

Job Name : name ofthis Job

Executable (Jar) Class: the main executable class. For here,

[Link] Class: class which overrides the "map" function.

here , [Link]: typeofoutput [Link], Text.

SetofWeatherDataover the years

We can represent a matrix as a relation (table) in RDBMSwhere each cell in the

Wehavethe followinginput parameters:

The path of the input file or directory for matrix

Notethat inall thestrategies thememoryfootprintof boththemappersandthereducers isflat atscale.

Notethatthestrategies all workreasonablywellwith bothdenseandsparse

17. if keyis(ib, kb, jb, 0)

GDATA=GROUP DATA byATTRIBUTE;

InstallandRunHivethenuseHiveto Create, alterand

Loading Data into table

ALTERTABLE nameRENAME TOnew_name

and Time Functions:- year(), month(), day(), to_date()

etcAggregateFunctions :-sum(),min(),max(), count(),avg()etc

CREATE INDEX index_name ON TABLE base_table_name (col_name,

CREATE INDEX index_ip ON TABLE log_data(ip_address)

1. First,thekeyappearingmorewill besend toonepartition

2. Second,all otherkeyswill besendtopartitions accordingtotheirhashCode().

You might also like