0% found this document useful (1 vote)
333 views70 pages

BDA Lab Manual R22

The document is a Big Data Analytics Lab Manual for Jawaharlal Nehru Technological University Hyderabad, outlining course objectives, outcomes, and a list of experiments involving Hadoop, R, and data analytics tools. It includes detailed instructions for installing Hadoop on Windows 10, configuring environment variables, and running a simple Word Count MapReduce program. Additionally, it covers processing big data in HBase and provides references for further reading on big data analytics.

Uploaded by

raios1747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
333 views70 pages

BDA Lab Manual R22

The document is a Big Data Analytics Lab Manual for Jawaharlal Nehru Technological University Hyderabad, outlining course objectives, outcomes, and a list of experiments involving Hadoop, R, and data analytics tools. It includes detailed instructions for installing Hadoop on Windows 10, configuring environment variables, and running a simple Word Count MapReduce program. Additionally, it covers processing big data in HBase and provides references for further reading on big data analytics.

Uploaded by

raios1747
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

lOMoARcPSD|53906646

BIG DATA ANALYTICS LAB MANUAL

JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD


III Year B.Tech.CSE. II – Sem L T P C
Course Code: 0 0 2 1
BIG DATA ANALYTICTS LAB MANUAL
Course Objectives
1. The purpose of this course is to provide the students with the knowledge of Big data
Analytics principles and techniques.
2. This course is also designed to give an exposure of the frontiers of Big data Analytics

Course Outcomes
1. Use Excel as an Analytical tool and visualization tool.
2. Ability to program using HADOOP and Map reduce.
3. Ability to perform data analytics using ML in R.
4. Use cassandra to perform social media analytics.

List of Experiments
1. Create a Hadoop cluster
2. Implement a simple map-reduce job that builds an inverted index on the set of input
documents (Hadoop)
3. Process big data in HBase
4. Store and retrieve data in Pig
5. Perform data analysis using MongoDB
6. Using Power Pivot (Excel) Perform the following on any dataset
a) Big Data Analytics
b) Big Data Charting
7. Use R-Project to carry out statistical analysis of big data
8. Use R-Project for data visualization
TEXT BOOKS:
1. Big Data Analytics, Seema Acharya, Subhashini Chellappan, Wiley 2015.
2. Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s
Business, Michael Minelli, Michehe Chambers, 1st Edition, Ambiga Dhiraj, Wiely CIO
Series, 2013.
3. Hadoop: The Definitive Guide, Tom White, 3rd Edition, O‟Reilly Media, 2012.
4. Big Data Analytics: Disruptive Technologies for Changing the Game, Arvind Sathi, 1st
Edition,
IBM Corporation, 2012.
REFERENCES:
1. Big Data and Business Analytics, Jay Liebowitz, Auerbach Publications, CRC press (2013).
2. Using R to Unlock the Value of Big Data: Big Data Analytics with Oracle R Enterprise
and Oracle R Connector for Hadoop, Tom Plunkett, Mark Hornick, McGraw-Hill/Osborne
Media (2013), Oracle press.
3. Professional Hadoop Solutions, Boris lublinsky, Kevin t. Smith, Alexey Yakubovich,
Wiley, ISBN: 9788126551071, 2015.
4. Understanding Big data, Chris Eaton, Dirk deroos et al., McGraw Hill, 2012.
5. Intelligent Data Analysis, Michael Berthold, David J. Hand, Springer, 2007.
PROGRAM 1
INSTALLING HADOOP IN WINDOWS 10

Preparations
A. Make sure that you are using Windows 10 and are logged in as admin.

B. Download Java jdk1.8.0 from


https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

C. Accept Licence Agreement [1] and download the exe-file [2]

D. Download Hadoop 2.8.0 from


http://archive.apache.org/dist/hadoop/core//hadoop-2.8.0/hadoop-2.8.0.tar.gz

E. Download Notepad++ from


https://notepad-plus-plus.org (current version for Windows)
F. Navigate to C:\ [1], make a New folder [2] and name it Java [3]

G. Run the Java installation file jdk-8u191-windows-x64. Install direct in the folder C:\Java, or
move the items from the folder jdk1.8.0 to the folder C:\Java. It should look like this:

H. Install Hadoop 2.8.0 right under C:\ like this:

I. Install Notepad++ anywhere


Setup Environment variables
A. Use the search-function to find the environment variables.

In System properties, click the button Environment Variables...

A new window will open with two tables and buttons. The upper table is for User variables
and the lower for System variables.
B. Make another New User variable [1]. Name it HADOOP_HOME and set it to the
hadoop-2.8.0 bin-folder [2]. Click OK [3].
C. Now add Java and Hadoop to System variables path: Go to path [1] and click edit [2]. The
editor window opens. Chose New [3] and add the address C:\Java\bin [4]. Chose New again
[5] and add the address C:\hadoop-2.8.0\bin [6]. Click OK [7] in the editor window and OK
[8] to change the System variables.
Configuration
A. Go to the file C:\Hadoop\Hadoop-2.8.0\etc\hadoop\core-site.xml [1]. Right-click on the file
and edit with Notepad++ [2].

B. In the end of the file you have two configuration tags.


<configuration>
</configuration>
Paste the following code between the two tags and save (spacing doesn’t matter):

<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

It should look like this in Notepad++:


C. Rename C:\Hadoop-2.8.0\etc\hadoop\mapred-site.xml.template to mapred-site.xml and
edit this file with Notepad++. Paste the following code between the configuration tags and
save:

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

D. Under C:\Hadoop-2.8.0 create a folder named data [1] with two subfolders, “datanode” and
“namenode” [2].

E. Edit the file C:\Hadoop-2.8.0\etc\hadoop/hdfs-site.xml with Notepad++. Paste the following


code between the configuration tags and save:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
F. Edit the file C:\Hadoop-2.8.0\etc\hadoop\yarn-site.xml with Notepad++. Paste the following
code between the configuration tags and save:

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

G. Edit the file C:\Hadoop-2.8.0\etc/hadoop\hadoop-env.cmd with Notepad++.


Write @rem in front of “set JAVA_HOME=%JAVA_HOME%”.
Write set JAVA_HOME=C:\Java at the next row.

It should look like this is Notepad++:

Don’t forget to save.


Bravo, configuration done!
Replace the bin folder
Before we can start testing we must exchange a folder in Hadoop.

A. Download Hadoop Configuration.zip


https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-ON-WINDOW-10/blob
/master/Hadoop%20Configuration.zip Unzip the bin file.

B. Delete the bin file C:\Hadoop\Hadoop-2.8.0\bin [1, 2] and replace it with the new
bin-folder from Hadoop Configuration.zip. [3].

Testing

A. Search for cmd [1] and open the Command Prompt [2]. Write
hdfs namenode –format [3] and push enter.

If this first test works the Command Prompt will run a lot of information. It is a good sign!

B. Now you must change directory in the Command Prompt. Write cd C:\hadoop-2.8.0\sbin and
push enter. In the sbin folder, write start-all.cmd and push enter.
If the configuration is right, four apps will start running and it will look something like this:

C. Now open a browser and write in the address field localhost:8088 and push enter.
Can you see the little hadoop elephant? Then you have made a really good work!

D. Last test - try to write localhost:50070 instead.


If you can see the overview you have implemented Hadoop on your PC.
Congratulations, you did it!!!

***********************
To close the running programs, run “stop-all.cmd” in tho command prompt
1. Write the steps about hadoop installation in windows 10

The step-by-step method


The step-by-step method uses headlines and letters to keep track of the successive flow. It have five main
steps with substeps labeled by letters. The substeps are sometimes illustrated, and the illustrations are often
numbered in order to show the linear process in detail.

Preparations
A. Check that you are logged in as admin.
B. Download Java jdk1.8.0
C. Download Hadoop 2.8.0
D. Download Notepad++
E. Create a folder C:\Java
F. Install Java jdk1.8.0 in the folder C:\Java
G. Install Hadoop 2.8.0 right under C:\
H. Install Notepad++
I. If the Windows firewall is activated open ports 8088 and 50070

Setup Environment variables


A. Set the JAVA_HOME Environment variable to Java bin-folder.
B. Set the HADOOP_HOME Environment variable to Hadoop bin-folder.
C. Add Java and Hadoop to the bin directory path.

Configuration
A. Edit the file C:\Hadoop\Hadoop-2.8.0\etc\hadoop\core-site.xml.
B. Rename C:\Hadoop-2.8.0\etc/hadoop\mapred-site.xml.template to mapred-site.xml and
edit.
C. C:\Hadoop-2.8.0 create a folder named “data” with subfolders “datanode” and
“namenode”.
D. Edit the file C:\Hadoop-2.8.0\etc\hadoop/hdfs-site.xml.
E. Edit the file C:\Hadoop-2.8.0\etc\hadoop\yarn-site.xml.
F. Edit the file C:\Hadoop-2.8.0\etc/hadoop\hadoop-env.cmd

Replace the Bin library


A. Download Hadoop Configuration.zip and extract the bin file.
B. Replace C:\Hadoop\Hadoop-2.8.0\bin with the bin-file from Hadoop
Configuration.zip.
Testing
A. Run the cmd "hdfs namenode –format".
B. Change directory in cmd: cd C:\Hadoop\Hadoop-2.8.0\sbin and run start-all.cmd.
C. Open browser and write localhost:8088
D. Open browser and write localhost:50070
PROGRAM 2:
Write a program for simple Word count MapReduce with hadoop using Java in
Windows 10

Aim: To perform a program for simple Word count MapReduce with hadoop using Java in
Windows 10

Prerequisites:
This program deals with the basic MapReduce program using the Apache Hadoop framework
in windows computers. The Hadoop framework installed is Pseudo Distributed (Single node).

This tutorial deals with running the MapReduce program on windows. Hadoop single node
framework and JDK with eclipse are already installed.

Java JDK version "1.8.0_291" and Hadoop version "3.3.0" are used here; the operations
resemble similar to other versions also.
Create text file with some content for word count.

Procedure:
Initially, open eclipse IDE and create a new project. Here the project name is
map_reduce_example.

Now use configure build path in your project and add the following external Hadoop jar files
to the project as shown below.

After successfully adding the jar files, the eclipse will show items in referenced libraries, as
shown below.
Next, add source code for the word-count program.

copy the below code in the WordCount.java class

Program:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

After adding source code, create a jar file of map_reduce_example using the export option in
eclipse IDE.

1. Open cmd prompt with administrator status.


2. Move to sbin folder of Hadoop using cd command (cd C:\hadoop-3.3.0\sbin)
3. Start the daemons by giving command start-all or better use (start-dfs then start-yarn)
for specific initialization

4. After starting, check if all nodes (namenode, datanode, resoucemanager,


nodemanager) are working using the command (jps)
5. Make an input directory in the Hadoop cluster using the command (hadoop fs -mkdir
/input_directory).
6. Now add the required text files to the input directory using the command (hadoop fs -
put file_path/file_name.txt /input_directory) as shown below. test_wordcnt and
test_wordcnt_2 are my input file with some words with
spaces.

7. Now run the map_reduce jar file exported previously using the Hadoop
command (hadoop jar jarpath/jar_name.txt /input_directory
/output_directory) --- leave space between input directory and output
directory as shown.

8. Finally check the output using command (hadoop dfs -cat


/output_dir/*). By this, we have successfully executed the word count
MapReduce program in windows.
Output:
lOMoARcPSD|53906646

Experiment 3. Process big data in HBase

Aim:To create a table and process the big data in Hbase

Resources:Hadoop,oracle virtual box,Hbase

Theory:
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and
horizontally scalable.
It is based on Google's Big Table.It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop
ecosystem that provides random real-time read/write access to data in the Hadoop File System.
 RDBMS get exponentially slow as the data becomes large
 Expects data to be highly structured, i.e. ability to fit in a well-defined schema
 Any change in schema might require a downtime
 For sparse datasets, too much of overhead of maintaining NULL values

Features of Hbase
 Horizontally scalable: You can add any number of columns anytime.
 Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
 Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File
System.
 sparse, distributed, persistent, multidimensional sorted map, which is indexed by
rowkey, column key,and timestamp.
 Often referred as a key value store or column family-oriented database, or storing versioned
maps of maps.
 fundamentally, it's a platform for storing and retrieving data with random access.
 It doesn't care about datatypes(storing an integer in one row and a string in another for
the same column).
 It doesn't enforce relationships within your data.
 It is designed to run on a cluster of computers, built using commodity hardware.

Cloudera VM is recommended as it has Hbase preinstalled on it.


Starting Hbase: Type Hbase shell in terminal to start the hbase.
lOMoARcPSD|53906646

Cloudera VM is recommended as it has Hbase preinstalled on it.

Hbase commands
Step 1:First go to terminal and type StartCDH.sh
Step 2:Next type jps command in the terminal

Step 3:Type hbase shell

Step 4:hbase(main):001:0> list


List will gives you list of tables in Hbase

Step 5:hbase(main):001:0>version
lOMoARcPSD|53906646

Version will gives you the version of hbase


Create Table Syntax

CREATE 'name_space:table_name', 'column_family’

hbase(main):011:0> create
'newtbl','knowledge'
hbase(main):011:0>describe 'newtbl'
hbase(main):011:0>status
1 servers, 0 dead, 15.0000 average load

HBase – Using PUT to Insert data to Table


To insert data into the HBase table use PUT command, this would be similar to insert statement
on RDBMS but the syntax is completely different. In this article I will describe how to insert data
into HBase table with examples using PUT command from the HBase shell.

HBase PUT command syntax


Below is the syntax of PUT command which is used to insert data (rows and columns) into a
HBase table.

HBase PUT command syntax


Below is the syntax of PUT command which is used to insert data (rows and columns) into a
HBase table.
put '<name_space:table_name>', '<row_key>' '<cf:column_name>', '<value>'
hbase(main):015:0> put 'newtbl','r1','knowledge:sports','cricket'
0 row(s) in 0.0150 seconds

hbase(main):016:0> put 'newtbl','r1','knowledge:science','chemistry'


0 row(s) in 0.0040 seconds

hbase(main):017:0> put 'newtbl','r1','knowledge:science','physics'


0 row(s) in 0.0030 seconds

hbase(main):018:0> put 'newtbl','r2','knowledge:economics','macroeconomics'


0 row(s) in 0.0030 seconds

hbase(main):019:0> put 'newtbl','r2','knowledge:music','songs'


0 row(s) in 0.0170 seconds
hbase(main):020:0> scan 'newtbl'
ROW COLUMN+CELL
r1 column=knowledge:science, timestamp=1678807827189, value
=physics
r1 column=knowledge:sports, timestamp=1678807791753, value=
cricket
lOMoARcPSD|53906646

r2 column=knowledge:economics, timestamp=1678807854590, val


lOMoARcPSD|53906646

ue=macroeconomics
r2 column=knowledge:music, timestamp=1678807877340, value=s
ongs
2 row(s) in 0.0250 seconds

To retrieve only the row1

data

hbase(main):023:0> get 'newtbl', 'r1'


output
COLUMN CELL
knowledge:science timestamp=1678807827189, value=physics
knowledge:sports timestamp=1678807791753, value=cricket
2 row(s) in 0.0150 seconds.
hbase(main):025:0> disable 'newtbl'
0 row(s) in 1.2760 seconds

Veri昀椀cation
After disabling the table, you can still sense its existence
through list and exists commands. You cannot scan it. It will give you the following
error.
hbase(main):028:0> scan 'newtbl'
ROW COLUMN + CELL
ERROR: newtbl is disabled.

is_disabled
This command is used to find whether a table is disabled. Its syntax is as follows.
hbase> is_disabled 'table name'

hbase(main):031:0> is_disabled 'newtbl'


true
0 row(s) in 0.0440 seconds

disable_all
This command is used to disable all the tables matching the given regex. The syntax
for disable_all command is given below.
hbase> disable_all 'r.*'

Suppose there are 5 tables in HBase, namely raja, rajani, rajendra, rajesh, and raju. The
following code will disable all the tables starting with raj.
hbase(main):002:07> disable_all 'raj.*'
raja
lOMoARcPSD|53906646

rajendra
rajesh
raju
Disable the above 5 tables (y/n)?
y
5 tables successfully disabled

Enabling a Table using HBase Shell


Syntax to enable a table:
enable ‘newtbl’
Example
Given below is an example to enable a table.

hbase(main):005:0> enable 'newtbl'


0 row(s) in 0.4580 seconds

Veri昀椀cation
After enabling the table, scan it. If you can see the schema, your table is successfully
enabled.

hbase(main):006:0> scan 'newtbl'

is_enabled
This command is used to find whether a table is enabled. Its syntax is as follows:
hbase> is_enabled 'table name'

The following code verifies whether the table named emp is enabled. If it is enabled, it
will return true and if not, it will return false.
hbase(main):031:0> is_enabled 'newtbl'
true
0 row(s) in 0.0440 seconds

describe
This command returns the description of the table. Its syntax is as follows:
hbase> describe 'table name'

hbase(main):006:0> describe 'newtbl'


DESCRIPTION
ENABLED
lOMoARcPSD|53906646

Experiment: 4 Store and retrieve data in Pig


Aim:To perform storing and retrieval of big data using Apache pig
Resources:Apache pig
Theory:
Pig is a platform that works with large data sets for the purpose of analysis. The
Pig dialect is called Pig Latin, and the Pig Latin commands get compiled into
MapReduce jobs that can be run on a suitable platform, like Hadoop.
Apache Pig is a platform for analyzing large data sets that consists of a high-
level language for expressing data analysis programs, coupled with
infrastructure for evaluating these programs. The salient property of Pig
programs is that their structure is amenable to substantial parallelization, which
in turns enables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that


produces sequences of Map-Reduce programs, for which large-scale parallel
implementations already exist (e.g., the Hadoop subproject). Pig's language
layer currently consists of a textual language called Pig Latin, which has the
following key properties:

 Ease of programming. It is trivial to achieve parallel execution of


simple, "embarrassingly parallel" data analysis tasks. Complex tasks
comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write, understand,
and maintain.
 Optimization opportunities. The way in which tasks are encoded
permits the system to optimize their execution automatically, allowing
the user to focus on semantics rather than efficiency.
 Extensibility. Users can create their own functions to do special-purpose
processing.
 Pig Latin – Relational Operations
 The following table describes the relational operators of Pig Latin.

Operator Description

Loading and Storing

LOAD To Load the data from the 昀椀le system (local/HDFS) into a
relation.
STORE To save a relation to the 昀椀le system (local/HDFS).
lOMoARcPSD|53906646

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, GENERATE To generate data transformations based on columns of


data.
STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or


more 昀椀elds (ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

EXPLAIN To view the logical, physical, or MapReduce execution


plans to compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of


statements.
lOMoARcPSD|53906646

For the given Student dataset and Employee dataset,perform Rela琀椀onal


opera琀椀ons like Loading, Storing, Diagnos琀椀c Opera琀椀ons (Dump, Describe,
Illustrate & Explain) in Hadoop Pig framework using Cloudera
Student ID First Name Age City CGPA
001 Jagruthi 21 Hyderabad 9.1
002 Praneeth 22 Chennai 8.6
003 Sujith 22 Mumbai 7.8
004 Sreeja 21 Bengaluru 9.2
005 Mahesh 24 Hyderabad 8.8
006 Rohit 22 Chennai 7.8
007 Sindhu 23 Mumbai 8.3

Employee Age
Name City
ID
001 Angelina 22 LosAngeles
002 Jackie 23 Beijing
003 Deepika 22 Mumbai
004 Pawan 24 Hyderabad
005 Rajani 21 Chennai
006 Amitabh 22 Mumbai

Step-1: Create a Directoryin HDFS with the name pigdir in the required path using mkdir:
$ hdfs dfs -mkdir /bdalab/pigdir
Step-2: The input 昀椀le of Pig contains each tuple/record in individual lines with the en琀椀琀椀es
separated by a delimiter ( “,”).
In the local file system, create an input In the local file system, create an
file student_data.txt containing data as input file employee_data.txt
shown below. containing data as shown below.
001,Jagruthi,21,Hyderabad,9.1 001,Angelina,22,LosAngeles
002,Praneeth,22,Chennai,8.6 002,Jackie,23,Beijing
003,Sujith,22,Mumbai,7.8 003,Deepika,22,Mumbai
004,Sreeja,21,Bengaluru,9.2 004,Pawan,24,Hyderabad
005,Mahesh,24,Hyderabad,8.8 005,Rajani,21,Chennai
006,Rohit,22,Chennai,7.8 006,Amitabh,22,Mumbai
007,Sindhu,23,Mumbai,8.3
Step-3: Move the 昀椀le from the local 昀椀le system to HDFS using put (Or) copyFromLocal
command and verify using -cat command
To get the path of the 昀椀le student_data.txt type the below
command readlink -f student_data.txt
$ hdfs dfs -put /home/hadoop/Desktop/student_data.txt /bdalab/pigdir/
lOMoARcPSD|53906646

$ hdfs dfs -cat /bdalab/pigdir/student_data


$ hdfs dfs -put /home/hadoop/Desktop/employee_data /bdalab/pigdir/
Step-4: Apply Rela琀椀onal Operator – LOAD to load the data from the 昀椀le student_data.txt
into Pig by execu琀椀ng the following Pig La琀椀n statement in the Grunt shell.
Rela琀椀onal Operators are NOT case sensi琀椀ve.
$ pig => will direct to grunt> shell
grunt> student = LOAD ' /bdalab/pigdir/student_data.txt' USING PigStorage(',')as (
id:int, name:chararray, age:int, city:chararray, cgpa:double );
grunt>employee = LOAD ' /bdalab/pigdir/employee_data.txt’ USING
PigStorage(',')as ( id:int, name:chararray, age:int, city:chararray);

Step-5: Apply Rela琀椀onal Operator – STORE to Store the rela琀椀on in the HDFS directory
“/pig_output/” as shown below.

grunt> STORE student INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');

grunt> STORE employee INTO ' /bdalab/pigdir/pig_output/ ' USING PigStorage (',');

Step-6: Verify the stored data as shown below

$ hdfs dfs -ls /bdalab/pigdir/pig_output/

$ hdfs dfs -cat /bdalab/pigdir/pig_output/part-m-00000

Step-7: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – DUMP toPrint the contents of
the rela琀椀on.

grunt> Dump student

grunt> Dump employee


Step-8: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – DESCRIBE toView the schema
of a rela琀椀on.
grunt> Describe student
grunt> Describe employee

Step-9: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – EXPLAIN toDisplay the logical,
physical, and MapReduce execu琀椀onplans of a rela琀椀on usingExplain operator
grunt> Explain student
grunt>Explain employee
lOMoARcPSD|53906646

Step-9: Apply Rela琀椀onal Operator – Diagnos琀椀c Operator – ILLUSTRATE to give the step-
by-step execu琀椀on of a sequence of statements
grunt>Illustrate student
grunt>Illustrate employee
PROGRAM 5:
Perform data analysis using MongoDB
MongoDB-
MongoDB is a source-available, cross-platform, document-oriented
database program. Classified as a NoSQL database product, MongoDB uses JSON-like
documents with optional schemas.

MongoDB with Hadoop


1. Set Up the Environment:
• Ensure that MongoDB, Hadoop, and the MongoDB Hadoop Connector are installed
and configured properly.
2. Import Data from MongoDB to Hadoop:
• Use the MongoDB Hadoop Connector to transfer data from MongoDB to Hadoop
Distributed File System (HDFS).
• For example, the connector enables you to read data stored in MongoDB collections
into Hadoop's MapReduce jobs or Apache Spark applications.
3. Process Data in Hadoop:
• You can write MapReduce jobs or use frameworks like Apache Spark to process the
imported data within Hadoop.
• Tools like Hive or Pig can also be used to perform SQL-like queries or
transformations on the data in HDFS.
4. Export Results Back to MongoDB (if needed):
• Once the data is processed, the results can be written back into MongoDB for further
use or visualization.

Example of Using MongoDB Hadoop Connector


Here is an example process:
• Configure the Hadoop job to use MongoInputFormat to read data from MongoDB and
MongoOutputFormat to write results back.
• Develop a MapReduce job to perform the required analysis.

Tools and Technologies Involved


• MongoDB: As the primary data source.
• Hadoop: For distributed data processing.
• MongoDB Hadoop Connector: To bridge MongoDB and Hadoop.
• Apache Spark (optional): For more efficient in-memory processing compared to
MapReduce.
• Hive/Pig (optional): For SQL-like querying in Hadoop.
Example Code: MongoDB to Hadoop MapReduce
1. Setup MongoDB Hadoop Connector Configuration
Ensure that the MongoDB Hadoop Connector jar is included in your project. For Maven, add the
dependency:
<dependency>
<groupId>org.mongodb.hadoop</groupId>
<artifactId>mongo-hadoop-core</artifactId>
<version>2.0.2</version>
</dependency>

2. Input MongoDB Collection


Let's assume we have a MongoDB collection named sales with documents structured like this:
{
"_id": ObjectId("..."),
"item": "Laptop",
"price": 1000,
"quantity": 2
}

3. MapReduce Code
Below is the Java MapReduce code to calculate the total revenue for each item.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import com.mongodb.hadoop.MongoInputFormat;
import com.mongodb.hadoop.MongoOutputFormat;

public class SalesAnalysis {

public static class SalesMapper extends


org.apache.hadoop.mapreduce.Mapper<Object, org.bson.BSONObject, Text,
IntWritable> {
public void map(Object key, org.bson.BSONObject value, Context context)
throws IOException, InterruptedException {
String item = value.get("item").toString();
int price = (Integer) value.get("price");
int quantity = (Integer) value.get("quantity");
int revenue = price * quantity; // Total revenue
context.write(new Text(item), new IntWritable(revenue));
}
}

public static class SalesReducer extends


org.apache.hadoop.mapreduce.Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context
context) throws IOException, InterruptedException {
int totalRevenue = 0;
for (IntWritable val : values) {
totalRevenue += val.get();
}
context.write(key, new IntWritable(totalRevenue));
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
conf.set("mongo.input.uri",
"mongodb://localhost:27017/yourDatabase.sales");
conf.set("mongo.output.uri",
"mongodb://localhost:27017/yourDatabase.result");

Job job = Job.getInstance(conf, "Sales Analysis");


job.setJarByClass(SalesAnalysis.class);
job.setMapperClass(SalesMapper.class);
job.setCombinerClass(SalesReducer.class);
job.setReducerClass(SalesReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(MongoInputFormat.class);
job.setOutputFormatClass(MongoOutputFormat.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

How It Works
1. MongoInputFormat: Reads documents from the MongoDB collection (sales).
2. SalesMapper: Computes revenue for each item.
3. SalesReducer: Aggregates the revenue for each item across all records.
4. MongoOutputFormat: Saves the processed results back into a MongoDB collection
(result).

4. Output in MongoDB result Collection


After running the job, the result collection will contain aggregated total revenue for each item:
{ "_id": "Laptop", "value": 2000 }
lOMoARcPSD|53906646

Experiment:6 using a power pivot(Excel) perform the


following on any data set

Aim: To perform the big data analytics using power pivot in Excel

Resources: Microsoft Excel

Theory: Power Pivot is an Excel add-in you can use to perform powerful data analysis and create
sophisticated data models. With Power Pivot, you can mash up large volumes of data from
various sources, perform information analysis rapidly, and share insights easily.

In both Excel and in Power Pivot, you can create a Data Model, a collection of tables with
relationships. The data model you see in a workbook in Excel is the same data model you see in
the Power Pivot window. Any data you import into Excel is available in Power Pivot, and vice
versa.

Procedure:

Open the Microsoft Excel and go to data menu and click get data
lOMoARcPSD|53906646

Import the Twitter data set and click load to button

Now from the excel data will starts importing


lOMoARcPSD|53906646

Next click create connection and click the check box add to the data model
lOMoARcPSD|53906646

Next click manage data model and see that all the
twitter data is loaded as model and close the power
pivot window.

Save the excel 昀椀le as sample.xls

Click the diagram view and give the relation ships between the tables
lOMoARcPSD|53906646

Go to the Insert menu and click pivot table


lOMoARcPSD|53906646

Select the columns and u can perform drill down


and rollup operations using pivot table
lOMoARcPSD|53906646

We can load 10mllions rows of data also from


multiple resources.
lOMoARcPSD|53906646

Experiment 6:Using Power Pivot perform the following on any data set

B)Big data Charting

Aim :To create variety of charts using Excel for the given data

Resources:Microsoft Excel

Theory:

When your data sets are big, you can use Excel Power Pivot that can handle
hundreds of millions of rows of data. The data can be in external data
sources and Excel Power Pivot builds a Data Model that works on a memory
optimization mode. You can perform the calculations, analyze the data and
arrive at a report to draw conclusions and decisions. The report can be either
as a Power PivotTable or Power PivotChart or a combination of both.
You can utilize Power Pivot as an ad hoc reporting and analytics solution.
Thus, it would be possible for a person with hands-on experience with Excel
to perform the high-end data analysis and decision making in a matter of few
minutes and are a great asset to be included in the dashboards.

Uses of Power Pivot


You can use Power Pivot for the following −
 To perform powerful data analysis and create sophisticated Data Models.
 To mash-up large volumes of data from several di昀昀erent sources quickly.
 To perform information analysis and share the insights interactively.
 To create Key Performance Indicators (KPIs).
 To create Power PivotTables.
 To create Power PivotCharts.

Di昀昀erences between PivotTable and Power


PivotTable
Power PivotTable resembles PivotTable in its layout, with the following
di昀昀erences −
 PivotTable is based on Excel tables, whereas Power PivotTable is based
on data tables that are part of Data Model.
 PivotTable is based on a single Excel table or data range, whereas
lOMoARcPSD|53906646

Power PivotTable can be based on multiple data tables, provided they


are added to Data Model.
lOMoARcPSD|53906646

 PivotTable is created from Excel window, whereas Power PivotTable is


created from PowerPivot window.

Creating a Power PivotTable


Suppose you have two data tables – Salesperson and Sales in the Data Model. To create
a Power PivotTable from these two data tables, proceed as follows −
 Click on the Home tab on the Ribbon in PowerPivot window.
 Click on PivotTable on the Ribbon.
 Click on PivotTable in the dropdown list.

Create PivotTable dialog box appears. Click on New Worksheet.

Click the OK button. New worksheet gets created in Excel window and an empty Power
PivotTable appears.
lOMoARcPSD|53906646

As you can observe, the layout of the Power PivotTable is similar to that of PivotTable.
The PivotTable Fields List appears on the right side of the worksheet. Here, you will 昀椀nd
some di昀昀erences from PivotTable. The Power PivotTable Fields list has two tabs − ACTIVE
and ALL, that appear below the title and above the 昀椀elds list. ALL tab is highlighted. The ALL
tab displays all the data tables in the Data Model and ACTIVE tab displays all the data tables
that are chosen for the Power PivotTable at hand.
 Click the table names in the PivotTable Fields list
under ALL. The corresponding 昀椀elds with check boxes will
appear.

 Each table name will have the symbol on the left side.
 If you place the cursor on this symbol, the Data Source and the Model Table Name of
that data table will be displayed.

 Drag Salesperson from Salesperson table to ROWS area.


 Click on the ACTIVE tab.
lOMoARcPSD|53906646

The 昀椀eld Salesperson appears in the Power PivotTable and the table Salesperson appears
under ACTIVE tab.

 Click on the ALL tab.


 Click on Month and Order Amount in the Sales table.
 Click on the ACTIVE tab.
Both the tables – Sales and Salesperson appear under the ACTIVE tab.

 Drag Month to COLUMNS area.


 Drag Region to FILTERS area.

 Click on arrow next to ALL in the Region 昀椀lter box.


 Click on Select Multiple Items.
 Click on North and South.
lOMoARcPSD|53906646

 Click the OK button. Sort the column labels in the ascending order.

Power PivotTable can be modi昀椀ed dynamically to explore and report data.

Creating a Power PivotChart


A Power PivotChart is a PivotChart that is based on Data Model and created from the Power
Pivot window. Though it has some features similar to Excel PivotChart, there are other features
that make it more powerful.
Suppose you want to create a Power PivotChart based on the following Data Model.
lOMoARcPSD|53906646

 Click on the Home tab on the Ribbon in the Power Pivot window.
 Click on PivotTable.
 Click on PivotChart in the dropdown list.

Create PivotChart dialog box appears. Click New Worksheet.


lOMoARcPSD|53906646

 Click the OK button. An empty PivotChart gets created on a new worksheet in the
Excel window. In this chapter, when we say PivotChart, we are referring to Power
PivotChart.

As you can observe, all the tables in the data model are displayed in the PivotChart Fields list.

 Click on the Salesperson table in the PivotChart Fields list.


 Drag the 昀椀elds – Salesperson and Region to AXIS area.
Two 昀椀eld buttons for the two selected 昀椀elds appear on the PivotChart. These are the Axis
昀椀eld buttons. The use of 昀椀eld buttons is to 昀椀lter data that is displayed on the PivotChart.
lOMoARcPSD|53906646

 Drag TotalSalesAmount from each of the 4 tables – East_Sales, North_Sales,


South_Sales and West_Sales to ∑ VALUES area.

As you can observe, the following appear on the worksheet −

 In the PivotChart, column chart is displayed by default.


 In the LEGEND area, ∑ VALUES gets added.
 The Values appear in the Legend in the PivotChart, with title Values.
 The Value Field Buttons appear on the PivotChart.
You can remove the legend and the value 昀椀eld buttons for a tidier look of the PivotChart.
lOMoARcPSD|53906646

 Click on the button at the top right corner of the PivotChart.


 Deselect Legend in the Chart Elements.

 Right click on the value 昀椀eld buttons.


 Click on Hide Value Field Buttons on Chart in the
dropdown list. The value 昀椀eld buttons on the chart will be

hidden.

Note that display of Field Buttons and/or Legend depends on the context of the PivotChart.
You need to decide what is required to be displayed.
As in the case of Power PivotTable, Power PivotChart Fields list also contains two tabs −
ACTIVE and ALL. Further, there are 4 areas −

 AXIS (Categories)
lOMoARcPSD|53906646

 LEGEND (Series)
 ∑ VALUES
 FILTERS
As you can observe, Legend gets populated with ∑ Values. Further, Field Buttons get added
to the PivotChart for the ease of 昀椀ltering the data that is being displayed. You can click on
the arrow on a Field Button and select/deselect values to be displayed in the Power
PivotChart.

Table and Chart Combinations


Power Pivot provides you with di昀昀erent combinations of Power PivotTable and Power
PivotChart for data exploration, visualization and reporting.
Consider the following Data Model in Power Pivot that we will use for illustrations −

You can have the following Table and Chart Combinations in Power Pivot.
 Chart and Table (Horizontal) - you can create a Power PivotChart and a Power
PivotTable, one next to another horizontally in the same worksheet.
lOMoARcPSD|53906646

Chart and Table (Vertical) - you can create a Power PivotChart and a Power PivotTable, one
below another vertically in the same worksheet.

These combinations and some more are available in the dropdown list that appears when
you click on PivotTable on the Ribbon in the Power Pivot window.

Click on the pivot chart and can develop multiple variety of charts

Output:
lOMoARcPSD|53906646

Experiment 7:using R project to carry out statistical analysis of


big data

Aim:To perform the statistical analysis of big data using R

Theory:Statistics is the science of analyzing, reviewing and


conclude data.
Some basic statistical numbers include:
 Mean, median and mode
 Minimum and maximum value
 Percentiles
 Variance and Standard Devation
 Covariance and Correlation
 Probability distributions
The R language was developed by two statisticians. It has many built-in functionalities, in addition
to libraries for the exact purpose of statistical analysis.

Procedure:

Installation of R and Rstudio


step 1:
sudo apt-get update
sudo apt-get install r-base
step 2:
Installation of R studio

https://posit.co/download/rstudio-desktop/#download

step 1:download R studio for ubuntu

step 2 :wget -c
https://download1.rstudio.org/desktop/jammy/amd64/rstudio
-2022.07.2-576-amd64.deb

step 2:sudo dpkg -i rstudio-2022.07.2-576-amd64.deb

step 3 :sudo apt install -f

step 4:rstudio

launch R studio

procedure:
-->install.packages("gapminder")
-->library(gapminder)
lOMoARcPSD|53906646

-->data(gapminder)
output:

A tibble: 1,704 × 6

country continent year lifeExp pop gdpPercap

<fct> <fct> <int> <dbl> <int> <dbl>

1 Afghanistan Asia 1952 28.8 8425333 779.

2 Afghanistan Asia 1957 30.3 9240934 821.

3 Afghanistan Asia 1962 32.0 10267083 853.

4 Afghanistan Asia 1967 34.0 11537966 836.

5 Afghanistan Asia 1972 36.1 13079460 740.

6 Afghanistan Asia 1977 38.4 14880372 786.

7 Afghanistan Asia 1982 39.9 12881816 978.

8 Afghanistan Asia 1987 40.8 13867957 852.

9 Afghanistan Asia 1992 41.7 16317921 649.

10 Afghanistan Asia 1997 41.8 22227415 635.

# … with 1,694 more rows

-->summary(gapminder)

summary(gapminder)

output:

country continent year

Afghanistan: 12 Africa :624 Min. 1952

Albania : 12 Americas:300 1st Qu.:1966

Algeria : 12 Asia :396 Median :1980

Angola : 12 Europe :360 Mean 1980

Argentina : 12 Oceania : 24 3rd Qu.:1993


lOMoARcPSD|53906646

Australia : 12 Max. :2007

(Other) 1632

lifeExp pop gdpPercap

Min. :23.60 Min. :6.001e+04 Min. : 241.2

1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1

Median :60.71 Median :7.024e+06 Median : 3531.8

Mean :59.47 Mean :2.960e+07 Mean : 7215.3

3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5

Max. :82.60 Max. :1.319e+09 Max. :113523.1

-->x<-mean(gapminder$gdpPercap)

Type X to get mean value of gapminder

-->x

output:[1] 7215.327

-->attach(gapminder)

-->median(pop)

output:[1] 7023596

-->hist(lifeExp)
lOMoARcPSD|53906646

-->boxplot(lifeExp)
will plot the below images

-->plot(lifeExp - gdpPercap)

-->install.packages("dplyr")

-->gapminder %>%
+ filter(year == 2007) %>%
+ group_by(continent) %>%
+ summarise(lifeExp = median(lifeExp))

output:
# A tibble: 5 × 2
continent lifeExp
<fct> <dbl>
1 Africa 52.9
2 Americas 72.9
3 Asia 72.4
4 Europe 78.6
5 Oceania 80.7

-->install.packages("ggplot2")
--> library("ggplot2")
-->ggplot(gapminder, aes(x = continent, y = lifeExp))
+ geom_boxplot(outlier.colour = "hotpink") +
geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1/4)
lOMoARcPSD|53906646

output:

-->head(country_colors, 4)

output:
Nigeria Egypt Ethiopia
"#7F3B08" "#833D07" "#873F07"
Congo, Dem. Rep.
"#8B4107"
-->head(continent_colors)

mtcars
mpg cyl disp hp drat wt qsec vs a gear carb
m
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
lOMoARcPSD|53906646

> Data_Cars <- mtcars


lOMoARcPSD|53906646

> dim(Data_Cars)
[1] 32 11
> names(Data_Cars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "car
b"
> Data_Cars <- mtcars
> Data_Cars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> Data_Cars <- mtcars
> sort(Data_Cars$cyl)
[1] 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8
> Data_Cars <- mtcars
>
> summary(Data_Cars)

mpg cyl disp hp drat


Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs a gear
m
Min. :1.513 Min. :14.50 Min. :0.0000 Min. :0.0000 Min. :3.00
0
1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.00
0
Median :3.325
Median :17.71 Median :0.0000 Median :0.0000 Median :4.00
0
Mean :3.217 Mean :17.85 Mean :0.4375 Mean :0.4062 Mean :3.68
8
3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.00
0
Max. :5.424 Max. :22.90 Max. :1.0000 Max. :1.0000 Max. :5.00
0
carb
Min. :1.000 1st
Qu.:2.000
Median :2.000
Mean :2.812
3rd Qu.:4.000
Max. :8.000
> Data_Cars <- mtcars
>
> max(Data_Cars$hp)
[1] 335
> min(Data_Cars$hp)
[1] 52
> Data_Cars <- mtcars
>
> which.max(Data_Cars$hp)
[1] 31
> which.min(Data_Cars$hp)
[1] 19
> Data_Cars <- mtcars
> rownames(Data_Cars)[which.max(Data_Cars$hp)]
[1] "Maserati Bora"
> rownames(Data_Cars)[which.min(Data_Cars$hp)]
[1] "Honda Civic"
> median(Data_Cars$wt)
[1] 3.325
> names(sort(-table(Data_Cars$wt)))[1] [1]
"3.44"

> Data_Cars <- mtcars


>
lOMoARcPSD|53906646

> mean(Data_Cars$wt)
[1] 3.21725

Data_Cars <- mtcars

median(Data_Cars$wt)
lOMoARcPSD|53906646

[1] 3.325
Data_Cars <- mtcars

names(sort(-table(Data_Cars$wt)))[1]

Data_Cars <- mtcars

# c() specifies which percentile you


want quantile(Data_Cars$wt, c(0.75))
75%
3.61

Data_Cars <- mtcars


>
> quantile(Data_Cars$wt)
0% 25% 50% 75% 100%
1.51300 2.58125 3.32500 3.61000 5.42400

Regression analysis using R


Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables. One of these variable is called predictor variable whose value is
gathered through experiments. The other variable is called response variable whose
value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where
exponent (power) of both these variables is 1. Mathematically a linear relationship
represents a straight line when plotted as a graph. A non-linear relationship where the
exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his height is
known. To do this we need to have the relationship between height and weight of a
person.
The steps to create the relationship is −
 Carry out the experiment of gathering a sample of observed values of
height and corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical
equation using these
lOMoARcPSD|53906646

 Get a summary of the relationship model to know the average error in


predic- tion. Also called residuals.
 To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response vari-
able.

Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficient
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(relation)

Result:
Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x
-38.4551 0.6746
lOMoARcPSD|53906646

To get the summary of the relation ships


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

print(summary(relation))

Result:

Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
 object is the formula which is already created using the lm() function.
 newdata is the vector containing the new value for predictor variable.
lOMoARcPSD|53906646

Predict the weight of new persons


# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The response vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.


relation <- lm(y~x)

# Find weight of a person with height 170.


a <- data.frame(x = 170)
result <- predict(relation,a)
print(result)

Result:
1
76.22869

Visualize the Regression Graphically

# Create the predictor and response variable.


x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file =
"linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.


lOMoARcPSD|53906646
lOMoARcPSD|53906646

Experiment 8:Using R project for data visualization


Aim:To perform data visualiza琀椀on using R programming
Theory:
Data visualization is the technique used to deliver insights in data using visual cues such as
graphs, charts, maps, and many others. This is useful as it helps in intuitive and easy
understanding of the large quantities of data and thereby make better decisions regarding it.
Data Visualization in R Programming Language
The popular data visualization tools that are available are Tableau, Plotly, R, Google Charts,
Infogram, and Kibana. The various data visualization platforms have different capabilities,
functionality, and use cases. They also require a different skill set. This article discusses the
use of R for data visualization.
R is a language that is designed for statistical computing, graphical data analysis, and
scientific research. It is usually preferred for data visualization as it offers flexibility and
minimum required coding through its packages.
Types of Data Visualizations
Some of the various types of visualizations offered by R are:

Bar Plot

There are two types of bar plots- horizontal and vertical which represent data points as
horizontal or vertical bars of certain lengths proportional to the value of the data item. They
are generally used for continuous and categorical variable plotting. By setting
the horiz parameter to true and false, we can get horizontal and vertical bar plots
respectively.

Bar plots are used for the following scenarios:


 To perform a comparative study between the various data categories in the data
set.
 To analyze the change of a variable over time in months or years.

Histogram

A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be
varied.
For a histogram, the parameter xlim can be used to specify the interval within which all
values are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
 To verify an equal and symmetric distribution of the data.

|
lOMoARcPSD|53906646

 To identify deviations from expected values.

Box Plot

The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and
third quartile, and interquartile range.
Box Plots are used for:
 To give a comprehensive statistical description of the data through a visual cue.
 To identify the outlier points that do not lie in the inter-quartile range of data.

Scatter Plot

A scatter plot is composed of many points on a Cartesian plane. Each point denotes the value
taken by two parameters and helps us easily identify the relationship between them.
Scatter Plots are used in the following scenarios:
 To show whether an association exists between bivariate data.
 To measure the strength and direction of such a relationship.

Heat Map

Heatmap is defined as a graphical representation of data using colors to visualize the value
of the matrix. heatmap() function is used to plot heatmap.
Syntax: heatmap(data)
Parameters: data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.

Procedure:
Step I : Facebook Developer Registration
Go to https://developers.facebook.com and register yourself by
clicking on Get Started button at the top right of page (See the
snapshot below). After it would open a form for registration which
you need to 昀椀ll it to get yourself registered.
lOMoARcPSD|53906646

Step2:click on tools

Step3 :click on graphApi explorer

Step4:copy the access token


lOMoARcPSD|53906646

Copy the access token

Go to R studio and write this Script

install.packages("h琀琀puv")
install.packages("Rfacebook")
install.packages("RcolorBrewer")
install.packages("Rcurl")
install.packages("rjson")
install.packages("h琀琀r")

library(Rfacebook)
library(h琀琀puv)
library(RcolorBrewer)
acess_token="EAATgfMOrIRoBAOR9XUl3VGzbLMuWGb9FqGkTK3PFBuRyUVZA
WAL7ZBw0xN3AijCsPiZBylucovck4YUhU昀欀WLMZBo640k2ZAupKgsaKog9736lec
P8E52qkl5de8M963oKG8KOCVUXqqLiRcI7yIbEONeQt0eyLI6LdoeZA65Hyxf8so1
UMbywAdZCZAQBpNiZAPPj7G3UX5jZAvUpRLZCQ5SIG"
op琀椀ons(RCurlop琀椀ons=list(verbose=FALSE,capath=system.昀椀le("CurlSSL","cacert.
pem",package = "Rcurl"),ssl.verifypeer=FALSE))
lOMoARcPSD|53906646

me<-getUsers("me",token=acess_token)
View(me)
myFriends<-getFriends(acess_token,simplify = FALSE)
table(myFriends)
pie(table(myFriends$gender))
output

You might also like