Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics
Subject Code :
U18MAR0001 – Statistical Foundations for Data Science
Regulation : R2018
Year : II
Department of Mathematics
1. Introduction to R Programming
8. Regression
Total
Program Execution Viva marks Faculty
S.No. Date Name of the experiment
(10) (10) (10) (30) sign
Introduction to R Programming
1
\
Mean, Median, Mode
5
Regression
8
ANOVA – One-way
9 classification
ANOVA – Two-way
10 classification
1
STEP 1: INTRODUCTION
STEP 2: ACQUISITION
I. INTRODUCTION TO R PROGRAMMING
R is a programming language and software environment for statistical analysis,
graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand
This programming language was named R, based on the first letter of first name of the
two R authors (Robert Gentleman and Ross Ihaka)
R allows integration with the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
R is an extremely flexible and customizable language.
R is a well-developed, simple and effective programming language which includes
conditional loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility.
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
R is the world’s most widely used statistics programming language.
R has a large number of built-in packages, functions and operators.
2
Command History
Editor (Environment)
Console Instructions
Packages
Plot
Functions:
A function is a group of statements or commands for performing any task. Like other
languages, in R the function is represented by parenthesis () symbol. Functions may be either
user-defined or built-in as provided by the language.
R-objects:
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
Creating vectors in R
A vector is a sequence of data elements of the same type. In R, c ( ) function is mostly used
to create a vectors.
R Code:
c(1,2,3,4)
Output:
## [1] 1 2 3 4
R Code:
c(1:4)
Output:
## [1] 1 2 3 4
The simplest of these objects is the vector object. R language has five frequently used atomic
vector types (Carpentries 2018):
• Logical – TRUE, FALSE
• Integer – 2L, 4L, etc. (the L tells R to store this as an integer)
• Numeric – 5, 12.5, etc.
• Complex – 1 + 4i, 2 + i, etc.
• Character – "Hello", "R", etc.
i) testData = FALSE
print(class(testData))
## [1] "logical"
In order to declare an integer in R language, we need to add an L suffix.
ii) testData = 5L
print(class(testData))
## [1] "integer"
iii) testData = as.integer(5)
print(class(testData))
## [1] "integer"
Output : 1 2 3 4 5 6 7 8 9 10
Output: 1 3 5 7 9 11 13 15
> round(1.414214,digits=3) # Rounding the value 1.414214 to three places after the decimal
point.
Output: 1.414
5
R-code R-code
a=c(1,2,3,4) a=c(1,2,3,4)
b=c(2,4,6,8)
a*a sum(a*b)
sum(a*a)
Output Output
a*a sum(a*b)
[1] 1 4 9 16 [1] 60
sum(a*a)
[1] 30
x=c("name","age","email")
rm(x)
testData = seq(from=1,to=10,by=0.5)
paste("The length of testData is", length(testData))
## [1] "The length of testData is 19"
The function gl (generate levels) is very useful because it generates regular series of factors.
6
The usage of this function is gl(k, n) where k is the number of levels (or classes), and n is the
number of replications in each level.
Two options may be used: length to specify the number of data produced, and labels to
specify the names of the levels of the factor.
Examples:
> gl(3, 5)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
Levels: 1 2 3
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
Levels: 1 2
[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2
Levels: 1 2
Matrix can be created using the matrix function. Dimension of the matrix can be defined by
passing appropriate value for arguments nrow and ncol.
matrix(data=1:9,nrow=3,ncol=3)
The numbers from 1 to 9 has been arranged column wise. If we want the numbers (1 to 9) to
be arranged in row wise format, the command is as follows:
matrix(data=1:9,nrow=3,ncol=3,byrow=TRUE)
u=c(9:12)
v=c(13:16)
u; v
## [1] 9 10 11 12
## [1] 13 14 15 16
First will create a single vector by joining the vectors u and v as given below:
w =c(u,v)
w
## [1] 9 10 11 12 13 14 15 16
Now
mat= matrix(data=w,nrow=4,ncol=2)
mat
## [,1] [,2]
8
## [1,] 9 13
## [2,] 10 14
## [3,] 11 15
## [4,] 12 16
rowSums(mat)
## 22 24 26 28
colSums(mat)
## 42 58
dim(mat)
#4 2
mat=cbind(mat,17:20)
mat
##
[,1] [,2] [,3]
[1,] 9 13 17
[2,] 10 14 18
[3,] 11 15 19
[4,] 12 16 20
dim(mat)
##
[1] 4 3
Matrix operations with built in data sets:
R Code:
library(help=datasets)
datasets::swiss
nrow(swiss)
ncol(swiss)
dim(swiss)
head(swiss)
head(swiss,5) ##(It gives only first five rows of the data set)
swiss[3,2]##(Finding the third row and second element in that column)
swiss[1] ##(Finding the first column elements)
swiss[2:5,] ##( Extracting the elements from second row to fifth row only)
swiss[1,] ##(Extracting the first row from the data set)
Output:
##
datasets::swiss
9
##
head(swiss)
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6
#head(swiss,5)
#
Swiss [3,2]
10
[1] 39.7
#
swiss[1]
Fertility
Courtelary 80.2
Delemont 83.1
Franches-Mnt 92.5
Moutier 85.8
Neuveville 76.9
Porrentruy 76.1
Broye 83.8
Glane 92.4
Gruyere 82.4
Sarine 82.9
Veveyse 87.1
Aigle 64.1
Aubonne 66.9
Avenches 68.9
Cossonay 61.7
Echallens 68.3
Grandson 71.7
Lausanne 55.7
La Vallee 54.3
Lavaux 65.1
Morges 65.5
Moudon 65.0
Nyone 56.6
Orbe 57.4
Oron 72.5
Payerne 74.2
Paysd'enhaut 72.0
Rolle 60.5
Vevey 58.3
Yverdon 65.4
Conthey 75.5
Entremont 69.3
Herens 77.3
Martigwy 70.5
Monthey 79.4
St Maurice 65.0
Sierre 92.2
Sion 79.3
Boudry 70.4
La Chauxdfnd 65.7
Le Locle 72.7
Neuchatel 64.4
Val de Ruz 77.6
ValdeTravers 67.6
V. De Geneve 35.0
Rive Droite 44.7
Rive Gauche 42.8
#
swiss[2:5,]
Fertility Agriculture Examination Education Catholic Infant.Mortality
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
swiss[1,]
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17 15 12 9.96 22.2
Experiment Number: 03
Lab Code : U18MAR0001
Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Data presentation methods - Bar Chart,
Pie Chart
STEP 1: INTRODUCTION
OBJECTIVE OF THE EXPERIMENT/EXPERIMENT
To create Bar charts and Pie charts
STEP 2: ACQUISITION
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.
A pie chart is a circular statistical graph which is divided into slices to illustrate numerical
proportion. The various observations or components are represented by the sectors of a circle
and the whole circle represents the sum of the values of all components. The arc length of
each slice (and consequently its central angle and area), is proportional to the quantity it
represents.
Frequency of data
Angle of sector= ×100
Total frequency
Bar Chart:
Example
The following list gives the export quantity of raw cotton (in million kg.) for five
consecutive years 2012-2013 to 2016-17:
1945.63, 1864.69, 1093.11, 1297.27, 918.15.
Plot a bar chart.
R-code:
export=c(1945.63, 1864.69, 1093.11, 1297.27, 918.15)
year=c("2012-13","2013-14","2014-15","2015-16","2016-17")
bar plot (export, names.arg=year, xlab="Financial Year", ylab="Export Quantity", col="dark
green", main="Export quantity of raw cotton (in million kg.)")
Output:
Task 1
Data regarding India’s Article market size (US$billion) for various years is given in the
following table. Plot a bar chart.
13
R Code:
Output:
Pie Chart
Example:
14
Draw a pie diagram to represent the following data giving the monthly expenditure of a
family:
Type of expenditure Amount in Rs. ‘000
Food 10
Rent 15
Clothes 2
Education 10
Miscellaneous 3
Savings 8
Total 48
R code
exp = c(10, 15, 2, 10, 3,8)
lbls= c("Food", "Rent", "Clothes", "Education", "Miscellaneous","Savings")
percent = round(exp/sum(exp)*100)
lbls = paste(lbls, percent)
lbls = paste(lbls,"%",sep ="")
pie(exp,labels = lbls, col=rainbow(length(lbls)), main=" Monthly expenditure of a family ")
Output
Note: Save the pie chart as image with file name “Monthly expenditure of a family”. It will
get saved in your working directory
Task 1
Elements that make up the earth’s crust are as follows. Represent as a pie diagam
Sl.No. Elements Percentage
15
1. Aluminum 9%
2. Silicon 23%
3. Iron 14%
4. Oxygen 39%
5. Calcium 11%
6. Others 4%
R Code:
data=c(9,12,14,39,11,4)
labels=c('Aluminium','Silicon','Iron','Oxygen','Calcium','Others')
labels =paste(labels, data, sep='-')
labels =paste(labels,'%', sep='')
pie (data,labels=labels,col=rainbow(length(labels)),main='Elements of Earth\'s Crust')
Output:
Task 2
The following table shows the area in millions of square kilometers of the
oceans of the world:
Ocean Area (million sq. km.)
Pacific 70.8
Indian 28.5
Atlantic 41.2
Antarctic 7.6
Arctic 4.8
R Code:
Output:
17
LABORATORY MANUAL
Experiment Number: 04
STEP 1: INTRODUCTION
OBJECTIVES OF THE EXPERIMENT
Example
R code:
A =data.frame( name=c("A","B","C"),
age =c(42,38,26))
18
(OR)
name=c("A","B","C")
age =c(42,38,26)
Output
Name gender height weight age
1 A Male 152.0 81 42
2 B Male 171.5 93 38
3 C Female 165.0 78 26
Task 1
Create a data frame from the following details regarding babies’ frocks (Given: size, season.
material, decoration, pattern type, price)
R Code:
size = c("L","M","M","M","L")
season=c("Spring","Summer","Summer","Winter","Autumn")
material=c("silk","chiffon","cotton","cotton","linen")
decoration=c("Embroidery","Bow","Null","Null","Ruffles")
patterntype=c("dot","print","animal","patchwork","animal")
price=c(650,275,380,450,420)
D=data.frame(size,season,material,decoration,patterntype,price)
D
19
Output:
To import data from Excel sheet ‘abc’, first save the file as .csv(comma delimited) in
the current working directory. Then execute the following command
data = read.csv("abc.csv")
data
3. $ symbol is used to extract a specific field.
Example:
To import data from Excel sheet ‘Testmarks’, first save the file as .csv (comma delimited) in
the current working directory. Then execute the following command
data = read.csv("Testmarks.csv")
data
Output:
Sl.No Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36
20
To get the list of students who have secured 30 or more marks in both tests
Output
Sl.No. Name IT.1 IT.II
6 6 F 32 32
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36
Task 2
Import the excel file ‘Studentdetails1’from your directory to create a data frame.
details = read.csv("Studentdetails1.csv")
details
Output:
X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS
1 1 A1 Male MQ TIRUVANNAMALAI 967 177.00
2 2 A2 Female MQ COIMBATORE 1097 183.75
3 3 A3 Female MQ COIMBATORE 1096 183.50
4 4 A4 Female MQ COIMBATORE 1085 187.50
5 5 A5 Male MQ COIMBATORE 1056 179.00
6 6 A6 Male MQ OOTY 1091 184.00
7 7 A7 Female MQ KARUR 1088 180.75
8 8 A8 Male MQ COIMBATORE 1009 171.75
21
LABORATORY MANUAL
Experiment Number: 05
STEP 1: INTRODUCTION
STEP 2: ACQUISITION
To create a data frame using given data
data
3. $ symbol is used to extract a specific field.
4. mean(data$-----)
5. Median(data$----)
6. Mode :
mode = function(x) {
ux = unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x = data$ ------
# Calculate the mode using the user function.
v = mode(x)
print(v)
7. summary(data)
Example:
To import data from Excel sheet ‘Test marks’, first save the file as .csv(comma delimited) in
the current working directory. Then execute the following command
data = read.csv("Testmarks.csv")
data
Output:
Sl.No. Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36
print(pass)
24
Output:
To get the list of students who have secured 30 or more marks in both tests
Good
Output
mean(data$IT.1)
Output:
27.41667
Task 2
Import the excel file ‘Studentdetails1’from your directory to create a data frame and
find:
R Code:
data
Output:
X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS
1 1 A1 Male MQ TIRUVANNAMALAI 967 177.00
2 2 A2 Female MQ COIMBATORE 1097 183.75
3 3 A3 Female MQ COIMBATORE 1096 183.50
4 4 A4 Female MQ COIMBATORE 1085 187.50
5 5 A5 Male MQ COIMBATORE 1056 179.00
6 6 A6 Male MQ OOTY 1091 184.00
7 7 A7 Female MQ KARUR 1088 180.75
8 8 A8 Male MQ COIMBATORE 1009 171.75
9 9 A9 Male MQ COIMBATORE 906 145.50
10 10 A10 Male MQ SALEM 977 159.25
11 11 A11 Male MQ COIMBATORE 1052 168.25
12 12 A12 Female MQ THE NILGIRIS 1125 190.50
13 13 A13 Male MQ BANGALORE 391 68.25
14 14 A14 Female MQ COIMBATORE 1003 158.00
15 15 A15 Male MQ SALEM 959 168.00
16 16 A16 Male MQ TIRUCHIRAPPALLI 1140 188.00
17 17 A17 Female MQ COIMBATORE 963 162.00
18 18 A18 Female GQ COIMBATORE 1135 195.50
19 19 A19 Male GQ ERODE 1139 195.75
20 20 A20 Female GQ TIRUPPUR 1158 195.25
21 21 A21 Male GQ PATTUKOTTAI 1153 195.25
22 22 A22 Female MQ COONOOR 1115 189.50
23 23 A23 Female GQ TRICHIRAPPALLI 1114 192.25
24 24 A24 Male GQ TIRUNELVELI 1145 195.75
25 25 A25 Male GQ SALEM 1164 196.00
26 26 A26 Male GQ ERODE 1152 195.75
27 27 A27 Female GQ DHARAPURAM 1159 194.50
28 28 A28 Female GQ COIMBATORE 1112 195.50
29 29 A29 Male GQ TIRUPPUR 1112 179.50
30 30 A30 Male GQ SIVAGANGAI 1147 194.25
31 31 A31 Female GQ CUDDALORE 1127 195.00
32 32 A32 Female GQ TIRUCHIRAPPALLI 1152 192.75
33 33 A33 Female GQ ERODE 1135 194.50
34 34 A34 Male GQ KRISHNAGIRI 1143 192.50
35 35 A35 Male GQ THIRUVALLUR 1125 195.00
36 36 A36 Male GQ VIRUDHUNAGAR 1143 195.75
37 37 A37 Male GQ MADURAI 1041 181.25
38 38 A38 Male MQ COIMBATORE 1047 170.75
39 39 A39 Female MQ SIVAGANGAI 1038 168.00
40 40 A40 Male GQ METTUPALAYAM 1094 187.00
41 41 A41 Male MQ TIRUPPUR 1084 186.50
42 42 A42 Male MQ KANYAKUMARI 1043 177.50
43 43 A43 Male MQ CHAMRAJNAGAR 433 68.00
44 44 A44 Male MQ SIVAGANGA 884 139.00
45 45 A45 Male MQ SALEM 926 152.25
46 46 A46 Male MQ COIMBATORE 936 153.00
47 47 A47 Male GQ COIMBATORE 1129 193.00
48 48 A48 Female GQ DINDIGUL 1101 188.75
49 49 A49 Female GQ THE NILGIRIS 1134 192.75
50 50 A50 Female GQ ERODE 1117 193.75
51 51 A51 Male GQ CUDDALORE 1129 188.75
26
CODING:
195.75
Standard Deviation:
sd(data$CUTOFFMARKS)
Output:
26.0761
Output:
55 character character
55 character character
55 character character
55 character character
R Code:
mean(data$TOTALMARKS)
Output:
1059.836
Standard Deviation:
sd(data$TOTALMARKS)
Output:
145.9606
4. To find the city from which maximum number of students have come.
data = read.csv("Studentdetails 1.csv")
data
mode = function(x) {
ux = unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x = data$CITYNAME
v = mode(x)
print(v)
Output
"COIMBATORE"
28
Output:
R Code:
a = mean(p$CUTOFFMARKS)
a
Output:
192.6638
6.To get the list of girl students and their average cutoff.
R Code:
p = subset(data,GENDER=="Female")
print(p)
29
Output:
R Code:
a = mean(p$CUTOFFMARKS)
a
Output:
187.0119
30
Experiment Number: 06
STEP 1: INTRODUCTION
OBJECTIVE OF THE EXPERIMENT/EXPERIMENT
1. To compute Standard Deviation
2. To compute the Five-Number Summary
3. Draw and interpret Box plots
STEP 2: ACQUISITION
1. Five Number Summary is an exploratory data analysis technique that uses five
numbers to summarize the data such as minimum value, first quartile (Q1), median, third
quartile(Q3) and maximum value.
A box-and-whisker plot or boxplot is a diagram based on the five-number summary
of a data set.
The boxplot is a visual representation of the distribution of the data.
The rectangular box is constructed with one end at Q1 and the other end at Q3 and
with a vertical segment at the median value.
Finally, the two horizontal segments on each side of the box, one down to the
minimum value and one up to the maximum
value, (these segments are called the whiskers).
The difference between Quartiles 1 and 3 is called the interquartile range (IQR).
The extreme lines show the highest and lowest value excluding outliers.
1. To import data from a given MS-Excel file and to find standard deviation from a
data frame
31
To import data from Excel sheet ‘abc’, first save the file as .csv(comma
delimited) in the current working directory. Then execute the following
command
data = read.csv("abc.csv")
data
3. $ symbol is used to extract a specific field.
4. Standard Deviation
z = sd (data$ ----)
z
Example:
To import data from Excel sheet
To import data from Excel sheet ‘Test marks’, first save the file as .csv (comma delimited) in
the current working directory. Then execute the following command
data = read.csv("Testmarks.csv")
data
Output:
Sl.No. Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36
data=c(x,y,z,….)
min(data)
max(data)
quantile(data)
quantile(data,0.25)
quantile(data,0.75)
fivenum(data)
summary(data)
boxplot(data,range=0.0,horizontal=FALSE,varwidth=TRUE,notch=FALSE,
outline=TRUE, boxwex = 0.5,border=c("blue"), col=c("green"), xlab="A",
ylab="B", main="M")
Example 1.
Compute the five-point summary (Minimum, Maximum, Median, 1st Quartile, 3rd Quartile
on a set of observations with the test scores of 9 Students in Mathematics and visualize the
summary statistics using box plot.
Test Scores:78, 93, 68, 84, 90, 74, 64, 55, 80.
R- code:
scores = c (78, 93, 68, 84, 90, 74, 64, 55, 80)
min(scores)
max(scores)
quantile(scores)
quantile(scores,0.25)
quantile(scores,0.75)
fivenum(scores)
summary(scores)
boxplot(scores,range=0.0,horizontal=FALSE,varwidth=TRUE,notch=FALSE,outline=TRUE,
boxwex=0.5,border=c("blue"),col=c("green"),xlab="Mathematics",ylab="Students Marks",
main=="Summary Statistics of Mathematics Marks ")
Output:
Scores = c (78, 93, 68, 84, 90, 74, 64, 55, 80)
min(scores)
55
max(scores)
93
quantile(scores)
0% 25% 50% 75% 100%
55 68 78 84 93
fivenum(scores)
55 68 78 84 93
summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
Example 2:
The analysis of data obtained from a cloud seeding experiment is represented below:
A cloud was deemed “seedable” if it satisfied certain criteria; for each seedable cloud a
decision was made at random whether to actually seed. The nonseeded clouds are referred to
as control clouds. The following table presents the rainfall from 26 seeded and 26 control
clouds.
Seeded Clouds: 129.6, 31.4,2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5, 200.7, 274.7,
274.7, 7.7, 1656,978,198.6,703.4,1697.8,334.1,118.3,255,115.3,242.5,32.7,40.6
Control Clouds:
26.1,26.3,87,95,372.4,0.01,17.3,24.4,11.5,321.2,68.5,81.2,47.3,28.6,830.1,345.5,1202.6,36.6,
4.9,41.1,29,163,244.3,147.8,21.7
Sol:
R Code:
x=c(129.6,31.4,2745.6,489.1,430,302.8,119,4.1,92.4,17.5,200.7,274.7,274.7,7.7,1656,978,198.6,
703.4, 1697.8,334.1,118.3,255,115.3,242.5,32.7,40.6)
y=c(26.1,26.3,87,95,372.4,0.01,17.3,24.4,11.5,321.2,68.5,81.2,47.3,28.6,830.1,345.5,1202.6,
36.6,4.9,4.9,41.1,29,163,244.3,147.8,21.7)
min(x)
max(x)
quantile(x)
quantile(x,0.25)
35
quantile(x,0.75)
fivenum(x)
summary(x)
min(y)
max(y)
quantile(y)
quantile(y,0.25)
quantile(y,0.75)
fivenum(y)
summary(y)
boxplot(x,y,names=c("Seeded Clouds"," Control Clouds"),horizontal=FALSE,
varwidth=TRUE, notch=FALSE, outline=TRUE, boxwex=0.8, border=c("blue"),
col=c("pink"), xlab="Clouds", ylab="Rainfall",main="Comparative Study")
Output:
a)
min(x)
4.1
> max(x)
2745.6
>quantile(x)
0% 25% 50% 75% 100%
4.100 98.125 221.600 406.025 2745.600
>quantile(x,0.25)
25%
98.125
>quantile(x,0.75)
75%
406.025
>fivenum(x)
4.1 92.4 221.6 430.0 2745.6
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.10 98.12 221.60 441.98 406.02 2745.60
> min(y)
0.01
> max(y)
1202.6
>quantile(y)
0% 25% 50% 75% 100%
0.010 24.825 44.200 159.200 1202.600
>quantile(y,0.25)
25%
24.825
>quantile(y,0.75)
75%
159.2
>fivenum(y)
0.01 24.40 44.20 163.00 1202.60
> summary(y)
36
b) Yes Four Outliers exist in Seeded Clouds and three outliers exist in control clouds.
c) Seeded Clouds median is greater and they have greater variability (spread) in their rainfall;
the Seeded Clouds distribution is more symmetric while the control clouds is skewed right.
Task 1:
Import the excel file ‘Cotton prices-International and Domestic.xlsx’ from your
directory to create a data frame and find:
The Standard deviation of
Cotlook.A.Minimum,Cotlook.A.Maximum,Cotlook.A…Average,Shankar.6.Maximum,
Shankar.6.Minimum, Shankar.6.Average.
R Code:
dat=read.csv("Cotton prices.csv")
dat
Output:
Cotlook.A.MinimumCotlook.A.Maximum Range Cotlook.A...Average
1 79.85 85.30 5.45 81.95
2 79.40 82.20 2.80 80.87
3 81.85 84.80 2.95 83.37
4 83.10 90.35 7.25 85.51
5 88.80 90.90 2.10 89.71
37
R-CODE:
ct=sd(dat$Cotlook.A.Minimum)
ct1=sd(dat$Cotlook.A.Maximum)
ca=sd(dat$Cotlook.A...Average)
smax=sd(dat$Shankar.6.Maximum)
smin=sd(dat$Shankar.6.Minimum)
avg=sd(dat$Shankar.6.Average)
ct
ct1
ca
smax
smin
avg
Output:
ct
8.927676
ct1
9.65965
ca
9.470624
smax
4970.638
smin
4234.153
avg
4585.738
40
LABORATORY MANUAL
Experiment Number: 07
Lab Code : U18MAR0001
Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Scatter diagram, Correlation
STEP 1: INTRODUCTION
1. To construct the scatter plot and to visualize the relationship between two quantitative
variables.
2. To find the correlation between two variables in a data set.
3. To find the coefficient of rank correlation between two variables in a data set by
Spearman’s method.
STEP 2: ACQUISITION
2. x=c(a,b,....)
y=c(l,m,....)
r=cor(x,y)
r
Example
Construct the scatter plot and also find the coefficient of correlation ,Spearman’s
correlation coefficient between the ends per inch(X) and picks per inch (Y).
x 23 27 28 28 29 30 31 33 35 36
y 18 20 22 27 21 29 27 29 28 29
R code:
x=c(23,27,28,28,29,30,31,33,35,36)
y=c(18,20,22,27,21,29,27,29,28,29)
plot(x,y,xlab ="ends per inch",ylab ="picks per
inch",xlim=c(0,50),ylim=c(0,40),col=c("green"),main="scatter plot of end and picks per
inch")
endspicks=cor(x,y)
endspicks
rank=cor(x,y,method="spearman")
rank
Scatter Plot:
42
Output:
Correlation Coefficient = 0.8176052
Spearman correlation coefficient= 0.9955947
Conclusion:
The correlation is strong positive between ends per inch(X) and picks per inch(Y)
Task 1
Calculate the coefficient of correlation from the following figures relating to the
consumption of fertilizer and the output of food grains in a district X:
Chemical fertilizer used (in metric
tonnes):100,110,120,130,140,150,160,170,180,190,200,210,220,230
Output of food (in metric
tonnes):1000,1050,1080,1150,1200,1220,1300,1360,1420,1500,1600,1650,1650,1650
Also draw the scatter plot diagram for the above data and justify the result.
R-Code:
input=c(100,110,120,130,140,150,160,170,180,190,200,210,220,230)
output=c(1000,1050,1080,1150,1200,1220,1300,1360,1420,1500,1600,1650,1650,1650)
plot(input,output,xlab="Chemical fertilizers used",ylab="Output of
food",xlim=c(0,250),ylim=c(0,2000),col=c("green"),main="Consumption of fertilizers and
the output of food grains in a district X")
picks=cor(input,output)
picks
43
Output: 0.991053
LABORATORY MANUAL
Experiment Number: 08
STEP 1: INTRODUCTION
1. To determine the equations of the regression lines for variables and to predict the
value of one variable when the value of the other variable is given.
2. To construct the regression plot for the given variables.
STEP 2: ACQUISITION
Note:
i) abline(lm(y~x)) --- adds regression line to plot
ii) plot(y~x) --- creates a scatterplot of y versus x
iii) regmodel = lm(y~x) --- fit a regression model
Example
1. Find the coefficient of correlation between the ends/inch (X) and picks per inch (Y).
Also find the two regression lines. Estimate the value of y when x = 26.
x 23 27 28 28 29 30 31 33 35 36
y 18 20 22 27 21 29 27 29 28 29
Also construct the regression plot of y on x and two regression plot.
R code:
x = c(23, 27,28,28,29,30,31,33,35,36)
y = c(18,20,22,27,21,29,27,29,28,29)
To find the regression line of y on x
regyx=lm(y~x)
regyx
Output
Call:
lm(formula = y ~ x)
Coefficients:
45
(Intercept) x
-1.7391 0.8913
ie, regression line of y on x is y= -1.7391+0.8913x
To find the regression line of x on y:
regxy=lm(x~y)
regxy
Output:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
11.25 0.75
ie, regression line of x on y is 11.25+0.75y
To find y when x=26
y1= - 1.7391+0.8913*26
y1
[1] 21.4347
Regression plot of y on x
R Code:
plot(x,y)
abline(lm(y ~ x),col="dark green")
Output:
y = -1.7391+0.8913*x
z = -15+1.33*x
plot (x,y,type="l",col="blue",lwd=5, xlab="x", ylab="y")
lines (x, z, col="red", lwd=2)
title ("2 Regression lines")
Experiment Number: 09
STEP 1: INTRODUCTION
STEP 2: ACQUISITION
Analysis of variance refers to the separation of variance ascribable to one group of causes
from the variance ascribable to the other group. It is used to test the homogeneity of several
means.
1. Treatments
2. Environmental
3. Residual or Error
1. aov(response~factor,data=data_name)
Example
A drug company tested three formulations of a pain relief medicine for migraine
headachesufferers. For the experiment 27 volunteers were selected and 9 were
randomly assigned to one of three drug formulations. The subjects were instructed to
take the drug during theirnext migraine headache episode and to report their pain on a
scale of 1 to 10 (10 beingmaximum pain)
Drug A 4 5 4 3 2 4 3 4 4
Drug B 6 8 4 5 4 6 5 8 6
Drug C 6 7 6 6 7 5 6 5 5
R-code:
48
pain=c(4,5,4,3,2,4,3,4,4,6,8,4,5,4,6,5,8,6,6,7,6,6,7,5,6,5,5)
drug=c(rep("A",9),rep("B",9),rep("C",9))
data=data.frame(pain,drug)
data
results=aov(pain~drug,data=data)
summary(results)
Output:
pain drug
1 4 A
2 5 A
3 4 A
4 3 A
5 2 A
6 4 A
7 3 A
8 4 A
9 4 A
10 6 B
11 8 B
12 4 B
13 5 B
14 4 B
15 6 B
16 5 B
17 8 B
18 6 B
19 6 C
20 7 C
21 6 C
22 6 C
23 7 C
24 5 C
25 6 C
26 5 C
27 5 C
F α=3 . 40 , F>F α
, so we reject the null hypothesis and conclude that that
the means of the
three drug groups are different.
Task 1
49
Three machines A, B & C gave the production of pieces in 4 days as below is there a
significant difference between machines?
A 17 16 14 13
B 15 12 19 18
C 20 8 11 17
R-CODE:
production=c(17,16,14,13,15,12,19,18,20,8,11,17)
machine=c(rep("A",4),rep("B",4),rep("C",4))
data=data.frame(production,machine)
result=aov(production~machine,data=data)
summary(result)
OUTPUT
Experiment Number: 10
STEP 1: INTRODUCTION
STEP 2: ACQUISITION
The data collected from experiments with randomised block design form a two-way
classification, classified according to two factors – blocks and treatments. The two-way table
has k rows and r columns – ie, N=kr entries.
Example
The following data represents the number of units of loom crank bushes produced per day
turned out by different workers using four different types of machines.
Machine Type
A B C D
1 44 38 47 36
51
Workers 2 46 40 52 43
3 34 36 44 32
4 43 38 46 33
5 38 42 49 39
Test whether the 5 men differ with respect to mean productivity and test whether the mean
Productivity is the same for the four different machine types.
R-code:
a=c(44,46,34,43,38,38,40,36,38,42,47,52,44,46,49,36,43,32,33,39)
f=c("w1","w2","w3","w4","w5")
k=5
r=4
worker=gl(k,1,r*k,factor(f))
worker
machine=gl(r,k,k*r)
machine
av = aov(a ~ worker+machine)
summary(av)
Output:
a=c(44,46,34,43,38,38,40,36,38,42,47,52,44,46,49,36,43,32,33,39)
f=c("w1","w2","w3","w4","w5")
k=5
r=4
worker=gl(k,1,r*k,factor(f))
worker
[1] w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5
Levels: w1 w2 w3 w4 w5
machine=gl(r,k,k*r)
machine
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Levels: 1 2 3 1
52
From F-table,
F0 .05 ( 4, 12 )=3.26
F0 . 05 ( 3, 12 )=3.49
F1 = 6.54 > F0 .05 ( 4, 12 )=3.26 , hence we reject H and conclude that the 5 workers differ with
01
F2 = 18.388 > F0 . 05 ( 3, 12 )=3.49 , hence we reject H and conclude that the 4 machines differ
02
Task 1
A company appoints 4 salesmen A,B,C,D and observes their sales in 3 seasons: summer,
winter and monsoon. The figures (in lakhs of Rs.) are given in the following table:
Salesmen
Season A B C D
Summer 45 40 38 37
Winter 43 41 45 38
Monsoon 39 39 41 41
Carry out an analysis of variance.
53
R-code
a=c(45,43,39,40,41,39,38,45,41,37,38,41)
f=c("Summer","winter","Monsoon")
k=3
r=4
season=gl(k,1,r*k,factor(f))
season
salesman=gl(r,k,k*r)
salesman
av=aov(a~season+salesman)
summary(av)
output: