0% found this document useful (0 votes)
10 views39 pages

R notes

Uploaded by

Mohan Rj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views39 pages

R notes

Uploaded by

Mohan Rj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Page 1

print("this is the first output in R")


y=10
x+y

#how to create a numeric vector


age <- c(3.2,4.5,2.4,4.4,5.6)
age

#how to create a character vector


name <-c("a","b","c","d")
name

#how to create a logical vector


logi<- c("TRUE","FALSE")
logi
logic<- c(T,F)
logic

# to extract different ages or likewise


age[1]
# to extract age of first and third child
age[c(1,3)]
# to extract age except that of second child
age[-2]
# to extract age except that of second and fifth

age[-c(2,5)]

# some specialities of R
0/0
1/0
a<- c(Inf,-Inf,Inf)
a
1/a
0/a
Inf/a
Inf-Inf
z<- 2+3i
z

# some mathematical operations


2+2
5-3
1000000*4
25^2
35/3
35%%3

# some relational operations


x=5
y=3
x>y
x<y
x==y

#operations on vectors
log(age)
log(age,2)
log(age^2)

# some more operations


x<-c(10,12,14)
y<-c(-20,30,40)
abs(y)
x+y
x^2+y^2
sqrt(x)
x^(1/3)

# to include a new value in y


y[4]=2
y
x+y
Page 2

# creating a vector using sequence


a<-1:10
a

seq(1,10,2) # here 2 represents the common difference


age[3:4]

# end of class on 23.11.2020

27.11.2020

# rounding sqrt of x to two decimal places


round(sqrt(x),2)
# other functions like ceiling and floor
ceiling(sqrt(x))
floor(sqrt(x))

# sorting in ascending order


x<- c(10,5,12,45,1,63,14)
sort(x)

# how to find median in different ways


median(x)
# without using median function when n is odd
(1)
z<- length(x)
m=(z+1)/2
sort(x)[m]
or in a simpler way
sort(x)[(length(x)+1)/2]
#to find median of even function if fun has to be used

# How to create vectors usinf ":" and "seq" functions


x=1:10
seq(1.2,4.8,.2)

# relational operators
x=c(10,5,12,45,1,63,14)
x>15
# Methods to find no of elements greater than 15
(1)
sum(x>15)
(2)
length(x[x>15])
(3)
length(which(x>15))

# How to find the sum of odd positions in x


seq(1,length(x),2)
x[seq(1,length(x),2)]
sum(x[seq(1,length(x),2)])

# Another way of creating vectors


rep(10,4)
rep(c(10,12,1,16,4),4)
rep(c(10,12,1,16,4),each=4)
rep(c("m","f"),c(45,55))

PROBLEM: Create a vector of values of( exp(-lambda)*lambda^x)/factorial (x))


at x=0,1,2,3,4,5,6,7 where lambda =3.7
Ans:
lambda=3.7
x=0:7
z=exp(-lambda)
q=lambda^x
r=factorial(x)
(z*q)/r
(exp(-lambda)*lambda^x)/(factorial (x))

30.11.2020

#PROBLEMS
?How to create the following vector
(a)(0.1^3. 0.2^1, 0.1^6. 0.2^4,........0.1^36. 0.1^24)
Page 3

solution:
x= (0.1)^(seq(3,36,3))*(.2)^(seq(1,4,24))

(b) (2,2^2/2,2^3/3,.....2^25/25)
solution
x= (2^(seq(1,25,1)))/seq(1,25,1)

(c) summation i=10 to 100 (i^3+4i^2)


solution
i <- seq(10,100,1)
sum(i^3+4i^2)

(d) summation i=1 to 25 (2^i/i+ 3^i/i^2)


solution
i<- seq(1,25,1)
sum((2^i/i)+(3^i/i^2))

#NOT WRITING THE ENTIRE QUESTION


set.seed(50)
xvec<- sample(0:999,250,replace=T)
yvec<-sample(0:999,250,replace=T)
Suppose x=(x1,x2,....xn) denote vector xvec and y=(y1,y2,....yn) denote
vector yvec.
That is
x<-xvec
y<-yvec
a) Create a vector (y2-x1,....yn-xn-1)
ans
y[2:250]-x[1:249] or
y[-1]-x[-250]

b)Create a vector ( sin(y1)/cos(x2),sin(y2)/cos(x3)+....sin(yn-1)/cos(xn))


ans
sin(y[1:249])/cos(x[2:250]) or
sin(y[-250])/cos(x[-1]) or
p=y[-250]
q=x[-1]
sin(p)/cos(q)

c) Create the vector (x1+2x2-x3,x2+2x3-x4,......xn-2+2xn-1-xn)


ans
n<- length(x)
p=x[1:(n-2)]+2*x[2:(n-1)]-x[3:n]

d) Calculate sum i=1 to n-1 (exp^-(xi+1)/(xi+ 10)


ans
sum((exp(-x[2:n]))/(x[1:(n-1)]+10))
exp(-x[2:n])/(x[1:(n-1)]+10)

?Pick the values in y greater than 600


ans
y[y>600]
y[which(y>600)]

?How many nos in x are divisible by 2


sum(x%%2==0)
length(x[x%%2==0])

?Form a vector of form |x1-xbar|*1/2,.....


ans
sqrt(abs(x-mean(x))
x-mean(x)
m=abs(x-mean(x))
sqrt(abs(x-mean(x)))
sqrt(m)

2.12.2020
# matrix-retangular arrangement of values
# keyword-"matrix"
# how to create a matrix
x<-matrix(0,3,3)
x
# when all the elements are not the same
Page 4

x1<-matrix(1:9,3,3)
x1
here the elements are successive
# to fill across rows
x2<-matrix(1:9,3,3,byrow=T)
x2
Another way
x1<-matrix(1:9,nrow=3,ncol=3)
x1
# to create a vector is with all kinds of elements
y<-matrix(c(10,4,2,12,5,7,15,8,0),nrow=3,ncol=3)
class(y)
dim(y)

# to extract elements from matrix


# to extract an element
y[1,2] # u will get the element of first row and second column
y[2,2]
y[1,] # u will get all the elements of first row
y[3,]
y[,2] # u wil get the second column
y[,3]
# to get the second element of y[,3] use y[,3][2]
y[,3][2]

# to extract submatrix
# to extract a submatrix from consecutive rows and columns
y[1:2,1:2] #u will get a matrix from first two rows and first two columns
y[2:3,2:3]
y[1:3,2:3]
y[1:2,]

# if u need to extract elements from non consecutive rows and columns


y[c(1,3),c(1,3)] # u will get elements from first row,third row ,first and
third columns
y[c(1,3),c(1,3)]
y[c(1,3),]

# how to give row and column names


rownames(y)<-c("f1","f2","f3")
colnames(y)<-c("a","b","c")

4.12.2020

# to find the sum of the rows


sum(y[1,]) # to extract the sum of first row
sum(y[2,])
rowSums(y) # to extract the sum of rows
sum(rowSums(y)) # to extract total sum of rows
Another way
sum(y[1:3,])
# to find sum of columns
colSums(y) # to extract sums of columns
sum(y[,1]) # to extract sum of first column

rowMeans(y) # gives the mean of rowsums


colMeans(y) # gives the mean of colsums

# to subtract the first row of matrix by its rowmeans


y[1,]-mean(y[1,]) or
y[1,]-rowMeans(y)[1]
# to standardize the value
standard=(y[1,]-rowMeans(y)[1])/sd(x[1,])

# to add a column to the existing matrix


cbind(y,c(12,3,4))
cbind(y,colMeans(y)) # gives a matrix with the new column as colmeans

# to add a row
rbind(y,c(5,6,7))
rbind(y,rowMeans(y))

log(y) # to get the log of each element


Page 5

# matrix multiplication
y%*%y

# to find the transpose


t(y)

# to verify a matrix is symmetric


y==t(y) # if all the matrix values are true the matrix is symmetric
sum(y==t(y))==nrow(y)*ncol(y) # returns the value TRUE if matrix is symmetric

# to check if a matrix is idempotent


y%*%y==y #return a matrix with true value if idempotent

# to create a diagonal matrix


#trivial way
x<-matrix(c(1,0,0,0,1,0,0,0,1),3,3)
#Altenative
x<-matrix(0,3,3)
diag(x)<-1

#to check whether a given matrix is orthogonal


y%*%t(y)==y

# to find the trace of a matrix


diag(y)
sum(diag(y))

# to find the product of diagonal elements


prod(diag(y))

# to find factorial using prod function


prod(1:10)

# to check whether a matrix is singular


det(y)==0 # if the value is returned as true the matrix is singular

# to find the inverse of a matrix


solve(y)

row(y) # returns the row values


col(y) #returns the column values

y[col(y)==row(y)]<-1 # returns matrix with diagonal value=1

# to construct a lower triangular matrix


y[row(y)>col(y)]<-0

#to construct an upper triangular matrix


y[row(y)<col(y)]<-0

#QUESTIONS
A and B are two matrices of order 3x3 where A has elements 1,5,15,4,6,17,3,10,21
and B has 2,10,3,17,3,5,1,7,6 .Obtain (AB)^-1.
ANS
A=matrix(c(1,5,15,4,6,17,3,10,21),3,3,byrow=T)
B= matrix(c(2,10,3,17,3,5,1,7,6),3,3,byrow=T)
First check
det(A%*%B)==0
solve(A%*%B)

Solving the system of linear equations


Let x be the coefficient matrix
a<-matrix(c(1,2,3,4,5,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,5,4,3,2,1),5,5,byrow=T)
b<-matrix(c(7,-1,-3,5,17),5,1)
we have ax=b
we need x=a^-1*b
x=solve(a)%*%b

Calculate double summation i^4/(3+j) where i=1-20,j=1-5


[1] Using vector
i<-c(1:20)
j<-c(1:5)
x=sum(i^4)
Page 6

sum((x)/(3+j)) or
sum(i^4)*sum(1/(3+j))

[2] Using matrices


Create a matrix with 20 rows and 5 columns .
p<- matrix(rep(1:20,5),20,5)
q<-matrix(rep(1:5,20),20,5,byrow=T)
y=p^4/(3+q)

Calculate double sum i^4/(3+ij).


p
q
sum(p^4/(3+(p*q)))

Calculate double sum i^4/(3+ij) where i=1-10 and j=1-i


i<- matrix(rep(1:10,10),10,10)
n<-matrix(rep(1:10,10),10,10,byrow=T)
n[row(n)<col(n)]<-0
a<-(i^4)/(3+i*n)
a[row(a)>=col(a)]
sum(a[row(a)>=col(a)])

y<- matrix(0,5,5)
y[1,]=y[,]<-c(1,2,3,4,5)
diag(y)<-1
diag(y[2:5,1:4])=diag(y[1:4,2:5])<-2
diag(y[3:5,1:3])=diag(y[1:3,3:5])<-3
diag(y[1:2,4:5])=diag(y[4:5,1:2])<-4

Construct the matrix A.Check whether a^3=0 and replace the third column of A by the
sum of second and third column.
a<-matrix(c(1,1,3,5,2,6,-2,-1,-3),3,3,byrow=T)
z<-a%*%a
z%*%a
z%*%a==0
x=(a[,2]+a[,3])
a[,3]<-x

Create a matrix B with 15 rows each row with elements 10,-10,10 and find B^TB

b<-matrix(rep(c(10,-10,10),times=15),15,3,byrow=T)
t(b)%*%b

# Order and sort function


vec<-c(10,-2,45,17,6)
sort(vec)
order(vec) # it assigns the original position of the sorted values
sort(vec,decreasing=T) # sorts in descending order

# Outer function
y=c(2,4)
vec%o%vec # each element will get multiplied to all other elements and a
matrix is formed
vec%o%y
Alternative way
outer(vec,y,"^") # each value raised to the power y
outer(vec,y,"*")

Find the mean of a vector after omitting mean and maximum


vec
y<- sort(vec)
x= y[-c(1,length(y))]
sum(x)
mean(x)
eg
a<-seq(2,55,3)
mean(a[-c(1,length(a))])

eg
z<-c(1,5,6,9,10,45,68,98)
t<- sort(z)
p=t[-c(1,length(t))]
sum(p)
Page 7

mean(p)

11.12.2020

?Create matrix without typing the elements.


12345
21234
32123
43212
54321
a<- matrix(0,5,5)
a<-1+abs(row(a)-col(a))

?Create the following patterned matrices


(a) 12345678910
24681012...
36912......
...........

102030405060708090100
x<-1:10
y<-1:10
outer(x,y,"*")

(b) 01234
12345
23456
34567
45678
x<-0:4
y<-0:4
outer(x,y,"+")

(c) 01234
12340
23401
34012
40123
x<-0:4
y<-0:4
z=outer(x,y,"+")
z%%5

? Create a 6x10 matrix of random numbers from 1-10 with replacement.

a<-matrix(sample(1:10,60,replace=T),6,10)

a)Find the number of entries in each row which are greater than 4.
length(which(a[1,]>4))
length(which(a[2,]>4))
length(which(a[3,]>4))
length(which(a[4,]>4))
length(which(a[5,]>4))
length(which(a[6,]>4))
OR
sum(a[1,]>4)
sum(a[2,]>4).....
THE METHOD TO BE USED
rowSums(a>4)

b)Which rows contain exactly two occurences of the number 7?


rowSums(a==7) # gives the no of 7 in each row
rowSums(a==7)==2 # gives TRUE for rows which satisfy the condition and FALSE
for others
which(rowSums(a==7)==2) # gives the row no which satisfies the condition

#How to use which in a matrix


which(a==10,arr.index=T)

14.12.2020

#apply function,list,lapply,sapply

#apply
Page 8

Consider matrix a. To sort the elements in each column we use sort


sort(a[,1])
To repeatedly perform this function we use a loop.The functions :
apply,lapply,sapply etc helps to perform operations without using loops.

Arguments: apply(matrix name,MARGIN,FUN)


#MARGIN: 1 for row op.,2 for col op and 1:2 for both row and col

#Use the apply function to find the median of the columns of matrix a.
apply(a,2,median)
# to find sd of each column
apply(a,2,sd)
# to find the maximum value in each row
apply(a,1,max)
# to find he log of each element in a
apply(a,1:2,log)

# when some functions are not in-built


#to add 3 to each element of a
a+3
using apply function
apply(a,1:2,function(x)x+3)

# Raising each element in a to 4


apply(a,1:2,function(x)x^4)

range function in matrix will not give the difference btw max and min
# To obtain the range of each column of a
apply(a,2,function(x)max(x)-min(x))

x<-c(12,9,3,5,7,15)
median(x) #second quartile
quantile(x,0.25) #Q1
quantile(x,0.5) #Q2
quantile(x,0.75) #Q3
quantile(x,0.8) #80 th percentile

#to obtain interquartile range of each column of a


a
apply(a,2,function(x)quantile(x,0.75)-quantile(x,0.25))
OR
apply(a,2,IQR)

#display the min and max of each column


apply(a,2,range)
OR
apply(a,2,function(x)c(min(x),max(x)))

#display the mean and sd of each column of a


m=apply(a,2,function(x)c(mean(x),sd(x)))
rownames(m)<-c("mean","sd")

?Transform the column values of a btw 0and 1.


#divide each by the max value of a.
apply(a,2,function(x)x/max(x))

LISTS-Creating lists and operations on lists

a<-c("India","Japan","US","UK")
b<-c(153.5,170.7,170.2,167.5)
d<-c(TRUE,TRUE,FALSE,FALSE)
#to create a list
list.obj<-list(a,b,d)

list.obj[[2]][4] #to extract the element 167.5


list.obj[[3]][1]

#to arrange the values in b in decreasing order


sort(list.obj[[2]],decreasing=T)

?Create a list containing numeric vector from 1:10 and matrix of order 4x4
having numbers from 1 to 16.
list.var<-list(1:10,matrix(1:16,4,4))
Page 9

#to get the element 12


list.var[[2]][4,3]
#to get rowsums
rowSums(list.var[[2]])

#Naming the list elements


list.obj
list.obj<-list(country=a,height=b,yesno=d)

list.obj$country #extracts the list country


length(list.obj$yesno)
list.obj$country[3] #extracts the third element of country

#using lapply
lapply(list.obj,length) #to find length

lapply(list.obj[-c(1,3)],mean) #to eliminate two lists and find the mean of


height.

lapply(list.var,max)

sapply(list.var,max) #to obtain the result in form of vector

15.12.2020

#CONDITIONAL STATEMENTS,LOOPS
#ifelse,if
#ifelse

x<--10:10
ifelse(x>0,sqrt(x),x^2)
ifelse(x>0,sqrt(x),"neagative")

#using more than one statement in true and false part


ifelse(x>0,{z<-sqrt(x);x+z},{z<-x^2;x+z})

ifelse(x>0,x,-x) #returns absolute value of x

#if
if(x[14]>0
{
print(log(x[14],2)) #prints log value of 14th element if>0

if(x[20]>0)
{
print(log(x[20],2))
}

#Usinf if and else


if(x[5]>0)
{
print(sqrt(x[5]))
}else
{
print("negative")
}

#LOOPING

for(i in c(-10,-9,3,6,7))
{
if(i<0)
{
print(i^2)
}else
{
print(log(i,2))
}
}

for(i in 1:5)
{
Page 10

for(j in 5:1)
print(i+j)
}

? You have two numeric vectors say xvec,yvec each of size 10 sampled from integers
1 to 100 without replacement.Display the total number of xvec less than yvec
values.

xvec<-sample(1:100,10)
yvec<-sample(1:100,10)
sum=0
for(i in 1:10)
{
for(j in 1:10)
{
if(xvec[i]<yvec[j])
{
sum=sum+1
}
}
}
print(sum)

18.12.2020

#while

x<-1 #starting with one the square of each no till 10 is taken


while(x<=10)
{
print(x*x)
x=x+1
}

x<-5
while(x<=10)
{
print(x-1)
x=x+2
}

#function in R- user defined or inbuilt

[1]
quadratic<-function(x) #here x is a local variable
{
a=2
b=4
d=7
print(a+b*x+d*x^2) # now give values to x
}
quadratic(6)
quadratic(c(2,4))
quadratic(2:7)

we can also give a default value for x


quadratic<-function(x=6) #here x is a default
{
a=2
b=4
d=7
print(a+b*x+d*x^2)
}
quadratic()

[2]
add<-function(x,y)
{
print(x+y)
}
add(c(2,4),c(5,6))
add(x=3,y=4)
add(10000,1000000)
Page 11

#giving default value


add<-function(x=5,y=10)
{
print(x+y)
}
add()
add(,1)
add(x=2)

[3]
power<-function(x,y)
{
a=2
b=3
print(a*x+b*y)
}
power(2,3)
power(1,1)
power(3,2)

[4]
#to check symmetricity

mat<-function(x)
{
ifelse((t(x)==x),"symm","asymm")
}
mat(matrix(c(1,0,0,0,1,0,0,0,1),3,3,byrow=T))

#using if
mat<-function(x)
{
if(t(x)==x)
{
print("yes")
}else
{
print("no")
}
}
mat(matrix(c(1,0,0,0,1,0,0,0,1),3,3,byrow=T)

#using return

power<-function(x,y)
{
z<-x^y
return(z)
}
power(3,4)

#returning more than one variable

mathop<-function(x=5,y=10)
{
a=x+y
b=x-y
d=x*y
e=x/y
z<-list(a,b,d,e)
return(z)
}
mathop()

?Write a function of the name "descriptive" that accepts a vector of numeric values
and returns the mean,sd,median,max,min,range of the vector.

descriptive<-function(x)
{
a=mean(x)
b=sd(x)
d=median(x)
e=max(x)
Page 12

f=min(x)
g=max(x)-min(x)
z<-list(mean=a,sd=b,median=d,max=e,min=f,range=g)
return(z)
}
des<-descriptive(c(5,6,7,8))
des$median

21.12.2020

?Write functions tmpFn1 and tmpFn2 such that if xVec is the vector (x1,x2,x3,..
xn) then tmpFn1(xVec) returns the vector (x1,x2^2,x3^3....) and tmpFn2(xVec) returns
the vector (x1,x2^2/2,x3^3/3......)
Ans
tmpFn1<-function(xvec)
{
z<-xvec^c(1:length(xvec))
return(z)
}
xvec<-c(1:5)
tmpFn1(xvec)
OR

tmpFn1<-function(xvec)
{
for(i in 1:length(xvec))
{
xvec[i]=xvec[i]^i
}
return(xvec)
}
xvec<-c(1:5)
tmpFn1(xvec)

OR
xvec<-c(1:5)
tmpFn1=function(xvec)
{
i=1:5
z=xvec^i
return(z)
}
tmpFn1(xvec)
OR

tmpFn1=function(xvec)
{
i=1:length(xvec)
z=xvec^i
return(z)
}
xvec<-c(1:10)
tmpFn1(xvec)

tmpfn1<-function(x,n=1)
{
z=x^n
return(z)
n=n+1
}
tmpfn1(1:6,1:6)

tmpFn1<-function(x)
{
a=seq(1,5,1)
z=x^a
return(z)
}
tmpFn1(1:5)

b)
tmpFn2<-function(xvec)
Page 13

{
i=1:length(xvec)
z=xvec^i/i
return(z)
}
xvec<-1:10
tmpFn2(xvec)
OR

tmpFn2<-function(xvec)
{
for(i in 1:length(xvec))
{
xvec[i]=(xvec[i]^i)/i
}
return(xvec)
}
xvec<-1:10
tmpFn2(xvec)

tmpFn2<-function(x)
{
a=seq(1,5,1)
z=(x^a)/a
return(z)
}
tmpFn2(1:5)

tmpfn2<-function(x,n)
{
z=(x^n)/n
return(z)
}
tmpfn2(1:6,1:6)

?Write a function tmpfn3 which takes 2 arguments x and n where x is a single


number and n is postive.The function should return the value of
1+x/1+x^2/2+......+x^n/n

tmpfn3<-function(xvec,n)
{
n=1:length(xvec)
m<- (xvec^n)/n
z<-1+sum(m)
return(z)
}
xvec<-1:5
n<-1:5
tmpfn3(xvec,n)

?Write a function tmpFn(xvec) such that if xvec is the vector x=(x1,x2....xn)


then tmpFn(xvec) returns vector of moving averages:
x1+x2+x3/3,x2+x3+x4/3,.......xn-2+xn-1+xn/3
Try out ur function;for example,try tmpFn(c(1:5,6:1))

tmpFn<-function(xvec,i)
{
i=3:length(xvec)
z=(xvec[i-2]+xvec[i-1]+xvec[i])/3
return(z)
}
xvec<-c(1:5,6:1)
tmpFn(xvec)
OR
tmpFn<-function(xvec)
{
i=3:length(xvec)
z=(xvec[i-2]+xvec[i-1]+xvec[i])/3
return(z)
}
xvec<-c(1:5)
tmpFn(xvec)
Page 14

?Consider the continuous function


f(x)= x^2+2x+3 if x<0
= x+3 if 0<=x<2
= x^2+4x-7 if 2<=x
Write a function tmpFn which takes a single argument xvec.The function should
return the vector of values of the function f(x) evaluated at the values in
xvec.

tmpFn<-function(xvec){
n=1:length(xvec)
for(i in n){
if(xvec[i]<0){
y=xvec[i]
print((y^2)+(2*y)+3)
}
if((xvec[i]>=0)&(xvec[i]<2)){
y=xvec[i]
print(y+3)
}
if(xvec[i]>=2)
y=xvec[i]
print((y^2)+(4*y)-7)
}}
tmpFn(seq(-3,3,0.1))

tmpFn<-function(x)
{
ifelse(x < 0, x^2 + 2*x + 3, ifelse(x < 2, x+3, x^2 + 4*x - 7))
}
tmpFn(seq(-3,3,0.1)

?Write a fuction which takes a single argument that is a matrix.The function


should return a matrix which is same as the function argument but every odd number
is doubled.
Hence the result of using the function on matrix
1 1 3
5 2 6
-2-1-3
should be:
2 2 6
10 2 6
-2-2-6

mat<-function(x)
{
z<-ifelse(x[x%%2==1],2*x[x%%2==1],x)
return(z)
}
mat(matrix(c(1,1,3,5,2,6,-2,-1,-3),3,3,byrow=T))

mat<-function(x)
{
x[x%%2==1]<- 2*x[x%%2==1]
print(x)
}
mat(matrix(c(1,1,3,5,2,6,-2,-1,-3),3,3,byrow=T))
mat(x)

tmpfn3<-function(xvec,n)
{
n=1:length(xvec)
m<- (xvec^n)/n
z<-1+sum(m)
return(z)
}
xvec<-1:5
n<-1:5
tmpfn3(xvec,n)

Q.Suppose x0=1 and x1=2 and xj=x(j-1)+2/(x(j-1)) for j=1,2,3....


Write a function testLoop which takes the single argument n and returns the first
n-1 values of the sequence {xi}j>=0 that means the values of x0,x1,x2,...xn-2.
Page 15

Now write a function testLoop2 which takes a single argument yvec which is
a vector.The function should return
sum(e^j ) j=1:n

a)testLoop1<-function(n)
{
xvec<-rep(NA,n-1)
xvec[1]=1
xvec[2]=2
print(xvec[1])
print(xvec[2])
for(j in 3:(n-1))
{
xvec[j]=xvec[j-1]+ (2/xvec[j-1])
print(xvec[j])
}
}
testLoop1(10)
OR
testLoop1<-function(n)
{
x0=xvec[1]=1
x1=xvec[2]=2
print(x0)
print(x1)
for(j in 3:(n-1))
{
xvec[j]=xvec[j-1]+ (2/xvec[j-1])
print(xvec[j])
}}
testLoop1(10)

b)testLoop2<-function(yvec)
{
m=0
for(j in 1:length(yvec))
{
m<-m+ (exp(j))
}
return(m)
}
testLoop2(c(2,3,4))

OR
testLoop2<-function(yvec)
{
n<-length(yvec)
j=1:n
z=sum(exp(j))
print(z)
}

Q. Given a vector x = (x1, . . . , xn), the sample autocorrelation of lag k


is defined to be

a)Write a function tmpFn1(xVec) which takes a single argument xVec which


is a vector and returns a scalar r1. In particular, for the vector X = (2,5,8,.
..,53,56) compute tmpFn1(X).

tmpFn1<-function(xvec)
{
n<-length(xvec)
t=mean(xvec)
p=(xvec-t)
r1=sum(p[2:n]*p[1:(n-1)])/sum(p^2)
r2=sum(p[3:n]*p[1:(n-2)])/sum(p^2)
z<-list(r1,r2)
return(z)
}
tmpFn1(seq(2,56,3))
Page 16

28.12.2020

letters
LETTERS
tolower(LETTERS)
toupper(letters)

Q.Display AaBbCc.......Zz.Use the inbuilt function letters and LETTERS.


paste(LETTERS,letters)

I<-LETTERS
for(i in 1:length(I))
{
print(c(I[i],tolower(I[i])))
}

OR
Z<-LETTERS
for(i in 1:length(Z))
{
print(Z[i])
print(tolower(Z[i]))
}

#ANONYMOUS FUNCTIONS

(function(x) {x*x})(-5:5)
(function(x,y) {x+y}) (2:5,3:6)
(function(x,y=5) {x+y}) (1:3)
(function(x,y){z<-x+y;z^2}) (1:3,2:4)

30.12.2020

Q.Display all the prime nos between 2 and 100

d=0
primeno<-c(2:100)
for(i in 2:length(primeno))
{
for(j in 1 :i)
{
if(i%%j==0)
{
d=d+1
}
}
if(d==2)
{
print(i)
}
d<-0
}
OR

primeVec <- c(2)

for(i in 3:100)

if(sum(i%%(2:(i-1))==0)==0)
{
primeVec <- c(primeVec, i)
}

}
primeVec

for(num in 3:100) #WRONG


{
for(j in 2:(num-1))
{
Page 17

if(num%%j==0)
print("not prime")
break
}
print(num)
}

#ways to create a table

tabulate(c(2,5,3,2,3,3,2,5,3,1)) #has several disadvantages

#preferred way to create a table-use table() fun


y<-table(c(2,5,3,2,3,3,2,5,3,1))
y

z<-table(c("a","z","c","a","c","d"))
class(z)

names(y) #gives the names of elements of the table y

y[1] #gives the frequency at first position


y[3]

sum(y) #gives the sum of all frequencies


which(y==max(y))

#Finding the modal value using a table


which(y==max(y)) #gives the position of maximum frequency
names(which(y==max(y))) #gives the name of that position
# to convert in to numeric
as.numeric(names(which(y==max(y))))

#Checking if a value is numeric


x=8
is.numeric(8)
x="e"
is.numeric(x)

#table function can also be used to construct cross tabulations

age<-c(8,8,9,10,9,8,10,9,9)
height<-c(100,105,100,102,100,100,105,102,100)
kids<-table(age,height)
kids

# finding proportion across rows


prop.table(kids,1)
# finding proportion across columns
prop.table(kids,2)
#finding relative frequency table
prop.table(kids)

is.matrix(kids) #output is TRUE since the table is a matrix

rowSums(kids) #marginal dist of age


colSums(kids) #marginal dist of height

Q.Form a frequency table of 1000 independent tosses of a die.


d<-c(sample(1:6,1000,replace=T))
table(d)

4.1.2021
#DATAFRAMES

age<-c(20,22,21,21,20)
state<-c("D","M","C","B","K")
gender<-c("M","M","F","F","F")
cgpa<-c(8.5,9.7,6.8,8.9,5.3)
major<-c("S","M","M","S","S")
stud.details<-data.frame(age,state,cgpa,gender,major)
Page 18

class(stud.details)

Another way to create a data frame

dd<-data.frame()
fix(dd)

#construct a matrix
e<-matrix(1:9,3,3)
rownames(e)<-c("stud1","stud2","stud3")
colnames(e)<-c("present","absent","halfday")
#converting the matrix to a dataframe
as.data.frame(e)

#to capture certain columns/rows in the dataframe


stud.details[,1] #captures age
stud.details[,3] #captures cgpa
stud.details[,4] #captures gender
factor(stud.details[,4]) #shows the level
a<-c("Y","Y","N","Y","N")
factor(a)

#to print details of 3rd student


stud.details[3,]
#to print details of 3rd,4th and 5th student
stud.details[3:5,]

Q.Print the age, state and major of 3rd and 5th student.
stud.details[c(3,5),c(1,2,5)]

summary(stud.details) #gives min,max,Q1,Q2,Q3,mean

stud.details$age #will also give age

#condtional selection of rows based on columns

Q.Print the details of all female students


stud.details[stud.details$gender=="F",]

Q.Print details of all students whose age is above 21.


stud.details[stud.details$age>21,]

Q.Print details of all students whose major is statistics and whose cgpa is
above 8.
stud.details[(stud.details$major=="S")& (stud.details$cgpa>8),]

Q.Find the mean,sd of cgpa of male students who majored in statistics.


s<-stud.details[(stud.details$gender=="M")&(stud.details$major=="S"),]
s
mean(s$cgpa)
sd(s$cgpa)

Q.Find the summary stats of male and female students seperately.


summary(stud.details[stud.details$gender=="M",])
summary(stud.details[stud.details$gender=="F",])

#Preparing a table of gender against age

table(stud.details$age,stud.details$gender)

#Another way to create a table using xtabs


xtabs(formula=~age+gender,data=stud.details)

The second argument can be avoided if we use attach function first


attach(stud.details)
y<-xtabs(formula=~age+gender)
class(y)
prop.table(y,1)

6.1.2021

#QUESTIONS(only the answers are given)

1.
Page 19

Age<-c(21,20,21,22,21,20,22,21)
Gender<-c("M","M","F","F","M","F","M","M")
freq.table<-table(Age,Gender)
freq.table

#to find marginal frequency

age_margin<-margin.table(freq.table,1)
#or
rowSums(freq.table)

gender_margin<-margin.table(freq.table,2)
#or
colSums(freq.table)

# for relative freq


prop.table(freq.table)

2.
gender<-c("M","F","F","M","F","M","M","F","M","M")
qualification<-c("UG","PG","BTECH","BTECH","UG","UG","PG","PG","BTECH","UG")
age<-c(24,27,28,25,21,34,26,25,34,27)
marks<-c(68,78,67,77,86,56,89,90,55,67)

#creating dataframe
dd<-data.frame(gender,qualification,age,marks)

#no of males and females


sum(dd$gender=="M")
sum(dd$gender=="F")

#displaying deails of those who qualified BTECH


dd[dd$qualification=="BTECH",]

#standardizing marks and adding it as a new column marks1


marks1<-(dd$marks-mean(dd$marks))/sd(dd$marks)
dd$marks1<-marks1
dd
OR

cbind(dd,marks1)

Q.Arrange the data according to increasing order of age.


dd[order(dd$age),]

Q.Arrange the data according to increasing order of age and marks.


dd[order(dd$age,dd$marks),]
#this code will arrange the data first wrt age and then within age,marks will be
arranged in increasing order.This is like a conditional ordering

8.1.2021

#non inclusion of NA
#how to omit NAs in the data
let
x<-c(40,29,31,NA,45,NA)
To perform operations like sum(x),max(x) ... first the NA values needs to be
omitted.
sum(na.omit(x))
mean(na.omit(x))

#in-built dataframes

to see the inbuilt dataframes in R use


data()
examples
Titanic (in built array)
iris (in built dataset that talks about 3 flower species)
airquality
Orange

#reading data from data files


Page 20

Method 1:Reading data from a file stored in a location

read.table("pathname",header,sep)
eg of pathname: C:/Users/Program files/File name

If data set is comma seperated use sep="," and if it is colon sepeated use
sep=":"

Eg
age_height<-read.table("C:/Users/acer/Desktop/age_height.txt",header=T)

'header' takes either true or false values.If header=TRUE,column heading will be


treated as column names

Method 2:
read.table(file.choose(),header=T)

#
str(iris) #will give information on columns of data frame

?Display the 5th row of iris


iris[5,]

?Display the 5th,10th ...50th rows of iris.


iris[seq(5,50,5),]

?Display the first 20 rows of iris


head(iris,20)
head(iris) #by default this command gives the first 6 rows

?Display the last 20 rows of iris.


tail(iris,20)

?.Display 10 randomly selected obs from iris.


iris[sample(1:150,10),]

attach(iris)
iris$Petal.Length[which(iris$Petal.Length>4)]

Consider the dataset airquality

#find the summary of airquality after omitting NA values.


summary(na.omit(airquality))

#how to apply some in built math/stat function in columns of data frame.


#function is apply

?Find the mean values of the columns of air quality


apply(na.omit(airquality),2,mean)

?Find the sd values of the first 4 columns of airquality after omitting NA values
apply(na.omit(airquality)[,1:4],2,sd)

11.1.2021

iris

#creating a data frame

sample1<-iris[sample(1:50,10),]
sample2<-iris[sample(51:100,10),]
sample3<-iris[sample(101:150,10),]

sampleiris<-rbind(sample1,sample2,sample3)

#mean
apply(sampleiris[1:10,c(2,4)],2,mean)
apply(sampleiris[11:20,c(2,4)],2,mean)
apply(sampleiris[21:30,c(2,4)],2,mean)

b)
attach(sampleiris)
#summary
Page 21

summary(Sepal.Length)
summary(Petal.Length)
OR
summary(sampleiris[,c(1,3)])

c)
#order
sampleiris[order(Sepal.Length,Petal.Length),]

d)
#ratio
ratio1<-Sepal.Length/Sepal.Width
ratio2<-Petal.Length/Petal.Width
sampleiris$ratio1<-ratio1
sampleiris$ratio2<-ratio2
sampleiris

OR
ratio1=Sepal.Length/Sepal.Width
ratio2=Petal.Length/Petal.Width
cbind(sampleiris,ratio1,ratio2)

e)
m1<-mean(sampleiris[1:10,3])
sum(sampleiris[1:10,3]>m)

m2<-mean(sampleiris[11:20,3])
sum(sampleiris[11:20,3]>m2)

m3<-mean(sampleiris[21:30,3])
sum(sampleiris[21:30,3]>m3)

2.#reading a file
read.table("C:/Users/SUMA P P/Documents/disease.status.txt",header=T,sep="\t")
read.table(file.choose(),header=T,sep="\t")

13.01.2021

#subset

?Display the rows in sample iris for which sepal length is more than 5.
attach(sampleiris)
sampleiris[Sepal.Length>5,]

#This can also be done using the subset function


#subset function select those rows satisying the condition
#usage:subset(dataframe,condition,select)
#condition refers to either a single or mutiple conditions on columns and
select is to select some columns of interest

subset(sampleiris,Sepal.Length>5)

? Display the rows of sampleiris with petal length and petal width for which
sepal length is above 5.
subset(sampleiris,Sepal.Length>5,select=c(3,4))
subset(sampleiris,Sepal.Length>5,select=c(Petal.Length,Petal.Width))
subset(sampleiris,Sepal.Length>5,select=-c(1,2,5))

? Display the rows of sampleiris with petal length and petal width for which
sepal length is above 5 and petal length>4.
subset(sampleiris,Sepal.Length>5 & Petal.Length>4,select=c(3,4))

?Display mean of petal length and petal width based on rows of sampleiris
satisfying above condition.
a<-subset(sampleiris,Sepal.Length>5 & Petal.Length>4,select=c(3,4))
apply(a,2,mean)
OR
mean(a$Petal.Length)
mean(a$Petal.Width)

?Display mean and sd of the above.


apply(a,2,function(x)c(mean(x),sd(x)))
apply(m,2,function(x)x/max(x)) #normalization
Page 22

Consider airquality
subset(airquality,Wind>9) #NA values will also be shown
subset(na.omit(airquality),Wind>9)

#transform function
?Attach a new column in sampleiris as the sqrt of Sepal length
attach(sampleiris)
new.sepal.length<-sqrt(Sepal.Length)
cbind(sampleiris,new.sepal.length)

#This can be done using transform function


#usage:transform(dataframe,transformation on column)

?Do a square root transform of sepal length


transform(sampleiris,new=sqrt(Sepal.Length))

?Do square root transform of both petal length and sepal length.
transform(sampleiris,new1=sqrt(Sepal.Length),new2=sqrt(Petal.Length))

18.1.2021

#within function
#within-we can transform the variables and the transformed variables can be
used for further operations or manipulations

?Do a square root transformation on sepal length and petal length in sample
iris and find the difference btw the transformed variables.

attach(sampleiris)
dd<-within(sampleiris,{
sqrtseplen<-sqrt(Sepal.Length)
sqrtpetlen<-sqrt(Petal.Length)
dif<-sqrtseplen-sqrtpetlen})

Q.Create the following text file in your desktop


studnamemarks1marks2
aaa1820
bbb1716
ccc1514
ddd2011
eee1920

1)Save as marks.txt

2)Read the file in R using the pathname and save it as studmarks

studmarks<-read.table("C:/Users/SUMA P P/Documents/marks.txt",header=T)

3)Create two new variables mark1_100 and mark2_100 where they are obtained by
transforming marks1 and marks2 values to 100.

transform(studmarks,marks1_100=marks1*5,marks2_100=marks2*5)
OR
transform(studmarks,marks1_100=(marks1/20)*100,marks2_100=(marks2/20)*100)

4)Do Q3 using within command.Also obtain average marks of each student based
on marks1_100 and marks2_100.Save the output in the object dd.Name the average
marks as avgmark

dd<-within(studmarks,{
marks1_100<-marks1*5
marks2_100<-marks2*5
avgmark<-(marks1_100+marks2_100)/2})
OR
dd<-within(studmarks,{
marks1_100<-(marks1/20)*100
marks2_100<-(marks2/20)*100
avgmark<-(marks1_100+marks2_100)/2})

5)Export the dataframe dd to text file and excel file named "newmarks" and
store in desktop.

#text file
Page 23

write.table(dd,"C:/Users/SUMA P P/Desktop/newmarks.txt",row.names=F)

#excel file

write.table(dd,"C:/Users/SUMA P P/Desktop/newmarks.xls",row.names=F,sep="\t")

20.01.2021

1.#first 20
head(airquality,20)

2.#colnames
colnames(airquality)

3.#order
attach(airquality)
head(airquality[order(Temp),],20)

4.
airquality[order(Temp,Solar.R),]

5.
attach(airquality)
na.omit(airquality[order(Temp,Solar.R),])

6.
summary(na.omit(airquality)[,-c(5,6)])

7.
apply(na.omit(airquality)[,-5],2,function(x)c(sd(x),max(x)-min(x)))

8.
attach(airquality)
apply(airquality[Month==5,c(3,4)],2,mean)
apply(airquality[Month==6,c(3,4)],2,mean)
apply(airquality[Month==7,c(3,4)],2,mean)
apply(airquality[Month==8,c(3,4)],2,mean)
apply(airquality[Month==9,c(3,4)],2,mean)
OR
tapply(Wind,Month,mean,na.rm=T)
tapply(Temp,Month,mean,na.rm=T)
OR
for(i in 5:9)
{
x<-apply(subset(airquality,Month==i,c(3,4)),2,function(x)c(mean(x),sd(x)))
print(x)
}

9.
attach(airquality)
transform(airquality,log_ozone=log(Ozone,2),log_temp=log(Temp,2))

10.1)
within(na.omit(airquality),{
log_ozone<-log(Ozone,2)
diff1<- Ozone-log_ozone})
2)
within(na.omit(airquality),{
log_temp<-log2(Temp)
diff2<- Temp-log_temp})

11.
attach(airquality)
subset(airquality,Ozone>20)

12.
subset(airquality,Ozone>20,select=c(Solar.R,Wind))

13.
m<-subset(na.omit(airquality),Ozone>20,c(Solar.R,Wind))
apply(m,2,function(x)c(mean(x),sd(x)))
Page 24

14.
attach(airquality)
Solar.complete<-Solar.R
Solar.complete[is.na(Solar.complete)]<-mean(na.omit(Solar.complete))
cbind(airquality,Solar.complete)
OR
airquality$Solar.complete<-ifelse(is.na(Solar.R),mean(Solar.R,na.rm=T),Solar.R)

15.
airquality.subset<-na.omit(subset(airquality,Ozone>20,c(Solar.R,Wind)))
write.table(airquality.subset,"C:/Users/SUMA P P/Desktop/airquality.xls",row.names=F,sep="\t")

22.1.2021

#split and merge

#merge
a<-iris[1:10,]
b<-iris[51:60,]
#rowwise merge
rbind(a,b)

a<-iris[1:10,1:2]
b<-iris[1:10,3:4]
#columnwise merge
cbind(a,b)

#split
#categorical variables can be split

attach(iris)
g<-split(iris,Species)
class(g) # class of g is list
str(g) #each category i.e,each Species is a dataframe

To capture list elements


g[1] OR
g$setosa

Q.Display the mean of first four columns of the species setosa.


apply(g$setosa[,1:4],2,mean)
OR
lapply(g$setosa[,1:4],mean)

Q.Display the mean and sd of first four columns of the species setosa.
apply(g$setosa[,1:4],2,function(x)c(mean(x),sd(x)))

Q.Use orange data set in R and display the mean and sd of age and cicumference
of tree types:
1)Using subset function
attach(Orange)
apply(subset(Orange,Tree==1,select=c(2,3)),2,function(x)c(mean(x),sd(x)))
apply(subset(Orange,Tree==2,select=c(2,3)),2,function(x)c(mean(x),sd(x)))
apply(subset(Orange,Tree==3,select=c(2,3)),2,function(x)c(mean(x),sd(x)))
apply(subset(Orange,Tree==4,select=c(2,3)),2,function(x)c(mean(x),sd(x)))
apply(subset(Orange,Tree==5,select=c(2,3)),2,function(x)c(mean(x),sd(x)))

2) Using split and apply/lapply


t<-split(Orange,Tree)
apply(t$'1'[,2:3],2,function(x)c(mean(x),sd(x)))
apply(t$'2'[,2:3],2,function(x)c(mean(x),sd(x)))
apply(t$'3'[,2:3],2,function(x)c(mean(x),sd(x)))
apply(t$'4'[,2:3],2,function(x)c(mean(x),sd(x)))
apply(t$'5'[,2:3],2,function(x)c(mean(x),sd(x)))

OR
s<-c(type1=t$'1'[,2:3],type2=t$'2'[,2:3],type3=t$'3'[,2:3],type4=t$'4'[,2:3],type5=t$'5'[,2:3])
lapply(s,function(x)c(MEAN=mean(x),SD=sd(x)))
OR
lapply(split(Orange, Orange$Tree), function(x){apply(x[,2:3],2,function(x)c(Mean=mean(x),SD=sd(x)
OR
for(i in 1:5)
Page 25

{
b<-subset(Orange,Tree==i)
print(apply(b[,-1],2,function(x)c(MEAN=mean(x),SD=sd(x))))
}

25.1.2021

#cut function
#It helps to categorise a continuos variable
g<-c(2,1,3,2,4,5,1,2,5,2,4,6)
cut(g,3)
Each value in g is assigned to an interval.The cut point/width of interval is found
by range divided by number of intervals.That is,
5/3=1.667
#Here the interval is right closed.To make it right open we use:
cut(g,3,right=F)
#Instead of specifying number of intervals we can specify the breaks.
cut(g,breaks=c(0,3,6))
#Including labels for the interval:
b<-cut(g,breaks=c(0,3,6),labels=c("I","K"))
class(b):'factor'
table(b)

Q.Consider the salary of 10 employees as below:


15000,10000,5000,24000,32000,3000,45000,26000,18000,9000
Display the number of employees under "low","middle" and "high" income
categories where the categories are defined as below:
low category- income below 10,000
middle "" - income btw 10,000-25,000
high "" - income above 45000

a<-c(15000,10000,5000,24000,32000,3000,45000,26000,18000,9000)
s<-cut(a,breaks=c(0,10000,25000,50000),labels=c("low","middle","high"),right=F)
table(s)
plot(s)

#PLOTTING IN R
#Traditonal way of plotting is available in graphic packages
#grid approach to plotting: gpplot,lattice packages

#to draw plots


All plots come under the function plot() which is a default function. Arguments
of plot functions are plenty and it's not necessary to use all the arguments.
# Some or most of these arguments are called graphic parameters

#To add colors


plot(s,col=c("red","green","yellow"))

#To add title and subtitle


plot(s,col=c("red","green","yellow"),main="Bar plot",
sub="barplot of income level")

Q.Practice question:
Consider the following set of random values of some characteristic interest:
123.5,142.7,155.3,120.4,112.8,110.9,152.6,147.2
Take a sample of size n=8 with replacememt from above values and give it as
argument to the function named 'boot'.Th function should find the mean and
variance of the sampled observation and the results are added as a row to
a datafrane named "sam".
The above process has to be repeated 100 times .Display the contents of
sam.

boot<-function(x)
{
samp<-matrix(x,100,8)
Mean<-apply(samp,1,mean)
SD<-apply(samp,1,sd)
sam<-data.frame(Mean,SD)
print(sam)
}
x<-sample(c(123.5,142.7,155.3,120.4,112.8,110.9,152.6,147.2),800,replace=T)
boot(x)
Page 26

x<-sample(c(115,123,146,234,134,145,167,156),800,replace=T)
boot(x)

OR

sam<-data.frame(Mean=rep(0,100),Variance=rep(0,100))
xvec<-c(123.5,142.7,155.3,120.4,112.8,110.9,152.6,147.2)

for(i in 1:100){
boot<-function(x){
m=mean(x)
v=var(x)
return(c(m,v))
}
boot(sample(xvec,replace = T))
sam[i,]=boot(sample(xvec,replace = T))

OR
sam<-data.frame()
boot<-function(x){
for(i in 1:100){
y=sample(x,8,replace=TRUE)
temp<-c(mean(y),sd(y))
sam<-rbind(sam,temp)
}
colnames(sam)<-c("Mean","SD")
sam
}
boot(c(123.5,142.7,155.3,120.4,112.8,110.9,152.6,147.2))

29.01.2021
#PLOTTING BARPLOTS

emp<-(c(rep("unemp",10),rep("emp",5)))
#Since the given data has only characters,extract the frequencies using table().
a<-table(emp)
barplot(a)

#Arguments to the function barplot

#To get a horizontal orientation of bars


barplot(a,horiz=T)

#To change the widths of the bars


barplot(a,width=c(2,1))

#To label the bars


barplot(a,names=c("employed","unemployed"))

#To draw lines over the bar


barplot(a,names=c("employed","unemployed"),density=5)
barplot(a,names=c("employed","unemployed"),density=c(3,5)) #gives a different density
of lines to each bar.

#To give colors


barplot(a,names=c("employed","unemployed"),col=c("red","yellow"),density=c(3,5))
barplot(a,names=c("employed","unemployed"),col=rainbow(2),density=c(3,5)) #another way of
giving colors

#Legend(not very beneficial in a simple bar diagram)


barplot(a,legend=c("employed","unemployed"),names=c("employed","unemployed"),col=rainbow
(2),density=c(3,5))

#Giving main title and sub title


barplot(a,main="barplot of the employment counts",sub="village details",
names=c("employed","unemployed"),col=rainbow(2))

#To increase the limit of y axis and label the y axis


barplot(a,main="barplot of the employment counts",sub="village details",names=c("employed","unemp
col=rainbow(2),ylab="counts",ylim=c(0,15))
#ylim=c(0,15) #will increase the y axis limit to 15
#xlab- used to label the x axis
Page 27

#Multiple bar diagram

#using a table

emp<-(c(rep("unemp",10),rep("emp",5)))
gender<-c(rep("m",2),rep("f",3),rep("m",6),rep("f",4))
z<-table(emp,gender)

barplot(z)-this will give a stacked/sub-divided bardiagram

#To get a side by side barplot


barplot(z,beside=T)

#To plot a bar diagram after reversing order of gender and emp
barplot(t(z),beside=T)

Here a legend makes more sense


barplot(z,beside=T,legend=c("employed","unemployed"))

#Using data frame

year<-c(2015:2020)
admissions<-c(67,75,60,72,66,72)
#Here we do not have a categorical data hence using a table doesn't make much sense

barplot(admissions,year)#trivial way of plotting above data


The above plotted bar diagram has no labels on x and y axis. So to label x axis:
barplot(admissions,year,ylim=c(0,100),names=c("2015","2016","2017","2018","2019","2020"))

#Better way to do this is by using data frame


stats<-data.frame(year,admissions)
barplot(admissions~year,data=stats,ylim=c(0,100))
[tiled operator ~ is used to define formula in R. admissions~year indicates that admissions
is numeric and year is categorical

Consider:
year<-c(2015:2020)
admissions<-c(67,75,60,72,66,72)
dropouts<-c(10,3,12,10,5,12)

#To draw a barplot with above three vectors

stats<-data.frame(year,admissions,dropouts)
barplot(cbind(admissions,dropouts)~year,data=stats,ylim=c(0,100))

#To draw a multiple bar diagram


barplot(cbind(admissions,dropouts)~year,data=stats,ylim=c(0,100),beside=T)

1.02.2021

?Create a dataframe using given data:


age_group<-c(1:6)
males<-c(30,36,40,45,24,12)
females<-c(27,34,56,50,24,10)
dd<-data.frame(age_group,males,females)

a)Draw a mutiple barplot of total males and total females across age groups and
color the bars.Also give proper legend,
Give the title as 'Barplot of counts'.

barplot(cbind(males,females)~age_group,data=dd,beside=T)
barplot(cbind(males,females)~age_group,data=dd,beside=T,legend=c("males","females"),
col=c("red","green"),main="Barplot of counts")

b) Find the odds of males and females and draw the multiple barplot and give
the title as barplot of odds.
odd_male<-dd$males/dd$females
odd_female<-dd$females/dd$males
ff<-data.frame(age_group,odd_male,odd_female)
barplot(cbind(odd_male,odd_female)~age_group,data=ff,beside=T,main="Barplot of odds")
Page 28

#histogram
marks<-c(35,38,42,47,30,56,67,63,71,79,83,94,58)
hist(marks) #the data should be continuous
hist(marks,breaks=3) #specifying the number of bins to be produced
In the above case we still get 4 bins as R finds 3 bins as less appealing.To
force R to produce 3 bins:
hist(marks,breaks=c(30,50,70,100))
#Here we get density in y axis.To convert it into frequency and to include
the lower limits
hist(marks,breaks=c(30,50,70,100),freq=T,include.lowest=T)

#To color
hist(marks,breaks=c(30,50,70,100),freq=T,include.lowest=T,col=rainbow(3))

Histogram gives an idea on the location of mean,spread of data,skewness,


kurtosis and also on the distribution.We can also draw a density curve based
on histogram.

p=hist(marks,col=rainbow(3),density=3)
names(p)
p$breaks #gives the break points
p$counts #gives the count/frequency of each interval
p$density #relative frequency 3/13,2/13.....
p$mids #gives the mid value
p$xname
p$equidist

plot(p)
text(p$mids,p$counts)

#Line plots
plot(marks) #we get a line plot
plot(log(marks)) #transformation on marks
plot(marks,main="line plot of marks",type="l")

types of line plots


"p"- points
"l"-line through points
"o"-line over the points
"b"-draw both points and line but do not overlap
"h"-vertical bars
"c"-just like type "b" but without points
"s"/"S"-stair case type going either horizontally or vertically

plot(marks,type="c",lty=2)
lty-line type

3.2.2021
math_marks<-c(12,19,9,16,18)
stat_marks<-c(14,20,7,12,14)

plot(math_marks,type="l",lty=4,col="red",main="line plot of marks",ylab=


"marks of maths")

#to draw a mutiple line plot


plot(math_marks,type="l",lty=4,col="red",main="line plot of marks",
ylab="marks",ylim=c(0,20))
lines(stat_marks,type="l",lty=2,col="blue")
legend("topleft",legend=c("math","stat"),lty=c(4,2),col=c("red","blue"))

#using dataframe in legend


marks<-data.frame(math_marks,stat_marks)
colnames(marks)

plot(math_marks,type="l",lty=4,col="red",main="line plot of marks",


ylab="marks",ylim=c(0,20))
lines(stat_marks,type="l",lty=2,col="blue")
legend("topright",legend=colnames(marks),lty=c(4,2),col=c("red","blue"))

#drawing multiple line plot using for loop

Q.Draw the line plot of Sepal length,sepal width,petal length and petal width
Page 29

using a single plot.Give suitable title to the plot.Label the y axis and give
proper legends.Write inference.

plot(iris[,1],type="l",ylim=c(0,8),main="iris data",ylab="length and width")

for(i in 2:4)
{
lines(iris[,i],type="l",lty=i,col=i,ylim=c(0,8))
}
legend("topright",legend=colnames(iris),lty=c(1:4),col=c(1:4))

Instead of using for loop

iris
attach(iris)
plot(iris[,1],type="l",lty=4,col="red",main="Line plot",ylim=c(0,10),ylab="iris")
lines(iris[,2],type="l",lty=2,col="blue")
lines(iris[,3],type="l",lty=2,col="green")
lines(iris[,4],type="l",lty=2,col="black")
legend("topright",legend=colnames(iris[,-5]),lty=c(4,2,2,2),
col=c("red","blue","green","black"))

All the four increases across the three species.

Q.Refer to question 12 in word doc.Draw the line plot of the function at


0.5,1.2,2.4,3.5,0.7,2.6.

tmpFn<-function(x)
{
ifelse(x < 0, x^2 + 2*x + 3, ifelse(x < 2, x+3, x^2 + 4*x - 7))
}
x<-c(.5,.7,1.2,2.4,2.6,3.5)
y<-tmpFn(x)

plot(x,y,type="l",lty=4,col="red",ylab="tmpFn",main="tmpFn values")

a<-c(0.5,1.2,2.4,3.5,0.7,2.6)
tmpFn<-function(x)
{
ifelse(x < 0, x^2 + 2*x + 3, ifelse(x < 2, x+3, x^2 + 4*x - 7)) }
x<-tmpFn(a)
plot(x,type="b",lty=2,col="blue",ylim=c(3,20))

5.2.2021

#scatter plot
Scatter plot makes sense only for bivariate distribution
height<-c(123,149,95,116,168)
weight<-c(45,60,34,52,64)
plot(height,weight,pch=10,col="green",main="scatterplot of height/weight")

?Use iris data set and:


1.Draw the scatter plot of Sepal Length against Petal Length

attach(iris)
plot(Sepal.Length,Petal.Length,pch=20,col="black")

2.Draw the scatter plot of Sepal length vs Petal length for each of the
species.
plot(iris[1:50,1],iris[1:50,3],pch=20,main="Setosa",xlab="Sepal length",
ylab="Petal length")
plot(iris[51:100,1],iris[51:100,3],pch=20,main="Vesicolor",xlab="Sepal length",
ylab="Petal length")
plot(iris[101:150,1],iris[101:150,3],pch=20,main="Virginica",xlab="Sepal length",
ylab="Petal length")

OR

a=subset(iris, Species == "setosa")


a
plot(a[,c(1,3)], main="setosa")

b=subset(iris, Species == "versicolor")


b
Page 30

plot(b[,c(1,3)], main="versicolor")

d=subset(iris, Species == "virginica")


d
plot(a[,c(1,3)], main="virginica")

#partitioning a plot window


par(mfrow=c(2,2))
plot(iris[1:50,1],iris[1:50,3],pch=20,main="Setosa",xlab="Sepal length",
ylab="Petal length")
plot(iris[51:100,1],iris[51:100,3],pch=20,main="Vesicolor",xlab="Sepal length",
ylab="Petal length")
plot(iris[101:150,1],iris[101:150,3],pch=20,main="Virginica",xlab="Sepal length",
ylab="Petal length")

height<-c(123,149,95,116,168)
weight<-c(45,60,34,52,64)
par(mfrow=c(1,2))
plot(height,type="l",col="red")
plot(weight,type="l",col="green")

#pairwise plots
if a dataframe is plotted each column in dataframe will be plotted against each other
plot(iris[,-5])
or
pairs(iris[,1:4])
#pairs function can be used to check collinearity that is whether the independent
variables are actually independent.

#To check collinearity of certain columns only:


pairs(~Sepal.Length+Petal.Length+Petal.Width,main="Scatter plot")

#drawing a horizontal or vertical line in a plot


height<-c(123,149,95,116,168)
plot(height,type="l")
abline(h=120)

10.2.2021

#boxplot
Boxplots are mainly used to know whether distribution from which the samples
are drawn is skewed and whether outliers are present in the sample.
Q3+Q1-2Q2=O(Measure for symmetricity-Bowley's coefficient of skewness)

x<-c(5,10,12,3,14,28,18,10,16,22)
boxplot(x)
summary(x)
Boxplot is also called box and whisker.
The bold line is the median.
If Q3-Q2>Q2-Q1 -positive skewness
If Q3-Q2<Q2-Q1 -negative skewness

#Further operations on a barplot


boxplot(x,main="Boxplot",col=rainbow(1))

x<-c(5,10,12,3,14,28,18,10,16,22,80)
boxplot(x,main="Boxplot",col=rainbow(1))
Outlier is indicated by a dot.
Boxplot is used to make inferences about the data.
From the boxplot constructed we can infer that:
The data is almost symmetric and there is a presence of outlier in the data.

12.2.2021
#Boxplot of mpg corresponding to cylinders in data "mtcars"
boxplot(mpg~cyl,data=mtcars,main="car mileage data",xlab="no of cylinders",
ylab="miles per gallon")
dbinom(5,20,0.2)

#Boxplot corresponding to the columns of iris across species


par(mfrow=c(2,2))
boxplot(Sepal.Length~Species,data=iris,ylim=c(0,8))
boxplot(Petal.Length~Species,data=iris,ylim=c(0,8))
boxplot(Sepal.Width~Species,data=iris,ylim=c(0,8))
Page 31

boxplot(Petal.Width~Species,data=iris,ylim=c(0,8))

#generating 50 random nos where n=20 and p=.8


x<-rbinom(50,20,0.8)

#to find their prob values


dbinom(x,30,0.5)

#finding the distribution


pbinom(75,100,.5)

Generate 100 random numbers & calculate the density of these numbers
y<-rbinom(100,50,.7)
dbinom(y,70,.5)

15.2.2021

#plotting boxplot of age corresponding to different tree types


boxplot(age~Tree,data=Orange,col=rainbow(5))
summary(Orange)

Example:
height<-c(123,140,127,135,153,165,172,181,162,163)
gender<-c("m","m","f","m","f","f","f","m","m","f")
dd<-data.frame(height,gender)
boxplot(height~gender,data=dd)

Inference:
For females Q3-Q2<Q2-Q1 ,hence negatively skewed
For males Q3-Q2>Q2-Q1 ,hence positively skewed
The spread of height of males is larger than that of females.

Probabaility distributions

Alphabets to remember:
d-density,p-distribution,q-quantile,r-random number generation
Keywords for each distribution:
binom,pois,exp,norm,gamma,unif

?Find P[X=3] where X~B(5,0.5)


Method1: Without using function
choose(5,3)*(.5)^3*(.5)^2
But doing this manually is cumbersome

Method 2: Use function


dbinom(3,5,0.5)

?Find P[X<=3]
Method 1:
sum(dbinom(0,5,.5)+dbinom(1,5,.5)+dbinom(2,5,.5)+dbinom(3,5,.5))
Method 2:
pbinom(3,5,.5)

?Find P[X>3]
P[X>3]=1-P[X<=3]
1-pbinom(3,5,.5)
OR
pbinom(3,5,0.5,lower.tail=F)

?Find P[X<3]
sum(dbinom(0:2,5,.5))

?Find P[3<=X<=5]
sum(dbinom(3:5,5,.5))
or
pbinom(5,5,.5)-pbinom(3,5,.5)+dbinom(3,5,.5)

#Quantile values
Given a probablity if we want to find the value of x that satisfies the
prob value ,we use the quantile function

?Obtain x such that P[X<=X]=.25


qbinom(.25,5,.5)
Page 32

x=2 is the output


This is used in hypothesis testing

p=c(.25,.5,.75).Find the corresponding x values


qbinom(c(.25,.5,.75),5,.5)

#Generation of random numbers

1) Inverse transformation method


W.K.T the cdf is non decreasing,continuous and takes values in interval [0,1]
and F(x) is uniform in (0,1)

Choose a random value btw [0,1].Call this value as u.


Equate F(x)=u and solve for x.This is inverse transformation.

Example;
Let X~exp(theta)
F(x)=1-exp(-theta^x)
Equate F(x)=u i.e, 1-exp(-theta*x)=u
exp(-theta*x)=1-u
(-theta*x)=ln(1-u)
x= -ln(1-u)/theta

?Generate 10 random obs from binomial(5,.5)


x<-rbinom(10,5,.5)
table(x)

QUESTIONS
Let X follow binomial distribution with n=10,p=.3.
a)Evaluate the binomial probs for x=2,4,6,8
dbinom(seq(2,8,2),10,.3)

b)Evaluate:
i) P[X<=4]
pbinom(4,10,.3)

ii) P[3<=X<=7]
pbinom(7,10,.3)-pbinom(3,10,.3)+dbinom(3,10,.3)

iii) P[2<X<5]
sum(dbinom(3:4,10,.3))

iv) P[3<X<=8]
pbinom(8,10,.3)-pbinom(3,10,.3)

v) P[X>=7]
1-pbinom(6,10,.3)
OR
pbinom(6,10,.3,lower.tail=F)

c)Obtain the x values corresponding to percentiles


qbinom(c(.25,.5,.75),10,.3)

d)Generate 50 random obsvs from X.


x<-rbinom(50,10,.3)

e)Tabulate the frequencies and draw the barplot.


y<-table(x)
barplot(y)

17.2.2021
#Poisson distribution

?Let X~P(lambda=4.5) then find:

a)P[X=5]
dpois(5,4.5)

b) P[X>4]
ppois(4,4.5,lower.tail=F)
OR
1-ppois(4,4.5)

c)x such that we have the 10th percentile


Page 33

qpois(.1,4.5)

d) Generate 100 random numbers


rpois(100,4.5)

?Let X~P(lambda=1.2)

a) Find P[X<=3]
ppois(3,1.2)

b) Find P[X>5]
ppois(5,1.2,lower.tail=F)

c) Find the median


qpois(.5,1.2)

d) Find the 90th percentile


qpois(.9,1.2)

Question

Let X follow Poisson distribution and has the following frequencies.


X : 0 1 2 3 4 5 6 and above
f: 53 45 38 24 17 8 4
Find the expected frequencies.

Sol:
x<-c(0,1,2,3,4,5,6)
freq<-c(53,45,38,24,17,8,4)
mean=sum(x*freq)/sum(freq)
exp_freq<- ppois(x,mean)*sum(freq)

?If lambda=1.2 find


a) density at .3,2.4
dpois(round(c(.3,2.4)),1.2)

b)cdf at 3.5
ppois(3.5,1.2)

#Exponential distribution
It is also called lifetime distribution.

dexp(5,1.3)
pexp(5,1.3)
qexp(.5,1.3)
rexp(10,1.3)

To plot exponential values we need to use a histogram instead of barplot as


the distribution is continuous.

Q.Generate 100 obs from exponential distribution with theta =1.3 and obtain
the histogram and density plot of exp distribution.

y=rexp(100,1.3)
#plotting a histogram

hist(y,freq=F,col="green")
#to plot against density values

#plotting the density values that is f(x)


z<-dexp(sort(y),1/mean(y))
#y terms are sorted as unsorted data will not give a correct plot.
plot(sort(y),z,type="l",col="yellow",ylab="f(x)")

Q. The lifetime of a certain model of electric bulb follow exponential dist


with theta=90.3(hrs). Suppose a random sample of 10 bulbs resulted in the
following life time.Obtain the plots of density and cdf in a single window.
lifetime:

par(mfrow=c(1,2))
life<-c(67.3,78.1,87.3,90.4,95.1,101.8,107.4,89.2,97.3,75.2)
m<-dexp(sort(life),1/90.3)
plot(sort(life),m,type="l",main="density",col=rainbow(1))
Page 34

s<-pexp(sort(life),1/90.3)
plot(sort(life),s,type="l",main="cdf",col=rainbow(2))

19.2.2021
#adding a curve on the histogram

y<-rexp(100,1.2)
hist(y,freq=F,col="purple")
curve(dexp(y,1.3),col="green",add=T,xname="y")

Superimposing a density curve on a histogram will give an idea about the


distribution. If the curve covers the whole area of histogram ,it is said to
fit the data.

Q.It is assumed that the following random obs of X are from exponential
distribution with theta=.8
X:1.2,1.6,.6,1.5,2.7,2.9,3.1,.7,.58,3.1,4.6,4.9
Draw the histogram of X and add the exponential density curve in the histogram.
Does the plot reveal whether X is exponentially distributed.

x<-c(1.2,1.6,.6,1.5,2.7,2.9,3.1,.7,.58,3.1,4.6,4.9)
hist(x,freq=F,col="red")
curve(dexp(x,.8),add=T,col="yellow")
# if we use any other variable other than "x",we need to include the
argument "xname"
eg:curve(dexp(y,.8),add=T,col="yellow",xname='y")
The above plot is not exponentially dist as the curve doesn't cover the
entire histogram.

We can test several characterization results such as:


Lack of memory of exponential
Sum of exponential varaibles follow gamma distribution
(To be worked out on ur own)

#Normal distribution
If parameters are not specified,R considers it as standard normal dist.

?Suppose Z has standard normal distribution.Find:

a)P[Z<=3]
pnorm(3)

b)P[-3<Z<3]
pnorm(3)-pnorm(-3)
#This can also be found directly by 3-sigma rule

c)f(0) and F(0)


dnorm(0)
pnorm(0)

d)f(1.3) and f(-1.3)


dnorm(1.3)
dnorm(-1.3)

Q.Generate 1000 random obs from standard normal dist.Obtain the histogram and
plot the density curve on the histogram.
x<-rnorm(1000)
hist(x,freq=F,col="yellow")
curve(dnorm(x),add=T,col="green")

Q.Suppose X is normally distributed with mean=1.2 and varaince=4.


Compute:

a)P[X<=2]
pnorm(2,mean=1.2,sd=sqrt(4))

b)P[-2<X<0]
pnorm(0,1.2,2)-pnorm(-2,1.2,2)

c)70th percentile of X
qnorm(.7,1.2,2)

Q.The marks obtained in statistics(X)by students of a class is assumed to be


Page 35

normally distributed with mean 72 and variance 2.3. How many students have
obtained marks below 60, marks above 80 and marks btw 40 and 60.Assume that
the total no of students who took the test is 500.

Sol.#Obtain the corresponding prob and multiply with the total no of students
#below 60
pnorm(60,72,sqrt(2.3))*500

#above 80
500*pnorm(80,72,sqrt(2.3),lower.tail=F)

#btw 40 and 60
500*(pnorm(60,72,sqrt(2.3))-pnorm(40,72,sqrt(2.3)))

Q.1.Select 1000 random nos from normal distribution with mean 5 and sd 2.
2.Calcuate the mean of the generated obs
3.Repeat steps 1 and 2 200 times.
4.Plot the histogram of sample mean
5.Does the shape lools like bell shape? why or why not?

Solution.
1.
y<-rnorm(1000,5,2)
2.
mean(y)
3.
x<-rep(NA,200)
for(i in 1:200)
{
ran<-rnorm(1000,5,2)
x[i]<-mean(ran)
}

4.
hist(x,main="Histogram of sample mean",col="blue")
5.
The shape is bell shaped.Since the generated values are normal, their sample
means would also be normal

22.2.2021

#Quantile-quantile plot(QQ PLOT)

#How to know whether the given random observations are generated from a
specified or assumed distribution?

?Let X(continuous rv) takes sample values 123.4,11.4,127.8,132.6,143.7,110.5,


108.9,109.4,106.3.Based on the sample observations can it be concluded that the
distribution of is normal with mean 110 and sd 1.3?

This is called fitting of a dist or goodness of fit.


Hypothesis H: X has normal distribution with mean 110 and sd 1.3
We need to verify the validity of H based on sample data.There are two
approaches for this:

Mathematical approach: Using tests like chi-square and kolmogrov-smirnov


Graphical approach: Using QQ plot

#How QQ plot is constructed?


x axis: quantiles based on sample data
y axis: quantiles obtained from the asuumed distribution under H
Plot the quantiles using a line plot,if the line obtained is a straight line
then the hypothesis is true.

#Code for the above question:


x<-c(123.4,11.4,127.8,132.6,143.7,110.5,108.9,109.4,106.3)
qsample<-quantile(x,seq(.01,.99,.01))
qdist<-qnorm(seq(.01,.99,.01),110,1.3)
plot(qsample,qdist,type="l")
Interpretation: Based on QQ plot it is observed that the given random sample
does not come from normal distribution with mean 110 and sd 1.3.
(It could be normal but with different means)

Question:
Page 36

Generate 1000 random numbers from an exponential dist with theta 1.5.Obtain the
QQ plot.

x<-rexp(1000,1.5)
sample<-quantile(x,seq(.01,.99,.01))
dist<-qexp(seq(.01,.99,.01),1.5)
plot(sample,dist,type="l")
Interpretation:Based on QQ plot it is observed that the random sample comes
from a distribution with theta 1.5

#HYPOTHESIS TESTING

Hypothehsis testing is done to:


1.Test for specified mean,equality of means(dependent and independent samples)
2.Test for ratio of variances(independent samples)
3.Test for proportions,equality of proportions
4.ANOVA- one way, two way

Tests for specified (small sample)-one sample t test


Assumption X follows normal dist with mean mu and sd sigma.
Null hypothesis(H):mu=mu_0
Alternative (K): mu=!mu_0 or mu>mu_0 or mu<mu_0
Test statistic t= (xbar-mu_0)/(s/sqrt(n)) which follows t dist with n-1 df.

?.The following data on X have been generated from a normal dist with unknown
mean and unknown sd.
X: 123.4,11.4,127.8,132.6,143.7,110.5,108.9,109.4,106.3
Stating the hypothesis ,verify whether the population mean of X is 110.3.

Solution:
Let mu denote the pop mean of X.The hypothesis to be tested is:
H : mu=110.3 against K= mu!=0
The test used in testing H against K is one sample t test.
The test statistic is given as:t= (xbar-mu_0)/(s/sqrt(n)) which follows t dist
with n-1 df where n is the sample size.

x<-c(123.4,11.4,127.8,132.6,143.7,110.5,108.9,109.4,106.3)
Function:
t.test(x,mu=110.3)
If its one tailed or two tailed an extra argument 'alternative' should be given.
If mu is not specified R assumes the value to be zero.

#How the decision on whether to accept or reject is made?


Decision will be based on the p value.If p value is large say above 5% or 1%
then we do not reject(accept) the null hypothesis at 5% or 1% level.If less
we reject the null hypothesis.

#Output:t = -0.16195, df = 8, p-value = 0.8754

#Interpretation:
Since p value is .8754 which is much larger than 5% we conclude that there is
no sample evidence to reject the null hypothesis at 5%.Thus we accept the
null hyothesis and conclude that the sample obsvs have been generated from
a normal dist with mean 110.3.

Question
Use iris data set. Test whether the population mean of sepal length is 4.5
against mean is greater than 4.5.

Solution:
Let mu denote the pop mean of sepal length..The hypothesis to be tested is:
H : mu=4.5 against K= mu>4.5
The test used in testing H against K is one sample t test.
The test statistic is given as:t= (xbar-mu_0)/(s/sqrt(n)) which follows t dist
with n-1 df where n is the sample size.

#code
attach(iris)
t.test(Sepal.Length,mu=4.5,alternative="greater")

#Output:t = 19.868, df = 149, p-value < 2.2e-16

#Interpretation:
Page 37

Since p value is less than 2.2e-16 which is much less than 5% we conclude that
there is no sample evidence to accept the null hypothesis at 5%.Thus we
reject the null hyothesis and conclude that the mean population of sepal length
is greater tahn 4.5.

thick<-c(7.5,7.60,7.65,7.70,7.55,7.55,7.40,7.40,7.50,7.50)
# let mu is the mean thisckness in hund of an inch for pc of gums
t.test(thick,mu=7.5)
# interpretation: We fail to reject the null hypotheis as the p value
0.2848 is not less than 0.01

y<-c(44,31,52,48,46,39,43,36,41,49)
t.test(y,mu=44)

#TEST FOR EQUALITY OF MEANS OF TWO INDEPENDENT SAMPLES

Let mu_1 and mu_2 denote resp the pop means of two independent samples say X
and Y.The hypothesis to be tested is:
H: mu_1=mu_2 against K= mu_1(!=,<,>)mu_2
The test statistic is t= (x1bar-x2bar)/S*(sqrt(1/n1+1/n2)) which follows t dist
with (n1+n2-2) df.

Question
Let X equal the weight in grams of Low fat strawberry kudo and Y the weight of
Low Fat Blueberry Kudo. Assume the distribution follow normal distribution.
Let 21.7,21.0,21.2,20.7,20.4,21.9,20.2,21.6,20.6 be n=9 observations of X and
let 21.5,20.5,20.3,21.6,21.7,21.3,23.0,21.3,18.7,20.0,20.4,220.8,20.3 be m=13
observations of Y.Does the data support the claim that mean weight of X is
samller than that of Y. Test at 5% level of significance. Assume the
population variance to be equal.

Solution
Let mu_1 and mu_2 denote resp the pop means of two independent samples say X
and Y.The hypothesis to be tested is:
H: mu_1=mu_2 against K= mu_1<mu_2
The test statistic is t= (x1bar-x2bar)/S*(sqrt(1/n1+1/n2)) which follows t
dist with (n1+n2-2) df
x<-c(21.7,21,21.2,20.7,20.4,21.9,20.2,21.6,20.6)
y<-c(21.5,20.5,20.3,21.6,21.7,21.3,23,21.3,18.9,20,20.4,20.8,20.3)

#code
t.test(x,y,alternative="less",var.equal=T)

#output:t = 0.37417, df = 20, p-value = 0.6439

#Interpretation:
Sine p value is greater than .05 we do not reject the null hypothesis and
conclude that the pop mean of X and Y are same.

While doing this test we assume that variances of X and Y are equal.If
pooled variance is not equal then we need to give the argument as var.equal=F.
This will give the Welsch test rather than t test.

#Alternate way of doing the above test:USING DATAFRAME

Create a data frame first.


1.Combine x and y in z.
z<-c(x,y)
2.Introduce grouping
g<-c(rep("1",9),rep("2",13))
thickness<-data.frame(weight=z,g)

#Code
attach(thickness)
t.test(weight~g,alternative="less",var.equal=T)

#Output:t = 0.37417, df = 20, p-value = 0.6439

24.2.2021

#EQUALITY OF VARIANCES OF TWO INDEPENDENT NORMAL POPULATIONS

Consider the above data where:


Page 38

x<-c(21.7,21,21.2,20.7,20.4,21.9,20.2,21.6,20.6)
y<-c(21.5,20.5,20.3,21.6,21.7,21.3,23,21.3,18.9,20,20.4,20.8,20.3)

Test statistic F= [S1^2/m-1]/[S2^2/n-1] which follows F dist with


df (m-1)(n-1).
Hypothesis H:sigma1^2=sigma2^2 (ratio of pop variance is 1)
against K : ratio(!=,<,>)than 1

#Code
var.test(x,y)
#using data frame
attach(thickness)
var.test(weight~g)

#Output:F = 0.36239, num df = 8, denom df = 12, p-value = 0.1573

#Interpretation:
Based on p value we conclude that H is not rejected at 5%.

Before doing a t test the the test for equality of variance should be done.
If variance is not equal based on the test then use argument var.equal=F.

#ONE WAY ANOVA

Here our interest is in the test of equality of means of more than 2 pop
simultaneously assuming that the populations are independent and normally
distributed.
If there are 'k' populations with means mu1,mu2,...muk then assuming variances
are equal hypothesis to be tested is
H:mu1=mu2=.....muk against K:H is false

ANOVA Table
Sourcedfsum of squares(SS)mean SSF ratiop value
Treatmentk-1SSTSST/k-1MSST/MSSE
Errorsn-kSSESSE/n-k
Total
Total variabiity is split into variability due to two factors.

Question
Consider 3 treatments A,B,C with sample observations of their yield.
A:10.3,12.2,14.5,11.6,10.7
B:20.4,27.1,28.2,29.4,26.9,32.1,20.8
C:30.1,33.2,38.9,40.1,42.6,37.5
Test whether the yield of the 3 treatments are equal at 5% level.

Solution:
H: Yield of 3 treatments are equal vs K: H is false

Create a data frame with yield values and correspoding group variables.

a<-c(10.3,12.2,14.5,11.6,10.7)
b<-c(20.4,27.1,28.2,29.4,26.9,32.1,20.8)
d<-c(30.1,33.2,38.9,40.1,42.6,37.5)
y<-c(a,b,d)
g<-c(rep("1",5),rep("2",7),rep("3",6))
treat=data.frame(yield=y,g)

#code
attach(treat)
h<-aov(yield~g)
summary(h)

#Output:
Df Sum Sq Mean Sq F value Pr(>F)
g 2 10.94 5.472 2.993e+31 <2e-16 ***
Residuals 15 0.00 0.000

#Interpretation:
Since the p value is less than 5% we reject the null hypothesis.
Thus there is no sample evidence to conclude that the mean yield of the three
treatments are equal.

Question
Use Orange data set and test whether the mean circumference of the trees are
Page 39

equal.

Solution:
H: Mean circumference of the trees are equal
K:H is false

#code
attach(Orange)
s<-aov(circumference~Tree)
summary(s)

#output
Df Sum Sq Mean Sq F value Pr(>F)
Tree 4 11841 2960 0.883 0.486
Residuals 30 100525 3351

#Interpretation
Since p value is greater than .05 we accept the null hypothesis.The mean
circumference of trees are equal.

#TWO WAY ANOVA


There are two sources of variation say row and column or treatment and block.

#Hypothesis is H1:Treatment effects are same


against K: H1 is false
H2: Block effects are same
against K: H2 is false

Example:
Let M denote Machine and O denote operator

M1 M2 M3
O1 4.5 3.6 5.7
O2 10.5 6.3 4.7
O3 3.3 4.7 8.1

The values denote the time(in hrs) to manufacture a product.Test whether the
mean manufacturing time of the operators are same and mean manufacturing time
by machines are the same.

Solution:
#Hypothesis
H1: mean manufacturing time of operators are same
against K1: H1 is false
H2: mean manufacturing time of machinesnare same
against K2: H2 is false

#code
time<-c(4.5,3.6,5.7,10.5,6.3,4.7,3.3,4.7,8.1)
machine<-c(rep(c("m1","m2","m3"),3))
operator<-c(rep(c("o1","o2","o3"),each=3))
s<-data.frame(time,machine,operator)

attach(s)
h<-aov(time~machine+operator)
summary(h)

#Output
Df Sum Sq Mean Sq F value Pr(>F)
machine 2 3.216 1.608 0.221 0.811
operator 2 10.416 5.208 0.715 0.543
Residuals 4 29.138 7.284

#Interpretation
Since both p values are above 5% both the hypotheses H1 and H2 are accepted.

You might also like