0% found this document useful (0 votes)
3K views58 pages

Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics

This document is a lab manual and workbook for a statistical lab using R programming. It provides instructions and exercises for students to complete 10 experiments involving basic R programming skills and statistical analyses, including introduction to R, basic data representation, data visualization with charts, importing data from Excel, calculating measures of central tendency, standard deviation and box plots, scatter plots and correlation, regression, and one-way and two-way ANOVA. The manual includes objectives, steps, code examples, and expected outputs for students to practice and demonstrate their understanding of key R programming and statistical techniques. It will be used by students in a statistical foundations course to learn skills for data science.

Uploaded by

Eswar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views58 pages

Statistical Lab Using R-Programming Lab Manual and Workbook: Department of Mathematics

This document is a lab manual and workbook for a statistical lab using R programming. It provides instructions and exercises for students to complete 10 experiments involving basic R programming skills and statistical analyses, including introduction to R, basic data representation, data visualization with charts, importing data from Excel, calculating measures of central tendency, standard deviation and box plots, scatter plots and correlation, regression, and one-way and two-way ANOVA. The manual includes objectives, steps, code examples, and expected outputs for students to practice and demonstrate their understanding of key R programming and statistical techniques. It will be used by students in a statistical foundations course to learn skills for data science.

Uploaded by

Eswar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Statistical Lab using R-Programming

LAB MANUAL AND WORKBOOK

Year : 2020 – 2021

Subject Code :
U18MAR0001 – Statistical Foundations for Data Science

Regulation : R2018

Year : II

Department of Mathematics

School of Foundational Sciences


Certificate

This is to certify that it is a bonafied record of practical work


done by Vijay M bearing the Roll No.20BME123 2nd year

Mechanical branch in the Statistical Foundations in data


science laboratory during the academic year 2021-22 under
our supervision.

Faculty In-Charge Internal Examiner External Examiner


TABLE OF CONTENTS

S.No LIST OF EXPERIMENTS Page No

1. Introduction to R Programming

2. Basic Data Representation

3. Data Presentation Methods - Bar Chart, Pie Chart

4. Importing data from MS-Excel

5. Mean, Median, Mode

6. Standard Deviation, Five Number Summary, Box Plot

7. Scatter diagram, Correlation

8. Regression

9. ANOVA – One-way classification

10. ANOVA – Two-way classification


STATISTICAL LAB USING R-PROGRAMMING - MARKS BREAK UP
STATEMENT

Total
Program Execution Viva marks Faculty
S.No. Date Name of the experiment
(10) (10) (10) (30) sign

Introduction to R Programming
1

Basic Data Representation


2

Data Presentation Methods -


3 Bar Chart, Pie Chart

Importing data from MS-Excel


4

\
Mean, Median, Mode
5

Standard Deviation, Five


6 Number Summary, Box Plot

Scatter diagram, Correlation


7

Regression
8

ANOVA – One-way
9 classification

ANOVA – Two-way
10 classification
1

KUMARAGURU COLLEGE OF TECHNOLOGY


LABORATORY MANUAL

Experiment Number: 01, 02

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Introduction to R Programming, Basic Data
Representation

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT

1. To understand the basics of R-Programming and R Studio

2. To understand the representations of basic data

STEP 2: ACQUISITION
I. INTRODUCTION TO R PROGRAMMING
 R is a programming language and software environment for statistical analysis,
graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand
 This programming language was named R, based on the first letter of first name of the
two R authors (Robert Gentleman and Ross Ihaka)
 R allows integration with the procedures written in the C, C++, .Net, Python or
FORTRAN languages for efficiency.
 R is an extremely flexible and customizable language.
 R is a well-developed, simple and effective programming language which includes
conditional loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility.
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
 R is the world’s most widely used statistics programming language.
 R has a large number of built-in packages, functions and operators.
2

A schematic view of how R works.

The four windows

Command History
Editor (Environment)

Console Instructions
Packages
Plot

Important note: R is case sensitive


3

Functions:

A function is a group of statements or commands for performing any task. Like other
languages, in R the function is represented by parenthesis () symbol. Functions may be either
user-defined or built-in as provided by the language.

R-objects:

There are many types of R-objects.

 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames

Creating vectors in R

A vector is a sequence of data elements of the same type. In R, c ( ) function is mostly used
to create a vectors.
R Code:
c(1,2,3,4)
Output:
## [1] 1 2 3 4
R Code:

c(1:4)
Output:
## [1] 1 2 3 4

Creating an empty vector using vector function.

vector (“character”, length = 3)


# [1] "" "" ""

The simplest of these objects is the vector object. R language has five frequently used atomic
vector types (Carpentries 2018):
• Logical – TRUE, FALSE
• Integer – 2L, 4L, etc. (the L tells R to store this as an integer)
• Numeric – 5, 12.5, etc.
• Complex – 1 + 4i, 2 + i, etc.
• Character – "Hello", "R", etc.

A logical type includes two elements namely, TRUE and FALSE.


4

i) testData = FALSE
print(class(testData))
## [1] "logical"
In order to declare an integer in R language, we need to add an L suffix.

ii) testData = 5L
print(class(testData))
## [1] "integer"
iii) testData = as.integer(5)
print(class(testData))
## [1] "integer"

iv) testData = 12.5


print(class(testData))
## [1] "numeric"

v) testData = "R Programming Language"


print(class(testData))
## [1] "character"

vi) testData = charToRaw("Hello")


print(testData)
## [1] 48 65 6c 6c 6f
print(class(testData))
## [1] "raw"
Some basic commands

1. To generate a sequence with common difference 1


R code: seq (1,10)

Output : 1 2 3 4 5 6 7 8 9 10

2. To generate a sequence with common difference 2


R code : seq(1,15,by=2) :

Output: 1 3 5 7 9 11 13 15

3. To find the square root of a number


R code:
#square root y=2
sqrt(2) (or) x=sqrt(y)
Output:

[1] 1.414214 [1] 1.414214

> round(1.414214,digits=3) # Rounding the value 1.414214 to three places after the decimal
point.

Output: 1.414
5

4. To perform addition of two numbers

R-code R-code R-code


a=2
c=a+b c=2+3
b=3 a=2 c
c=a+b (or) b=3 (or)
c c
Output Output Output

[1] 5 [1] 5 [1] 5

R-code R-code

a=c(1,2,3,4) a=c(1,2,3,4)

b=c(2,4,6,8)

a*a sum(a*b)

sum(a*a)

Output Output

a*a sum(a*b)
[1] 1 4 9 16 [1] 60
sum(a*a)
[1] 30

5. To delete objects in memory.

x=c("name","age","email")
rm(x)

rm(x) deletes the object in memory.


rm(list=ls()) deletes all the objects in memory.

6. To find the length of the vector

testData = seq(from=1,to=10,by=0.5)
paste("The length of testData is", length(testData))
## [1] "The length of testData is 19"

7. The function gl():

The function gl (generate levels) is very useful because it generates regular series of factors.
6

The usage of this function is gl(k, n) where k is the number of levels (or classes), and n is the
number of replications in each level.
Two options may be used: length to specify the number of data produced, and labels to
specify the names of the levels of the factor.

Examples:

> gl(3, 5)

[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Levels: 1 2 3

> gl(3, 5, length=30)

[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

Levels: 1 2 3

> gl(2, 6, label=c("Male", "Female"))

[1] Male Male Male Male Male Male

Female Female Female Female Female Female

Levels: Male Female

> gl(2, 10)

[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Levels: 1 2

> gl(2, 1, length=20)

[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

Levels: 1 2

> gl(2, 2, length=20)

[1] 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2

Levels: 1 2

8. The on-line help in R


7

R Code: help ("*")

Arithmetic package: base R Documentation Arithmetic Operators

Creating matrices using vectors:

Matrix can be created using the matrix function. Dimension of the matrix can be defined by
passing appropriate value for arguments nrow and ncol.

Creating a 3X3 matrix:

matrix(data=1:9,nrow=3,ncol=3)

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

The numbers from 1 to 9 has been arranged column wise. If we want the numbers (1 to 9) to
be arranged in row wise format, the command is as follows:

matrix(data=1:9,nrow=3,ncol=3,byrow=TRUE)

## [,1] [,2] [,3]


## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9

Consider two different vectors as given below

u=c(9:12)
v=c(13:16)
u; v

## [1] 9 10 11 12
## [1] 13 14 15 16

To create a 4X2 matrix with the vectors u,v:

First will create a single vector by joining the vectors u and v as given below:
w =c(u,v)
w

## [1] 9 10 11 12 13 14 15 16

Now
mat= matrix(data=w,nrow=4,ncol=2)
mat

## [,1] [,2]
8

## [1,] 9 13
## [2,] 10 14
## [3,] 11 15
## [4,] 12 16

rowSums(mat)
## 22 24 26 28
colSums(mat)
## 42 58
dim(mat)
#4 2

Adding row(s) or column(s) to a matrix


Suppose to add a vector (with elements as 17, 18, 19, and 20) as a column to the matrix mat.
For this, to use cbind function.

mat=cbind(mat,17:20)
mat
##
[,1] [,2] [,3]
[1,] 9 13 17
[2,] 10 14 18
[3,] 11 15 19
[4,] 12 16 20

dim(mat)
##
[1] 4 3
Matrix operations with built in data sets:

R Code:

library(help=datasets)
datasets::swiss
nrow(swiss)
ncol(swiss)
dim(swiss)
head(swiss)
head(swiss,5) ##(It gives only first five rows of the data set)
swiss[3,2]##(Finding the third row and second element in that column)
swiss[1] ##(Finding the first column elements)
swiss[2:5,] ##( Extracting the elements from second row to fifth row only)
swiss[1,] ##(Extracting the first row from the data set)

Output:
##
datasets::swiss
9

Fertility Agriculture Examination Education Catholic Infant.Mortality


Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6
Broye 83.8 70.2 16 7 92.85 23.6
Glane 92.4 67.8 14 8 97.16 24.9
Gruyere 82.4 53.3 12 7 97.67 21.0
Sarine 82.9 45.2 16 13 91.38 24.4
Veveyse 87.1 64.5 14 6 98.61 24.5
Aigle 64.1 62.0 21 12 8.52 16.5
Aubonne 66.9 67.5 14 7 2.27 19.1
Avenches 68.9 60.7 19 12 4.43 22.7
Cossonay 61.7 69.3 22 5 2.82 18.7
Echallens 68.3 72.6 18 2 24.20 21.2
Grandson 71.7 34.0 17 8 3.30 20.0
Lausanne 55.7 19.4 26 28 12.11 20.2
La Vallee 54.3 15.2 31 20 2.15 10.8
Lavaux 65.1 73.0 19 9 2.84 20.0
Morges 65.5 59.8 22 10 5.23 18.0
Moudon 65.0 55.1 14 3 4.52 22.4
Nyone 56.6 50.9 22 12 15.14 16.7
Orbe 57.4 54.1 20 6 4.20 15.3
Oron 72.5 71.2 12 1 2.40 21.0
Payerne 74.2 58.1 14 8 5.23 23.8
Paysd'enhaut 72.0 63.5 6 3 2.56 18.0
Rolle 60.5 60.8 16 10 7.72 16.3
Vevey 58.3 26.8 25 19 18.46 20.9
Yverdon 65.4 49.5 15 8 6.10 22.5
Conthey 75.5 85.9 3 2 99.71 15.1
Entremont 69.3 84.9 7 6 99.68 19.8
Herens 77.3 89.7 5 2 100.00 18.3
Martigwy 70.5 78.2 12 6 98.96 19.4
Monthey 79.4 64.9 7 3 98.22 20.2
St Maurice 65.0 75.9 9 9 99.06 17.8
Sierre 92.2 84.6 3 3 99.46 16.3
Sion 79.3 63.1 13 13 96.83 18.1
Boudry 70.4 38.4 26 12 5.62 20.3
La Chauxdfnd 65.7 7.7 29 11 13.79 20.5
Le Locle 72.7 16.7 22 13 11.22 18.9
Neuchatel 64.4 17.6 35 32 16.92 23.0
Val de Ruz 77.6 37.6 15 7 4.97 20.0
ValdeTravers 67.6 18.7 25 7 8.65 19.5
V. De Geneve 35.0 1.2 37 53 42.34 18.0
Rive Droite 44.7 46.6 16 29 50.43 18.2
Rive Gauche 42.8 27.7 22 29 58.33 19.3
##
nrow(swiss)
[1] 47
##
> ncol(swiss)
[1] 6
##
> dim(swiss)
[1] 47 6

##
head(swiss)
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6

#head(swiss,5)

Fertility Agriculture Examination Education Catholic Infant.Mortality


Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6

#
Swiss [3,2]
10

[1] 39.7
#
swiss[1]
Fertility
Courtelary 80.2
Delemont 83.1
Franches-Mnt 92.5
Moutier 85.8
Neuveville 76.9
Porrentruy 76.1
Broye 83.8
Glane 92.4
Gruyere 82.4
Sarine 82.9
Veveyse 87.1
Aigle 64.1
Aubonne 66.9
Avenches 68.9
Cossonay 61.7
Echallens 68.3
Grandson 71.7
Lausanne 55.7
La Vallee 54.3
Lavaux 65.1
Morges 65.5
Moudon 65.0
Nyone 56.6
Orbe 57.4
Oron 72.5
Payerne 74.2
Paysd'enhaut 72.0
Rolle 60.5
Vevey 58.3
Yverdon 65.4
Conthey 75.5
Entremont 69.3
Herens 77.3
Martigwy 70.5
Monthey 79.4
St Maurice 65.0
Sierre 92.2
Sion 79.3
Boudry 70.4
La Chauxdfnd 65.7
Le Locle 72.7
Neuchatel 64.4
Val de Ruz 77.6
ValdeTravers 67.6
V. De Geneve 35.0
Rive Droite 44.7
Rive Gauche 42.8

#
swiss[2:5,]
Fertility Agriculture Examination Education Catholic Infant.Mortality
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6

swiss[1,]
Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17 15 12 9.96 22.2

KUMARAGURU COLLEGE OF TECHNOLOGY


LABORATORY MANUAL
11

Experiment Number: 03
Lab Code : U18MAR0001
Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Data presentation methods - Bar Chart,
Pie Chart
STEP 1: INTRODUCTION
OBJECTIVE OF THE EXPERIMENT/EXPERIMENT
To create Bar charts and Pie charts
STEP 2: ACQUISITION
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.

Procedure for doing the Experiment:

1. To draw a bar chart representing variables x, y, z, …… with values a, b, c, ……


respectively
v=c(a, b, c, ……)
labels=c("x", "y", "z", ….)
barplot(v, names.arg=labels, xlab="A", ylab="B", col="C", main="Sample title")

A pie chart is a circular statistical graph which is divided into slices to illustrate numerical
proportion. The various observations or components are represented by the sectors of a circle
and the whole circle represents the sum of the values of all components. The arc length of
each slice (and consequently its central angle and area), is proportional to the quantity it
represents.

The formula to determine the angle of a sector in a pie chart is:

Frequency of data
Angle of sector= ×100
Total frequency

Procedure for doing the Experiment:

1. To draw a pie chart representing variables x,y,z,…… with values a,b,c,


…… respectively
v=c(a,b,c,……)
12

lbls=c("x", "y", "z",….)


percent = round(v/sum(v)*100)
lbls = paste(lbls, percent) # add percents to labels
lbls = paste(lbls,"%",sep="") # add % to labels
pie(v,labels = lbls, col=rainbow(length(lbls)),
main="Sample title")

Bar Chart:

Example
The following list gives the export quantity of raw cotton (in million kg.) for five
consecutive years 2012-2013 to 2016-17:
1945.63, 1864.69, 1093.11, 1297.27, 918.15.
Plot a bar chart.

R-code:
export=c(1945.63, 1864.69, 1093.11, 1297.27, 918.15)
year=c("2012-13","2013-14","2014-15","2015-16","2016-17")
bar plot (export, names.arg=year, xlab="Financial Year", ylab="Export Quantity", col="dark
green", main="Export quantity of raw cotton (in million kg.)")

Output:

Task 1
Data regarding India’s Article market size (US$billion) for various years is given in the
following table. Plot a bar chart.
13

Year 2009 2010 2011 2014 2015 2016 2023


(Expected)
Market 70.0 78.0 89.0 99.0 106.5 137.0 226.0
size

R Code:

export = c(70 , 78 , 89 , 99 , 106.5 , 137 , 226)


y = c("2009", "2010" ,"2011" , "2014" , "2015" , "2016" , "2023(Excepted)")
barplot(export , names.arg=y , xlab="Year" , ylab="Market Size" , col="dark green" ,
main="India's Article Market Size (USBILLION)")

Output:

Pie Chart
Example:
14

Draw a pie diagram to represent the following data giving the monthly expenditure of a
family:
Type of expenditure Amount in Rs. ‘000
Food 10
Rent 15
Clothes 2
Education 10
Miscellaneous 3
Savings 8
Total 48

R code
exp = c(10, 15, 2, 10, 3,8)
lbls= c("Food", "Rent", "Clothes", "Education", "Miscellaneous","Savings")
percent = round(exp/sum(exp)*100)
lbls = paste(lbls, percent)
lbls = paste(lbls,"%",sep ="")
pie(exp,labels = lbls, col=rainbow(length(lbls)), main=" Monthly expenditure of a family ")

Output

Note: Save the pie chart as image with file name “Monthly expenditure of a family”. It will
get saved in your working directory

Task 1
Elements that make up the earth’s crust are as follows. Represent as a pie diagam
Sl.No. Elements Percentage
15

1. Aluminum 9%
2. Silicon 23%
3. Iron 14%
4. Oxygen 39%
5. Calcium 11%
6. Others 4%

R Code:

data=c(9,12,14,39,11,4)
labels=c('Aluminium','Silicon','Iron','Oxygen','Calcium','Others')
labels =paste(labels, data, sep='-')
labels =paste(labels,'%', sep='')
pie (data,labels=labels,col=rainbow(length(labels)),main='Elements of Earth\'s Crust')

Output:

Task 2

The following table shows the area in millions of square kilometers of the
oceans of the world:
Ocean Area (million sq. km.)
Pacific 70.8
Indian 28.5
Atlantic 41.2
Antarctic 7.6
Arctic 4.8

Draw a pie diagram to represent the data.

R Code:

data = c(70.8 , 28.5 , 41.2 , 7.6 , 4.8)


16

labels = c("Pacific" , "Indian" , "Atlantic" , "Antartic" , "Arctic")


percent = round(data/sum(data)*100)
labels = paste(labels , percent)
labels = paste(labels , "%" , sep ='')
pie(data,labels=labels,col=rainbow(length(labels)),main='Area in million square kilometers of
the oceans of the world')

Output:
17

KUMARAGURU COLLEGE OF TECHNOLOGY

LABORATORY MANUAL

Experiment Number: 04

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Importing data from MS-Excel

STEP 1: INTRODUCTION
OBJECTIVES OF THE EXPERIMENT

1. To create a data frame using given data


2. To import data from a given MS-Excel file
STEP 2: ACQUISITION
1. To create a data frame using given data

Procedure for doing the Experiment:

1. Represent the various columns of the data frame by the vectors x, y, z ……


2. A=data.frame(x,y,z,…..) creates the data frame.

Example
R code:

A =data.frame( name=c("A","B","C"),

gender = c("Male", "Male","Female"),

height = c(152, 171.5, 165),

weight = c(81,93, 78),

age =c(42,38,26))
18

(OR)

name=c("A","B","C")

gender = c("Male", "Male","Female")

height = c(152, 171.5, 165)

weight = c(81,93, 78)

age =c(42,38,26)

A=data.frame (name, gender, height, weight, age)

Output
Name gender height weight age
1 A Male 152.0 81 42
2 B Male 171.5 93 38
3 C Female 165.0 78 26

Task 1

Create a data frame from the following details regarding babies’ frocks (Given: size, season.
material, decoration, pattern type, price)

1. L, spring, silk, embroidery, dot, 650


2. M, summer, chiffon, bow, print, 275
3. M, summer, cotton, null, animal, 380
4. M, Winter, cotton, null, patchwork, 450
5. L, autumn, linen, ruffles, animal, 420

R Code:

size = c("L","M","M","M","L")
season=c("Spring","Summer","Summer","Winter","Autumn")
material=c("silk","chiffon","cotton","cotton","linen")
decoration=c("Embroidery","Bow","Null","Null","Ruffles")
patterntype=c("dot","print","animal","patchwork","animal")
price=c(650,275,380,450,420)
D=data.frame(size,season,material,decoration,patterntype,price)
D
19

Output:

size season material decoration pattern type price


1 L Spring silk Embroidery dot 650
2 M Summer chiffon Bow print 275
3 M Summer cotton Null animal 380
4 M Winter cotton Null patchwork 450
5 L Autumn line Ruffles animal 420

2. To import data from a given MS-Excel file


1. To locate current working directory
# Get and print current working directory.
print(getwd( ))
2. To import data from Excel sheet

To import data from Excel sheet ‘abc’, first save the file as .csv(comma delimited) in
the current working directory. Then execute the following command
data = read.csv("abc.csv")
data
3. $ symbol is used to extract a specific field.

Example:

To import data from Excel sheet

To import data from Excel sheet ‘Testmarks’, first save the file as .csv (comma delimited) in
the current working directory. Then execute the following command

data = read.csv("Testmarks.csv")
data
Output:
Sl.No Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36
20

To get the list of students who have passed in Internal test 1

pass= subset(data, IT.1 >= 25 )


print(pass)
Output:
Sl.No. Name IT.1 IT. II
1 1 A 26 32
2 2 B 25 25
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36

To get the list of students who have secured 30 or more marks in both tests

Good= subset(data, IT.1 >= 30&IT.II>=30 )


Good

Output
Sl.No. Name IT.1 IT.II
6 6 F 32 32
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36

Task 2

Import the excel file ‘Studentdetails1’from your directory to create a data frame.

details = read.csv("Studentdetails1.csv")

details

Output:
X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS
1 1 A1 Male MQ TIRUVANNAMALAI 967 177.00
2 2 A2 Female MQ COIMBATORE 1097 183.75
3 3 A3 Female MQ COIMBATORE 1096 183.50
4 4 A4 Female MQ COIMBATORE 1085 187.50
5 5 A5 Male MQ COIMBATORE 1056 179.00
6 6 A6 Male MQ OOTY 1091 184.00
7 7 A7 Female MQ KARUR 1088 180.75
8 8 A8 Male MQ COIMBATORE 1009 171.75
21

9 9 A9 Male MQ COIMBATORE 906 145.50


10 10 A10 Male MQ SALEM 977 159.25
11 11 A11 Male MQ COIMBATORE 1052 168.25
12 12 A12 Female MQ THE NILGIRIS 1125 190.50
13 13 A13 Male MQ BANGALORE 391 68.25
14 14 A14 Female MQ COIMBATORE 1003 158.00
15 15 A15 Male MQ SALEM 959 168.00
16 16 A16 Male MQ TIRUCHIRAPPALLI 1140 188.00
17 17 A17 Female MQ COIMBATORE 963 162.00
18 18 A18 Female GQ COIMBATORE 1135 195.50
19 19 A19 Male GQ ERODE 1139 195.75
20 20 A20 Female GQ TIRUPPUR 1158 195.25
21 21 A21 Male GQ PATTUKOTTAI 1153 195.25
22 22 A22 Female MQ COONOOR 1115 189.50
23 23 A23 Female GQ TRICHIRAPPALLI 1114 192.25
24 24 A24 Male GQ TIRUNELVELI 1145 195.75
25 25 A25 Male GQ SALEM 1164 196.00
26 26 A26 Male GQ ERODE 1152 195.75
27 27 A27 Female GQ DHARAPURAM 1159 194.50
28 28 A28 Female GQ COIMBATORE 1112 195.50
29 29 A29 Male GQ TIRUPPUR 1112 179.50
30 30 A30 Male GQ SIVAGANGAI 1147 194.25
31 31 A31 Female GQ CUDDALORE 1127 195.00
32 32 A32 Female GQ TIRUCHIRAPPALLI 1152 192.75
33 33 A33 Female GQ ERODE 1135 194.50
34 34 A34 Male GQ KRISHNAGIRI 1143 192.50
35 35 A35 Male GQ THIRUVALLUR 1125 195.00
36 36 A36 Male GQ VIRUDHUNAGAR 1143 195.75
37 37 A37 Male GQ MADURAI 1041 181.25
38 38 A38 Male MQ COIMBATORE 1047 170.75
39 39 A39 Female MQ SIVAGANGAI 1038 168.00
40 40 A40 Male GQ METTUPALAYAM 1094 187.00
41 41 A41 Male MQ TIRUPPUR 1084 186.50
42 42 A42 Male MQ KANYAKUMARI 1043 177.50
43 43 A43 Male MQ CHAMRAJNAGAR 433 68.00
44 44 A44 Male MQ SIVAGANGA 884 139.00
45 45 A45 Male MQ SALEM 926 152.25
46 46 A46 Male MQ COIMBATORE 936 153.00
47 47 A47 Male GQ COIMBATORE 1129 193.00
48 48 A48 Female GQ DINDIGUL 1101 188.75
49 49 A49 Female GQ THE NILGIRIS 1134 192.75
50 50 A50 Female GQ ERODE 1117 193.75
51 51 A51 Male GQ CUDDALORE 1129 188.75
52 52 A52 Male GQ RAMANATHAPURAM 1137 194.25
53 53 A53 Female GQ THENI 1115 193.25
54 54 A54 Male GQ ERODE 1144 190.25
55 55 A55 Male GQ VELLORE 1124 193.50
22

KUMARAGURU COLLEGE OF TECHNOLOGY

LABORATORY MANUAL

Experiment Number: 05

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Mean, Median, Mode

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT


To import data from a given MS-Excel file and to find arithmetic mean, median, mode and
from a data frame.

STEP 2: ACQUISITION
To create a data frame using given data

Procedure for doing the Experiment:


To import data from a given MS-Excel file and to find arithmetic mean, median, mode
and standard deviation from a data frame.

1. To locate current working directory

# Get and print current working directory.


print(getwd( ))
2. To import data from Excel sheet
To import data from Excel sheet ‘abc’, first save the file as .csv (comma delimited)
in the current working directory. Then execute the following command
data = read.csv("abc.csv")
23

data
3. $ symbol is used to extract a specific field.

4. mean(data$-----)

5. Median(data$----)

6. Mode :
mode = function(x) {
ux = unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x = data$ ------
# Calculate the mode using the user function.
v = mode(x)
print(v)
7. summary(data)

Example:

To import data from Excel sheet

To import data from Excel sheet ‘Test marks’, first save the file as .csv(comma delimited) in
the current working directory. Then execute the following command

data = read.csv("Testmarks.csv")

data

Output:
Sl.No. Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36

To get the list of students who have passed in Internal test 1

pass= subset(data, IT.1 >= 25 )

print(pass)
24

Output:

Sl.No. Name IT.1 IT.II


1 A 26 32
2 B 25 25
5 E 25 28
6 F 32 32
7 G 29 42
8 H 25 26
9 I 31 38
10 J 35 39
11 K 33 31
12 L 35 36

To get the list of students who have secured 30 or more marks in both tests

Good= subset (data, IT.1 >= 30&IT.II>=30 )

Good

Output

Sl.No. Name IT.1 IT.II


6 F 32 32
9 I 31 38
10 J 35 39
11 K 33 31
12 L 35 36

To find the average marks of all students in Internal test 1

mean(data$IT.1)

Output:

27.41667

Task 2
Import the excel file ‘Studentdetails1’from your directory to create a data frame and
find:

1. The mean, median, mode of cut-off marks


2. The summary of all details in the file.
3. The mean of total marks
4. The city from which maximum number of students have come.
5. The list of GQ students and their mean cutoff marks
6. The list of girl students and their average cutoff.
25

R Code:

data = read.csv("Studentdetails 1.csv")

data
Output:
X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS
1 1 A1 Male MQ TIRUVANNAMALAI 967 177.00
2 2 A2 Female MQ COIMBATORE 1097 183.75
3 3 A3 Female MQ COIMBATORE 1096 183.50
4 4 A4 Female MQ COIMBATORE 1085 187.50
5 5 A5 Male MQ COIMBATORE 1056 179.00
6 6 A6 Male MQ OOTY 1091 184.00
7 7 A7 Female MQ KARUR 1088 180.75
8 8 A8 Male MQ COIMBATORE 1009 171.75
9 9 A9 Male MQ COIMBATORE 906 145.50
10 10 A10 Male MQ SALEM 977 159.25
11 11 A11 Male MQ COIMBATORE 1052 168.25
12 12 A12 Female MQ THE NILGIRIS 1125 190.50
13 13 A13 Male MQ BANGALORE 391 68.25
14 14 A14 Female MQ COIMBATORE 1003 158.00
15 15 A15 Male MQ SALEM 959 168.00
16 16 A16 Male MQ TIRUCHIRAPPALLI 1140 188.00
17 17 A17 Female MQ COIMBATORE 963 162.00
18 18 A18 Female GQ COIMBATORE 1135 195.50
19 19 A19 Male GQ ERODE 1139 195.75
20 20 A20 Female GQ TIRUPPUR 1158 195.25
21 21 A21 Male GQ PATTUKOTTAI 1153 195.25
22 22 A22 Female MQ COONOOR 1115 189.50
23 23 A23 Female GQ TRICHIRAPPALLI 1114 192.25
24 24 A24 Male GQ TIRUNELVELI 1145 195.75
25 25 A25 Male GQ SALEM 1164 196.00
26 26 A26 Male GQ ERODE 1152 195.75
27 27 A27 Female GQ DHARAPURAM 1159 194.50
28 28 A28 Female GQ COIMBATORE 1112 195.50
29 29 A29 Male GQ TIRUPPUR 1112 179.50
30 30 A30 Male GQ SIVAGANGAI 1147 194.25
31 31 A31 Female GQ CUDDALORE 1127 195.00
32 32 A32 Female GQ TIRUCHIRAPPALLI 1152 192.75
33 33 A33 Female GQ ERODE 1135 194.50
34 34 A34 Male GQ KRISHNAGIRI 1143 192.50
35 35 A35 Male GQ THIRUVALLUR 1125 195.00
36 36 A36 Male GQ VIRUDHUNAGAR 1143 195.75
37 37 A37 Male GQ MADURAI 1041 181.25
38 38 A38 Male MQ COIMBATORE 1047 170.75
39 39 A39 Female MQ SIVAGANGAI 1038 168.00
40 40 A40 Male GQ METTUPALAYAM 1094 187.00
41 41 A41 Male MQ TIRUPPUR 1084 186.50
42 42 A42 Male MQ KANYAKUMARI 1043 177.50
43 43 A43 Male MQ CHAMRAJNAGAR 433 68.00
44 44 A44 Male MQ SIVAGANGA 884 139.00
45 45 A45 Male MQ SALEM 926 152.25
46 46 A46 Male MQ COIMBATORE 936 153.00
47 47 A47 Male GQ COIMBATORE 1129 193.00
48 48 A48 Female GQ DINDIGUL 1101 188.75
49 49 A49 Female GQ THE NILGIRIS 1134 192.75
50 50 A50 Female GQ ERODE 1117 193.75
51 51 A51 Male GQ CUDDALORE 1129 188.75
26

52 52 A52 Male GQ RAMANATHAPURAM 1137 194.25


53 53 A53 Female GQ THENI 1115 193.25
54 54 A54 Male GQ ERODE 1144 190.25
55 55 A55 Male GQ VELLORE 1124 193.50

CODING:

1. To find mean , median, mode of cutoff marks


Mean:
mean(data$CUTOFFMARKS)
Output:
179.0318
Median:
median(data$CUTOFFMARKS)
Output:
188.75
Mode:
mode = function(x){
ux = unique(x)
ux[which.max(tabulate(match(x,ux)))]
}
x = data$CUTOFFMARKS
v = mode(x)
v
Output:

195.75

Standard Deviation:
sd(data$CUTOFFMARKS)

Output:
26.0761

2. Summary of all details:


R code:
summary(data)

Output:

X = Min. 1st Qu. Median Mean 3rd Qu. Max.

1.0 14.5 28.0 28.0 41.5 55.0

STUDENT NAME = Length Class Mode


27

55 character character

GENDER = Length Class Mode

55 character character

SEATCATG = Length Class Mode

55 character character

CITYNAME = Length Class Mode

55 character character

TOTALMARKS = Min. 1st Qu. Median Mean 3rd Qu. Max.

391 1042 1112 1060 1136 1164

CUTOFFMARKS = Min. 1st Qu. Median Mean 3rd Qu. Max.

68.0 174.4 188.8 179.0 194.2 196.0

3. The mean of total marks

R Code:
mean(data$TOTALMARKS)

Output:
1059.836
Standard Deviation:
sd(data$TOTALMARKS)

Output:
145.9606

4. To find the city from which maximum number of students have come.
data = read.csv("Studentdetails 1.csv")
data
mode = function(x) {
ux = unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x = data$CITYNAME
v = mode(x)
print(v)

Output
"COIMBATORE"
28

5. The list of GQ students and their mean cutoff marks


R Code:
data = read.csv("Studentdetails 1.csv")
data
p = subset(data,SEATCATG=="GQ")
print(p)

Output:

X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS


18 18 A18 Female GQ COIMBATORE 1135 195.50
19 19 A19 Male GQ ERODE 1139 195.75
20 20 A20 Female GQ TIRUPPUR 1158 195.25
21 21 A21 Male GQ PATTUKOTTAI 1153 195.25
23 23 A23 Female GQ TRICHIRAPPALLI 1114 192.25
24 24 A24 Male GQ TIRUNELVELI 1145 195.75
25 25 A25 Male GQ SALEM 1164 196.00
26 26 A26 Male GQ ERODE 1152 195.75
27 27 A27 Female GQ DHARAPURAM 1159 194.50
28 28 A28 Female GQ COIMBATORE 1112 195.50
29 29 A29 Male GQ TIRUPPUR 1112 179.50
30 30 A30 Male GQ SIVAGANGAI 1147 194.25
31 31 A31 Female GQ CUDDALORE 1127 195.00
32 32 A32 Female GQ TIRUCHIRAPPALLI 1152 192.75
33 33 A33 Female GQ ERODE 1135 194.50
34 34 A34 Male GQ KRISHNAGIRI 1143 192.50
35 35 A35 Male GQ THIRUVALLUR 1125 195.00
36 36 A36 Male GQ VIRUDHUNAGAR 1143 195.75
37 37 A37 Male GQ MADURAI 1041 181.25
40 40 A40 Male GQ METTUPALAYAM 1094 187.00
47 47 A47 Male GQ COIMBATORE 1129 193.00
48 48 A48 Female GQ DINDIGUL 1101 188.75
49 49 A49 Female GQ THE NILGIRIS 1134 192.75
50 50 A50 Female GQ ERODE 1117 193.75
51 51 A51 Male GQ CUDDALORE 1129 188.75
52 52 A52 Male GQ RAMANATHAPURAM 1137 194.25
53 53 A53 Female GQ THENI 1115 193.25
54 54 A54 Male GQ ERODE 1144 190.25
55 55 A55 Male GQ VELLORE 1124 193.50

R Code:
a = mean(p$CUTOFFMARKS)
a

Output:
192.6638

6.To get the list of girl students and their average cutoff.

R Code:
p = subset(data,GENDER=="Female")
print(p)
29

Output:

X STUDENTNAME GENDER SEATCATG CITYNAME TOTALMARKS CUTOFFMARKS


2 2 A2 Female MQ COIMBATORE 1097 183.75
3 3 A3 Female MQ COIMBATORE 1096 183.50
4 4 A4 Female MQ COIMBATORE 1085 187.50
7 7 A7 Female MQ KARUR 1088 180.75
12 12 A12 Female MQ THE NILGIRIS 1125 190.50
14 14 A14 Female MQ COIMBATORE 1003 158.00
17 17 A17 Female MQ COIMBATORE 963 162.00
18 18 A18 Female GQ COIMBATORE 1135 195.50
20 20 A20 Female GQ TIRUPPUR 1158 195.25
22 22 A22 Female MQ COONOOR 1115 189.50
23 23 A23 Female GQ TRICHIRAPPALLI 1114 192.25
27 27 A27 Female GQ DHARAPURAM 1159 194.50
28 28 A28 Female GQ COIMBATORE 1112 195.50
31 31 A31 Female GQ CUDDALORE 1127 195.00
32 32 A32 Female GQ TIRUCHIRAPPALLI 1152 192.75
33 33 A33 Female GQ ERODE 1135 194.50
39 39 A39 Female MQ SIVAGANGAI 1038 168.00
48 48 A48 Female GQ DINDIGUL 1101 188.75
49 49 A49 Female GQ THE NILGIRIS 1134 192.75
50 50 A50 Female GQ ERODE 1117 193.75
53 53 A53 Female GQ THENI 1115 193.25

R Code:
a = mean(p$CUTOFFMARKS)
a
Output:
187.0119
30

KUMARAGURU COLLEGE OF TECHNOLOGY


LABORATORY MANUAL

Experiment Number: 06

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Standard deviation, Five Number Summary,
Box Plot

STEP 1: INTRODUCTION
OBJECTIVE OF THE EXPERIMENT/EXPERIMENT
1. To compute Standard Deviation
2. To compute the Five-Number Summary
3. Draw and interpret Box plots
STEP 2: ACQUISITION
1. Five Number Summary is an exploratory data analysis technique that uses five
numbers to summarize the data such as minimum value, first quartile (Q1), median, third
quartile(Q3) and maximum value.
 A box-and-whisker plot or boxplot is a diagram based on the five-number summary
of a data set.
 The boxplot is a visual representation of the distribution of the data.
 The rectangular box is constructed with one end at Q1 and the other end at Q3 and
with a vertical segment at the median value.
 Finally, the two horizontal segments on each side of the box, one down to the
minimum value and one up to the maximum
value, (these segments are called the whiskers).
 The difference between Quartiles 1 and 3 is called the interquartile range (IQR).
 The extreme lines show the highest and lowest value excluding outliers.

Procedure for doing the Experiment:

1. To import data from a given MS-Excel file and to find standard deviation from a
data frame
31

1. To locate current working directory


# Get and print current working directory.
print(getwd( ))
2. To import data from Excel sheet

To import data from Excel sheet ‘abc’, first save the file as .csv(comma
delimited) in the current working directory. Then execute the following
command
data = read.csv("abc.csv")
data
3. $ symbol is used to extract a specific field.

4. Standard Deviation
z = sd (data$ ----)
z

Example:
To import data from Excel sheet

To import data from Excel sheet ‘Test marks’, first save the file as .csv (comma delimited) in
the current working directory. Then execute the following command

data = read.csv("Testmarks.csv")

data

Output:
Sl.No. Name IT.1 IT.II
1 1 A 26 32
2 2 B 25 25
3 3 C 19 31
4 4 D 14 26
5 5 E 25 28
6 6 F 32 32
7 7 G 29 42
8 8 H 25 26
9 9 I 31 38
10 10 J 35 39
11 11 K 33 31
12 12 L 35 36

To get Standard Deviation in Internal test 1


std= sd(data, IT.1)
std
Output
6.4167162.
32

1. To compute the five number summary and draw a box plot

data=c(x,y,z,….)
min(data)
max(data)
quantile(data)
quantile(data,0.25)
quantile(data,0.75)
fivenum(data)
summary(data)
boxplot(data,range=0.0,horizontal=FALSE,varwidth=TRUE,notch=FALSE,
outline=TRUE, boxwex = 0.5,border=c("blue"), col=c("green"), xlab="A",
ylab="B", main="M")

 The boxplot is a visual representation of the distribution of the data.


 The rectangular box is constructed with one end at Q1 and the other end at Q3 and
with a vertical segment at the median value.
 Finally, the two horizontal segments on each side of the box, one down to the
minimum value and one up to the maximum
value, (these segments are called the whiskers.)
 The difference between Quartiles 1 and 3 is called the interquartile range (IQR).
 The extreme lines show the highest and lowest value excluding outliers.

The important parameters of the boxplot function

x - Data in the form of a numeric vector, a list of vectors or a data frame.


range - A number that decides the data values upto which the whiskers extend. A value of
zero makes the whiskers extend upto extreme data point on both sides. A positive value m
extents the whiskers upto m times the interquartile distance on both sides. Points outside this
range are marked as outliers.
Width -a vector giving the relative widths of the boxes making up the plot.
varwidth - A logical value that decides whether the width of the box is related to the data size.
If varwidth=TRUE, the box width will be proportional to the square root of the number of
observations in the data.
If varwidth=FALSE, width of the box will not be dependent on data size.
notch- If notch is TRUE, a notch is drawn on each side of the boxes.
outline -This controls the display of outliers.
If outline=FALSE, outliers are not drawn.
If outline=TRUE, outliers are drawn as points.
names - A vector of strings to be printed as names under each box.
boxwex- a scale factor to be applied to all boxes. When there are only a few groups, the
appearance of the plot can be improved by making the boxes narrower.
horizontal-A logical value that decides whether the box and whiskers are drawn horizontally
or vertically.
horizontal=TRUE creates horizontal boxes
horizontal=FALSE creates vertical boxes.
color - color to fill the bodies of the boxes.
By default, inside of the boxes will be painted with background color.
na.action -A function which indicates the action to be taken when the data has NA's.
33

By default, missing values are ignored in the plot.


For a comprehensive list of all commands, type help(boxplot) in R prompt.

Example 1.
Compute the five-point summary (Minimum, Maximum, Median, 1st Quartile, 3rd Quartile
on a set of observations with the test scores of 9 Students in Mathematics and visualize the
summary statistics using box plot.
Test Scores:78, 93, 68, 84, 90, 74, 64, 55, 80.

R- code:
scores = c (78, 93, 68, 84, 90, 74, 64, 55, 80)
min(scores)
max(scores)
quantile(scores)
quantile(scores,0.25)
quantile(scores,0.75)
fivenum(scores)
summary(scores)
boxplot(scores,range=0.0,horizontal=FALSE,varwidth=TRUE,notch=FALSE,outline=TRUE,
boxwex=0.5,border=c("blue"),col=c("green"),xlab="Mathematics",ylab="Students Marks",
main=="Summary Statistics of Mathematics Marks ")

Output:
Scores = c (78, 93, 68, 84, 90, 74, 64, 55, 80)
min(scores)
55
max(scores)
93
quantile(scores)
0% 25% 50% 75% 100%
55 68 78 84 93
fivenum(scores)
55 68 78 84 93
summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.

55.00 68.00 78.00 76.22 84.00 93.00


34

Example 2:

The analysis of data obtained from a cloud seeding experiment is represented below:

A cloud was deemed “seedable” if it satisfied certain criteria; for each seedable cloud a
decision was made at random whether to actually seed. The nonseeded clouds are referred to
as control clouds. The following table presents the rainfall from 26 seeded and 26 control
clouds.

Seeded Clouds: 129.6, 31.4,2745.6, 489.1, 430, 302.8, 119, 4.1, 92.4, 17.5, 200.7, 274.7,
274.7, 7.7, 1656,978,198.6,703.4,1697.8,334.1,118.3,255,115.3,242.5,32.7,40.6

Control Clouds:
26.1,26.3,87,95,372.4,0.01,17.3,24.4,11.5,321.2,68.5,81.2,47.3,28.6,830.1,345.5,1202.6,36.6,
4.9,41.1,29,163,244.3,147.8,21.7

a. Make a five-number summary of the seeded clouds and control clouds.


b. Identify whether any extreme outliers are exists in both clouds.
c. Compare the distributions of rainfall from seeded and control clouds. Justify your answer.

Sol:

R Code:
x=c(129.6,31.4,2745.6,489.1,430,302.8,119,4.1,92.4,17.5,200.7,274.7,274.7,7.7,1656,978,198.6,
703.4, 1697.8,334.1,118.3,255,115.3,242.5,32.7,40.6)
y=c(26.1,26.3,87,95,372.4,0.01,17.3,24.4,11.5,321.2,68.5,81.2,47.3,28.6,830.1,345.5,1202.6,
36.6,4.9,4.9,41.1,29,163,244.3,147.8,21.7)
min(x)
max(x)
quantile(x)
quantile(x,0.25)
35

quantile(x,0.75)
fivenum(x)
summary(x)
min(y)
max(y)
quantile(y)
quantile(y,0.25)
quantile(y,0.75)
fivenum(y)
summary(y)
boxplot(x,y,names=c("Seeded Clouds"," Control Clouds"),horizontal=FALSE,
varwidth=TRUE, notch=FALSE, outline=TRUE, boxwex=0.8, border=c("blue"),
col=c("pink"), xlab="Clouds", ylab="Rainfall",main="Comparative Study")

Output:
a)
min(x)
4.1
> max(x)
2745.6
>quantile(x)
0% 25% 50% 75% 100%
4.100 98.125 221.600 406.025 2745.600
>quantile(x,0.25)
25%
98.125
>quantile(x,0.75)
75%
406.025
>fivenum(x)
4.1 92.4 221.6 430.0 2745.6
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.10 98.12 221.60 441.98 406.02 2745.60
> min(y)
0.01
> max(y)
1202.6
>quantile(y)
0% 25% 50% 75% 100%
0.010 24.825 44.200 159.200 1202.600
>quantile(y,0.25)
25%
24.825
>quantile(y,0.75)
75%
159.2
>fivenum(y)
0.01 24.40 44.20 163.00 1202.60
> summary(y)
36

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.01 24.82 44.20 164.55 159.20 1202.60

b) Yes Four Outliers exist in Seeded Clouds and three outliers exist in control clouds.

c) Seeded Clouds median is greater and they have greater variability (spread) in their rainfall;

the Seeded Clouds distribution is more symmetric while the control clouds is skewed right.

Task 1:

Import the excel file ‘Cotton prices-International and Domestic.xlsx’ from your
directory to create a data frame and find:
The Standard deviation of
Cotlook.A.Minimum,Cotlook.A.Maximum,Cotlook.A…Average,Shankar.6.Maximum,
Shankar.6.Minimum, Shankar.6.Average.
R Code:
dat=read.csv("Cotton prices.csv")
dat
Output:
Cotlook.A.MinimumCotlook.A.Maximum Range Cotlook.A...Average
1 79.85 85.30 5.45 81.95
2 79.40 82.20 2.80 80.87
3 81.85 84.80 2.95 83.37
4 83.10 90.35 7.25 85.51
5 88.80 90.90 2.10 89.71
37

6 91.40 98.85 7.45 94.45


7 90.60 95.70 5.10 92.68
8 89.40 95.10 5.70 92.74
9 88.80 96.65 7.85 93.08
10 91.15 93.95 2.80 92.60
11 89.15 97.35 8.20 92.59
12 88.35 91.45 3.10 89.95
13 85.40 93.15 7.75 89.33
14 83.75 85.60 1.85 84.64
15 85.15 89.70 4.55 87.49
16 88.05 94.45 6.40 90.96
17 91.95 95.75 3.80 94.05
18 93.30 98.90 5.60 96.93
19 92.20 97.75 5.55 94.20
20 89.40 95.80 6.40 92.71
21 89.30 93.70 4.40 90.90
22 79.60 88.40 8.80 83.84
23 72.15 76.05 3.90 74.04
24 69.95 76.15 6.20 73.38
25 69.65 71.45 1.80 70.35
26 65.90 70.00 4.10 67.53
27 66.00 70.25 4.25 68.38
28 65.30 68.75 3.45 67.35
29 67.05 71.75 4.70 69.84
30 67.20 71.25 4.05 69.35
31 69.55 73.95 4.40 71.72
32 71.05 74.70 3.65 72.86
33 71.25 74.35 3.10 72.36
34 70.65 74.80 4.15 72.35
35 69.85 74.10 4.25 71.82
36 66.40 70.25 3.85 68.74
37 66.65 70.85 4.20 69.03
38 68.30 70.55 2.25 69.22
39 69.50 71.70 2.20 70.39
40 67.70 69.95 2.25 68.75
41 65.05 68.95 3.90 66.57
42 64.05 66.50 2.45 65.46
43 66.40 71.70 5.30 69.28
44 68.80 72.95 4.15 70.28
45 71.80 76.15 4.35 74.10
46 74.85 85.39 10.54 81.07
47 75.70 85.85 10.15 80.26
48 75.00 80.65 5.65 77.87
49 76.55 80.35 3.80 78.52
50 76.95 81.15 4.20 78.92
51 78.20 80.70 2.50 79.53
52 79.65 84.25 4.60 82.33
53 84.10 86.80 2.70 85.16
54 85.75 88.10 2.35 86.84
55 84.60 88.80 4.20 87.04
56 86.40 94.90 8.50 88.64
57 82.60 87.70 5.10 84.66
58 82.20 85.05 2.85 84.09
59 77.40 81.35 3.95 79.36
60 78.55 84.70 6.15 80.59
38

61 77.60 80.40 2.80 78.60


62 79.00 81.60 81.60 61.80
Shankar.6.Minimum Shankar.6.Maximum Range.1 Shankar.6.Average
1 32900 34400 1500 33450
2 33000 33800 800 33564
3 33300 34200 900 33764
4 33600 34300 700 33771
5 33900 37200 3300 35013
6 37000 39300 2300 38275
7 36700 39400 2700 38139
8 37000 38600 1600 37742
9 38500 41500 3000 39892
10 41000 43200 2200 42370
11 42400 49000 6600 45968
12 46900 48900 2000 47805
13 41000 48500 7500 44776
14 38800 40900 2100 39935
15 38500 40400 1900 39284
16 40200 42800 2600 42015
17 41800 43200 1400 42565
18 41500 42400 900 41943
19 41400 42900 1500 42038
20 40700 43200 2500 42065
21 41200 42900 1700 42044
22 39500 42900 3400 41542
23 39000 40500 1500 39835
24 34700 39900 5200 38360
25 32700 34000 1300 33448
26 32400 33200 800 32812
27 32900 33300 400 33146
28 29800 32900 3100 31300
29 30100 31300 1200 30678
30 30700 32600 1900 31122
31 32200 34200 2000 33296
32 34200 35500 1300 34922
33 33200 35000 1800 34232
34 33800 34600 800 34293
35 33500 34700 1200 33992
36 33000 35500 2500 34672
37 31800 32900 1100 32472
38 32000 32500 500 32209
39 32400 34000 1600 33223
40 33400 34000 600 33672
41 33100 33800 700 33452
42 32100 33200 1100 32676
43 32800 34700 1900 33975
44 34700 36800 2100 35315
45 36700 42700 6000 39456
46 42700 48500 5800 45896
47 43900 47800 3900 46269
48 43000 48000 5000 45125
49 37700 44500 6800 41233
50 37700 40000 2300 38728
51 38600 39600 1000 39007
52 40000 42600 2600 41256
39

53 41900 43000 1100 42482


54 42600 43700 1100 43085
55 42100 44000 1900 42967
56 41600 43000 1400 42396
57 42300 43100 800 42642
58 41800 43300 1500 42362
59 42200 42600 400 42323
60 38700 42300 3600 40829
61 37800 39000 1200 38468
62 37200 38100 900 35861

R-CODE:

ct=sd(dat$Cotlook.A.Minimum)
ct1=sd(dat$Cotlook.A.Maximum)
ca=sd(dat$Cotlook.A...Average)
smax=sd(dat$Shankar.6.Maximum)
smin=sd(dat$Shankar.6.Minimum)
avg=sd(dat$Shankar.6.Average)
ct
ct1
ca
smax
smin
avg

Output:

ct
8.927676
ct1
9.65965
ca
9.470624
smax
4970.638
smin
4234.153
avg
4585.738
40

KUMARAGURU COLLEGE OF TECHNOLOGY

LABORATORY MANUAL

Experiment Number: 07
Lab Code : U18MAR0001
Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Scatter diagram, Correlation

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT

1. To construct the scatter plot and to visualize the relationship between two quantitative
variables.
2. To find the correlation between two variables in a data set.
3. To find the coefficient of rank correlation between two variables in a data set by
Spearman’s method.

STEP 2: ACQUISITION

Procedure for doing the Experiment/experiment:

1. To construct the scatter plot with the variables x and y


x=c(a,b,....)
y=c(l,m,....)
plot(x,y, xlab = “....”,ylab
=”…”,xlim=c(0,10),ylim=c(0,25),col=c(“…”),main=”…..”)
41

To find the correlation between x and y

2. x=c(a,b,....)
y=c(l,m,....)
r=cor(x,y)
r

3. To find the Spearman’s rank correlation coefficient between x and


y
x=c(a,b,....)
y=c(l,m,....)
r=cor(x,y,method=”spearman”)
r

Example

Construct the scatter plot and also find the coefficient of correlation ,Spearman’s
correlation coefficient between the ends per inch(X) and picks per inch (Y).
x 23 27 28 28 29 30 31 33 35 36
y 18 20 22 27 21 29 27 29 28 29
R code:

x=c(23,27,28,28,29,30,31,33,35,36)
y=c(18,20,22,27,21,29,27,29,28,29)
plot(x,y,xlab ="ends per inch",ylab ="picks per
inch",xlim=c(0,50),ylim=c(0,40),col=c("green"),main="scatter plot of end and picks per
inch")
endspicks=cor(x,y)
endspicks
rank=cor(x,y,method="spearman")
rank

Scatter Plot:
42

Output:
Correlation Coefficient = 0.8176052
Spearman correlation coefficient= 0.9955947

Conclusion:
The correlation is strong positive between ends per inch(X) and picks per inch(Y)
Task 1

Calculate the coefficient of correlation from the following figures relating to the
consumption of fertilizer and the output of food grains in a district X:
Chemical fertilizer used (in metric
tonnes):100,110,120,130,140,150,160,170,180,190,200,210,220,230
Output of food (in metric
tonnes):1000,1050,1080,1150,1200,1220,1300,1360,1420,1500,1600,1650,1650,1650
Also draw the scatter plot diagram for the above data and justify the result.

R-Code:

input=c(100,110,120,130,140,150,160,170,180,190,200,210,220,230)
output=c(1000,1050,1080,1150,1200,1220,1300,1360,1420,1500,1600,1650,1650,1650)
plot(input,output,xlab="Chemical fertilizers used",ylab="Output of
food",xlim=c(0,250),ylim=c(0,2000),col=c("green"),main="Consumption of fertilizers and
the output of food grains in a district X")
picks=cor(input,output)
picks
43

Output: 0.991053

KUMARAGURU COLLEGE OF TECHNOLOGY

LABORATORY MANUAL

Experiment Number: 08

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : Regression

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT


44

1. To determine the equations of the regression lines for variables and to predict the
value of one variable when the value of the other variable is given.
2. To construct the regression plot for the given variables.

STEP 2: ACQUISITION

Procedure for doing the Experiment/experiment:

1. To find regression line of y on x


regyx=lm(y~x) #lm stands for linear model
regyx

2. To find regression line of x on y


regxy=lm(x~y)
regxy
To construct the regression plot of y on x
plot(x,y)
3. abline(lm(y ~ x),col="---")

Note:
i) abline(lm(y~x)) --- adds regression line to plot
ii) plot(y~x) --- creates a scatterplot of y versus x
iii) regmodel = lm(y~x) --- fit a regression model

Example

1. Find the coefficient of correlation between the ends/inch (X) and picks per inch (Y).
Also find the two regression lines. Estimate the value of y when x = 26.
x 23 27 28 28 29 30 31 33 35 36
y 18 20 22 27 21 29 27 29 28 29
Also construct the regression plot of y on x and two regression plot.
R code:
x = c(23, 27,28,28,29,30,31,33,35,36)
y = c(18,20,22,27,21,29,27,29,28,29)
To find the regression line of y on x
regyx=lm(y~x)
regyx
Output
Call:
lm(formula = y ~ x)
Coefficients:
45

(Intercept) x
-1.7391 0.8913
ie, regression line of y on x is y= -1.7391+0.8913x
To find the regression line of x on y:
regxy=lm(x~y)
regxy
Output:
Call:
lm(formula = x ~ y)
Coefficients:
(Intercept) y
11.25 0.75
ie, regression line of x on y is 11.25+0.75y
To find y when x=26
y1= - 1.7391+0.8913*26
y1
[1] 21.4347

Regression plot of y on x
R Code:
plot(x,y)
abline(lm(y ~ x),col="dark green")
Output:

# The regression line of y on x is y = -1.7391+0.8913x


# The regression line of x on y is x = 11.25+0.75y which can be rewritten as y =-15+1.33x
#To find two regression lines
x = 0:40
46

y = -1.7391+0.8913*x
z = -15+1.33*x
plot (x,y,type="l",col="blue",lwd=5, xlab="x", ylab="y")
lines (x, z, col="red", lwd=2)
title ("2 Regression lines")

KUMARAGURU COLLEGE OF TECHNOLOGY


LABORATORY MANUAL

Experiment Number: 09

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : ANOVA – one way classification

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT


47

To perform analysis of variance for a completely randomized design

STEP 2: ACQUISITION
Analysis of variance refers to the separation of variance ascribable to one group of causes
from the variance ascribable to the other group. It is used to test the homogeneity of several
means.

Three types of variation present in a data

1. Treatments
2. Environmental
3. Residual or Error

Assumptions for ANOVA test

1. The observations are independent.


2. The parent population is normal
3. Various treatment and environmental effects are additive in nature.
4. The samples have been randomly selected from the population

Null Hypothesis: All the population means are equal

Alternative Hypothesis: Some of the means are not equal.

Three important designs of experiments:

1. Completely Randomised Design (CRD) – One-way classification


2. Randomised Block Design (RBD) – Two-way classification
3. Latin Square Design (LSD) – Three-way classification

Procedure for doing the Experiment:

1. aov(response~factor,data=data_name)

Example

A drug company tested three formulations of a pain relief medicine for migraine
headachesufferers. For the experiment 27 volunteers were selected and 9 were
randomly assigned to one of three drug formulations. The subjects were instructed to
take the drug during theirnext migraine headache episode and to report their pain on a
scale of 1 to 10 (10 beingmaximum pain)
Drug A 4 5 4 3 2 4 3 4 4
Drug B 6 8 4 5 4 6 5 8 6
Drug C 6 7 6 6 7 5 6 5 5
R-code:
48

pain=c(4,5,4,3,2,4,3,4,4,6,8,4,5,4,6,5,8,6,6,7,6,6,7,5,6,5,5)

drug=c(rep("A",9),rep("B",9),rep("C",9))

data=data.frame(pain,drug)

data

results=aov(pain~drug,data=data)
summary(results)
Output:

pain drug
1 4 A
2 5 A
3 4 A
4 3 A
5 2 A
6 4 A
7 3 A
8 4 A
9 4 A
10 6 B
11 8 B
12 4 B
13 5 B
14 4 B
15 6 B
16 5 B
17 8 B
18 6 B
19 6 C
20 7 C
21 6 C
22 6 C
23 7 C
24 5 C
25 6 C
26 5 C
27 5 C

Df Sum Sq Mean Sq F value Pr(>F)


drug 2 28.22 14.111 11.91 0.000256 ***
Residuals 24 28.44 1.185
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

F α=3 . 40 , F>F α
, so we reject the null hypothesis and conclude that that
the means of the
three drug groups are different.

Task 1
49

Three machines A, B & C gave the production of pieces in 4 days as below is there a
significant difference between machines?
A 17 16 14 13
B 15 12 19 18
C 20 8 11 17

R-CODE:

production=c(17,16,14,13,15,12,19,18,20,8,11,17)

machine=c(rep("A",4),rep("B",4),rep("C",4))

data=data.frame(production,machine)

result=aov(production~machine,data=data)

summary(result)

OUTPUT

Df Sum Sq Mean Sq F value Pr(>F)

machine 2 8 4.00 0.277 0.764

Residuals 9 130 14.44

KUMARAGURU COLLEGE OF TECHNOLOGY


LABORATORY MANUAL

Experiment Number: 10

Lab Code : U18MAR0001


Lab : Statistical Foundations for Data Science
Year : II
Title of the Experiment : ANOVA – two way classification
50

STEP 1: INTRODUCTION

OBJECTIVES OF THE EXPERIMENT

To perform analysis of variance for a Randomised Block Design.

STEP 2: ACQUISITION

The data collected from experiments with randomised block design form a two-way
classification, classified according to two factors – blocks and treatments. The two-way table
has k rows and r columns – ie, N=kr entries.

Consider an agricultural experiment in which we wish to test the effect of k fertilising


treatments on the yield of a crop. We divide the plots into r blocks, according to soil fertility,
each block containing k plots. The plots in each block will be of homogeneous fertility. I
each block, the k treatments are given to the k plots in a random manner in such a way that
each treatment occurs only once in each block. The same k treatments are repeated from
block to block.

H01 : There is no difference in the yield of crop due to treatments

H02 : There is no difference in the yield of crop due to blocks

Procedure for doing the Experiment:

Consider a two way table with k rows and r columns

1. a=c(a1 , a2 ,………) (entries entered columnwise)


f=c("row1","row2","row3","row4","row5")
k=5
r=4
A=gl(k,1,r*k,factor(f))
A
B=gl(r,k,k*r)
B
av = aov(a ~ A+B)
summary(av)

Example

The following data represents the number of units of loom crank bushes produced per day
turned out by different workers using four different types of machines.

Machine Type
A B C D
1 44 38 47 36
51

Workers 2 46 40 52 43
3 34 36 44 32
4 43 38 46 33
5 38 42 49 39
Test whether the 5 men differ with respect to mean productivity and test whether the mean
Productivity is the same for the four different machine types.

R-code:

a=c(44,46,34,43,38,38,40,36,38,42,47,52,44,46,49,36,43,32,33,39)

f=c("w1","w2","w3","w4","w5")

k=5

r=4

worker=gl(k,1,r*k,factor(f))

worker

machine=gl(r,k,k*r)

machine

av = aov(a ~ worker+machine)

summary(av)

Output:

a=c(44,46,34,43,38,38,40,36,38,42,47,52,44,46,49,36,43,32,33,39)
f=c("w1","w2","w3","w4","w5")
k=5
r=4
worker=gl(k,1,r*k,factor(f))
worker
[1] w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5
Levels: w1 w2 w3 w4 w5
machine=gl(r,k,k*r)
machine
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Levels: 1 2 3 1
52

>av = aov(a ~ worker+machine)


>summary(av)
Df Sum Sq Mean Sq F value Pr(>F)
worker 4 161.5 40.37 6.574 0.00485 **
machine 3 338.8 112.93 18.388 8.78e-05 ***
Residuals 12 73.7 6.14
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion:

From F-table,
F0 .05 ( 4, 12 )=3.26

F0 . 05 ( 3, 12 )=3.49

F1 = 6.54 > F0 .05 ( 4, 12 )=3.26 , hence we reject H and conclude that the 5 workers differ with
01

respect to mean productivity.

F2 = 18.388 > F0 . 05 ( 3, 12 )=3.49 , hence we reject H and conclude that the 4 machines differ
02

with respect to mean productivity.

Task 1

A company appoints 4 salesmen A,B,C,D and observes their sales in 3 seasons: summer,
winter and monsoon. The figures (in lakhs of Rs.) are given in the following table:

Salesmen
Season A B C D
Summer 45 40 38 37
Winter 43 41 45 38
Monsoon 39 39 41 41
Carry out an analysis of variance.
53

R-code

a=c(45,43,39,40,41,39,38,45,41,37,38,41)

f=c("Summer","winter","Monsoon")

k=3

r=4

season=gl(k,1,r*k,factor(f))

season

salesman=gl(r,k,k*r)

salesman

av=aov(a~season+salesman)

summary(av)

output:

Df Sum Sq Mean Sq F value Pr(>F)

season 2 8.17 4.083 0.535 0.611

salesman 3 22.92 7.639 1.000 0.455

Residuals 6 45.83 7.639


54

You might also like