0% found this document useful (0 votes)
7 views73 pages

Data Science UNIT 3

Unit 3 covers Graphical Analysis and Hypothesis Testing, focusing on data visualization techniques using matplotlib, including bar charts, histograms, and box plots. It also discusses hypothesis testing methods such as t-tests and correlation coefficients. The unit emphasizes understanding data distributions and comparisons through various graphical representations.

Uploaded by

tejaguggilam1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views73 pages

Data Science UNIT 3

Unit 3 covers Graphical Analysis and Hypothesis Testing, focusing on data visualization techniques using matplotlib, including bar charts, histograms, and box plots. It also discusses hypothesis testing methods such as t-tests and correlation coefficients. The unit emphasizes understanding data distributions and comparisons through various graphical representations.

Uploaded by

tejaguggilam1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

UNIT-3

Graphical Analysis & Hypothesis Testing


Syllabus
Unit – III : Graphical Analysis & Hypothesis Testing

Visualizing Data: matplotlib, Bar Charts, Histograms and frequency polygons,


box-plots, quartiles, scatter plots, heat maps.

Simple Hypothesis testing


student’s t-test: 1) two-Sample t-test with unequal variance 2) two-Sample
t-test with equal variance 3) one-Sample t-testing
The Wilcoxon u-test: 1)two-sample u-test 2)one -sample u-test, Paired t-
test and u-test

Correlation : 1)Pearson Correlation Coefficient 2)Spearman’s Correlation


Coefficient, tests for association

Reference Books
Joe Grus, “Data Science from Scratch:, O’Reilly
Mark Gardener, “Beginning R The statistical Programming Language”, Wiley.
Graphical Analysis
3.1 Visualizing data : matplotlib
• A wide variety of tools exist for visualization. “matplotlib” is a Python library
that helps in visualizing and analyzing the data for better understanding.
Which is widely used for producing simple bar charts, line charts, and
scatterplots by using matplotlib.pyplot module.

• The pyplot maintains an internal state in which you build up a visualization


step by step.

• For saving the graphs use savefig() and to display use show() functions.
• Most of the matplotlib utilities lies under the pyplot submodule, and are
usually imported under the plt alias:
import matplotlib.pyplot as plt
Functions in matplotlib

plot() - to create the plot


show() – to display the created plots
xlabel() - to label the x-axis
ylabel() – to label the y-axis
title() - to give the title to the graph
xticks() - to decide how the markings are to be made on the x-axis
yticks() - to decide how the markings are to be made on the y-axis
Creating a Simple Plot

Program 1 :

# importing the required module


import matplotlib.pyplot as plt

# x axis values
x = [2,4,6]
# corresponding y axis values
y = [5,2,7]

# plotting the points


plt.plot(x, y)

# naming the x axis


plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')

# giving a title of the graph


plt.title('A simple plot!’)
# function to show the plot
plt.show()
Program 2 :

import matplotlib.pyplot as plt


years = [1950, 1960, 1970, 1980, 1990, 2000, 2010]
gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7,
14958.3]
# create a line chart, years on x-axis, gdp on y-axis
plt.plot(years, gdp, color='green', marker='o',

linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()
Types of Plots:
Most commonly used Plots come under Matplotlib are:
1. Bar Chart
2. Histogram
3. Frequency polygon
4. Quartiles
5. Box-plot
6. Scatter Plot
7. Heat maps
3.2 Bar Chart

1) Bar Chart:
A bar chart is a graph that represents the category of data with rectangular bars. A bar chart
describes the comparisons between the discrete categories. One of the axis of the plot represents
the specific categories being compared, while the other axis represents the measured values
corresponding to those categories. bar() function used to plot Bar chart.

Syntax:
bar(x, height, width, bottom, align)

The function creates a bar plot bounded with a rectangle depending on the given parameters.
x x coordinates of the bars.

height scalar or sequence of scalars representing height(s) of the bars.

width Optional parameter - the width(s) of the bars default 0.8

bottom Optional parameter - the y coordinate(s) of the bars default None.

align Optional parameter - {‘center’, ‘edge’}, default ‘center’


Program :
import matplotlib.pyplot as
plt
x = ['A', 'B', 'C', 'D']
y1 = [10, 20, 10, 30]
plt.bar(x, y1, color='r')
plt.xlabel("X values")
plt.ylabel("Y values")
plt.title("Barplot using
Matplotlib")
plt.show()
Program 2:

import matplotlib.pyplot as plt


movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi",
"West Side Story"]
num_oscars = [5, 11, 3, 8, 10]
# bars are by default width 0.8, so we'll add 0.1 to the
#left coordinates so that each bar is centered
xs = [i + 0.1 for i, _ in enumerate(movies)]

# plot bars with left x-coordinates [xs],


# heights [num_oscars]
plt.bar(xs, num_oscars)
plt.ylabel("# of Academy Awards")
plt.title("My Favorite Movies")

# label x-axis with movie names at bar centers

plt.xticks([i + 0.5 for i, _ in enumerate(movies)],


movies)
plt.show()
3.3 Histograms
 A histogram is an accurate representation of the distribution of numerical data. It is a
kind of bar graph. A histogram plot shows the frequency distribution (shape) of a set of
continuous data, each bar in histogram represents the height of the number of values
present in that range. In Python hist() function is used to plot a histogram.

Syntax:
hist (x, bins, range, align, color, label)

x : This parameter is the sequence of data.


bins : optional parameter and it contains the integer or sequence or string.
range : optional parameter and it the lower and upper range of the bins.
align : optional and it controls how the histogram is plotted. {‘left’, ‘mid’, ‘right’}
color : optional parameter and it is a color spec
label : This parameter is an optional parameter and it is a string to match datasets.
To construct a histogram, follow these steps :
1) Bin the range of values.
2) Divide the entire range of values into a series of intervals.
3) Count how many values fall into each interval.
Program:
from matplotlib import pyplot as plt
import numpy as np

# Creating dataset
a =
np.array([5,5,6,7,8,10,15,15,18,20,25,27,
30,50,55,65,65,68, 88,
89,90,92,94,96,99])

plt.hist(a, bins = [0, 20, 40, 60, 80,


100])
plt.title("histogram of result")
plt.xticks([0,25,50,75,100])
plt.xlabel('marks')
plt.ylabel('no. of students')

plt.show()
Differences between Bar chart and Histogram

Histogram is similar to bar chart but the difference is


1) the type of data that is presented
2) the way they are drawn.

 Bar graphs are usually used to display "categorical data", that is data that fits
into categories.
 Histograms on the other hand are usually used to present "continuous data",
that is data that represents measured quantity where the numbers can take on
any value in a certain range. Ex: A good example is weight.

Bar chart Histogram


3.4 Frequency polygons

A frequency polygon shows the overall distribution of a data set. It looks like a line graph–
but the points on the graph can be plotted using data from a histogram or a frequency
table.

Frequency polygons are especially useful for comparing two data sets
Frequency polygon plotted using data from a histogram
Frequency polygon plotted using data from the Frequency table
 Frequency polygons are very similar to histograms, except histograms have bars,
and frequency polygons have dots and lines connecting the frequencies of each
class interval.

 To create a frequency polygon, start just as for histograms, by choosing a class


interval.
 Then draw an X-axis representing the values of the scores in your data.

 Mark the middle of each class interval with a tick mark, and label it with the middle
value represented by the class.

 Draw the Y-axis to indicate the frequency of each class. Place a point in the middle
of each class interval at the height corresponding to its frequency.

 Finally, connect the points. You should include one class interval below the lowest
value in your data and one above the highest value.

 The graph will then touch the X-axis on both sides


Middle of
Class
Lower Upper Interval Cumulative
Limit Limit (Round off) Count Count
29.5 39.5 35 0 0
39.5 49.5 45 3 3
49.5 59.5 55 10 13
59.5 69.5 65 53 66
69.5 79.5 75 107 173
79.5 89.5 85 147 320
89.5 99.5 95 130 450
99.5 109.5 105 78 528
109.5 119.5 115 59 587
119.5 129.5 125 36 623
129.5 139.5 135 11 634
139.5 149.5 145 6 640
149.5 159.5 155 1 641
159.5 169.5 165 1 642
169.5 179.5 175 0 642

642 psychology test scores from frequency table


Frequency polygon

choosing a class
interval. Then
draw an X-axis
representing the
values of the
scores in your
data. Mark the
middle of each
class interval with
a tick mark, and
label it with the
middle value
represented by
the class.

Frequency polygon for the psychology test scores.

It is easy to discern the shape of the distribution from Figure . Most of the scores
are between 65 and 115. It is clear that the distribution is not symmetric in as
much as good scores (to the right) trail off more gradually than poor scores (to the
left). The distribution is skewed.
Cumulative Frequency polygon

Cumulative frequency polygon for the psychology test scores

A cumulative frequency polygon for the same test scores is shown in Figure. The
graph is the same as before except that the Y value for each point is the number
of students in the corresponding class interval plus all numbers in lower
intervals
Uses of Frequency polygon

Frequency polygons are useful for comparing distributions. This is achieved by


overlaying the frequency polygons drawn for different data sets.

Overlaid cumulative frequency polygons.


3.5 Box-plot
• A Box Plot is also known as Whisker plot is created to display the summary of the
set of data values having properties like minimum, first quartile, median, third
quartile and maximum.
• In the box plot, a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the median.
• Here x-axis denotes the data to be plotted while the y-axis shows the frequency
distribution.
Ex: Draw a box-and-whisker plot for the data set {3, 7, 8, 5, 12, 14, 21, 15, 18, 14}.

Arrange data in ascending order ={3,5,7,8,12, 14,14,15,18,21}

From our Example, we had the five-number summary:

Minimum: 3, Q1: 7, Median: 13, Q3: 15, and Maximum: 21


The matplotlib.pyplot module of matplotlib library provides boxplot( ) function
Syntax:

boxplot (data, notch, vert, widths)

data array or sequence of array to be plotted

notch optional parameter accepts boolean values

optional parameter accepts boolean values false and true for horizontal and vertic
vert
plot respectively

widths optional parameter accepts array and sets the width of boxes
Program:
import matplotlib.pyplot as
plt

# Creating dataset
data = [1,3,3,4,5,6,6,7,8,8]

# Creating plot
plt.boxplot(data)

# show plot
plt.show()
Program:
import matplotlib.pyplot as plt

# Creating dataset
data = [1,3,3,4,5,6,6,7,8,8]

# Creating plot
plt.boxplot(data, notch=True)

# show plot
plt.show()
Uses of Boxplots
1) To compare and analyses two data sets
2) To identify outliers

1) To compare and analyze two data sets

Program:
import matplotlib.pyplot as plt

# Creating dataset
data1 = [1,3,3,4,5,6,6,7,8,8]
data2 = [2,4,4,5,7,7,8,8,8,9]

data = [data1, data2]

# Creating plot
plt.boxplot(data, vert = False)

# show plot
plt.show()
 It is also useful in comparing the distribution of data across data sets by drawing
box plots for each of them.

Ex: box-and-whisker plots below represent quiz scores out of 25 points for Quiz
1 and Quiz 2 for the same class

These box-and-whisker plots show that the lowest score, highest score,
and Q3 are all the same for both exams, so performance on the two exams were
quite similar.
However, the movement Q1 up from a score of 6 to a score of 9 indicates that
there was an overall improvement.
On the first test, approximately 75% of the students scored at or above a score of
6. On the second test, the same number of students (75%) scored at or above a
score of 9.
2) To Identify outliers

Outliers : In a box plot, an outlier is an a data point that is located outside the whiskers
of the box plot. That is 1.5 times interquartile range above the upper quartile or below
the lower quartile

outlier = (Q1 - 1.5 * IQR) or (Q3 + 1.5 * IQR)

Here IQR – Inter quartile Range = Q3-Q1

Program:
# with outliers
import matplotlib.pyplot as plt

# Creating dataset
data = [52, 57, 57, 58, 63, 66, 66,
67, 67, 68, 69, 70, 70, 70, 70, 72,
73, 75, 75, 76, 76, 78, 79, 89]
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
The round circle represents outliers. In this case Maximum and minimum are outliers.
3.6 Quartiles

• Quartiles are measures of variation, which describes how spread out the data is.

• Quartiles are type of quantiles. These are values that separate the data into four
equal parts.

• Quartiles are the values that divide the data in to four equal parts (quarters)

• The first quartile, or lower quartile, is the value that cuts off the first 25% of
the data when it is sorted in ascending order.
• The second quartile, or median, is the value that cuts off the first 50%.
• The third quartile, or upper quartile, is the value that cuts off the first 75%.
Procedure :
Arrange the data in ascending order then start by finding the median or
middle value. The median splits the data values into halves. Again find the
middle value of these two halves.

Example:

With Python use the NumPy library quantile() method to find the quartiles
Syntax:
quantile (x,q)

x : This parameter is the sequence of data.


q : List of quartiles 0.25 for Q1, 0.5 for Q2, 0.75 for Q3 and 0 for
minimum and 1 for maximum values
Program:
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [0,0.25,0.5,0.75,1])
print(x)

Output:
[13. 21. 41. 49.75 72. ]

Here Q0(Minimum) = 13, Q1(First Quartile) = 21, Q2(Second Quartile) = 41,


Q3(Third Quartile) = 49.75, Q4(Maximum) = 72

Program: only first quartile

import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [.25])
print(x)

Output:
[21.]
3.7 Scatter Plot

• A scatterplot is used for visualizing the relationship between two paired sets of
data.

• Scatter plots are widely used to represent relation among variables and how change
in one affects the other.

• Data are displayed as a collection of points, each having the value of one variable
determining the position on the horizontal axis and the value of the other variable
determining the position on the vertical axis.

• A scatter plot can suggest various kinds of correlations between variables with a
certain confidence interval.

Ex: weight and height, weight would be on y axis and height would be on the x
axis.
• Correlations may be positive (rising), negative (falling), or null (uncorrelated).
• If the pattern of dots slopes from lower left to upper right, it indicates a
positive correlation between the variables being studied. If the pattern of dots
slopes from upper left to lower right, it indicates a negative correlation.

• A line of best fit (alternatively called 'trendline') can be drawn in order to study
the relationship between the variables.

• The scatter() method in the matplotlib library is used to draw a scatter plot.
Scatter plots are widely used to represent relation among variables and how
change in one affects the other.
Syntax:

scatter(x_axis_data, y_axis_data, s, c, marker,


linewidths, edgecolors)

x_axis_data- An array containing x-axis data


y_axis_data- An array containing y-axis data
s- optional, marker size (can be scalar or array of size equal to size of x or y)
c- optional, color of sequence of colors for markers
marker- optional, marker style
linewidths- optional, width of marker border
edgecolor- optional. marker border color
Program:

import matplotlib.pyplot as
plt
import numpy as np

height =
np.array([99,86,88,111,132,1
03,125,94,78,77,129,116])
weight =
np.array([65,57,59,69,77,72,
75,63,45,43,81,71])

plt.scatter(height, weight)

plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Relationship
between Height & Weight')
plt.show()
Program:
Illustrates the relationship between the number of friends the users have and the number of minutes
they spend on the site every day:

import matplotlib.pyplot as plt


friends = [ 70, 65, 72, 63, 71, 64,
60, 64, 67]
minutes = [175, 170, 205, 120, 220,
130, 105, 145, 190]
labels = ['a', 'b', 'c', 'd', 'e',
'f', 'g', 'h', 'i']
plt.scatter(friends, minutes)
# label each point
for label, f_count, m_count in
zip(labels, friends, minutes):
plt.annotate(label, xy=(f_count,
m_count))
plt.title("Daily Minutes vs. Number
of Friends")
plt.xlabel("# of friends")
plt.ylabel("daily minutes spent on the
site")
plt.show()
Each variable is paired up with each of the remaining variable. A scatter plot is
plotted for each pair.

Ex: Take mtcars datset find the correlation between one variable versus other
variables.
3.8 Heat Maps
• A heat map is a graphical representation of data where the individual values
contained in a matrix are represented as colors. i.e a colored representation of a
matrix.
• It is a great tool for visualizing data across the surface.
• It is used to see the correlation between columns of a dataset where we can use a
darker color for columns having a high correlation.

Ex: Below figure shows the analysis of the data by using bar charts and heat maps. It
is easy to understand the relationship between data by using heatmaps.
Matplotlib Heatmaps can be created using functions such as imshow()

Syntax:
imshow(X, cmap, alpha)

X :- this is input data matrix which is to be displayed


cmap :- Optional parameter - Colormap use to display the
heatmap,default
Viridis colour palette
alpha :- Optional parameter - it specifies the trans piracy of the
heatmap
Method 1: Using matplotlib.pyplot.imshow() Function
Program:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.random(( 12 , 12 ))
plt.imshow( data)

plt.title( "2-D Heat Map" )


plt.show()
Adding Colorbar to Heatmap Using Matplotlib:

Add a colorbar to the heatmap using plt.colorbar(). colorbar shows the weight of
color relatively between a certain range.

Program:

import numpy as np
import matplotlib.pyplot as plt

data = np.random.random((12, 12))


plt.imshow(data, cmap='autumn')

# Add colorbar
plt.colorbar()

plt.title("Heatmap with color bar")


plt.show()
Method 2: Using Seaborn Library: Seaborn
library has a function called seaborn.heatmap() to
create heatmaps

Program:

import numpy as np
import seaborn as sns
import matplotlib.pylab as plt

data_set = np.random.rand( 10 , 10 )

ax = sns.heatmap( data_set ,
linewidth = 0.5 , cmap = 'coolwarm' )
plt.title( "2-D Heat Map" )

plt.show()
Hypothesis testing
3.9 Simple Hypothesis testing
Hypothesis
 Parameter can be estimated form sample data.
 Instead of estimating the value of a parameter, need to decide whether to
accept or reject a statement about the parameter.
 This statement is called a Hypothesis.
 Decision making procedure about the hypothesis is called Hypothesis
testing.
 It is also called significance testing used to tests a claim about a
parameter using evidence (data in a sample).

Statistical hypothesis
Def : On the basis of sample information make assumptions about population
parameters to take decision about the population,
• Statistical hypothesis which may or may not true.
Test of Hypothesis

The procedure which enables us to decide in the basis of sample results whether a
hypothesis is true or not is called Test of Hypothesis.

 Many statistical analyses are concerned with testing hypotheses.

 There are methods for testing some simple hypothesis using standard and classic
tests.

Student’s t-test

 It is a method for comparing two samples.


 Looking at the means to determine if the samples are different.
 This is parametric test .
 data should be normally distributed.
 R can handle this using t.test() command.
 It can deal with two- and one-sample tests as well as paired tests.
• It is a method for comparing two samples; looking at the means to determine if
the samples are different.
• R can handle this using t.test() command. It can deal with two- and one-sample
tests as well as paired tests.
t.test(data1,data2) Apply t-test to compare two vectors of numerical data.

Different t-Tests:

1. Two-Sample t-Test with Unequal variance


2. Two-Sample t-Test with equal variance
3. One Sample t-test
4. Using directional hypotheses
Options available in the t.test() command
1) t.test(data1,data2)
Apply t-test to compare two vectors of numerical data.
2) mu = 0
If one-sample test is carried out, mu indicates the mean against which the
sample should the tested.
3) Alternaive = “two.sided”
Sets the alternative hypothesis. The default is “two-sided” but you can
specify “greater” or “less”
4) conf.level = 0.95
Sets the confidence level of the interval (default = 0.95)
5) paired = FALSE
if set to TRUE, a matched pair t-test is carried out
6) var.equal = FALSE
If var.equal is set to TRUE, the variance is considered to be equal and the
standard test is carried out.
If var.equal is set to FALSE, the variance is considered to be un equal and
the two-sample test is carried out.
1) Two-Sample t-Test with Unequal variance

 Use the t.test() command to compare two vectors of numerical values.

 The default form of t.test() assume that the samples are have unequal varience.

Ex : t.test(data2, data3)

Welch version
data: data2 and data3 t = -2.8151, df = 24.564, p-value = 0.009462 alternative
hypothesis: true difference in means is not equal to 0 95 percent confidence
interval: -3.5366789 -0.5466544 sample estimates: mean of x mean of y 5.125000
7.166667
2) Two-Sample t-Test with equal variance

 The t.test() command to compare two vectors of numerical values with equal
variance by adding the var.equal = TRUE

Ex : t.test(data2, data3, var.equal = TRUE )

data: data2 and data3 t = -2.7908, df = 26, p-value = 0.009718 alternative


hypothesis: true difference in means is not equal to 0 95 percent confidence
interval: -3.5454233 -0.5379101 sample estimates: mean of x mean of y 5.125000
7.166667

The calculation of the t-value uses pooled variance and the degrees of freedom are
unmodified; as a result, the p-value is slightly different from the Welch version
3) One Sample t-test

 one-Sample t-testing In this supply the name of a single vector and the mean to
compare it to (this defaults to 0):

Ex: t.test(data2, mu = 5)

data: data2 t = 0.2548, df = 15, p-value = 0.8023 alternative hypothesis: true mean
is not equal to 5; 95 percent confidence interval: 4.079448 6.170552 sample
estimates: mean of x 5.125
4) Using directional hypotheses

 It is possible to specify y a “direction” to the hypothesis.


 perform simple testing to see if the means of two samples are different.

 To know if a sample mean is lower than another sample mean (or greater),
use the alternative = instruction to switch from a two-sided test (the default)
to a one-sided test for this use “less”, or “greater” abbreviations.

Ex: t.test(data2, mu = 5, alternative = 'greater')

One Sample t-test


data: data2 t = 0.2548, df = 15, p-value = 0.4012 alternative hypothesis: true
mean is greater than 5; 95 percent confidence interval: 4.265067 sample
estimates: mean of x 5.125
Formula Syntax and Subsetting Samples in the t-test
The t-test is designed to compare two samples (or one sample with a “standard”).

> grass
rich graze
1 12 mow
2 15 mow
3 17 mow
4 11 mow
5 15 mow
6 8 unmow
7 9 unmow
8 7 unmow
9 9 unmow
 There is a possibility in R to set the data in more sensible and flexible way using
“formula syntax”.

 To create a formula using the tilde (~) symbol. Here response variable goes on
the left of the ~ and the predictor goes on the right.

Ex: > t.test(rich ~ graze, data = grass)


Subsetting Samples in the t-test

 If the predictor column contains more than two items, the t-test cannot be used.

 However, you can still carry out a test by subsetting this predictor column and
specifying which two samples you want to compare.

 Use the subset = instruction as part of the t.test() command.

Ex: t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))

 First specify which column you want to take your subset from (graze in this case)

 Then type %in%; tells the command that the list that follows is contained in the
graze column and put the levels in quotes.

 Here compare “mow” and “unmow” .


U-Test
 To compare non-parametric data use U-test.
 Various names or U-Test are Mann-Whitney U-test or Wilcoxon sign rank test.
 Use the wilcox.test() command to perform analysis similar to t.test().
 The wilcox.test() command can conduct two-sample or one-sample tests.

Basic two-sample U-test on the numerical vectors to compare


data1
35753268569
data2
3575326856945734

wilcox.test(data1, data2)

Wilcoxon rank sum test with continuity correction


data: data1 and data2 W = 94.5, p-value = 0.7639 alternative hypothesis: true
location shift is not equal to 0
Warning message: In wilcox.test.default(data1, data2) : cannot compute exact p-
value with ties By default the confidence intervals are not calculated and the p-
value is adjusted using the “continuity correction”;
Parid t- and U-Tests

 To test paired data, use matched pair versions of the t-test and the U-test with a
simple extra instruction paired = TRUE.

 It does not matter if the data are in two separate sample columns or are
represented as response and predictor

Ex: t.test(count ~ trap, data = mpd.s, paired = TRUE, mu = 1, conf.level = 0.99)

wilcox.test(mpd$white, mpd$yellow, exact = FALSE, paired = TRUE)


3.10 CORRELATION AND COVARIANCE

 Correlation find the relation between two continuous variables correlation.

 In R the cor() command determines correlations between two vectors, all the
columns of a data frame (or matrix), or two data frames (or matrix objects).

 The cov() command examines covariance.

 By default the Pearson product moment ( regular parametric correlation) is used .

 We can explicitly specify the other non-parametric correlation methods

• Spearman (rho)
• Kendall (tau)

 The cor.test() command carries out a test of significance of the correlation.


Instructions available to these commands are
1. cor(x, y = NULL)
Carries out a basic correlation between x and y . If x is a matrix or data frame, y
can be omitted .
2. cov(x, y = NULL)
Determines covariance between x and y . If x is a matrix or data frame, y can be
omitted .
3. method =
The default is “pearson”, but “spearman” or “kendall” can be specified as the
methods for correlation or covariance .
4. var(x, y = NULL)
Determines the variance of x . If x is a matrix or data frame or y is specified, the
covariance is also determined .
5. alternative = “two.sided”
The default is for a two-sided test but the alternative hypothesis can be given as
“two.sided”, “greater”, or “less” and abbreviations are permitted .
6. cov2cor(V)
Takes a covariance matrix V and calculates the correlations .
3.10.1 Simple correlation
Simple correlations are between two continuous variables and you can use the cor()
command to obtain a correlation coefficient.
Ex: > count = c(9, 25, 15, 2, 14, 25, 24, 47)
> speed = c(2, 3, 5, 9, 14, 24, 29, 34)

> cor(count, speed)


0.7237206
Pearson correlation is the default for R, specify other correlations using the
method = instruction.
> cor(count, speed, method = 'spearman')
0.5269556
Take another example look at the women data frame, this comes as example data
with R distribution.
> data(women)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
 Use attach() or with() commands to allow R to “read inside” the data frame and
access the variables within and also use the $ syntax to access the variables.

Ex : cor(women$height, women$weight)
[1] 0.9954948

cor() command has calculated the Pearson correlation coefficient between the height
and weight variables contained in the women data frame.

 Directly use cor() command on a data frame (or matrix). The correlation matrix
shows all combinations of the variables in the data frame.

cor(women)

height weight

height 1.0000000 0.9954948


weight 0.9954948 1.0000000
Another example contains five columns of the data

> head(mf)

Length Speed Algae NO3 BOD

1 20 12 40 2.25 200
2 21 14 45 2.15 180
3 22 12 45 1.75 135
4 23 16 80 1.95 120
5 21 20 75 1.95 110
6 20 21 65 2.75 120

Task : select a single variable and compare it to all the others

cor(mf$Length, mf)

Length Speed Algae NO3 BOD
[1,] 1 -0.3432297 0.7650757 0.4547609 -0.8055507
3.10.2 Covariance

The cov() command uses syntax similar to the cor() command to examine covariance.

The women data are used with the cov() command in the following example:
Ex : > cov(women$height, women$weight)

[1] 69
 Directly use cor() command on a data frame (or matrix)
Ex : > cov(women)

height weight
height 20 69.0000
weight 69 240.2095

The cov2cor() command is used to determine the correlation from a matrix of


covariance.
Ex: > women.cv = cov(women)
> cov2cor(women.cv)
height weight
height 1.0000000 0.9954948
weight 0.9954948 1.0000000
3.10.3 Significance testing in correlation tests

Apply a significance test to correlations using the cor.test() command to compare


only two vectors at a time.

cor.test(women$height, women$weight)

Output:

Pearson's product-moment correlation


data: women$height and women$weight

t = 37.8553, df = 13, p-value = 1.088e-14


alternative hypothesis: true
correlation is not equal to 0; 95 percent confidence interval: 0.9860970 0.9985447
sample estimates: cor 0.9954948

Here the Pearson correlation has been carried out between height and weight in the
women data and the result also shows the statistical significance of the correlation.
3.10.4 Formula Syntax

If the data are contained in a data frame, using the attach() or with() commands is
easy instead of $ syntax.

A formula syntax is available as an alternative, used to test for association between


paired samples.

data(cars) > cor.test(~ speed + dist, data = cars, method = 'spearman', exact = F)

Spearman's rank correlation rho


data: speed and dist S = 3532.819, p-value = 8.825e-14 alternative hypothesis: true
rho is not equal to 0 sample estimates: rho 0.8303568

Examine the cars data, which is built into R.


In the formula both variables specified to the right of the ~.

It is possible to give the name of the data as a separate instruction.

If the data contain a separate grouping column, you can specify the samples to use
from it using an instruction along the following lines:

Subset = grouping %in% “sample”


3.11 TESTS FOR ASSOCIATION
If the dataset have categorical data to look for associations between categories
use the chisquared test. For this R uses the chisq.test() command.

The additional instructions to the basic command are

1. chisq.test(x, y = NULL)
A basic chi-squared test is carried out on a matrix or data frame . If x is provided
as a vector, a second vector can be supplied . If x is a single vector and y is not
given, a goodness of fit test is carried out .

2. correct = TRUE
If the data form a 2 x 2 contingency table the Yates’ correction is applied .
3. p=
A vector of probabilities for use with a goodness of fit test . If p is not given, the
goodness of fit tests that the probabilities are all equal .

4. rescale.p = FALSE
If TRUE, p is rescaled to sum to 1 . For use with goodness of fit tests .

5. simulate.p.value = FALSE
If set to TRUE, a Monte Carlo simulation is used to calculate p-values .
B = 2000 (B is used to specify number of replicates used in Monte carlo test)
3.11.1 Multiple Categories: Chi-Squared Tests

The most common use for a chi-squared test is where you have multiple categories
and want to see if associations exist between them.

In the following example some categorical data set out in a data frame bird.df.

> bird.df

Garden Hedgerow Parkland Pasture Woodland

Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0

In the data each cell represents a unique combination of the two categories and have
several habitats and several species.
Run the chisq.test() command simply by giving the name of the data to the command :

> bird.cs = chisq.test(bird.df)

Warning message: In chisq.test(bird.df) : Chi-squared approximation may be incorrect

To examine in more detail give the result a name and set it up as a new object.

> bird.cs

Pearson's Chi-squared test


data: bird.df X-squared = 78.2736, df = 20, p-value = 7.694e-09

 In this example get an error message because of some small values in the observed
data and the expected values will probably include some that are smaller than 5.

When you issue the name of the result object you see a very brief result that contains
the salient points.
3.11.2 Monte Carlo Simulation

Use Monte Carlo simulation to determine the p-value by adding an extra


instruction to the chisq.test() command, simulate.p.value = TRUE, like so:

 chisq.test(bird.df, simulate.p.value = TRUE, B = 2500)

Pearson's Chi-squared test with simulated p-value (based on 2500 replicates)

data: bird.df X-squared = 78.2736, df = NA, p-value = 0.0003998

The default is that simulate.p.value = FALSE and that B = 2000. The latter is the number
of replicates to use in the Monte Carlo test, which is set to 2500 for this example.
3.11.3 Yates’ Correction for 2 x 2 Tables

In stat contingency table is a type of table in a matrix format that displays the
frequency distribution of the variables.
Table showing the distribution of one variable in rows and another in columns,
used to study the correlation between the two variables.
A 2 x 2 contingency table it is common to apply the Yates’ correction.
By default this is used if the contingency table has two rows and two columns.
You can turn off the correction using the correct = FALSE instruction in the
command.
> nd
Urt.dio.y Urt.dio.n
Rum.obt.y 96 41
Rum.obt.n 26 57

> chisq.test(nd)
Pearson's Chi-squared test with Yates' continuity correction
data: nd X-squared = 29.8653, df = 1, p-value = 4.631e-08
chisq.test(nd, correct = FALSE)

Pearson's Chi-squared test

data: nd X-squared = 31.4143, df = 1, p-value = 2.084e-08

• At the top see the data and when run the chisq.test() command see that Yates’
correction is applied automatically.
• In the second example you force the command not to apply the correction by
setting correct = FALSE.
• Yates’ correction is applied only when the matrix is 2 x 2, and even if you tell R to
apply the correction explicitly it will do so only if the table is 2 x 2.
Goodness of Fit Tests
• The goodness of fit test is used to test if sample data fits a distribution from a certain
population (i.e. a population with a normal distribution).
• In other words, it tells if your sample data belongs to a certain population.
• Goodness of fit tests commonly used in statistics are chi-square, Kolmogorov-Smirnov,
Anderson-Darling, Shipiro-Wilk.

Chi Square Goodness of Fit Test


• Chi-square test requires a sufficient sample size in order for the chi-square approximation
to be valid.
• There is another type of chi-square test, called the chi-square test for independence. The
two are sometimes confused but they are quite different.
• The chi-square test for independence compares two sets of data to see if there is a
relationship.
• The chi-square Goodness of fit is to fit one categorical variable to a distribution.
3.11.4 Single category: goodness of Fit tests

 In general use the chisq.test() command to carry out a goodness of fit test.

 If the data contains two vectors of numerical values, one representing the
observed values and the other representing the expected ratio of values.

 The goodness of fit tests the data against the ratios (probabilities) you specified.

 If you do not specify any, the data are tested against equal probability.

data frame containing two columns; the first column contains values relating to an old
survey. The second column contains values relating to a new survey. You want to see
if the proportions of the new survey match the old one, so you perform a goodness of
fit test:
> survey
old new
woody 23 19
shrubby 34 30
tall 132 111
short 98 101
grassy 45 52
mossy 53 26
To run the test use the chisq.test() command, but need to specify the test data as a
single vector and also point to the vector that contains the probabilities:

> survey.cs = chisq.test(survey$new, p = survey$old, rescale.p = TRUE)

> survey.cs

Chi-squared test for given probabilities


data: survey$new X-squared = 15.8389, df = 5, p-value = 0.00732

In this example you did not have the probabilities as true probabilities but as
frequencies; you use the rescale.p = TRUE instruction to make sure that these are
converted to probabilities (this instruction is set to FALSE by default).

The result contains all the usual items for a chi-squared result object, but if you
display the expected values, for example, you do not automatically get to see the
row names, even though they are present in the data:

> survey.cs$exp

[1] 20.25195 29.93766 116.22857 86.29091 39.62338 46.66753


To get the row names from the original data using the row.names() command.

>names(survey.cs$expected) = row.names(survey)
> survey.cs$exp

woody shrubby tall short grassy mossy


20.25195 29.93766 116.22857 86.29091 39.62338 46.66753

You might also like