Data Science UNIT 3
Data Science UNIT 3
Reference Books
Joe Grus, “Data Science from Scratch:, O’Reilly
Mark Gardener, “Beginning R The statistical Programming Language”, Wiley.
Graphical Analysis
3.1 Visualizing data : matplotlib
• A wide variety of tools exist for visualization. “matplotlib” is a Python library
that helps in visualizing and analyzing the data for better understanding.
Which is widely used for producing simple bar charts, line charts, and
scatterplots by using matplotlib.pyplot module.
• For saving the graphs use savefig() and to display use show() functions.
• Most of the matplotlib utilities lies under the pyplot submodule, and are
usually imported under the plt alias:
import matplotlib.pyplot as plt
Functions in matplotlib
Program 1 :
# x axis values
x = [2,4,6]
# corresponding y axis values
y = [5,2,7]
linestyle='solid')
# add a title
plt.title("Nominal GDP")
# add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()
Types of Plots:
Most commonly used Plots come under Matplotlib are:
1. Bar Chart
2. Histogram
3. Frequency polygon
4. Quartiles
5. Box-plot
6. Scatter Plot
7. Heat maps
3.2 Bar Chart
1) Bar Chart:
A bar chart is a graph that represents the category of data with rectangular bars. A bar chart
describes the comparisons between the discrete categories. One of the axis of the plot represents
the specific categories being compared, while the other axis represents the measured values
corresponding to those categories. bar() function used to plot Bar chart.
Syntax:
bar(x, height, width, bottom, align)
The function creates a bar plot bounded with a rectangle depending on the given parameters.
x x coordinates of the bars.
Syntax:
hist (x, bins, range, align, color, label)
# Creating dataset
a =
np.array([5,5,6,7,8,10,15,15,18,20,25,27,
30,50,55,65,65,68, 88,
89,90,92,94,96,99])
plt.show()
Differences between Bar chart and Histogram
Bar graphs are usually used to display "categorical data", that is data that fits
into categories.
Histograms on the other hand are usually used to present "continuous data",
that is data that represents measured quantity where the numbers can take on
any value in a certain range. Ex: A good example is weight.
A frequency polygon shows the overall distribution of a data set. It looks like a line graph–
but the points on the graph can be plotted using data from a histogram or a frequency
table.
Frequency polygons are especially useful for comparing two data sets
Frequency polygon plotted using data from a histogram
Frequency polygon plotted using data from the Frequency table
Frequency polygons are very similar to histograms, except histograms have bars,
and frequency polygons have dots and lines connecting the frequencies of each
class interval.
Mark the middle of each class interval with a tick mark, and label it with the middle
value represented by the class.
Draw the Y-axis to indicate the frequency of each class. Place a point in the middle
of each class interval at the height corresponding to its frequency.
Finally, connect the points. You should include one class interval below the lowest
value in your data and one above the highest value.
choosing a class
interval. Then
draw an X-axis
representing the
values of the
scores in your
data. Mark the
middle of each
class interval with
a tick mark, and
label it with the
middle value
represented by
the class.
It is easy to discern the shape of the distribution from Figure . Most of the scores
are between 65 and 115. It is clear that the distribution is not symmetric in as
much as good scores (to the right) trail off more gradually than poor scores (to the
left). The distribution is skewed.
Cumulative Frequency polygon
A cumulative frequency polygon for the same test scores is shown in Figure. The
graph is the same as before except that the Y value for each point is the number
of students in the corresponding class interval plus all numbers in lower
intervals
Uses of Frequency polygon
optional parameter accepts boolean values false and true for horizontal and vertic
vert
plot respectively
widths optional parameter accepts array and sets the width of boxes
Program:
import matplotlib.pyplot as
plt
# Creating dataset
data = [1,3,3,4,5,6,6,7,8,8]
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
Program:
import matplotlib.pyplot as plt
# Creating dataset
data = [1,3,3,4,5,6,6,7,8,8]
# Creating plot
plt.boxplot(data, notch=True)
# show plot
plt.show()
Uses of Boxplots
1) To compare and analyses two data sets
2) To identify outliers
Program:
import matplotlib.pyplot as plt
# Creating dataset
data1 = [1,3,3,4,5,6,6,7,8,8]
data2 = [2,4,4,5,7,7,8,8,8,9]
# Creating plot
plt.boxplot(data, vert = False)
# show plot
plt.show()
It is also useful in comparing the distribution of data across data sets by drawing
box plots for each of them.
Ex: box-and-whisker plots below represent quiz scores out of 25 points for Quiz
1 and Quiz 2 for the same class
These box-and-whisker plots show that the lowest score, highest score,
and Q3 are all the same for both exams, so performance on the two exams were
quite similar.
However, the movement Q1 up from a score of 6 to a score of 9 indicates that
there was an overall improvement.
On the first test, approximately 75% of the students scored at or above a score of
6. On the second test, the same number of students (75%) scored at or above a
score of 9.
2) To Identify outliers
Outliers : In a box plot, an outlier is an a data point that is located outside the whiskers
of the box plot. That is 1.5 times interquartile range above the upper quartile or below
the lower quartile
Program:
# with outliers
import matplotlib.pyplot as plt
# Creating dataset
data = [52, 57, 57, 58, 63, 66, 66,
67, 67, 68, 69, 70, 70, 70, 70, 72,
73, 75, 75, 76, 76, 78, 79, 89]
# Creating plot
plt.boxplot(data)
# show plot
plt.show()
The round circle represents outliers. In this case Maximum and minimum are outliers.
3.6 Quartiles
• Quartiles are measures of variation, which describes how spread out the data is.
• Quartiles are type of quantiles. These are values that separate the data into four
equal parts.
•
• Quartiles are the values that divide the data in to four equal parts (quarters)
• The first quartile, or lower quartile, is the value that cuts off the first 25% of
the data when it is sorted in ascending order.
• The second quartile, or median, is the value that cuts off the first 50%.
• The third quartile, or upper quartile, is the value that cuts off the first 75%.
Procedure :
Arrange the data in ascending order then start by finding the median or
middle value. The median splits the data values into halves. Again find the
middle value of these two halves.
Example:
With Python use the NumPy library quantile() method to find the quartiles
Syntax:
quantile (x,q)
Output:
[13. 21. 41. 49.75 72. ]
import numpy
values = [13,21,21,40,42,48,55,72]
x = numpy.quantile(values, [.25])
print(x)
Output:
[21.]
3.7 Scatter Plot
• A scatterplot is used for visualizing the relationship between two paired sets of
data.
• Scatter plots are widely used to represent relation among variables and how change
in one affects the other.
• Data are displayed as a collection of points, each having the value of one variable
determining the position on the horizontal axis and the value of the other variable
determining the position on the vertical axis.
• A scatter plot can suggest various kinds of correlations between variables with a
certain confidence interval.
Ex: weight and height, weight would be on y axis and height would be on the x
axis.
• Correlations may be positive (rising), negative (falling), or null (uncorrelated).
• If the pattern of dots slopes from lower left to upper right, it indicates a
positive correlation between the variables being studied. If the pattern of dots
slopes from upper left to lower right, it indicates a negative correlation.
• A line of best fit (alternatively called 'trendline') can be drawn in order to study
the relationship between the variables.
• The scatter() method in the matplotlib library is used to draw a scatter plot.
Scatter plots are widely used to represent relation among variables and how
change in one affects the other.
Syntax:
import matplotlib.pyplot as
plt
import numpy as np
height =
np.array([99,86,88,111,132,1
03,125,94,78,77,129,116])
weight =
np.array([65,57,59,69,77,72,
75,63,45,43,81,71])
plt.scatter(height, weight)
plt.xlabel('Height')
plt.ylabel('Weight')
plt.title('Relationship
between Height & Weight')
plt.show()
Program:
Illustrates the relationship between the number of friends the users have and the number of minutes
they spend on the site every day:
Ex: Take mtcars datset find the correlation between one variable versus other
variables.
3.8 Heat Maps
• A heat map is a graphical representation of data where the individual values
contained in a matrix are represented as colors. i.e a colored representation of a
matrix.
• It is a great tool for visualizing data across the surface.
• It is used to see the correlation between columns of a dataset where we can use a
darker color for columns having a high correlation.
Ex: Below figure shows the analysis of the data by using bar charts and heat maps. It
is easy to understand the relationship between data by using heatmaps.
Matplotlib Heatmaps can be created using functions such as imshow()
Syntax:
imshow(X, cmap, alpha)
Add a colorbar to the heatmap using plt.colorbar(). colorbar shows the weight of
color relatively between a certain range.
Program:
import numpy as np
import matplotlib.pyplot as plt
# Add colorbar
plt.colorbar()
Program:
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
data_set = np.random.rand( 10 , 10 )
ax = sns.heatmap( data_set ,
linewidth = 0.5 , cmap = 'coolwarm' )
plt.title( "2-D Heat Map" )
plt.show()
Hypothesis testing
3.9 Simple Hypothesis testing
Hypothesis
Parameter can be estimated form sample data.
Instead of estimating the value of a parameter, need to decide whether to
accept or reject a statement about the parameter.
This statement is called a Hypothesis.
Decision making procedure about the hypothesis is called Hypothesis
testing.
It is also called significance testing used to tests a claim about a
parameter using evidence (data in a sample).
Statistical hypothesis
Def : On the basis of sample information make assumptions about population
parameters to take decision about the population,
• Statistical hypothesis which may or may not true.
Test of Hypothesis
The procedure which enables us to decide in the basis of sample results whether a
hypothesis is true or not is called Test of Hypothesis.
There are methods for testing some simple hypothesis using standard and classic
tests.
Student’s t-test
Different t-Tests:
The default form of t.test() assume that the samples are have unequal varience.
Ex : t.test(data2, data3)
Welch version
data: data2 and data3 t = -2.8151, df = 24.564, p-value = 0.009462 alternative
hypothesis: true difference in means is not equal to 0 95 percent confidence
interval: -3.5366789 -0.5466544 sample estimates: mean of x mean of y 5.125000
7.166667
2) Two-Sample t-Test with equal variance
The t.test() command to compare two vectors of numerical values with equal
variance by adding the var.equal = TRUE
The calculation of the t-value uses pooled variance and the degrees of freedom are
unmodified; as a result, the p-value is slightly different from the Welch version
3) One Sample t-test
one-Sample t-testing In this supply the name of a single vector and the mean to
compare it to (this defaults to 0):
Ex: t.test(data2, mu = 5)
data: data2 t = 0.2548, df = 15, p-value = 0.8023 alternative hypothesis: true mean
is not equal to 5; 95 percent confidence interval: 4.079448 6.170552 sample
estimates: mean of x 5.125
4) Using directional hypotheses
To know if a sample mean is lower than another sample mean (or greater),
use the alternative = instruction to switch from a two-sided test (the default)
to a one-sided test for this use “less”, or “greater” abbreviations.
> grass
rich graze
1 12 mow
2 15 mow
3 17 mow
4 11 mow
5 15 mow
6 8 unmow
7 9 unmow
8 7 unmow
9 9 unmow
There is a possibility in R to set the data in more sensible and flexible way using
“formula syntax”.
To create a formula using the tilde (~) symbol. Here response variable goes on
the left of the ~ and the predictor goes on the right.
If the predictor column contains more than two items, the t-test cannot be used.
However, you can still carry out a test by subsetting this predictor column and
specifying which two samples you want to compare.
Ex: t.test(rich ~ graze, data = grass, subset = graze %in% c('mow', 'unmow'))
First specify which column you want to take your subset from (graze in this case)
Then type %in%; tells the command that the list that follows is contained in the
graze column and put the levels in quotes.
wilcox.test(data1, data2)
To test paired data, use matched pair versions of the t-test and the U-test with a
simple extra instruction paired = TRUE.
It does not matter if the data are in two separate sample columns or are
represented as response and predictor
In R the cor() command determines correlations between two vectors, all the
columns of a data frame (or matrix), or two data frames (or matrix objects).
• Spearman (rho)
• Kendall (tau)
Ex : cor(women$height, women$weight)
[1] 0.9954948
cor() command has calculated the Pearson correlation coefficient between the height
and weight variables contained in the women data frame.
Directly use cor() command on a data frame (or matrix). The correlation matrix
shows all combinations of the variables in the data frame.
cor(women)
height weight
> head(mf)
1 20 12 40 2.25 200
2 21 14 45 2.15 180
3 22 12 45 1.75 135
4 23 16 80 1.95 120
5 21 20 75 1.95 110
6 20 21 65 2.75 120
cor(mf$Length, mf)
Length Speed Algae NO3 BOD
[1,] 1 -0.3432297 0.7650757 0.4547609 -0.8055507
3.10.2 Covariance
The cov() command uses syntax similar to the cor() command to examine covariance.
The women data are used with the cov() command in the following example:
Ex : > cov(women$height, women$weight)
[1] 69
Directly use cor() command on a data frame (or matrix)
Ex : > cov(women)
height weight
height 20 69.0000
weight 69 240.2095
cor.test(women$height, women$weight)
Output:
Here the Pearson correlation has been carried out between height and weight in the
women data and the result also shows the statistical significance of the correlation.
3.10.4 Formula Syntax
If the data are contained in a data frame, using the attach() or with() commands is
easy instead of $ syntax.
data(cars) > cor.test(~ speed + dist, data = cars, method = 'spearman', exact = F)
If the data contain a separate grouping column, you can specify the samples to use
from it using an instruction along the following lines:
1. chisq.test(x, y = NULL)
A basic chi-squared test is carried out on a matrix or data frame . If x is provided
as a vector, a second vector can be supplied . If x is a single vector and y is not
given, a goodness of fit test is carried out .
2. correct = TRUE
If the data form a 2 x 2 contingency table the Yates’ correction is applied .
3. p=
A vector of probabilities for use with a goodness of fit test . If p is not given, the
goodness of fit tests that the probabilities are all equal .
4. rescale.p = FALSE
If TRUE, p is rescaled to sum to 1 . For use with goodness of fit tests .
5. simulate.p.value = FALSE
If set to TRUE, a Monte Carlo simulation is used to calculate p-values .
B = 2000 (B is used to specify number of replicates used in Monte carlo test)
3.11.1 Multiple Categories: Chi-Squared Tests
The most common use for a chi-squared test is where you have multiple categories
and want to see if associations exist between them.
In the following example some categorical data set out in a data frame bird.df.
> bird.df
Blackbird 47 10 40 2 2
Chaffinch 19 3 5 0 2
Great Tit 50 0 10 7 0
House Sparrow 46 16 8 4 0
Robin 9 3 0 0 2
Song Thrush 4 0 6 0 0
In the data each cell represents a unique combination of the two categories and have
several habitats and several species.
Run the chisq.test() command simply by giving the name of the data to the command :
To examine in more detail give the result a name and set it up as a new object.
> bird.cs
In this example get an error message because of some small values in the observed
data and the expected values will probably include some that are smaller than 5.
When you issue the name of the result object you see a very brief result that contains
the salient points.
3.11.2 Monte Carlo Simulation
The default is that simulate.p.value = FALSE and that B = 2000. The latter is the number
of replicates to use in the Monte Carlo test, which is set to 2500 for this example.
3.11.3 Yates’ Correction for 2 x 2 Tables
In stat contingency table is a type of table in a matrix format that displays the
frequency distribution of the variables.
Table showing the distribution of one variable in rows and another in columns,
used to study the correlation between the two variables.
A 2 x 2 contingency table it is common to apply the Yates’ correction.
By default this is used if the contingency table has two rows and two columns.
You can turn off the correction using the correct = FALSE instruction in the
command.
> nd
Urt.dio.y Urt.dio.n
Rum.obt.y 96 41
Rum.obt.n 26 57
> chisq.test(nd)
Pearson's Chi-squared test with Yates' continuity correction
data: nd X-squared = 29.8653, df = 1, p-value = 4.631e-08
chisq.test(nd, correct = FALSE)
• At the top see the data and when run the chisq.test() command see that Yates’
correction is applied automatically.
• In the second example you force the command not to apply the correction by
setting correct = FALSE.
• Yates’ correction is applied only when the matrix is 2 x 2, and even if you tell R to
apply the correction explicitly it will do so only if the table is 2 x 2.
Goodness of Fit Tests
• The goodness of fit test is used to test if sample data fits a distribution from a certain
population (i.e. a population with a normal distribution).
• In other words, it tells if your sample data belongs to a certain population.
• Goodness of fit tests commonly used in statistics are chi-square, Kolmogorov-Smirnov,
Anderson-Darling, Shipiro-Wilk.
In general use the chisq.test() command to carry out a goodness of fit test.
If the data contains two vectors of numerical values, one representing the
observed values and the other representing the expected ratio of values.
The goodness of fit tests the data against the ratios (probabilities) you specified.
If you do not specify any, the data are tested against equal probability.
data frame containing two columns; the first column contains values relating to an old
survey. The second column contains values relating to a new survey. You want to see
if the proportions of the new survey match the old one, so you perform a goodness of
fit test:
> survey
old new
woody 23 19
shrubby 34 30
tall 132 111
short 98 101
grassy 45 52
mossy 53 26
To run the test use the chisq.test() command, but need to specify the test data as a
single vector and also point to the vector that contains the probabilities:
> survey.cs
In this example you did not have the probabilities as true probabilities but as
frequencies; you use the rescale.p = TRUE instruction to make sure that these are
converted to probabilities (this instruction is set to FALSE by default).
The result contains all the usual items for a chi-squared result object, but if you
display the expected values, for example, you do not automatically get to see the
row names, even though they are present in the data:
> survey.cs$exp
>names(survey.cs$expected) = row.names(survey)
> survey.cs$exp