0% found this document useful (0 votes)
17 views

Stats Lect

This document provides an overview of key concepts in statistics and data science. It discusses the foundations of data science including mathematics, statistics, and computer science. It introduces statistical thinking and modeling for big data projects. It also covers basic statistical concepts such as populations, samples, descriptive statistics, inferential statistics, and terms related to single data sets. Examples of statistical calculations and visualizations are presented using Python libraries like Pandas and the statistics module.

Uploaded by

Christopher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Stats Lect

This document provides an overview of key concepts in statistics and data science. It discusses the foundations of data science including mathematics, statistics, and computer science. It introduces statistical thinking and modeling for big data projects. It also covers basic statistical concepts such as populations, samples, descriptive statistics, inferential statistics, and terms related to single data sets. Examples of statistical calculations and visualizations are presented using Python libraries like Pandas and the statistics module.

Uploaded by

Christopher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Statistics and Its

Applications in Data Science


Lecture 5
Data Science: Inference + Computation
• Foundation of Data Science
• Mathematics & Statistics
• Computer Science (data structures + programming)

• Statistical Thinking
• Basis for modeling a big data project

• Mathematical background
• Linear algebra (lines, planes, vectors, matrices, linear equations, …)
• Representation of data and relations

• Computation and Visualization


• Tools to process statistics
Statistics: Introduction
• Statistics refers to the mathematics & techniques used for understanding
data.
• A rich, enormous field
• Statistical analysis is very important to big data analytics
• In this lecture only basics covered, more to courses offered in STA
• Many tools available
• Python Statistics module
• Provides functions for calculating mathematical statistics of numeric
• Documentation
• https://docs.python.org/3/library/statistics.html
• Pandas statistics
• Introduction to Pandas (supplement)
Basic Statistical Concepts
• Population
• A population is a collection of objects about which information is sought
• Sample
• A sample is a part of the population that is observed
• Example

df = pd.read_csv(‘scores.csv', index_col=0) #millions of scores


print(df.sample(100))

#df.sample(100) returns a sample of 100 items from the a collection


#of score objects (which is a population)
Population vs Sample
• Population: all members of a group in a study
• The average height of men
• The average height of living male ≥ 18yr in USA between 2001 and 2010
• The average height of all male students ≥ 18yr registered in Fall 2021
• Sample: a subset of the members in the population
• Most studies choose to sample the population due to cost/time or other factors
• i.e. most analysis performed on samples
• Example: election polls
• Each sample is only one of many possible subsets of the population
• May or may not be representative of the whole population
• Need to choose sample carefully to draw valid/reasonable conclusion
• Sample size and sampling procedure is important
Descriptive Statistics

• Descriptive statistics
• produce statistics that summarize the data concisely (e.g. mean, average,
standard deviation)
• Chapter 5
• Chapter 6 (probability) and Chapter 7 (Hypothesis and Inference)
• As optional materials
• Exploratory data analysis
• look for patterns, differences, and other features
Inferential Statistics
• Descriptive statistics describes data (for example, a chart or graph)
and inferential statistics allows you to make predictions (“inferences”)
from that data.
• With inferential statistics, you take data from samples and make
generalizations about a population.
• Two areas of inferential statistics
• Estimating parameters, e.g. use sample mean as population mean
• Hypothesis testing – use sample data to answer research questions, e.g.
• Eating breakfast help children perform better in school?
• A new cancer treatment drug effective?
Statistical terms of a single data set
• Information (numbers) that give a quick and simple
description of the data
• Maximum value
• Minimum value
• Range
• Mean Q1: What statistics course you took?
• Median
Q2: Do you still remember these
• Mode terms?
• Quantile/Quartile/Percentile
• Variance
• Standard deviation
Example: Test scores
Score Statistics Blackboard statistics

Visualization
Describing a Single Set of Data
• Given a list of values, we’d get some information from it (or even
apply some operation on it)
• Example: A list of exam scores
• What is the highest, lowest, median, mean, …; Add bonus to everyone?
• Percentiles, quantiles, …
• My score is 76, am I on top 25%?
• Mode
• A quiz of 5 points, interesting that a lot of students get a score of 4.
• What is the standard deviation? Curve it?
•…
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)

• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum:
Maximum:
Range:
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)

• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum: 13
Maximum: 21
Range: 21-13 = 8
Central Tendencies
• notion of where our data is centered
• mean, median, mode

mean([1, 10, 2, 9, 5]) = (1+10+2+9+5)/5 = 3.4

median([1, 10, 2, 9, 5]) => sort the list [1, 2, 5, 9, 10] => pick up middle one: 5

median([1, 9, 2, 10]) = (2 + 9) / 2 = 5.5


Definitions: Mean, Median, Mode
• Mean (also called average, or arithmetic average)
• For a sample of n values, xi (i=1,…,n), the mean, µ, is the sum of the
values divided by the number of values
• Median: The median is a simple measure of
central tendency.
• To find the median, we arrange the observations
in order from smallest to largest value (i.e. sorted)
• If there is an odd number of observations, the
median is the middle value.
• If there is an even number of observations, the
median is the average of the two middle values.
• Lo-median
• Hi-median
• Mode: the most frequently occurring
number found in a set of numbers.
• The mode is found by collecting and
organizing data in order to count the
frequency of each result.
• The result with the highest number of
occurrences is the mode of the set.
Mode
• Example:
• 10 students took a quiz of 5 points. The scores are:
5, 4, 3, 4, 4, 5, 4, 4, 2, 4
• What score received by “most” people?
• The mode for the data list is 4

• The mode of a data set is the number that occurs most often.
• E.g. From a survey pizza price, we find that the mode of New York pizza prices
is 3 dollars.
Practice

(1) Find the mean, median, mode,


and range for the following list
of values:
13, 18, 13, 14, 13, 16, 14, 21, 13, 18

(2) Should I curve the grades?


Compare with my standard grading rubrics,
will curve help if your score is 72, 64, 44
respectively?
Problem with mean/median
• Assume my class has only 5 students, test result: [100, 30, 40, 98, 95]
mean score = 72.6 (the class is doing fine!)
median score = 95 (the class is doing great!)

• Practical example: average salary survey for a group of people


• What about Michael Jordan is in your group?

• Average vs. Mean: Average can be ambiguous


• The average household income in this community is $60,000
• The average (mean) income for households in this community is $60,000
• The income for an average household in this community is $60,000
Methods in Python statistics module
mean() Arithmetic mean (“average”) of data.
fmean() Fast, floating point arithmetic mean.
geometric_mean() Geometric mean of data.
harmonic_mean() Harmonic mean of data.
median() Median (middle value) of data.
median_low() Low median of data.
median_high() High median of data.
median_grouped() Median, or 50th percentile, of grouped data.
Single mode (most common value) of discrete or
mode()
nominal data.
List of modes (most common values) of discrete or
multimode()
nomimal data.
quantiles() Divide data into intervals with equal probability.
Python statistics Module
from statistics import *

Raw dataset:
quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,100,95,90,46,83,97,94,91,81,88,78,86]
How to keep decimal points to 2?
In [25]: round(variance(quiz1),2)
Out[25]: 165.12

round(fnum, k) : round a real number to k digits after decimal point

Or, you can add print format, using .2f for keeping two digits after
decimal point.
Arithmetic, Geometric, Harmonic Mean

>>> from statistics import *


>>> L = [12, 43, 42]
>>>mean(L)
32.333333333333336
>>> fmean(L)
32.333333333333336
>>> geometric_mean(L)
27.880442608624076
>>> harmonic_mean(L)
23.006369426751593
>>>
median, median_low, median_high
>>> L = [12, 43, 42, 21]
>>> median(L)
31.5
>>> median_low(L)
21
>>> median_high(L)
42
>>> median_grouped(L)
41.5
>>>
Finding the Median of Grouped Data
Test #1: 17 students took the test, scores distribution as follows.
What is the median score of the students?

Marks out of 50 Frequency Cumulative frequency


0-10 2 2
10-20 4 6
20-30 5 11
30-40 4 15
40-50 2 17
Q: have you learned all these
formula/calculations from the STA
class?
Recap
• What we’ve discussed
• Describing a single set of data
• Maximum, minimum, range
• Mean (arithmetic, geometric, harmonic), median (median, median-low, median-high,
median-grouped), mode

Q: which measure(s) describe central tendency?

• Scores used as a sample data set


• Discussion: Any other data set(s) could fit the above analysis/description?
Quantile
• A generalization of the median is the quantile, which represents the value
under which a certain percentile of the data lies
• Quantiles are the cut-point values
• The median represents the value under which 50% of the data lies
• 75% is the value which 75% of data (of population or sample) are less than or
equal to
• e.g. SAT scores
• Example: given a score list [78, 45, 55, 98, 86, 72, 65, 12, 9, 22]
0.1 quantile of scores is 9
0.25 quantile of scores is 22
0.9 quantile of scores is 86
Quantiles: Python Statistics
# Decile cut points for empirically sampled data
#A frequency distribution into equal groups
#Returns a list of n - 1 cut points separating the intervals.
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100, 75, 105, 103, 109, 76, 119, 99,
91, 103, 129, 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87, 102, 121,
111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104] #total 50 data elements
distributionLst =[round(q, 1) for q in quantiles(data, n=10) ] #a list of n-1 points
print(distributionLst)
Q: my score is 105, about what %?
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
#interpretation of result: based on above score data, if we divide the whole population into 10
groups, if your score > 111.0, your are in top group, say group 1; if score between 106.0 – 109.8, in
3rd group; if lower than 81.0, in bottom 10th group.
<=10% 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100%
81.0 86.20 89.0 99.4 102.5 103.6 106.0 109.8 111.0

[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]


Quartiles and Percentiles (vs. Quantiles)
• Set n to 4 for quartiles (the default).
quantiles(data, n=4) #or, quantiles(data)
[87.0, 102.5, 108.25]

• Set n to 10 for deciles (say last slide)


quantiles(data, n=10)
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

• Set n to 100 for percentiles, i.e. 99 cuts points that separate data into 100 equal sized groups.
• e.g. SAT test score: 75 percentile
quantiles(data, n=100)
Quantile vs. Quartile vs. Percentile

Quartile: each of four equal


groups into which a
population can be divided
according to the distribution
of values of a particular
variable.
Recap
3
• Given three values a, b, c, what does 𝑎𝑎𝑎𝑎𝑎𝑎 represent?
A. Harmonic mean
B. Median high
C. Geometric mean
D. Multimode

• From a list of quiz scores (max 25) we calculated quantiles as follows, if your score
is 21, which of the following most likely to be your standing in the class?
A. 85%
B. 95%
Quantiles = [4, 5, 8, 11, 13, 14, 17, 19, 23]
C. 75%
D. 60%
Box plot
Boxplot: Example
>>> import matplotlib.pyplot as plt
>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92,
110, 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75,
87, 102, 121,111, 88, 89, 101, 106, 95, 103, 107, 101, 81,
109, 104] #total 50 data elements

>>> plt.boxplot(data)

>>> plt.show()
Max= ?
Min = ?
Median (50 percentile) = ?
Q1 (25 percentile) = ?
Q3: (75 percentile) = ?
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()
[0.1, 0.2, 0.25, 0.3, 0.4, 0.4, 0.6, 0.7, 0.8, 0.95], which graph likely? A, B, C, D, E?
Practice

1. Given the dataset [0.1, 0.95, 0.4, 0.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.4]
which of the box plot shown in previous slide will look like (i.e.
close to) to the box plot for this dataset? (Select one without
actually drawing the box plot for the above dataset.)
(1) A (2) B (3) C (4) D (5) E

2. Use matplotlib or Pandas to draw the box plot for the above
dataset.
Practice
Boxplot: Example

From boxplot we may estimate;


Max= 128 or 129 (top line)
Min = 74 or 75 (bottom line)
Median (50 percentile) = 102-104 (red line)
Outliers
• Outliers are values that are far from the central tendency.
• Question: which term describes the central tendency?
• Outliers might be caused by errors in collecting or
processing the data, or they might be correct but unusual
measurements.
• It is always a good idea to check for outliers, and
sometimes it is useful and appropriate to discard them.
Plotting Outliers
Dataset: [0.1, 0.95, 0.4, 1.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]

Outlier: 1.7
(as shown on box plot)

df = pd.DataFrame(np.array([0.1, 0.95, 0.4, 1.7,


0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]), columns=['A'])

df.plot.box()
How to determine outliers
• IQR: Interquartile Range

Lower Outlier = Q1 – (1.5 * IQR)


Higher Outlier= Q3 + (1.5 * IQR)

• Q1: first quartile


• Q3: third quartile

• How to calculate IQR?


IQR = Q3 – Q1
Example: Outliers
Dataset (sorted) 1,2,5,6,7,9,12,15,18,19,38
Step 1: Find the median. median = 9
Step 2: Find Q1 and Q3. Q1 = 5 Q3 = 18
Step 3: Subtract Q1 from Q3. IQR = 18-5=13
Step 4: Lower and higher “fences”/”boundaries”.
Lower Outlier = Q1 – (1.5 * IQR) = 5 - 1.5*13= -14.5
Higher Outlier= Q3 + (1.5 * IQR) = 18 + 1.5 *13 = 37.5

Conclusion: in above dataset, 38 is the only outlier


Practice
• Check that 1.7 is an outlier for the following dataset

Dataset: [0.1, 0.95, 0.4, 1.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]

Outlier: 1.7
Statistics question: Does range include outliers?
• According to some online statistical references:
• The outlier is a piece of data that is distant from all other observations, and
therefore should be excluded from the data set.
• However, mean would be affected by outliers.
• Advice from a reputable site https://www.itl.nist.gov/div898/handbook/ (section 7.1.6)
• Outliers should be investigated carefully. Often, they contain valuable
information about the process under investigation or the data gathering and
recording process.
• Before considering the possible elimination of these points from the data, one
should try to understand why they appeared and whether it is likely similar
values will continue to appear.
Dispersion: Measure of Spread
• Dispersion refers to measures of how spread out our data is.
• Typically, they’re statistics for which values near zero signify not spread out at
all and for which large values are spread out
• A very simple measure of Dispersion is range
range = maximum – minimum
note: Python already used “range” for other purpose.

• A more complicated measure is variance


• Variance
• Standard deviation
Variance
• Let’s look at two examples
• Apples are all pretty much the same size (at least the ones sold in
supermarkets). So if you buy 6 apples and the total weight is 3 lbs, would it be
a reasonable summary to say they are about 0.5lb each?

• But pumpkins are more diverse. Suppose there are several varieties in a
garden, three decorative pumpkins that are 1 lb each, two pie pumpkins that
are 3 lbs each, and one Giant pumpkin that weighs 591 lbs. The mean is 100
lbs. However, if we say “The average pumpkin in my garden is 100 lbs” that is
misleading, at least not clear.
[1, 1, 1, 3, 3, 591] => average is 600/6 = 100
Mean and Variance
• If there is no single number that summarizes pumpkin weights, we
can do a little better with two numbers: mean and variance.
• mean is intended to describe the central tendency,
• variance is intended to describe the spread.

Deviation from the mean: xi-µ


Variance: σ2
Standard deviation: σ (the square root of variance)
Variance and Standard Deviation
• Standard deviation looks at how spread out a group of numbers is
from the mean
• Standard deviation equals to the square root of the variance.
• The variance measures the average degree to which each point
differs from the mean—the average of all data points.
• The two concepts are useful and significant for certain applications
• E.g. traders who use them to measure market volatility.

Deviation from the mean: xi-µ


Variance: σ2
Standard deviation: σ (the square root of variance)
Population Variance vs. Sample Variance

Population Variance

Sample Variance
https://22vignesh97.medium.com/statistics-for-data-science-e1327584209a
Why Sample Variance divide by n – 1?

When we’re dealing with a sample from a larger population, x_bar, i.e. .
is only an estimate of the actual mean, which means that on average
(x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation
from the mean, which is why we divide by n - 1 instead of n. See Wikipedia.
Measures of Spread

pstdev() Population standard deviation of data.

pvariance() Population variance of data.

stdev() Sample standard deviation of data.

variance() Sample variance of data.

(1) which formula to use when we calculate the standard deviation of Test 1 score for the class?
(2) Give an example that pstdev() would differ from stdev().
Standard
Deviation
Illustration
Question: in this example
Standard Deviation Example standard deviation = ?
mean = ?
Practice
1. Calculate the variance and standard deviation of
(1) The sequence of 9 numbers 13, 18, 13, 14, 13, 16, 14, 21, 13
(2) The 6 pumpkins weighted 1, 1, 1, 3, 3, 591
Question: which sequence has larger variance?

2. Assume the mean score of Exam 1 is 70 (which is a C), which of the following
case(s) you are more likely to get a C grade? In each case what grade you
estimate/expect to receive?
(1) Your score is 66 and the standard deviation is 2
(2) Your score is 66 and the standard deviation is 5
(3) Your score is 78 and the standard deviation is 1
Standard deviation graph

Q: If a curve is flatter and more spread out (e.g. left one), does this shows a larger or
smaller standard deviation?
Histogram Data Distribution
• Describes how often each value
appears.
• The most common representation of
a distribution is a histogram
Simple Distributions – bar charts
import matplotlib.pyplot as plt
plt.bar([1,3,5,7,9],[5,2,7,8,2], label=“Quiz1")
plt.bar([2,4,6,8,10],[8,6,2,5,6], label=“Quiz2", color='g')
plt.legend()
plt.xlabel(‘# of students')
plt.ylabel(‘quiz score')

plt.title(‘Quiz score comparison')

plt.show()
Histogram
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, color=‘g’)
plt.show()
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, bins=5, color='g')
plt.show()
Histogram vs. Bar Graph
• Histograms visualize quantitative
data or numerical data, whereas
bar charts display categorical
variables.
• The numerical data in a histogram
may be continuous (having infinite
values)
• If we use bar graph, attempting to
display all possible values of a
continuous variable along an axis
would be foolish
• Unless we pre-summarize data into
certain categories, i.e. [60, 65) as
category 1, [65,70) category 2, …
Statistical Definition of Histogram
• A histogram is a display of statistical information that uses rectangles
to show the frequency of data items in successive numerical
intervals of equal size.
• In the most common form of histogram, the independent variable is
plotted along the horizontal axis and the dependent variable is
plotted along the vertical axis.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mu, sigma = 100, 20


x = mu + sigma*np.random.randn(10000)

# the histogram of the data


n, bins, patches = plt.hist(x, 50, density=True,
facecolor='green', alpha=0.75)

# add a 'best fit' line


y = norm.pdf(bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=1)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\
\mu=100,\ \sigma=20$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
Histogram of IQ
plt.show()
Histogram for score distribution
import matplotlib.pyplot as plt

quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,
100,95,90,46,83,97,94,91,81,88,78,86]
plt.hist(quiz1, bins=5, color='g')
plt.show()
Case Study: Social Media friend counts
• Assume a dataset contains millions of users’ friend counts
e.g. num_friends = [75, 25, 41, 40, 25, ... ], a very large list
note: num_friends[0] = 75 means Person_0 has 75 friends.
• A real example: twitter’s data https://www.kaggle.com/hwassner/TwitterFriends

• What (summary, tendency, …) information we could find out from the dataset?
• Central tendency
• Average, median number of friends?
• Mode? Quantiles?
• Dispersion/Spread
• Range? Highest, lowest number of friends? -- outliers affecting the average?
• Standard deviation? …
• Assume all users have 0 to 100 friends, then suddenly one friendliest user has 200 friends, how would
that affect standard deviation?
• Histogram
• …
Correlation
• We discussed a case study analyzing social media friend counts
• One hypothesis
• People who spend more time on social media must have more friends
i.e. the amount of time people spend on the site is related to the number of
friends they have on the site?
• How to investigate the relationship between these two metrics?
• Visualization -- preliminary method
• Statistically: covariance
• Covariance: the paired analogue of variance.
• Whereas variance measures how a single variable deviates from its mean, covariance
measures how two variables vary in tandem from their means
Covariance Formula
Numpy covariance matrix
• Example
>> import numpy as np
>>> a = [10, 4, 19, 24, 23] #number of friends
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7] #number of daily hours spent on social network
>>> print(np.cov(a,b))
[[75.5 12.9 ]
The 2x2 array returned by np.cov(a,b) has elements equal to
[12.9 2.84]]
𝑐𝑐𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)
Thus, cov(a,b) = 12.9 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑏𝑏, 𝑏𝑏)
But, what does this mean?
Correlation
• It is hard to interpret the covariance number
• It’s more common to look at the correlation, which divides out the standard deviations of
both variables
def correlation(xs, ys) :
#Measures how much xs and ys vary in tandem about their means
stdev_x = standard_deviation(xs)
stdev_y = standard_deviation(ys)
if stdev_x > 0 and stdev_y > 0:
return covariance(xs, ys) /( stdev_x * stdev_y)
else:
return 0 # if no variation, correlation is zero
Correlation Formula
Numpy calculation of correlation
>>> import numpy as np
>>> a = [10, 4, 19, 24, 23] The correlation is unitless and always lies
between –1 (perfect anticorrelation) and
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7]
1 (perfect correlation).
>>> print(np.cov(a,b))
[[75.5 12.9 ] A number like 0.25 represents a relatively
[12.9 2.84]] weak positive correlation.
>>> print( np.corrcoef(a,b))
A number like 0.75 represents a relatively
[[1. 0.88096177] strong positive correlation.
[0.88096177 1. ]]
(see previous linear regression discussion.)
Thus, correlation(a, b) = 0.88096177
Correlation (Coefficient) Matrix
The 2x2 array returned by np.corrcoef(a,b) has
elements equal to

𝑐𝑐𝑜𝑜𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)


𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑏𝑏, 𝑏𝑏)

The correlation matrix may represent


the mutual relation among a set t of
variables, e.g. automobile features.
Variance, Co-Variance, and Correlation
• Variance
• Tells us magnitude
• Co-Variance
• Tells us direction
• Correlation
• Tells us direction as well as magnitude
Summary

• An important module for data science to analyze value distributions


and spread

• Advanced statistical terms are for statistics course to explain

• Visualization of data helps understand data distribution but statistical


analysis may reveal more accurate information

You might also like