Stats Lect
Stats Lect
• Statistical Thinking
• Basis for modeling a big data project
• Mathematical background
• Linear algebra (lines, planes, vectors, matrices, linear equations, …)
• Representation of data and relations
• Descriptive statistics
• produce statistics that summarize the data concisely (e.g. mean, average,
standard deviation)
• Chapter 5
• Chapter 6 (probability) and Chapter 7 (Hypothesis and Inference)
• As optional materials
• Exploratory data analysis
• look for patterns, differences, and other features
Inferential Statistics
• Descriptive statistics describes data (for example, a chart or graph)
and inferential statistics allows you to make predictions (“inferences”)
from that data.
• With inferential statistics, you take data from samples and make
generalizations about a population.
• Two areas of inferential statistics
• Estimating parameters, e.g. use sample mean as population mean
• Hypothesis testing – use sample data to answer research questions, e.g.
• Eating breakfast help children perform better in school?
• A new cancer treatment drug effective?
Statistical terms of a single data set
• Information (numbers) that give a quick and simple
description of the data
• Maximum value
• Minimum value
• Range
• Mean Q1: What statistics course you took?
• Median
Q2: Do you still remember these
• Mode terms?
• Quantile/Quartile/Percentile
• Variance
• Standard deviation
Example: Test scores
Score Statistics Blackboard statistics
Visualization
Describing a Single Set of Data
• Given a list of values, we’d get some information from it (or even
apply some operation on it)
• Example: A list of exam scores
• What is the highest, lowest, median, mean, …; Add bonus to everyone?
• Percentiles, quantiles, …
• My score is 76, am I on top 25%?
• Mode
• A quiz of 5 points, interesting that a lot of students get a score of 4.
• What is the standard deviation? Curve it?
•…
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)
• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum:
Maximum:
Range:
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)
• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum: 13
Maximum: 21
Range: 21-13 = 8
Central Tendencies
• notion of where our data is centered
• mean, median, mode
median([1, 10, 2, 9, 5]) => sort the list [1, 2, 5, 9, 10] => pick up middle one: 5
• The mode of a data set is the number that occurs most often.
• E.g. From a survey pizza price, we find that the mode of New York pizza prices
is 3 dollars.
Practice
Raw dataset:
quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,100,95,90,46,83,97,94,91,81,88,78,86]
How to keep decimal points to 2?
In [25]: round(variance(quiz1),2)
Out[25]: 165.12
Or, you can add print format, using .2f for keeping two digits after
decimal point.
Arithmetic, Geometric, Harmonic Mean
• Set n to 100 for percentiles, i.e. 99 cuts points that separate data into 100 equal sized groups.
• e.g. SAT test score: 75 percentile
quantiles(data, n=100)
Quantile vs. Quartile vs. Percentile
• From a list of quiz scores (max 25) we calculated quantiles as follows, if your score
is 21, which of the following most likely to be your standing in the class?
A. 85%
B. 95%
Quantiles = [4, 5, 8, 11, 13, 14, 17, 19, 23]
C. 75%
D. 60%
Box plot
Boxplot: Example
>>> import matplotlib.pyplot as plt
>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92,
110, 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75,
87, 102, 121,111, 88, 89, 101, 106, 95, 103, 107, 101, 81,
109, 104] #total 50 data elements
>>> plt.boxplot(data)
>>> plt.show()
Max= ?
Min = ?
Median (50 percentile) = ?
Q1 (25 percentile) = ?
Q3: (75 percentile) = ?
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()
[0.1, 0.2, 0.25, 0.3, 0.4, 0.4, 0.6, 0.7, 0.8, 0.95], which graph likely? A, B, C, D, E?
Practice
1. Given the dataset [0.1, 0.95, 0.4, 0.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.4]
which of the box plot shown in previous slide will look like (i.e.
close to) to the box plot for this dataset? (Select one without
actually drawing the box plot for the above dataset.)
(1) A (2) B (3) C (4) D (5) E
2. Use matplotlib or Pandas to draw the box plot for the above
dataset.
Practice
Boxplot: Example
Outlier: 1.7
(as shown on box plot)
df.plot.box()
How to determine outliers
• IQR: Interquartile Range
Dataset: [0.1, 0.95, 0.4, 1.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]
Outlier: 1.7
Statistics question: Does range include outliers?
• According to some online statistical references:
• The outlier is a piece of data that is distant from all other observations, and
therefore should be excluded from the data set.
• However, mean would be affected by outliers.
• Advice from a reputable site https://www.itl.nist.gov/div898/handbook/ (section 7.1.6)
• Outliers should be investigated carefully. Often, they contain valuable
information about the process under investigation or the data gathering and
recording process.
• Before considering the possible elimination of these points from the data, one
should try to understand why they appeared and whether it is likely similar
values will continue to appear.
Dispersion: Measure of Spread
• Dispersion refers to measures of how spread out our data is.
• Typically, they’re statistics for which values near zero signify not spread out at
all and for which large values are spread out
• A very simple measure of Dispersion is range
range = maximum – minimum
note: Python already used “range” for other purpose.
• But pumpkins are more diverse. Suppose there are several varieties in a
garden, three decorative pumpkins that are 1 lb each, two pie pumpkins that
are 3 lbs each, and one Giant pumpkin that weighs 591 lbs. The mean is 100
lbs. However, if we say “The average pumpkin in my garden is 100 lbs” that is
misleading, at least not clear.
[1, 1, 1, 3, 3, 591] => average is 600/6 = 100
Mean and Variance
• If there is no single number that summarizes pumpkin weights, we
can do a little better with two numbers: mean and variance.
• mean is intended to describe the central tendency,
• variance is intended to describe the spread.
Population Variance
Sample Variance
https://22vignesh97.medium.com/statistics-for-data-science-e1327584209a
Why Sample Variance divide by n – 1?
When we’re dealing with a sample from a larger population, x_bar, i.e. .
is only an estimate of the actual mean, which means that on average
(x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation
from the mean, which is why we divide by n - 1 instead of n. See Wikipedia.
Measures of Spread
(1) which formula to use when we calculate the standard deviation of Test 1 score for the class?
(2) Give an example that pstdev() would differ from stdev().
Standard
Deviation
Illustration
Question: in this example
Standard Deviation Example standard deviation = ?
mean = ?
Practice
1. Calculate the variance and standard deviation of
(1) The sequence of 9 numbers 13, 18, 13, 14, 13, 16, 14, 21, 13
(2) The 6 pumpkins weighted 1, 1, 1, 3, 3, 591
Question: which sequence has larger variance?
2. Assume the mean score of Exam 1 is 70 (which is a C), which of the following
case(s) you are more likely to get a C grade? In each case what grade you
estimate/expect to receive?
(1) Your score is 66 and the standard deviation is 2
(2) Your score is 66 and the standard deviation is 5
(3) Your score is 78 and the standard deviation is 1
Standard deviation graph
Q: If a curve is flatter and more spread out (e.g. left one), does this shows a larger or
smaller standard deviation?
Histogram Data Distribution
• Describes how often each value
appears.
• The most common representation of
a distribution is a histogram
Simple Distributions – bar charts
import matplotlib.pyplot as plt
plt.bar([1,3,5,7,9],[5,2,7,8,2], label=“Quiz1")
plt.bar([2,4,6,8,10],[8,6,2,5,6], label=“Quiz2", color='g')
plt.legend()
plt.xlabel(‘# of students')
plt.ylabel(‘quiz score')
plt.show()
Histogram
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, color=‘g’)
plt.show()
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, bins=5, color='g')
plt.show()
Histogram vs. Bar Graph
• Histograms visualize quantitative
data or numerical data, whereas
bar charts display categorical
variables.
• The numerical data in a histogram
may be continuous (having infinite
values)
• If we use bar graph, attempting to
display all possible values of a
continuous variable along an axis
would be foolish
• Unless we pre-summarize data into
certain categories, i.e. [60, 65) as
category 1, [65,70) category 2, …
Statistical Definition of Histogram
• A histogram is a display of statistical information that uses rectangles
to show the frequency of data items in successive numerical
intervals of equal size.
• In the most common form of histogram, the independent variable is
plotted along the horizontal axis and the dependent variable is
plotted along the vertical axis.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\
\mu=100,\ \sigma=20$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
Histogram of IQ
plt.show()
Histogram for score distribution
import matplotlib.pyplot as plt
quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,
100,95,90,46,83,97,94,91,81,88,78,86]
plt.hist(quiz1, bins=5, color='g')
plt.show()
Case Study: Social Media friend counts
• Assume a dataset contains millions of users’ friend counts
e.g. num_friends = [75, 25, 41, 40, 25, ... ], a very large list
note: num_friends[0] = 75 means Person_0 has 75 friends.
• A real example: twitter’s data https://www.kaggle.com/hwassner/TwitterFriends
• What (summary, tendency, …) information we could find out from the dataset?
• Central tendency
• Average, median number of friends?
• Mode? Quantiles?
• Dispersion/Spread
• Range? Highest, lowest number of friends? -- outliers affecting the average?
• Standard deviation? …
• Assume all users have 0 to 100 friends, then suddenly one friendliest user has 200 friends, how would
that affect standard deviation?
• Histogram
• …
Correlation
• We discussed a case study analyzing social media friend counts
• One hypothesis
• People who spend more time on social media must have more friends
i.e. the amount of time people spend on the site is related to the number of
friends they have on the site?
• How to investigate the relationship between these two metrics?
• Visualization -- preliminary method
• Statistically: covariance
• Covariance: the paired analogue of variance.
• Whereas variance measures how a single variable deviates from its mean, covariance
measures how two variables vary in tandem from their means
Covariance Formula
Numpy covariance matrix
• Example
>> import numpy as np
>>> a = [10, 4, 19, 24, 23] #number of friends
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7] #number of daily hours spent on social network
>>> print(np.cov(a,b))
[[75.5 12.9 ]
The 2x2 array returned by np.cov(a,b) has elements equal to
[12.9 2.84]]
𝑐𝑐𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)
Thus, cov(a,b) = 12.9 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑏𝑏, 𝑏𝑏)
But, what does this mean?
Correlation
• It is hard to interpret the covariance number
• It’s more common to look at the correlation, which divides out the standard deviations of
both variables
def correlation(xs, ys) :
#Measures how much xs and ys vary in tandem about their means
stdev_x = standard_deviation(xs)
stdev_y = standard_deviation(ys)
if stdev_x > 0 and stdev_y > 0:
return covariance(xs, ys) /( stdev_x * stdev_y)
else:
return 0 # if no variation, correlation is zero
Correlation Formula
Numpy calculation of correlation
>>> import numpy as np
>>> a = [10, 4, 19, 24, 23] The correlation is unitless and always lies
between –1 (perfect anticorrelation) and
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7]
1 (perfect correlation).
>>> print(np.cov(a,b))
[[75.5 12.9 ] A number like 0.25 represents a relatively
[12.9 2.84]] weak positive correlation.
>>> print( np.corrcoef(a,b))
A number like 0.75 represents a relatively
[[1. 0.88096177] strong positive correlation.
[0.88096177 1. ]]
(see previous linear regression discussion.)
Thus, correlation(a, b) = 0.88096177
Correlation (Coefficient) Matrix
The 2x2 array returned by np.corrcoef(a,b) has
elements equal to