0% found this document useful (0 votes)

17 views

Stats Lect

This document provides an overview of key concepts in statistics and data science. It discusses the foundations of data science including mathematics, statistics, and computer science. It introduces statistical thinking and modeling for big data projects. It also covers basic statistical concepts such as populations, samples, descriptive statistics, inferential statistics, and terms related to single data sets. Examples of statistical calculations and visualizations are presented using Python libraries like Pandas and the statistics module.

Uploaded by

Christopher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Stats Lect

Uploaded by

Christopher

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Statistics and Its

Applications in Data Science

Lecture 5
Data Science: Inference + Computation
• Foundation of Data Science
• Mathematics & Statistics
• Computer Science (data structures + programming)

• Statistical Thinking
• Basis for modeling a big data project

• Mathematical background
• Linear algebra (lines, planes, vectors, matrices, linear equations, …)
• Representation of data and relations

• Computation and Visualization

• Tools to process statistics
Statistics: Introduction
• Statistics refers to the mathematics & techniques used for understanding
data.
• A rich, enormous field
• Statistical analysis is very important to big data analytics
• In this lecture only basics covered, more to courses offered in STA
• Many tools available
• Python Statistics module
• Provides functions for calculating mathematical statistics of numeric
• Documentation
• https://docs.python.org/3/library/statistics.html
• Pandas statistics
• Introduction to Pandas (supplement)
Basic Statistical Concepts
• Population
• A population is a collection of objects about which information is sought
• Sample
• A sample is a part of the population that is observed
• Example

df = pd.read_csv(‘scores.csv', index_col=0) #millions of scores

print(df.sample(100))

#df.sample(100) returns a sample of 100 items from the a collection

#of score objects (which is a population)
Population vs Sample
• Population: all members of a group in a study
• The average height of men
• The average height of living male ≥ 18yr in USA between 2001 and 2010
• The average height of all male students ≥ 18yr registered in Fall 2021
• Sample: a subset of the members in the population
• Most studies choose to sample the population due to cost/time or other factors
• i.e. most analysis performed on samples
• Example: election polls
• Each sample is only one of many possible subsets of the population
• May or may not be representative of the whole population
• Need to choose sample carefully to draw valid/reasonable conclusion
• Sample size and sampling procedure is important
Descriptive Statistics

• Descriptive statistics
• produce statistics that summarize the data concisely (e.g. mean, average,
standard deviation)
• Chapter 5
• Chapter 6 (probability) and Chapter 7 (Hypothesis and Inference)
• As optional materials
• Exploratory data analysis
• look for patterns, differences, and other features
Inferential Statistics
• Descriptive statistics describes data (for example, a chart or graph)
and inferential statistics allows you to make predictions (“inferences”)
from that data.
• With inferential statistics, you take data from samples and make
generalizations about a population.
• Two areas of inferential statistics
• Estimating parameters, e.g. use sample mean as population mean
• Hypothesis testing – use sample data to answer research questions, e.g.
• Eating breakfast help children perform better in school?
• A new cancer treatment drug effective?
Statistical terms of a single data set
• Information (numbers) that give a quick and simple
description of the data
• Maximum value
• Minimum value
• Range
• Mean Q1: What statistics course you took?
• Median
Q2: Do you still remember these
• Mode terms?
• Quantile/Quartile/Percentile
• Variance
• Standard deviation
Example: Test scores
Score Statistics Blackboard statistics

Visualization
Describing a Single Set of Data
• Given a list of values, we’d get some information from it (or even
apply some operation on it)
• Example: A list of exam scores
• What is the highest, lowest, median, mean, …; Add bonus to everyone?
• Percentiles, quantiles, …
• My score is 76, am I on top 25%?
• Mode
• A quiz of 5 points, interesting that a lot of students get a score of 4.
• What is the standard deviation? Curve it?
•…
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)

• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum:
Maximum:
Range:
Maximum, Minimum and Range
• Very familiar concepts
• Maximum: greatest/largest element of a sample
• Minimum: least/smallest element of a sample
• Range: the difference between the minimum and maximum (max – min)

• Example: find the maximum, minimum and range for the following list of values
13, 18, 13, 14, 13, 16, 14, 21, 13
Minimum: 13
Maximum: 21
Range: 21-13 = 8
Central Tendencies
• notion of where our data is centered
• mean, median, mode

mean([1, 10, 2, 9, 5]) = (1+10+2+9+5)/5 = 3.4

median([1, 10, 2, 9, 5]) => sort the list [1, 2, 5, 9, 10] => pick up middle one: 5

median([1, 9, 2, 10]) = (2 + 9) / 2 = 5.5

Definitions: Mean, Median, Mode
• Mean (also called average, or arithmetic average)
• For a sample of n values, xi (i=1,…,n), the mean, µ, is the sum of the
values divided by the number of values
• Median: The median is a simple measure of
central tendency.
• To find the median, we arrange the observations
in order from smallest to largest value (i.e. sorted)
• If there is an odd number of observations, the
median is the middle value.
• If there is an even number of observations, the
median is the average of the two middle values.
• Lo-median
• Hi-median
• Mode: the most frequently occurring
number found in a set of numbers.
• The mode is found by collecting and
organizing data in order to count the
frequency of each result.
• The result with the highest number of
occurrences is the mode of the set.
Mode
• Example:
• 10 students took a quiz of 5 points. The scores are:
5, 4, 3, 4, 4, 5, 4, 4, 2, 4
• What score received by “most” people?
• The mode for the data list is 4

• The mode of a data set is the number that occurs most often.
• E.g. From a survey pizza price, we find that the mode of New York pizza prices
is 3 dollars.
Practice

(1) Find the mean, median, mode,

and range for the following list
of values:
13, 18, 13, 14, 13, 16, 14, 21, 13, 18

(2) Should I curve the grades?

Compare with my standard grading rubrics,
will curve help if your score is 72, 64, 44
respectively?
Problem with mean/median
• Assume my class has only 5 students, test result: [100, 30, 40, 98, 95]
mean score = 72.6 (the class is doing fine!)
median score = 95 (the class is doing great!)

• Practical example: average salary survey for a group of people

• What about Michael Jordan is in your group?

• Average vs. Mean: Average can be ambiguous

• The average household income in this community is $60,000
• The average (mean) income for households in this community is $60,000
• The income for an average household in this community is $60,000
Methods in Python statistics module
mean() Arithmetic mean (“average”) of data.
fmean() Fast, floating point arithmetic mean.
geometric_mean() Geometric mean of data.
harmonic_mean() Harmonic mean of data.
median() Median (middle value) of data.
median_low() Low median of data.
median_high() High median of data.
median_grouped() Median, or 50th percentile, of grouped data.
Single mode (most common value) of discrete or
mode()
nominal data.
List of modes (most common values) of discrete or
multimode()
nomimal data.
quantiles() Divide data into intervals with equal probability.
Python statistics Module
from statistics import *

Raw dataset:
quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,100,95,90,46,83,97,94,91,81,88,78,86]
How to keep decimal points to 2?
In [25]: round(variance(quiz1),2)
Out[25]: 165.12

round(fnum, k) : round a real number to k digits after decimal point

Or, you can add print format, using .2f for keeping two digits after
decimal point.
Arithmetic, Geometric, Harmonic Mean

>>> from statistics import *

>>> L = [12, 43, 42]
>>>mean(L)
32.333333333333336
>>> fmean(L)
32.333333333333336
>>> geometric_mean(L)
27.880442608624076
>>> harmonic_mean(L)
23.006369426751593
>>>
median, median_low, median_high
>>> L = [12, 43, 42, 21]
>>> median(L)
31.5
>>> median_low(L)
21
>>> median_high(L)
42
>>> median_grouped(L)
41.5
>>>
Finding the Median of Grouped Data
Test #1: 17 students took the test, scores distribution as follows.
What is the median score of the students?

Marks out of 50 Frequency Cumulative frequency

0-10 2 2
10-20 4 6
20-30 5 11
30-40 4 15
40-50 2 17
Q: have you learned all these
formula/calculations from the STA
class?
Recap
• What we’ve discussed
• Describing a single set of data
• Maximum, minimum, range
• Mean (arithmetic, geometric, harmonic), median (median, median-low, median-high,
median-grouped), mode

Q: which measure(s) describe central tendency?

• Scores used as a sample data set

• Discussion: Any other data set(s) could fit the above analysis/description?
Quantile
• A generalization of the median is the quantile, which represents the value
under which a certain percentile of the data lies
• Quantiles are the cut-point values
• The median represents the value under which 50% of the data lies
• 75% is the value which 75% of data (of population or sample) are less than or
equal to
• e.g. SAT scores
• Example: given a score list [78, 45, 55, 98, 86, 72, 65, 12, 9, 22]
0.1 quantile of scores is 9
0.25 quantile of scores is 22
0.9 quantile of scores is 86
Quantiles: Python Statistics
# Decile cut points for empirically sampled data
#A frequency distribution into equal groups
#Returns a list of n - 1 cut points separating the intervals.
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100, 75, 105, 103, 109, 76, 119, 99,
91, 103, 129, 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87, 102, 121,
111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104] #total 50 data elements
distributionLst =[round(q, 1) for q in quantiles(data, n=10) ] #a list of n-1 points
print(distributionLst)
Q: my score is 105, about what %?
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
#interpretation of result: based on above score data, if we divide the whole population into 10
groups, if your score > 111.0, your are in top group, say group 1; if score between 106.0 – 109.8, in
3rd group; if lower than 81.0, in bottom 10th group.
<=10% 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100%
81.0 86.20 89.0 99.4 102.5 103.6 106.0 109.8 111.0

[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

Quartiles and Percentiles (vs. Quantiles)
• Set n to 4 for quartiles (the default).
quantiles(data, n=4) #or, quantiles(data)
[87.0, 102.5, 108.25]

• Set n to 10 for deciles (say last slide)

quantiles(data, n=10)
[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

• Set n to 100 for percentiles, i.e. 99 cuts points that separate data into 100 equal sized groups.
• e.g. SAT test score: 75 percentile
quantiles(data, n=100)
Quantile vs. Quartile vs. Percentile

Quartile: each of four equal

groups into which a
population can be divided
according to the distribution
of values of a particular
variable.
Recap
3
• Given three values a, b, c, what does 𝑎𝑎𝑎𝑎𝑎𝑎 represent?
A. Harmonic mean
B. Median high
C. Geometric mean
D. Multimode

• From a list of quiz scores (max 25) we calculated quantiles as follows, if your score
is 21, which of the following most likely to be your standing in the class?
A. 85%
B. 95%
Quantiles = [4, 5, 8, 11, 13, 14, 17, 19, 23]
C. 75%
D. 60%
Box plot
Boxplot: Example
>>> import matplotlib.pyplot as plt
>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92,
110, 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75,
87, 102, 121,111, 88, 89, 101, 106, 95, 103, 107, 101, 81,
109, 104] #total 50 data elements

>>> plt.boxplot(data)

>>> plt.show()
Max= ?
Min = ?
Median (50 percentile) = ?
Q1 (25 percentile) = ?
Q3: (75 percentile) = ?
df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box()
[0.1, 0.2, 0.25, 0.3, 0.4, 0.4, 0.6, 0.7, 0.8, 0.95], which graph likely? A, B, C, D, E?
Practice

1. Given the dataset [0.1, 0.95, 0.4, 0.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.4]
which of the box plot shown in previous slide will look like (i.e.
close to) to the box plot for this dataset? (Select one without
actually drawing the box plot for the above dataset.)
(1) A (2) B (3) C (4) D (5) E

2. Use matplotlib or Pandas to draw the box plot for the above
dataset.
Practice
Boxplot: Example

From boxplot we may estimate;

Max= 128 or 129 (top line)
Min = 74 or 75 (bottom line)
Median (50 percentile) = 102-104 (red line)
Outliers
• Outliers are values that are far from the central tendency.
• Question: which term describes the central tendency?
• Outliers might be caused by errors in collecting or
processing the data, or they might be correct but unusual
measurements.
• It is always a good idea to check for outliers, and
sometimes it is useful and appropriate to discard them.
Plotting Outliers
Dataset: [0.1, 0.95, 0.4, 1.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]

Outlier: 1.7
(as shown on box plot)

df = pd.DataFrame(np.array([0.1, 0.95, 0.4, 1.7,

0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]), columns=['A'])

df.plot.box()
How to determine outliers
• IQR: Interquartile Range

Lower Outlier = Q1 – (1.5 * IQR)

Higher Outlier= Q3 + (1.5 * IQR)

• Q1: first quartile

• Q3: third quartile

• How to calculate IQR?

IQR = Q3 – Q1
Example: Outliers
Dataset (sorted) 1,2,5,6,7,9,12,15,18,19,38
Step 1: Find the median. median = 9
Step 2: Find Q1 and Q3. Q1 = 5 Q3 = 18
Step 3: Subtract Q1 from Q3. IQR = 18-5=13
Step 4: Lower and higher “fences”/”boundaries”.
Lower Outlier = Q1 – (1.5 * IQR) = 5 - 1.5*13= -14.5
Higher Outlier= Q3 + (1.5 * IQR) = 18 + 1.5 *13 = 37.5

Conclusion: in above dataset, 38 is the only outlier

Practice
• Check that 1.7 is an outlier for the following dataset

Dataset: [0.1, 0.95, 0.4, 1.7, 0.8, 0.2, 0.3, 0.25, 0.6, 0.45 ]

Outlier: 1.7
Statistics question: Does range include outliers?
• According to some online statistical references:
• The outlier is a piece of data that is distant from all other observations, and
therefore should be excluded from the data set.
• However, mean would be affected by outliers.
• Advice from a reputable site https://www.itl.nist.gov/div898/handbook/ (section 7.1.6)
• Outliers should be investigated carefully. Often, they contain valuable
information about the process under investigation or the data gathering and
recording process.
• Before considering the possible elimination of these points from the data, one
should try to understand why they appeared and whether it is likely similar
values will continue to appear.
Dispersion: Measure of Spread
• Dispersion refers to measures of how spread out our data is.
• Typically, they’re statistics for which values near zero signify not spread out at
all and for which large values are spread out
• A very simple measure of Dispersion is range
range = maximum – minimum
note: Python already used “range” for other purpose.

• A more complicated measure is variance

• Variance
• Standard deviation
Variance
• Let’s look at two examples
• Apples are all pretty much the same size (at least the ones sold in
supermarkets). So if you buy 6 apples and the total weight is 3 lbs, would it be
a reasonable summary to say they are about 0.5lb each?

• But pumpkins are more diverse. Suppose there are several varieties in a
garden, three decorative pumpkins that are 1 lb each, two pie pumpkins that
are 3 lbs each, and one Giant pumpkin that weighs 591 lbs. The mean is 100
lbs. However, if we say “The average pumpkin in my garden is 100 lbs” that is
misleading, at least not clear.
[1, 1, 1, 3, 3, 591] => average is 600/6 = 100
Mean and Variance
• If there is no single number that summarizes pumpkin weights, we
can do a little better with two numbers: mean and variance.
• mean is intended to describe the central tendency,
• variance is intended to describe the spread.

Deviation from the mean: xi-µ

Variance: σ2
Standard deviation: σ (the square root of variance)
Variance and Standard Deviation
• Standard deviation looks at how spread out a group of numbers is
from the mean
• Standard deviation equals to the square root of the variance.
• The variance measures the average degree to which each point
differs from the mean—the average of all data points.
• The two concepts are useful and significant for certain applications
• E.g. traders who use them to measure market volatility.

Deviation from the mean: xi-µ

Variance: σ2
Standard deviation: σ (the square root of variance)
Population Variance vs. Sample Variance

Population Variance

Sample Variance
https://22vignesh97.medium.com/statistics-for-data-science-e1327584209a
Why Sample Variance divide by n – 1?

When we’re dealing with a sample from a larger population, x_bar, i.e. .
is only an estimate of the actual mean, which means that on average
(x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation
from the mean, which is why we divide by n - 1 instead of n. See Wikipedia.
Measures of Spread

pstdev() Population standard deviation of data.

pvariance() Population variance of data.

stdev() Sample standard deviation of data.

variance() Sample variance of data.

(1) which formula to use when we calculate the standard deviation of Test 1 score for the class?
(2) Give an example that pstdev() would differ from stdev().
Standard
Deviation
Illustration
Question: in this example
Standard Deviation Example standard deviation = ?
mean = ?
Practice
1. Calculate the variance and standard deviation of
(1) The sequence of 9 numbers 13, 18, 13, 14, 13, 16, 14, 21, 13
(2) The 6 pumpkins weighted 1, 1, 1, 3, 3, 591
Question: which sequence has larger variance?

2. Assume the mean score of Exam 1 is 70 (which is a C), which of the following
case(s) you are more likely to get a C grade? In each case what grade you
estimate/expect to receive?
(1) Your score is 66 and the standard deviation is 2
(2) Your score is 66 and the standard deviation is 5
(3) Your score is 78 and the standard deviation is 1
Standard deviation graph

Q: If a curve is flatter and more spread out (e.g. left one), does this shows a larger or
smaller standard deviation?
Histogram Data Distribution
• Describes how often each value
appears.
• The most common representation of
a distribution is a histogram
Simple Distributions – bar charts
import matplotlib.pyplot as plt
plt.bar([1,3,5,7,9],[5,2,7,8,2], label=“Quiz1")
plt.bar([2,4,6,8,10],[8,6,2,5,6], label=“Quiz2", color='g')
plt.legend()
plt.xlabel(‘# of students')
plt.ylabel(‘quiz score')

plt.title(‘Quiz score comparison')

plt.show()
Histogram
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, color=‘g’)
plt.show()
import matplotlib.pyplot as plt
data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110, 100,
75, 105, 103, 109, 76, 119, 99, 91, 103, 129, 106, 101,
84, 111, 74, 87, 86, 103, 103, 106, 86, 111, 75, 87,
102, 121, 111, 88, 89, 101, 106, 95, 103, 107, 101, 81, 109, 104]
plt.hist(data, bins=5, color='g')
plt.show()
Histogram vs. Bar Graph
• Histograms visualize quantitative
data or numerical data, whereas
bar charts display categorical
variables.
• The numerical data in a histogram
may be continuous (having infinite
values)
• If we use bar graph, attempting to
display all possible values of a
continuous variable along an axis
would be foolish
• Unless we pre-summarize data into
certain categories, i.e. [60, 65) as
category 1, [65,70) category 2, …
Statistical Definition of Histogram
• A histogram is a display of statistical information that uses rectangles
to show the frequency of data items in successive numerical
intervals of equal size.
• In the most common form of histogram, the independent variable is
plotted along the horizontal axis and the dependent variable is
plotted along the vertical axis.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

mu, sigma = 100, 20

x = mu + sigma*np.random.randn(10000)

# the histogram of the data

n, bins, patches = plt.hist(x, 50, density=True,
facecolor='green', alpha=0.75)

# add a 'best fit' line

y = norm.pdf(bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=1)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\
\mu=100,\ \sigma=20$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
Histogram of IQ
plt.show()
Histogram for score distribution
import matplotlib.pyplot as plt

quiz1=[95,51,92,99,82,90,86,89,92,92,96,91,92,
100,95,90,46,83,97,94,91,81,88,78,86]
plt.hist(quiz1, bins=5, color='g')
plt.show()
Case Study: Social Media friend counts
• Assume a dataset contains millions of users’ friend counts
e.g. num_friends = [75, 25, 41, 40, 25, ... ], a very large list
note: num_friends[0] = 75 means Person_0 has 75 friends.
• A real example: twitter’s data https://www.kaggle.com/hwassner/TwitterFriends

• What (summary, tendency, …) information we could find out from the dataset?
• Central tendency
• Average, median number of friends?
• Mode? Quantiles?
• Dispersion/Spread
• Range? Highest, lowest number of friends? -- outliers affecting the average?
• Standard deviation? …
• Assume all users have 0 to 100 friends, then suddenly one friendliest user has 200 friends, how would
that affect standard deviation?
• Histogram
• …
Correlation
• We discussed a case study analyzing social media friend counts
• One hypothesis
• People who spend more time on social media must have more friends
i.e. the amount of time people spend on the site is related to the number of
friends they have on the site?
• How to investigate the relationship between these two metrics?
• Visualization -- preliminary method
• Statistically: covariance
• Covariance: the paired analogue of variance.
• Whereas variance measures how a single variable deviates from its mean, covariance
measures how two variables vary in tandem from their means
Covariance Formula
Numpy covariance matrix
• Example
>> import numpy as np
>>> a = [10, 4, 19, 24, 23] #number of friends
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7] #number of daily hours spent on social network
>>> print(np.cov(a,b))
[[75.5 12.9 ]
The 2x2 array returned by np.cov(a,b) has elements equal to
[12.9 2.84]]
𝑐𝑐𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)
Thus, cov(a,b) = 12.9 𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑏𝑏, 𝑏𝑏)
But, what does this mean?
Correlation
• It is hard to interpret the covariance number
• It’s more common to look at the correlation, which divides out the standard deviations of
both variables
def correlation(xs, ys) :
#Measures how much xs and ys vary in tandem about their means
stdev_x = standard_deviation(xs)
stdev_y = standard_deviation(ys)
if stdev_x > 0 and stdev_y > 0:
return covariance(xs, ys) /( stdev_x * stdev_y)
else:
return 0 # if no variation, correlation is zero
Correlation Formula
Numpy calculation of correlation
>>> import numpy as np
>>> a = [10, 4, 19, 24, 23] The correlation is unitless and always lies
between –1 (perfect anticorrelation) and
>>> b = [2.3, 0.5, 4.3, 3.2, 4.7]
1 (perfect correlation).
>>> print(np.cov(a,b))
[[75.5 12.9 ] A number like 0.25 represents a relatively
[12.9 2.84]] weak positive correlation.
>>> print( np.corrcoef(a,b))
A number like 0.75 represents a relatively
[[1. 0.88096177] strong positive correlation.
[0.88096177 1. ]]
(see previous linear regression discussion.)
Thus, correlation(a, b) = 0.88096177
Correlation (Coefficient) Matrix
The 2x2 array returned by np.corrcoef(a,b) has
elements equal to

𝑐𝑐𝑜𝑜𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)

𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑏𝑏, 𝑏𝑏)

The correlation matrix may represent

the mutual relation among a set t of
variables, e.g. automobile features.
Variance, Co-Variance, and Correlation
• Variance
• Tells us magnitude
• Co-Variance
• Tells us direction
• Correlation
• Tells us direction as well as magnitude
Summary

• An important module for data science to analyze value distributions

and spread

• Advanced statistical terms are for statistics course to explain

• Visualization of data helps understand data distribution but statistical

analysis may reveal more accurate information

Ssmda End Sem
No ratings yet
Ssmda End Sem
152 pages
Stats 1 Module Updated
No ratings yet
Stats 1 Module Updated
53 pages
Module 3 - Branches of Statistics (1)
No ratings yet
Module 3 - Branches of Statistics (1)
50 pages
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
No ratings yet
Lecture 2 Core Statistics 101 Mean Median Mode Distribution
32 pages
Statistics Notes Self Made
100% (1)
Statistics Notes Self Made
41 pages
Math236_Lecture_2 (1)
No ratings yet
Math236_Lecture_2 (1)
64 pages
Statistics
100% (4)
Statistics
124 pages
Statistics For Data Science
100% (1)
Statistics For Data Science
27 pages
PC 2 Statistics by Praveen Mathur
No ratings yet
PC 2 Statistics by Praveen Mathur
44 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
UNIT II_ Statistics for Data Science_new (1)
No ratings yet
UNIT II_ Statistics for Data Science_new (1)
153 pages
Statistics, Statistical Modelling & Data Analytics
No ratings yet
Statistics, Statistical Modelling & Data Analytics
68 pages
Data Management
No ratings yet
Data Management
36 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
93 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
chapter2-statistical analysis
No ratings yet
chapter2-statistical analysis
86 pages
Statistics and Probability
No ratings yet
Statistics and Probability
59 pages
Week 01
No ratings yet
Week 01
71 pages
Statistics and Its Types(v1.0)
No ratings yet
Statistics and Its Types(v1.0)
6 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
Comprehensive Ebook of Statistics For Data Science - Chaitali
No ratings yet
Comprehensive Ebook of Statistics For Data Science - Chaitali
21 pages
Statistics Training for Math Tutors VWZdTNUo
No ratings yet
Statistics Training for Math Tutors VWZdTNUo
94 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Unit 2 DS pdf
No ratings yet
Unit 2 DS pdf
97 pages
Chapter 3(Technical English for Statistics)
No ratings yet
Chapter 3(Technical English for Statistics)
8 pages
Quantitative Methods For Decision Making: Dr. Akhter
No ratings yet
Quantitative Methods For Decision Making: Dr. Akhter
100 pages
Definition_and_Scope_of_Statistics_PPT
No ratings yet
Definition_and_Scope_of_Statistics_PPT
20 pages
Data Management
No ratings yet
Data Management
48 pages
SSM & Da All Unit Notes
No ratings yet
SSM & Da All Unit Notes
152 pages
Topic 2- Descriptive_statistics
No ratings yet
Topic 2- Descriptive_statistics
36 pages
UNIT II - Statistics For Data Science - New
No ratings yet
UNIT II - Statistics For Data Science - New
153 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Stats and its Real world applications.
No ratings yet
Stats and its Real world applications.
53 pages
PRELIM-COVERAGE
No ratings yet
PRELIM-COVERAGE
6 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
C1S1 Statistics Packet
No ratings yet
C1S1 Statistics Packet
24 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
STPDF2 - Descriptive Statistics
100% (1)
STPDF2 - Descriptive Statistics
74 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Week 5 - Result and Analysis 1 (UP)
No ratings yet
Week 5 - Result and Analysis 1 (UP)
7 pages
Ge 4 Topic 2-Statistics
67% (3)
Ge 4 Topic 2-Statistics
11 pages
Toaz - Info Ge 4 Topic 2 Statistics PR
No ratings yet
Toaz - Info Ge 4 Topic 2 Statistics PR
11 pages
Staticus: Math 103 Lecture 9 Class Notes
No ratings yet
Staticus: Math 103 Lecture 9 Class Notes
4 pages
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
No ratings yet
Applied Statistical Methods (ASM) : "The True Logic of This World Is in The Calculus of Probabilities"
90 pages
Mmw Reviewer
No ratings yet
Mmw Reviewer
9 pages
Statistics & Psychology
No ratings yet
Statistics & Psychology
47 pages
WEEK 3 - Central-Tendency-Variation-And-Shape
No ratings yet
WEEK 3 - Central-Tendency-Variation-And-Shape
39 pages
Measures of Central Tendency
100% (1)
Measures of Central Tendency
48 pages
Business Statistics
No ratings yet
Business Statistics
106 pages
MATH& 146 Lesson 8: Averages and Variation
No ratings yet
MATH& 146 Lesson 8: Averages and Variation
30 pages
Data Management ( 1)
No ratings yet
Data Management ( 1)
46 pages
Statistics and Probabilities Quarter 1
No ratings yet
Statistics and Probabilities Quarter 1
6 pages
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
No ratings yet
Introduction To Statistics: Prepared By: Joshua Erdy A. Tan
29 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Measures of Central Tendency (Summarizing Data With A Single Number) Grouped Data Intro To Dispersion: Quartiles
No ratings yet
Measures of Central Tendency (Summarizing Data With A Single Number) Grouped Data Intro To Dispersion: Quartiles
11 pages
Stats Lecture 1
No ratings yet
Stats Lecture 1
45 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Cyclic Codes
No ratings yet
Cyclic Codes
12 pages
Communities of Women in Assam Being doing and thinking together 1st Edition Nandana Dutta (Editor) - Read the ebook online or download it to own the complete version
100% (2)
Communities of Women in Assam Being doing and thinking together 1st Edition Nandana Dutta (Editor) - Read the ebook online or download it to own the complete version
52 pages
Chemistry Syllabus
No ratings yet
Chemistry Syllabus
10 pages
Sample Thesis For Psychology Students
100% (3)
Sample Thesis For Psychology Students
8 pages
Thesis Compare and Contrast
100% (3)
Thesis Compare and Contrast
6 pages
English For The Financial Sector Intermediate Teachers Book Sample Pages
No ratings yet
English For The Financial Sector Intermediate Teachers Book Sample Pages
6 pages
Design Thinking Notes Unit-1
No ratings yet
Design Thinking Notes Unit-1
10 pages
Asl 2 Presentation
No ratings yet
Asl 2 Presentation
9 pages
The Two Hour Tarpu Ş: Part 1: Historical Background
No ratings yet
The Two Hour Tarpu Ş: Part 1: Historical Background
15 pages
Chakra - Hypo & Hyper
100% (3)
Chakra - Hypo & Hyper
3 pages
GMRK Tgtag Handbook
No ratings yet
GMRK Tgtag Handbook
119 pages
Stock Widget Screens Document
No ratings yet
Stock Widget Screens Document
12 pages
Convinced That God Had Called Us Dreams, Visions, And the Perception of Gods Will in Luke-Acts (Biblical Interpretation... (John B.F. Miller) (Z-Library)
No ratings yet
Convinced That God Had Called Us Dreams, Visions, And the Perception of Gods Will in Luke-Acts (Biblical Interpretation... (John B.F. Miller) (Z-Library)
295 pages
The Theory of Public Administration (Presentation)
No ratings yet
The Theory of Public Administration (Presentation)
191 pages
SG Airlines Vs Pano
No ratings yet
SG Airlines Vs Pano
8 pages
Error Spotting Article, Pronoun, Noun (March 13)
No ratings yet
Error Spotting Article, Pronoun, Noun (March 13)
29 pages
Updated Calendar for year 2025
No ratings yet
Updated Calendar for year 2025
1 page
Autopsy of Gunpowder Empires Project 2018
No ratings yet
Autopsy of Gunpowder Empires Project 2018
1 page
Career in Physics
0% (1)
Career in Physics
9 pages
Material Thinking: The Aesthetic Philosophy of Jacques Rancière and The Design Art of Andrea Zittel
No ratings yet
Material Thinking: The Aesthetic Philosophy of Jacques Rancière and The Design Art of Andrea Zittel
17 pages
Moke's Random Subclasses
100% (2)
Moke's Random Subclasses
4 pages
Satcom: C-Band, Power Amplifier SSPA-50-200 Watts
No ratings yet
Satcom: C-Band, Power Amplifier SSPA-50-200 Watts
3 pages
Acknowledgement: Boominathan, M.SC., M.Phil., PGDCSA., PH.D., Principal Bishop Heber
No ratings yet
Acknowledgement: Boominathan, M.SC., M.Phil., PGDCSA., PH.D., Principal Bishop Heber
6 pages
CBSE School List With Code
No ratings yet
CBSE School List With Code
25 pages
Roland Vaux "Studies in Old Testament Sacrifice"
No ratings yet
Roland Vaux "Studies in Old Testament Sacrifice"
1 page
MS WS Art, Culture and Architecture in Medieval India_
No ratings yet
MS WS Art, Culture and Architecture in Medieval India_
2 pages
Grerat Story Series 17jan24d
No ratings yet
Grerat Story Series 17jan24d
11 pages
A48970353 - 28750 - 6 - 2023 - QTT201 Ca3
No ratings yet
A48970353 - 28750 - 6 - 2023 - QTT201 Ca3
4 pages
Validity
No ratings yet
Validity
32 pages
Bigfoot reading
No ratings yet
Bigfoot reading
5 pages

Stats Lect

Uploaded by

Stats Lect

Uploaded by

Statistics and Its

Applications in Data Science

• Computation and Visualization

df = pd.read_csv(‘scores.csv', index_col=0) #millions of scores

#df.sample(100) returns a sample of 100 items from the a collection

mean([1, 10, 2, 9, 5]) = (1+10+2+9+5)/5 = 3.4

median([1, 9, 2, 10]) = (2 + 9) / 2 = 5.5

(1) Find the mean, median, mode,

(2) Should I curve the grades?

• Practical example: average salary survey for a group of people

• Average vs. Mean: Average can be ambiguous

round(fnum, k) : round a real number to k digits after decimal point

>>> from statistics import *

Marks out of 50 Frequency Cumulative frequency

Q: which measure(s) describe central tendency?

• Scores used as a sample data set

[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]

• Set n to 10 for deciles (say last slide)

Quartile: each of four equal

From boxplot we may estimate;

df = pd.DataFrame(np.array([0.1, 0.95, 0.4, 1.7,

Lower Outlier = Q1 – (1.5 * IQR)

• Q1: first quartile

• How to calculate IQR?

Conclusion: in above dataset, 38 is the only outlier

• A more complicated measure is variance

Deviation from the mean: xi-µ

Deviation from the mean: xi-µ

pstdev() Population standard deviation of data.

pvariance() Population variance of data.

stdev() Sample standard deviation of data.

variance() Sample variance of data.

plt.title(‘Quiz score comparison')

mu, sigma = 100, 20

# the histogram of the data

# add a 'best fit' line

𝑐𝑐𝑜𝑜𝑜𝑜𝑜𝑜(𝑎𝑎, 𝑎𝑎) 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐(𝑎𝑎, 𝑏𝑏)

The correlation matrix may represent

• An important module for data science to analyze value distributions

• Advanced statistical terms are for statistics course to explain

• Visualization of data helps understand data distribution but statistical

You might also like