0% found this document useful (0 votes)
44 views47 pages

Lect 1

The document provides an overview of statistics, defining key concepts such as population, sample, and variable, and differentiating between descriptive and inferential statistics. It discusses various sampling methods, the importance of data collection, and the distinction between qualitative and quantitative variables. Additionally, it covers data presentation techniques and the statistical description of data, including frequency distributions and graphical representations.

Uploaded by

444d9f55a5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views47 pages

Lect 1

The document provides an overview of statistics, defining key concepts such as population, sample, and variable, and differentiating between descriptive and inferential statistics. It discusses various sampling methods, the importance of data collection, and the distinction between qualitative and quantitative variables. Additionally, it covers data presentation techniques and the statistical description of data, including frequency distributions and graphical representations.

Uploaded by

444d9f55a5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Statistics and probability

University of Baghdad
College of science
Computer science department
First semester - Master ( 2024-2025)
Second semester
Lecture(1)
Dr. Basad Al-Sarray
1. Basics of Statistics
Definition: Science of collection, presentation, analysis, and
reasonable interpretation of data.

Statistics presents a rigorous scientific method for gaining insight


into data.
For example, suppose we measure the weight of 100 patients in a
study.
With so many measurements, simply looking at the data fails to
provide an informative account.

However statistics can give an instant overall picture of data based


on graphical presentation or numerical summarization irrespective
to the number of data points.

Besides data summarization, another important task of statistics is


to make inference and predict relations of variables.
What is Statistics?

Statistics: The science of collecting, describing, and interpreting


data.
Two areas of statistics:

Descriptive Statistics: Inferential Statistics:


collection, presentation, making decisions and
and description of drawing conclusions
sample data. about populations.
Introduction to Basic Terms

Population: A collection, or set, of individuals or objects or


events whose properties are to be analyzed.

Two kinds of populations: finite or infinite.

Sample: A subset of the population.

Sampling
Population

Sample
Statistical Descriptive
Inference Statistics
Example: A college dean is interested in learning about the average age of
faculty. Identify the basic terms in this situation.

The population is the age of all faculty members at the college.


A sample is any subset of that population. For example, we might select 10 faculty
members and determine their age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the sample
and determining the actual age of each faculty member in the sample.
The parameter of interest is the “average” age of all faculty at the college.
The statistic is the “average” age for all faculty in the sample.
Variable: A characteristic about each individual element of

a population or sample.
Data (singular): The value of the variable associated with
one element of a population or sample. This value may be
a number, a word, or a symbol.
Data (plural): The set of values collected for the variable
from each of the elements belonging to the sample.
Experiment: A planned activity whose results yield a set of
data.
Parameter: A numerical value summarizing all the data of
an entire population.
Statistic: A numerical value summarizing the sample data.
T data
Types of
One-way Table. Two-way Table
(also known as Double Table)

Three-way Table
(also known as Treble Table)
Binary data
Coca cola Redpoll Juice Coffee Tee Coca cola Redpoll Juice Coffee Tee

A no yes yes no yes A 0 1 1 0 1


Yes=1
B yes yes yes no no B 1 1 1 0 0 No=0
C yes no yes yes no C 1 0 1 1 0

Modalities table
How often do you drink these beverages
Coca cola Redpoll Juice Coffee Tee Coca Redpoll Juice Coffee Tee
cola

A never some always some always A 0 2 3 2 3

B always some always never never B 3 2 3 0 0

C always some some always never C 3 2 2 3 0


Two kinds of variables:
Qualitative, or Attribute, or Categorical, Variable: A variable that
categorizes or describes an element of a population.

Note: Arithmetic operations, such as addition and averaging, are


not meaningful for data resulting from a qualitative variable.

Quantitative, or Numerical, Variable: A variable that quantifies an


element of a population.

Note: Arithmetic operations such as addition and averaging, are


meaningful for data resulting from a quantitative variable.
Example: Identify each of the following examples as attribute (qualitative) or
numerical (quantitative) variables.
1. The residence hall for each student in a statistics class. (Attribute)
2. The amount of gasoline pumped by the next 10 customers at the local
Unimart. (Numerical)
3. The amount of radon in the basement of each of 25 homes in a new
development. (Numerical)
4. The color of the baseball cap worn by each of 20 students. (Attribute)
5. The length of time to complete a mathematics homework assignment.
(Numerical)
6. The state in which each truck is registered when stopped and inspected at a
weigh station. (Attribute)
Nominal
Qualitative
Variables Ordinal

Discrete
Quantitative
Continuous

Nominal Variable: A qualitative variable that categorizes (or describes, or names) an element
of a population.
Ordinal Variable: A qualitative variable that incorporates an ordered position, or ranking.
Discrete Variable: A quantitative variable that can assume a countable number of values.
Intuitively, a discrete variable can assume values corresponding to isolated points along a line
interval. That is, there is a gap between any two values.
Continuous Variable: A quantitative variable that can assume an uncountable number of
values.
Intuitively, a continuous variable can assume any value along a line interval, including every
possible value between any two values.
Note:
In many cases, a discrete and continuous variable may be distinguished
by determining whether the variables are related to a count or a
measurement.
Discrete variables are usually associated with counting. If the variable
cannot be further subdivided, it is a clue that you are probably dealing with
a discrete variable.
Continuous variables are usually associated with measurements. The
values of discrete variables are only limited by your ability to measure
them.

Identify each of the following as examples of qualitative or numerical


variables:
1. The temperature in Barrow, Alaska at 12:00 pm on any given day.
2. The make of automobile driven by each faculty member.
3. Whether or not a 6 volt lantern battery is defective.
4. The weight of a lead pencil.
5. The length of time billed for a long distance telephone call.
6. The brand of cereal children eat for breakfast.
7. The type of book taken out of the library by an adult.
Identify each of the following as examples of (1) nominal,
(2) ordinal, (3) discrete, or (4) continuous variables:
1. The length of time until a pain reliever begins to work.
2. 2. The number of chocolate chips in a cookie.
3. The number of colors used in a statistics textbook.
4. The brand of refrigerator in a home.
5. The overall satisfaction rating of a new car.
6. The number of files on a computer’s hard disk.
7. The pH level of the water in a swimming pool.
8. The number of staples in a stapler.
Data Collection
• First problem a statistician faces: how to obtain the data.
• It is important to obtain good, or representative, data.

• Inferences are made based on statistics obtained from the data.


• Inferences can only be as good as the data.

Biased Sampling Method: A sampling method that produces data which systematically

differs from the sampled population. An unbiased sampling method is one that is not

biased.

Sampling methods that often result in biased samples:

1. Convenience sample: sample selected from elements of a population that are easily

accessible.

2. Volunteer sample: sample collected from those elements of the population which

chose to contribute the needed information on their own initiative.


Process of data collection:

1. Define the objectives of the survey or experiment. Example: Estimate the average life

of an electronic component.

2. Define the variable and population of interest.

Example: Length of time for anesthesia to wear off after surgery.

3. Defining the data-collection and data-measuring schemes. This includes sampling

procedures, sample size, and the data-measuring device (questionnaire, scale, ruler,

etc.).

4. Determine the appropriate descriptive or inferential data analysis techniques.


Methods used to collect data:

Experiment: The investigator controls or modifies the environment and

observes the effect on the variable under study.

Survey: Data are obtained by sampling some of the population of interest. The

investigator does not modify the environment.

Census: A 100% survey. Every element of the population is listed. Seldom used:

difficult and time-consuming to compile, and expensive.


Sampling Frame: A list of the elements belonging to the population from which
the sample will be drawn.
Note: It is important that the sampling frame be representative of the
population.
Sample Design: The process of selecting sample elements from the sampling
frame.
Note: There are many different types of sample designs.
Usually they all fit into two categories:
judgment samples and probability samples.

Judgment Samples: Samples that are Probability Samples: Samples in which


selected on the basis of being “typical.” the elements to be selected are drawn
Items are selected that are representative on the basis of probability. Each
of the population. The validity of the element in a population has a certain
results from a judgment sample reflects the probability of being selected as part of
soundness of the collector’s judgment. the sample.
Random Samples: A sample selected in such a way that every element in the
population has a equal probability of being chosen. Equivalently, all samples
of size n have an equal chance of being selected.
Random samples are obtained either by sampling with replacement from a
finite population or by sampling without replacement from an infinite
population.
Note:
1. Inherent in the concept of randomness: the next result (or occurrence) is
not predictable.
2. Proper procedure for selecting a random sample: use a random number
generator or a table of random numbers.

Example: An employer is interested in the time it takes each employee to commute to


work each morning.
A random sample of 35 employees will be selected and their commuting time will be
recorded.
There are 2712 employees. Each employee is numbered: 0001, 0002, 0003, etc. up to
2712.
Using four-digit random numbers, a sample is identified: 1315, 0987, 1125, etc.
Systematic Sample: A sample in which every kth item of the sampling frame is selected,
starting from the first element which is randomly selected from the first k elements.
Note: The systematic technique is easy to execute. However, it has some inherent dangers
when the sampling frame is repetitive or cyclical in nature. In these situations the results
may not approximate a simple random sample.
Stratified Random Sample: A sample obtained by stratifying the sampling frame and then
selecting a fixed number of items from each of the strata by means of a simple random
sampling technique.
Proportional Sample (or Quota Sample): A sample obtained by stratifying the sampling
frame and then selecting a number of items in proportion to the size of the strata (or by
quota) from each strata by means of a simple random sampling technique.
Cluster Sample: A sample obtained by stratifying the sampling frame and then selecting
some or all of the items from some of, but not all, the strata.
Comparison of Probability and Statistics
Probability: Properties of the population are assumed known.
Answer questions about the sample based on these properties.
Statistics: Use information in the sample to draw a conclusion about
the population.

Example: A jar of M&M’s contains 100 candy pieces, 15 are red. A handful of 10
is selected.
Probability question: What is the probability that 3 of the 10 selected are red?
Example: A handful of 10 M&M’s is selected from a jar containing 1000 candy
pieces. Three M&M’s in the handful are red.
Statistics question: What is the proportion of red M&M’s in the entire jar?
Statistical Description of Data

Statistics describes a numeric set of data by its


• Center Variability Shape
Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Some Definitions
Variable - any characteristic of an individual or entity. A variable can take different values for
different individuals. Variables can be categorical or quantitative. Per S. S. Stevens…
Nominal - Categorical variables with no inherent order or ranking sequence such as names
or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II,
III). The only operation that can be applied to Nominal variables is enumeration.
Ordinal - Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be
compared for equality, or greater or less, but not how much greater or less.
Interval - Values of the variable are ordered as in Ordinal, and additionally, differences
between values are meaningful, however, the scale is not absolutely anchored. Calendar
dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but
not multiplication and division are meaningful operations.
Ratio - Variables with all properties of Interval plus an absolute, non-arbitrary zero point,
e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are
all meaningful operations.

Distribution - (of a variable) tells us what values the variable takes and how often it takes
these values.
Unimodal - having a single peak
Bimodal - having two distinct peaks
Symmetric - left and right half are mirror images.
A Taxonomy of Statistics
Statistical Description of Data
• Statistics describes a numeric set of data by its
• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Frequency Distribution
Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2

Grouped Frequency Distribution of Age:


Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency
Cumulative frequency of data in previous page

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6

Frequency 8 12 6

Cumulative Frequency 8 20 26
Data Presentation
Two types of statistical presentation of data - graphical and numerical.

Graphical Presentation: We
look for the overall pattern Bar diagram and Pie charts are
and for striking deviations used for categorical variables.
from that pattern. Over all
pattern usually described by
Histogram, stem and leaf and Box-
shape, center, and spread of
plot are used for numerical
the data. An individual value variable.
that falls outside the overall
pattern is called an outlier.
Data Presentation –Categorical Variable
Bar Diagram: Lists the categories and presents the percent or count of individuals who fall
in each category.

Figure 1: Bar Chart of Subjects in


Treatm ent Groups
Treatment Frequency Proportion Percent
Group (%)
30
Number of Subjects

25
1 15 (15/60)=0.25 25.0
20
15
2 25 (25/60)=0.333 41.7
10
3 20 (20/60)=0.417 33.3
5
0 Total 60 1.00 100
1 2 3
Treatm ent Group
Data Presentation –Categorical Variable
Pie Chart: Lists the categories and presents the percent or count of individuals who fall in
each category.

Figure 2: Pie Chart of Treatment Frequency Proportion Percent


Subjects in Treatment Groups Group (%)

1 15 (15/60)=0.25 25.0
25% 2 25 (25/60)=0.333 41.7
33% 1
2 3 20 (20/60)=0.417 33.3

3 Total 60 1.00 100


42%
Graphical Presentation –Numerical Variable

Histogram: Overall pattern can be described by its shape, center, and spread. The
following age distribution is right skewed. The center lies between 80 to 100. No
outliers.

Mean 90.41666667
Figure 3: Age Distribution
Standard Error 3.902649518

16 Median 84
14 Mode 84
Number of Subjects

12 Standard Deviation 30.22979318


10
Sample Variance 913.8403955
8
Kurtosis -1.183899591
6
4 Skewness 0.389872725

2 Range 95
0 Minimum 48
40 60 80 100 120 140 More
Maximum 143
Age in Month
Sum 5425
Count 60
Graphical Presentation –Numerical Variable

Box-Plot: Describes the five-number summary


Figure 3: Distribution of Age

160
140
120
q1
100 min
80 median
60 max
q3
40
20
0
1

Box Plot
Numerical Presentation

A fundamental concept in summary statistics is that of a central value for a set


of observations and the extent to which the central value characterizes the
whole set of data. Measures of central value such as the mean or median must
be coupled with measures of data dispersion (e.g., average distance from the
mean) to indicate how well the central value characterizes the data as a whole.

To understand how well a central value characterizes a set of observations, let


us consider the following two sets of data:
A: 30, 50, 70
B: 40, 50, 60
The mean of both two data sets is 50. But, the distance of the observations from
the mean in data set A is larger than in the data set B. Thus, the mean of data
set B is a better representation of the data set than is the case for set A.
Methods of Center Measurement
Center measurement is a summary measure of the overall level of a dataset

Commonly used methods are mean, median, mode, geometric mean etc.

Mean: Summing up all the observation and dividing by number of observations. Mean
of 20, 30, 40 is (20+30+40)/3 = 30.

Notation : Let x1 , x2, ...xn are n observations of a variable


x. Then the mean of this variable,
n

x1  x2  ...  xn x i
x  i 1
n n
Median: The middle value in an ordered sequence of observations. That is, to find
the median we need to order the data set and then find the middle value. In case of
an even number of observations the average of the two middle most values is the
median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations
is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle
values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.

Mode: The value that is observed most frequently. The mode is undefined for
sequences in which no observation is repeated.

Mean or Median

The median is less sensitive to outliers (extreme scores) than the mean and thus a
better measure than the mean for highly skewed distributions, e.g. family income. For
example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40.
So, the mean 270 really fails to give a realistic picture of the major part of the data. It
is influenced by extreme value 990.
Methods of Variability Measurement

Variability (or dispersion) measures the amount of scatter in a dataset.

Commonly used methods: range, variance, standard deviation, interquartile range,


coefficient of variation etc.

Range: The difference between the largest and the smallest observations. The range of
10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.
Variance: The variance of a set of observations is the average of the squares of the
deviations of the observations from their mean. In symbols, the variance of the n
observations x1, x2,…xn is
( x1  x )  ....  ( xn  x )
2 2
S2 
n 1
Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is
(5  5) 2  (3  5) 2  (7  5) 2
4
3 1
Standard Deviation: Square root of the variance. The standard deviation of the above
example is 2.
Quartiles: Data can be divided into four regions that cover the total range of observed
values. Cut points for these regions are known as quartiles.
In notations, quartiles of a data is the ((n+1)/4)qth observation of the data, where q is
the desired quartile and n is the number of observations of data.
The first quartile (Q1) is the first 25% of the data. The second quartile (Q2) is between
the 25th and 50th percentage points in the data. The upper bound of Q2 is the median.
The third quartile (Q3) is the 25% of the data lying between the median and the 75% cut
point in the data.

Q1 is the median of the first half of the ordered observations and Q3 is the median of
the second half of the ordered observations.
In the following example Q1= ((15+1)/4)1 =4th observation of the data. The 4th observation is
11. So Q1 is of this data is 11.
An example with 15 numbers
3 6 7 11 13 22 30 40 44 50 52 61 68 80 94
Q1 Q2 Q3
The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The
third quartile is Q3=61.

Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous
example is 61- 40=21. The middle half of the ordered data lie between 40 and 61.
Deciles and Percentiles
Deciles: If data is ordered and divided into 10 parts, then cut points are called Deciles

Percentiles: If data is ordered and divided into 100 parts, then cut points are called
Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th
percentile of the data is Q3.

In notations, percentiles of a data is the ((n+1)/100)p th observation of the data, where p


is the desired percentile and n is the number of observations of data.

Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually
expressed in percent.

Coefficient of Variation = 100
x
Five Number Summary

Five Number Summary: The five number summary of a distribution consists of the
smallest (Minimum) observation, the first quartile (Q1),
The median(Q2), the third quartile, and the largest (Maximum) observation written in
order from smallest to largest.

Box Plot: A box plot is a graph of the five number summary. The central box spans
the quartiles. A line within the box marks the median. Lines extending above and
below the box mark the smallest and the largest observations (i.e., the range).
Outlying samples may be additionally plotted outside the range.
Boxplot
Distribution of Age in Month

160
160
140
140
120
120 q1
100 q1
100 min
min
80 median
80 median
60 max
60 max
q3
40 q3
40
20
20
0
0
1
1
Choosing a Summary
The five number summary is usually better than the mean and standard deviation for
describing a skewed distribution or a distribution with extreme outliers. The mean and
standard deviation are reasonable for symmetric distributions that are free of outliers.

In real life we can’t always expect symmetry of the data. It’s a common practice to include
number of observations (n), mean, median, standard deviation, and range as common for
data summarization purpose. We can include other summary statistics like Q1, Q3,
Coefficient of variation if it is considered to be important for describing data.
Shape of Data

• Shape of data is measured by


– Skewness
– Kurtosis
Skewness
• Measures asymmetry of data
– Positive or right skewed: Longer right tail
– Negative or left skewed: Longer left tail

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi  x )3
Skewness  i 1
3/ 2
 n
2
  ( xi  x ) 
 i 1 
Kurtosis
• Measures peakedness of the distribution of data. The
kurtosis of normal distribution is 0.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi  x ) 4
Kurtosis  i 1
2
3
 n 2
  ( xi  x ) 
 i 1 
Summary of the Variable ‘Age’ in the given
data set

Mean 90.41666667 Histogram of Age

Standard Error 3.902649518

10
Median 84
Mode 84

8
Standard Deviation 30.22979318

Number of Subjects

6
Sample Variance 913.8403955
Kurtosis -1.183899591

4
Skewness 0.389872725
Range 95 2
Minimum 48
0

Maximum 143
40 60 80 100 120 140 160
Sum 5425
Age in Month
Count 60
Summary of the Variable ‘Age’ in the given
data set

Boxplot of Age in Month


140
120
Age(month)

100
80
60

You might also like