0% found this document useful (0 votes)
9 views62 pages

Probability and statistics

Uploaded by

hgull8490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views62 pages

Probability and statistics

Uploaded by

hgull8490
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Chapter 1 Introduction to Statistics

Statistics
Statistics is the art of learning from data. It is concerned with the
collection of data, its description, and its analysis, which often leads to
the drawing of conclusions.

1
Data Collection & Descriptive Statistics
Sometimes a statistical analysis begins with a given set of data: For instance, the
government regularly collects and publicizes data concerning earthquake occurrences, the
unemployment rate and the rate of inflation. Statistics can be used to describe,
summarize, and analyze these data.
In some situations, data are not yet available; in such cases statistical theory can be
used to design an appropriate experiment to generate data.
For instance, suppose that an instructor is interested in determining which of two
different methods for teaching computer programming to beginners is most effective. To
study this question, the instructor might divide the students into two groups, and use a

2
different teaching method for each group. At the end of the class the students can be
tested and the scores of the members of the different groups compared. If the data,
consisting of the test scores of members of each group, are significantly higher in one of
the groups, then it might seem reasonable to suppose that the teaching method used for
that group is superior.
It is important to note, however, that in order to be able to draw a valid conclusion from
the data, it is essential that the students were divided into groups in such a manner that
neither group was more likely to have the students with greater natural aptitude for
programming. For instance, the instructor should not have let the male class members be
one group and the females the other. For if so, then even if the women scored
significantly higher than the men, it would not be clear whether this was due to the them,
3
or to the fact that women may be inherently better than men at learning programming
skills. The accepted way of avoiding this pitfall is to divide the class members into the
two groups “at random.” This term means that the division is done in such a manner
that all possible choices of the members of a group are equally likely.
At the end of the experiment, the data should be described. For instance, the scores of
the two groups should be presented. In addition, summary measures such as the average
score of members of each of the groups should be presented. This part of statistics,
concerned with the description and summarization of data, is called descriptive statistics.

4
Inferential Statistics & Probability Models
After the preceding experiment is completed and the data are described and summarized,
we hope to be able to draw a conclusion about which teaching method is superior. This
part of statistics, concerned with the drawing of conclusions, is called inferential statistics.
To be able to draw a conclusion from the data, we must take into account the possibility of
chance. For instance, suppose that the average score of members of the first group is quite
a bit higher than that of the second. Can we conclude that this increase is due to the
teaching method used? Or is it possible that the teaching method was not responsible for
the increased scores but rather that the higher scores of the first group were just a chance
occurrence? For instance, the fact that a coin comes up heads 7 times in 10 flips does not
necessarily mean that the coin is more likely to come up heads than tails in future flips.
5
Indeed, it could be a perfectly ordinary coin that, by chance, just happened to land heads 7
times out of the total of 10 flips. (On the other hand, if the coin had landed heads 47 times
out of 50 flips, then we would be quite certain that it was not an ordinary coin.)
To be able to draw logical conclusions from data, we usually make some assumptions
about the chances (or probabilities) of obtaining the different data values. The totality of
these assumptions is referred to as a probability model for the data.

6
Population & Samples
In statistics, we are interested in obtaining information about a total collection of elements,
which we will refer to as the population. The population is often too large for us to
examine each of its members. For instance, we might have all the residents of a given state,
or all the television sets produced in the last year by a particular manufacturer, or all the
households in a given community. In such cases, we try to learn about the population by
choosing and then examining a subgroup of its elements. This subgroup of a population is
called a sample.
The sample is to be informative because it is representative of the population.

7
Chapter 2 Descriptive Statistics

Describing Data Set

The numerical findings of a study should be presented clearly, concisely,


and in such a manner that an observer can quickly obtain a feel for the
essential characteristics of the data. Over the years it has been found that
tables and graphs are particularly useful ways of presenting data, often
revealing important features such as the range, symmetry of the data.

8
Frequency Tables & Graphs
A data set having a relatively small number of distinct values can be conveniently
presented in a frequency table. For instance, Table 1 is a frequency table for a data
set consisting of the starting yearly salaries (to the nearest thousand dollars) of 42
recently graduated students with B.S. degrees in computer science. Table 1 tells us,
among other things, that the lowest starting salary of $47,000 was received by four
of the graduates, whereas the highest salary of $60,000 was received by a single
student. The most common starting salary was $52,000, and was received by 10 of
the students.

9
Table 1

10
Data from a frequency table can be graphically represented by a line graph that plots the
distinct data values on the horizontal axis and indicates their frequencies by the heights of
vertical lines. A line graph of the data presented in Table 1 is shown in Figure 1.

Figure 1

11
When the lines in a line graph are given added thickness, the graph is called a bar graph.
Figure 2 presents a bar graph.

Figure 2

12
13
Another type of graph used to represent a frequency table is the frequency polygon, which
plots the frequencies of the different data values on the vertical axis, and then connects
the plotted points with straight lines. Figure 3 presents a frequency polygon for the data of
Table 1.

Figure 3

14
Relative Frequency Tables & Graphs
Consider a data set consisting of 𝑛 values. If 𝑓 is the frequency of a particular value,
then the ratio 𝑓/𝑛 is called its relative frequency. That is, the relative frequency of a
data value is the proportion of the data that have that value. The relative frequencies
can be represented graphically by a relative frequency line or bar graph or by a
relative frequency polygon. Indeed, these relative frequency graphs will look like the
corresponding graphs of the absolute frequencies except that the labels on the
vertical axis are now the old labels (that gave the frequencies) divided by the total
number of data points.

15
Table 4 is a relative frequency table for the data of Table 1. The relative frequencies
are obtained by dividing the corresponding frequencies of Table 1 by 42, the size of
the data set.

Figure 4

16
Pie Chart
We can construct pie chart by dividing a circle into various sections or slices. It
should be used when we want to compare individual categories with the whole. If
you want to compare the values of categories with each other, a bar chart may be
more useful.

17
Problem

The following table shows the yearly budget of a family

Draw a pie chart to represent the above information.

18
Solution

19
From the table, we obtain the required pie chart as shown below.

20
21
22
Grouped data, histograms, ogives, and stem and
leaf plots
Using a line or a bar graph to plot the frequencies of data values is often an effective
way of portraying a data set. However, for some data sets the number of distinct
values is too large to utilize this approach. Instead, in such cases, it is useful to
divide the values into groupings, or class intervals, and then plot the number of data
values falling in each class interval. The number of class intervals chosen should be
a trade-off between (1) choosing too few classes at a cost of losing too much
information about the actual data values in a class and (2) choosing too many
classes, which will result in the frequencies of each class being too small for a
pattern to be discernible.
23
The endpoints of a class interval are called the class boundaries. We will adopt the
left end inclusion convention, which stipulates that a class interval contains its left-
end but not its right-end boundary point. Thus, for instance, the class interval 20–30
contains all values that are both greater than or equal to 20 and less than 30.
Table 2 (on next slide) presents the lifetimes of 200 incandescent lamps. A class
frequency table for the data of Table 2 is presented in Table 3. The class intervals
are of length 100, with the first one starting at 500.

24
Table 2

25
Table 3

26
A bar graph plot of class data, with the bars placed adjacent to each other, is called
a histogram. The vertical axis of a histogram can represent either the class
frequency or the relative class frequency; in the former case the graph is called a
frequency histogram and in the latter a relative frequency histogram.

27
28
An efficient way of organizing a small- to moderate-sized data set is to utilize a
stem and leaf plot. Such a plot is obtained by first dividing each data value into two
parts —its stem and its leaf. For instance, if the data are all two-digit numbers, then
we could let the stem part of a data value be its tens digit and let the leaf be its ones
digit. Thus, for instance, the value 62 is expressed as

and the two data values 62 and 67 can be represented as

29
Example

Table 4 (on next slide) presents the per capita personal income for each of the 50
states and the District of Columbia. We can represent the data by a stem and leaf
plot.

30
Table 4

31
The data presented in Table 4 are represented in the following stem-and-leaf plot.
Note that the values of the leaves are put in the plot in increasing order.

32
Example

The following stem-and-leaf plot represents the weights of 80 attendees at a


sporting convention. The stem represents the tens digit and the hundred digits, and
the leaves are the ones digit.

33
34
The numbers in parentheses on the right represent the number of values in each
stem class. These summary numbers are often useful. They tell us, for instance, that
there are 10 values having stem 16; that is, 10 individuals have weights between 160
and 169. Note that a stem without any leaves (such as stem value 23) indicates that
there are no occurrences in that class.

35
Summarizing data sets

Modern-day experiments often deal with huge sets of data. To obtain a feel for such
a large amount of data, it is useful to be able to summarize it by some suitably
chosen measures.

36
Sample mean, sample median, and sample mode
Let’s introduce some statistics that are useful for describing the center of a set of data value.

37
Note: More often we use the term statistic for numerical quantity computed from a data set.

38
Consider the data

39
A physical interpretation of the sample mean demonstrates how it
assesses the center of a sample. Think of each dot in the dotplot as
representing a 1 kg weight. Then a fulcrum placed with its tip on the
horizontal axis will balance precisely when it is located at x. So the
sample mean can be regarded as the balance point of the distribution of
observations.

40
Problem
The number of iPhone sold daily by a small company for the past 6 days has been
arranged in the following frequency table:

What is the sample mean?

41
Solution

42
Another statistic used to indicate the center of a data set is the sample median;
loosely speaking, it is the middle value when the data set is arranged in increasing
order.

43
Order the data values from smallest to largest. If the number of data values is odd, then
the sample median is the middle value in the ordered list; if it is even, then the sample
median is the average of the two middle values.

44
Problem

The following data represent the number of weeks it took seven individuals to
obtain their driver’s licenses. Find the sample median.

2, 110, 5, 7, 6, 7, 3
Solution

45
46
Problem

Solution

47
Comparison of Mean with Median
The sample mean and sample median are both useful statistics for describing
the central tendency of a data set. The sample mean, being the arithmetic
average, makes use of all the data values. The sample median, which makes
use of only one or two middle values, is not affected by extreme values.

48
49
Mode

50
Problem

51
Solution

52
54
55
56
57
58
59
60
61
62

You might also like