Probability Density Functions
Probability Density Functions
x
f(x)
b
0.02
0.03
0
0.01
a
P(a <x b)
)
o If a and b +, the probability
must equal 1 (100%), i.e., ( ) ( ) 1
x
x
P x f x dx
=
=
< < = =
.
In other words, the probability that x lies between and + is 100% (a fact that should be obvious,
since there are no other possibilities for real number x).
o Once we have defined the probability density function f(x), we leave the system of discrete random
variables and enter the system of continuous random variables, on which we make some more formal
definitions:
Expected value is defined in terms of the probability density function as the mean of all possible x
values in the continuous system. Namely, ( ) ( ) expected value E x xf x
= = =
dx . In an ideal
situation in which f(x) exactly represents the population, is the mean of the entire population of x
values, and that is why it is called the expected value. It is therefore also called the population
mean. In general, x , but x when n is large, i.e., the sample mean approaches the
Probability Density Functions, Page 2
expected value when n is large. x and are often used interchangeably, but this should be done
only if n is large.
Standard deviation is defined in terms of the PDF as
( ) ( )
2
standard deviation x f x dx
= =
= and ( ) ( ) f z f = x .
o The above transformations accomplish two things:
The first transformation normalizes the abscissa such that the PDF is centered around z =0.
The second transformation normalizes the ordinate such that the PDF is spread out in similar fashion
regardless of the value of standard deviation.
o When normalized in this way, the normalized PDF can be directly compared to standard PDFs, which we
discuss in a later learning module.
o To summarize, here are several steps used in Excel to generate a normalized PDF of experimental data:
1. Generate the histogram with Excel as discussed in the histogram learning module. Excel generates a
table called a frequency table. The table contains two columns, bin and frequency. Bin is the
maximum value of the range of each bin, and frequency is the number of data points in that bin range.
(For example, suppose there are 200 data points total, the mean value of x is 10.0, and the standard
deviation of the data set is 3.0. Also suppose that 8 of those data points lie in the bin with x between
4 and 6 (4 <x 6). Thus, for this bin, Bin =6 and Frequency =8.)
2. Create a new column called probability in which you divide each frequency by the total number of
data points. This gives the probability that a data point lies in that bin, i.e. probability frequency/ n = .
(In the example here, probability =8/200 =0.040 or 4.0%.)
3. Create a new column called x
mid
in which you list the mid value of each bin:
mid min max
( ) x x x = + / 2.
(In the example here, the mid value of the sample bin is (4 +6)/2 =5.0.)
4. Create a new column called f(x) in which you divide each probability by the appropriate bin width,
i.e., ( ) probability/ f x x = .
(In the example here, the bin width of the sample bin is x =6 4 =2, and f(x) =0.04/2 =0.02 at x =
x
mid
=5.0.) A smoothed plot of f(x) versus x is the PDF.
5. Create a new column called z in which you normalize the x values into nondimensional z values.
This is accomplished by converting each mid value of x into z: ( ) / z x = .
(In the example here, z for the sample bin is z =(5.0 10.0)/3.0 =1.667.)
6. Create a new column called f(z) in which you normalize the PDF into the f(z) values. This is
accomplished by converting each f(x) into f(z): ( ) ( ) f z f x = .
(In the example here, f(z) of the sample bin is f(z) =0.02*3.0 =0.060 at z =1.667.)
7. Finally, a plot of f(z) vs. z can be generated. A smooth curve through these data represents the
normalized PDF.
Example:
Given: The same 1000 temperature measurements used in a previous example for generating a histogram.
The data are provided in an Excel spreadsheet (Temperature_data_analysis.xls) on the website.
To do: Generate a PDF of these data. Normalize the PDF.
Solution:
o In a previous example (see the Histogram learning module), we generated a histogram of the temperature
data. We begin with the bin and frequency data generated in Excel.
Probability Density Functions, Page 3
o To generate the PDF, we follow the step-by-step instructions provided above. This will be shown in class
in Excel. The vertically normalized PDF is shown below (left side).
Transform
o Finally, we transform to normalized variables the fully normalized PDF is shown above (right side).
Notice that the shape is the same, but the variable transformation to f(z) is nondimensional, making it
more useful for comparison with other probability density distributions.
o The final PDF should be continuous, not discrete. Because of scatter, it is difficult to get Excel to draw a
smooth curve through these data. For lack of a better method at this point, we sketch the smooth curve
by eye below:
Discussion:
o The peak in the vertically normalized PDF occurs at x 31, which is very close to the sample mean. This
peak transforms to z 0 in the fully normalized PDF; this is a useful feature of the normalization.
o We can estimate the area under the f(x) curve by eye by counting squares the area is indeed
approximately 1.0 or 100%, as it must be.
o We can also estimate the area under the f(z) curve by eye it is approximately 1.0 or 100%, as it also
must be.
There are several standard PDFs discussed in statistics literature. Of these, the normal PDF, is the most
common, and will be discussed next. We will also compare the above results with the normal PDF.