HISTOGRAMS
HISTOGRAMS
2 Histograms
Histograms are density estimates. A density estimate gives a good impression of the
distribution of the data. In contrast to boxplots, density estimates show possible
multimodality of the data. The idea is to locally represent the data density by
counting the number of observations in a sequence of consecutive intervals (bins)
with origin x0 . Let Bj .x0 ; h/ denote the bin of length h which is the element of a
bin grid starting at x0 :
where Œ:; :/ denotes a left closed and right open interval. If fxi gniD1 is an i.i.d. sample
with density f , the histogram is defined as follows:
XX
n
fOh .x/ D n1 h1 Ifxi 2 Bj .x0 ; h/g Ifx 2 Bj .x0 ; h/g: (1.7)
j 2Z i D1
In sum (1.7) the first indicator function Ifxi 2 Bj .x0 ; h/g (see Symbols and
Notation in Chap. 21) counts the number of observations falling into bin Bj .x0 ; h/.
The second indicator function is responsible for “localising” the counts around x.
The parameter h is a smoothing or localising parameter and controls the width of
the histogram bins. An h that is too large leads to very big blocks and thus to a
very unstructured histogram. On the other hand, an h that is too small gives a very
variable estimate with many unimportant peaks.
The effect of h is given in detail in Fig. 1.6. It contains the histogram (upper
left) for the diagonal of the counterfeit bank notes for x0 D 137:8 (the minimum
of these observations) and h D 0:1. Increasing h to h D 0:2 and using the same
origin, x0 D 137:8, results in the histogram shown in the lower left of the figure.
This density histogram is somewhat smoother due to the larger h. The binwidth is
next set to h D 0:3 (upper right). From this histogram, one has the impression that
the distribution of the diagonal is bimodal with peaks at about 138.5 and 139.9.
12 1 Comparison of Batches
10 30
8 25
20
6
15
4
10
2 5
0 0
138 139 140 141 138 139 140 141
h = 0.1 h = 0.3
20 40
15 30
10 20
5 10
0 0
138 139 140 141 138 139 140 141
h = 0.2 h = 0.4
Fig. 1.6 Diagonal of counterfeit bank notes. Histograms with x0 D 137:8 and h D 0:1 (upper
left), h D 0:2 (lower left), h D 0:3 (upper right), h D 0:4 (lower right) MVAhisbank1
The detection of modes requires fine tuning of the binwidth. Using methods from
smoothing methodology (Härdle, Müller, Sperlich, & Werwatz, 2004) one can find
an “optimal” binwidth h for n observations:
p 1=3
24
hopt D :
n
Unfortunately, the binwidth h is not the only parameter determining the shapes of fO.
In Fig. 1.7, we show histograms with x0 D 137:65 (upper left), x0 D 137:75
(lower left), with x0 D 137:85 (upper right), and x0 D 137:95 (lower right). All
the graphs have been scaled equally on the y-axis to allow comparison. One sees
that—despite the fixed binwidth h—the interpretation is not facilitated. The shift
of the origin x0 (to four different locations) created four different histograms. This
1.2 Histograms 13
40 40
20 20
0 0
138 139 140 141 138 139 140 141
x = 137.65 x = 137.85
0 0
40 40
20 20
0 0
138 139 140 141 138 139 140 141
x = 137.75 x = 137.95
0 0
Fig. 1.7 Diagonal of counterfeit bank notes. Histogram with h D 0:4 and origins x0 D 137:65
(upper left), x0 D 137:75 (lower left), x0 D 137:85 (upper right), x0 D 137:95 (lower right)
MVAhisbank2
0.4 0.4
Diagonal
Diagonal
0.3 0.3
0.2 0.2
0.1 0.1
0 0
138 139 140 141 142 138 139 140 141 142
2 shifts 8 shifts
0.4 0.4
Diagonal
Diagonal
0.3 0.3
0.2 0.2
0.1 0.1
0 0
138 139 140 141 142 138 139 140 141 142
4 shifts 16 shifts
Fig. 1.8 Averaged shifted histograms based on all (counterfeit and genuine) Swiss bank notes:
there are 2 shifts (upper left), 4 shifts (lower left), 8 shifts (upper right) and 16 shifts (lower right)
MVAashbank
Summary
,! Modes of the density are detected with a histogram.
Summary (continued)
p
,! There is an “optimal” h D .24 =n/1=3 .