Study Guide - Describing data
Study Guide - Describing data
Dispersion …………….…………… 13
_______________________________________________________________________
This guide covers these core ideas, but do read in conjunction with wider reading to ensure you
gain a fuller understanding of these critical concepts:
Kirkwood, B.R. & Sterne, J.A.C. (2003) Essential medical statistics (2nd edn), Oxford: Blackwell Science.
Chapters 1-3: Basics, defining & describing data
Chapter 4-5: Means, standard deviations; Normal distribution
Bland, M. (2015) An introduction to medical statistics (4th edn), Oxford: Oxford University Press
Chapters 4-5: Summarising & presenting data
Peacock, J.L. & Peacock, P.J. (2011) Oxford handbook of medical statistics, Oxford: Oxford University Press
Chapter 6: Summarising data
Field, A. (2009) Discovering statistics using SPSS (3rd edn), London: Sage
Chapter 1: Why is my evil lecturer forcing me to learn statistics?
2
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
1. Approaches to statistical analysis
As with any subject, if you are new to statistics, you may feel bewildered by the scope of the subject and not
know where to begin. A useful starting point is to understand the broad structure of the subject, so this guide will
begin with an overview of the main branches of statistics.
There are two major branches with differing goals: descriptive and inferential statistics.
Descriptive statistics
Descriptive statistics is concerned with summarising data. This is usually the starting point of all statistical
investigations, but with a descriptive approach the analyst would mainly focus on summarizing and looking for
patterns within the raw data collected in the study. This would include characterizing the data, measuring the
central location or ‘average’ of the data, and measuring the spread or variation in the data, focusing on how
individuals (people or other data units) in the dataset differed. The data will usually be described and summarized
both in numerical form, using statistics such as the mean or standard deviation, and graphically, using various
charts, histograms or plots. In descriptive statistical analysis, the researchers will remain within the parameters of
the data collected and s/he cannot reach any conclusions beyond this data.
Inferential statistics
In contrast, in the other main branch, inferential statistics, the goal is to make inferences about a general
population based on a smaller sample, or the relationships between populations from comparing samples.
Inferential statistics is concerned with, and applied to populations. The focus is on collecting samples that are
representative of the total population, and examining the properties of these samples (the sample ‘statistics’).
The interest is then turned to the population and the properties of the population (the population ‘parameters’).
There are two broad methods in inferential statistics: estimation and hypothesis testing
Estimation: The objective of estimation statistics is to identify a specific statistic of interest. This may be the
average height of a population (mean), or the prevalence of a disease (proportion). It may be to identify how two
groups or treatments differ (difference) or the linear relationship between two variables of interest (regression
line).
The goal is to determine an estimate of this measure for the population, based on the properties of a randomized
sample. The outcome is a point estimate and an associated margin of error either side of this that contains the
real population value. The margin of error either side is referred to as a confidence interval.
Hypothesis testing: The other category of inferential statistics involves testing hypotheses, usually to identify the
scale of any differences between groups of data and whether any differences are significant in a statistical sense.
The focus traditionally has been on determining this significance as a probability, or ‘p’ value and expressing
significance as a ‘yes / no’ relationship depending on whether the p value crosses a threshold of p<0.5.
Increasingly, emphasis on p values is downplayed in favour of effect sizes and confidence intervals.
Improved computing power has seen a trend toward more advanced regression modeling approaches.
3
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
The terms ‘data’ and ‘variable’ are used interchangeably to describe information we collect. Data is the
information obtained, but because there are usually different groups and forms of data we refer to the specific
measures as variables.
Categorical variables will comprise qualitative values. So for example, for the categorical variable ‘eye colour’, the
values are attributes labelled ‘brown’, ‘blue’, ‘green’ and so on. Information provided for categorical variables is
usually in the form of frequencies or counts: the number of instances recorded for each value.
Binary (or dichotomous) variable. A categorical variable with only two options.
These are frequently used as often decision-making involves determining
whether a condition is present or not.
Ordinal (ranked) variable: This is a categorical variable that does have a logical,
natural order, based on a relative ranking, although there is no numerical
relationship as the categories may be different in scale.
4
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
For treatment in statistical testing using software (e.g. IBM SPSS), categorical data are often reclassified to a
number. For example, for the variable ‘Species’, the values could be reclassified as: Anopheles darling = 1,
Anopheles Gambiae = 2, Anopheles stephensi = 3. However, these numbers are simply labels – there is no order.
Statistical tests that use categorical data are usually based around the count data recorded for specific value units
of the variable.
‘Dummy variables’: In some analyses, such as regression, categorical variables with more than two categories are
often reclassified as numerical ‘dummy variables’, assigned a combination of values of 0 or 1. For example a
variable with 4 categories A-D could be recoded as [A = 1,0,0], [B = 0,1,0], [C = 0,0,1] and [D = 0,0,0]. This enables
the categorical information to be included in the regression model.
Discrete data: If the numbers are restricted to particular values, such as whole numbers, and cannot be
meaningfully expressed as a fraction, we call this data ‘discrete’. In most cases this will be count or
frequency data, such as the number of children in a particular family, or the number of snails caught in a
net when sampling. These can only be whole numbers; 3.5 children in a family, or 8.67 snails in one catch
is meaningless. Discrete frequency data are usually summarised as a table or bar chart.
Continuous data: This is where the numbers are meaningfully divisible into fractions or decimal places.
Typically this is measurement data such as a dosage (100mg), weight (76.654 kg), height (176.2 cm) or
temperature (22.6°C). The range of possible values cannot be counted, so for summarisation purposes the
data are usually ‘discretized’, or made more discrete by defining range bins. So heights could be
represented in ‘bins’ of 0-10cms, 10-20cms and so on. In this way, frequency distributions of continuous
scale data are represented graphically as histograms.
Why the discrete – continuous form is important: When analysing numerical data, we treat discrete data and
continuous ‘measurement’ data differently in statistical modelling. Continuous data is commonly modelled
against a theoretical normal distribution and is analysed using particular statistical methods. Discrete count data
on the other hand are often described according to different theoretical distribution forms (e.g. Poisson
distribution). More on this can be found in the Study Guides on Discrete and Continuous distributions.
Is the distinction obvious? There is overlap. Whilst discrete data such as the number of snails caught in a net is a
numerical variable, discrete data can also be associated with categorical variables. The number of people with
blue eyes is discrete data associated with one of the values of the categorical variable ‘eye colour’.
Also, some variables may technically be continuous but in practical terms treated as discrete. Your age is
technically continuous data as it could be subdivided down from years to days, minutes or seconds if need be.
However this is unfeasible and in reality there is a finite range that we’d use to express age so it is really a form of
discrete data, often expressed as an integer such as 28 years old.
How about ratio & interval data? Some textbooks describe numerical data as interval or ratio. For interval data,
a zero on a scale is just another number and has no real meaning, as the measurement can just as easily go below
or above this. So temperature measured in Celsius is interval data, as the temperature can be above or below
zero. In contrast, for ratio data an absolute zero exists and would mean the absence of that unit. With ratio data
the ratios of values are meaningful, so a family with six children have 2x the number of children than a family with
three children. Zero children would mean the absence of children. In this case, the data are discrete, but ratio
data can be continuous too. Weight is an example of a continuous ratio variable, you cannot be less than zero kg.
Other examples include volume, height, per capita income, rates (eg. number per 10,000).
5
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Choosing a level of measurement
When designing a research project, you will need to consider what level of unit measure you will record.
Information can often be measured in many different ways, resulting in data in different forms.
Let’s imagine you are designing a survey. How could you collect and report age data?
Age is technically a continuous data variable, but we treat it as discrete. It can be it ‘discretized’ in many ways.
Why would you do this? Well, in survey design you may decide that you only need the age data at a crude level
and for speed of coding in the field you may want to simplify how the age is recorded, perhaps as a simple ordinal
category (as child, adult, pension-age). If precision is needed the age in years will be more appropriate. The
choice is important as it influences the choices of statistical testing available at a later stage in the analysis. We
want data that provides the most information at the least expense. As a general rule it is better to aim for higher
resolution data when collecting as this can always be reclassified to a more crude level at a later stage.
Now, however, let’s introduce a variable such as a disease. We are particularly interested to see if the other
variables might be associated with this ‘outcome’ disease variable, as it may help us identify an underlying cause.
Now we have a relationship between potential ‘exposures’ (age, gender, time spent outside, weight etc) and an
outcome (presence of disease).
With this potential dependency relationship, we often label our groups depending on which side of this
connection they lie. The disease will be described as the dependent variable, or outcome variable. The other
variables of interest are termed independent variables, or exposure variables. In fact there are many terms
researchers use, which depend on subject traditions, study designs or statistical methods used. These are
grouped below, but most terms are variations on the same principle.
The variable being manipulated The outcome or focus of the study Usage
to see if it affects the outcome What we are trying to understand
Independent variable Dependent variable General science
Exposure, Risk factor Outcome variable Medical sciences
Experimental variable
Explanatory variable
Predictor variable Response or dependent variable Regression modelling
Factor (if nominal)
Covariate (if continuous)
6
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
As well as the exposure-outcome relationship examined within a study, we also consider those variables that
either are not examined but that may significantly influence the exposure-outcome relationship. These are
termed extraneous or confounding variables.
Any study of data should start with a screening process, the simplest way usually being to represent the data
graphically as a scatter-plot, bar-chart or histogram to identify the general distribution shape and locate any
unusual or extreme values. Raw data will usually be disorganised so the first step is often to create an array, in
which the data are ordered numerically from to lowest values to highest (the range).
Categorical data
The simplest way to summarise categorical data is to present the counts for each value in a frequency table. In
the example below, the categorical variable is ‘Household material’ and the values are the different types of
material observed. Often it is easier to compare frequencies between the category groups using a bar chart that
shows the counts on the x-axis (vertical) and category groups on the y-axis (horizontal). The lower table and bar
chart represent the same data as proportions,
or relative frequency.
7
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
A relative frequency represents the data as a
proportion of 1.00, or this could be
expressed as a percentage. Area diagrams
such as pie charts can sometimes be an
effective way to represent proportions for
categorical data particularly if the
information needs to be conveyed quickly
such as a presentation or on a poster.
For ordered categorical data as well as the count and proportion (i.e. frequency and relative frequency), it is
valuable to include the cumulative frequency, where the count from each new category in the ordered grouping
is added to that of the previous level. In the example below, categories of infestation are arranged from high to
low – the primary interest being higher levels of infestation. Measuring the cumulative frequencies and
proportions allows us to say, for instance, that of a total of 276 households, 76 had moderate to severe
infestation. We can also do the same for the proportions too, to give a cumulative relative frequency. This allows
us easily to see that 27.4% of cases had moderate to severe levels of infestation.
8
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Continuous data
Raw measurement data usually represents various points on a scale. To represent the frequency distribution we
need to divide these measurement readings into interval groups for the purpose of visualising the spread of data.
In the table below, the raw data comprised 91 pulse-rate readings. This continuous data has then been divided
into equal spaced intervals ordered into an array from low to high values (column 1).
The frequency of observed data that fell into each of these intervals is recorded in column 2. Columns 1 and 2 can
be represented graphically as a histogram. Column 3 shows the cumulative frequency, with the values for each
category added as the table is completed. The relative cumulative frequency, the percentage of the observations
falling into each interval, is shown in column 4. These values are derived from dividing each cumulative frequency
value by the total number (91) to determine the proportion, then multiplying this by 100 to present as a
percentage
Histograms: For continuous data, the simplest way to represent graphically the frequency distribution is as a
histogram, in which the horizontal x-axis represents interval groupings or ‘bins’ that contain values within a pre-
determined range.
The main difference with a histogram is that the area of the bar represents the frequency, not the height, and the
bins can be different widths. Therefore the vertical y-axis should be labelled frequency density, rather than
frequency. The frequency density is calculated by frequency ÷ class width. However, where the bins are the same
width the frequency density is effectively also the frequency of data in that interval bin. .
The histogram below is for the pulse-rate data shown in the frequency distribution table above.
9
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Note that, compared to a bar chart which has a gaps between the bars, with a histogram the bins touch one
another. This is because we are dealing with continuous data. Any ‘gaps’ between bins are simply interval bins
containing no data.
Histograms allow us to assess the general shape of the distribution of our continuous data. In this case we can see
that it tends to centre round a mid-point value and is broadly symmetrical, but with a little more ‘weight’ to the
upper values on the right hand side.
Assessing normality: The Normal curve is a theoretical data distribution shape used to model continuous data.
This is explained in detail in the separate guide to Continuous Data Distributions (pages 2-11). It is usually valuable
to determine if our data frequency distribution is a close fit (or not) to the theoretical Normal distribution. If it is,
it opens up options to generalise from our sample and utilise a wide range of statistical methods. (If not there are
many alternatives too).
In the first instance we ‘eyeball’ the distribution shape and make a visual assessment of whether it follows the
classic bell-shaped symmetrical shape of the Normal distribution. To be more precise there are a range of graphs
such as the Normal plot (p17) and statistical tests for assessing the ‘goodness of fit’ to a Normal distribution,
mostly using statistical software (see the separate notes ‘Determining normality of distributions’).
Positive and negative skew: The data may be asymmetrical. It was often thought that symmetry was the natural
form for most data but we have increasingly come to realise that much data is naturally skewed. Skewness is
often present in smaller sample sizes because of sampling error that will be high in small samples; it often reduces
and the sample distribution approaches normality with increasing sample size. If not, the data could be
transformed, say using the log, square root or reciprocal of each value, towards a normal distribution shape to
allow statistical testing prior to back-transforming the results. For more on transformations and alternative
distributions see the guides on ‘Continuous data distributions’.
If the distribution has values weighted at the low values (left side of the histogram) with a long tail to the right,
we say it is positively skewed. Negative skew, which is usually less common, is when the greater weighting of
values is at the upper value end (right-hand side of histogram).
10
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
4. Summarising data – locating the centre and spread of the distribution
The most accurate way to represent continuous data is to show all the raw data. However this is almost never
convenient, so we have to summarise the data to a form that can be communicated. The most important qualities
to summarise are the central value of the data, and a measure of the spread or dispersion of the data.
Example:
A clinician is studying the number of times a sample of nine children (0-5 years) have been treated for malaria.
Mode: The mode is the value that occurs most frequently. In the scenario above, the most frequent
number of treatments is 2. The mode is usually the main measure of central tendency for categorical
data.
Median: To determine the median we rearrange the values from lowest to highest and take the physical
centre of the data. For our example, the central value (in bold) is 2. If there are an even number of
values, we take the middle two, add them and divide by two.
1 1 1 2 2 2 2 5 20
Arithmetic mean: When we talk of the mean, we usually refer to the arithmetic mean. Here we sum up
the values of the data and divide by the number of observations (in this case 9 children).
2 + 1 + 2 + 2 + 5 + 1 + 2 + 1 + 20 / 9 = 4
Note that here our mean is 4, which is quite different to the median (physical centre) of 2. This is because
we have an outlier in the data – a value that is substantially different from the rest of the data (Child 9,
who had received treatment 20 times). The arithmetic mean is sensitive to the influence of outliers so
data need to be carefully screened first to identify outliers and to establish if they are genuine and
meaningful (to retain for analysis), or a potential error in measurement or recording (in which case
discard).
If we excluded Child 9, from our data analysis and recalculated the arithmetic mean, it would align with
the median value.
2+1+2+2+5+1+2+1 / 8= 2
Note that the arithmetic mean is a theoretical number derived from a calculation. None of the actual
raw values may in fact correspond with the calculated mean.
11
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Geometric mean: The arithmetic mean is calculated directly from the actual values in the data. With
some data there is a multiplicative dimension to the data, for example if the data represents a growth
rate or percentage growth. It can also be used as a way to reduce the effect of outliers in the data as it
prevents undue weight being assigned to such values.
To calculate the geometric mean we multiply the values, then take the ‘nth’ root (so if 9 data point we
take the 9th root). In our scenario:
9
(2 x 1 x 2 x 2 x 5 x 1 x 2 x 1 x 20)
1/9
or alternatively (1600) / = 2.27
Compare this to the arithmetic mean that contained our outlier. This value of 2.27 is much closer to the
other measures than the arithmetic mean was, so is a better summary statistic in this scenario.
Weighted mean: In some cases we decide that certain values are ‘weighted’. For example in a course
we may have 3 tests of which the final test is weighted 3 times as much as the first two.
However, if the data distribution is skewed, the three measures of centrality will not coincide. The mean in
particular is susceptible to distortion with strong skew, or extreme value outliers, as we have already seen.
Therefore the mean should not be used for skewed data as it is likely to be a poor representation of the middle.
The median is much more reliable when we have outliers or skewed data as it represents the physical mid-point
of the distribution of values.
For a negatively skewed distribution, the value of the mean will always be lower than the median, so it will
estimate the centre of the distribution to be a lower value than it is. Conversely, for a positively skewed
distribution, the mean will indicate a higher value than is realistic for the centre.
12
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Measures of dispersion (variation)
There are many ways that the spread of the data can be described, but the two most important are the variance
and the standard deviation. The simplest way to think of these is the average difference an observation is from
the mean.
3 3 5 6 8 11 12 12 12 16 18 20 21 24 25
The arithmetic mean is 13.06, the median is 12, and the mode is 12, so it appears to be close to a normal
distribution.
To calculate the variance: Subtract each value from the mean, then square the result. Then work out the
arithmetic mean of these numbers (i.e. sum them and divide by the number of observations).
Therefore the variance is the sum of the squared differences from the mean.
Why do we square the values? Try calculating without squaring. Those values to the left of the mean are minus
numbers, and those to the right are positive, so they would cancel each other out. By squaring the values they will
always be positive, allowing them to be summed.
However, the problem with the variance is because we squared the values, the variance value is difficult to
interpret in the context of the raw data. To remove the squaring effect, we can simply take the square root of the
variance to return the figure to the same units as the original data.
This is the Standard Deviation. In simple terms it represents the average difference of a value from the mean.
Thus, the larger the standard deviation value, the wider the spread of data. A small standard deviation means that
the distribution is compact around the central value.
An important assumption for many statistical tests is that the variances are similar between two or more samples
being compared. So we can easily check this by comparing the standard deviation or variance values between the
samples.
Population or sample? The variance formula above stands well for population data, but usually we are dealing
with a sample of the population. In populations there are usually numerous extreme cases (outliers). In samples,
being smaller, extremes are more likely to be missed. As a result, sample variances are more likely to
underestimate the population variance. Therefore we can adjust the formula to correct for this effect by
subtracting 1 from the sample number (n-1) in the denominator.
Caution note: Like the mean, the variance and standard deviation are sensitive to skew. Therefore we can only
report the variance and standard deviation if we are reasonably sure the data is approximately normally
distributed. If skewed, another measure such as quartiles (p13) or box plots (p14) is a better summary.
13
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Standard deviation and the empirical rule
The theoretical normal distribution has important properties that allow us to make predictions and probability
estimates. Read the separate guide ‘Continuous data distributions’ for more detail. The mean and standard
deviation can entirely describe a normally distributed set of data. This means that the distribution shape can be
reconstructed from these values using an equation. The resulting curve is called a probability density function
(again, refer to the guide mentioned). Arising from this curve is another rule that concerns the standard
deviation, called the empirical rule, which states that:
The example below shows the proportions of values contained between the standard deviation segments.
Standard error: Closely related to the standard deviation is the standard error. The standard deviation applies to
distributions of observed data, either population data or samples of whole populations. It is a descriptive
statistic, concerned with how spread-out our data distribution is.
In inferential statistics, we use samples to estimate values for the population. For any sample, if we divide the
standard deviation by the square root of the sample number, this gives us the standard error. This is a value that
indicates the likely amount of sampling error or uncertainty that there is in our sample, if we were to use it to
generalise to the population. It tells us about the quality of our estimation. Therefore, for inferential statistics the
standard error is of central importance as it tells us the precision of a sample-based estimate. More on this in the
guide to ‘Sampling Theory and Estimation’.
14
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Other measures of dispersion
Range: The range is simply the difference between the largest and smallest value. In our example from earlier:
3 3 5 6 8 11 12 12 12 16 18 20 21 24 25
The range, however, is susceptible to outliers. Taking our first example of the number of treatments experienced
by the 9 children:
1 1 1 2 2 2 2 5 20
Now this is quite misleading as the majority of values range between 1 and 5, so 19 is quite
distorting. Instead, a better approach that takes better account of extreme values is the inter-
quartile range.
Inter-quartile range
The inter-quartile range is the difference between the upper and lower quartile (12 – 5) = 7
The inter-quartile range is a good indicator of dispersion where there are extreme outliers as these will be
excluded from this summary statistic. A graphical representation based on quartiles is the box and whisker plot
(page 14).
Deciles are another way of subdividing the array of data at each 10 percentage point, although used far less
frequently.
15
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
A box and whiskers plot (or ‘box-plot’) is constructed from quartile information, with the box representing the
inter-quartile range between the first and third quartiles, and the bars or ‘whiskers’ extending from the lowest
data point to the first quartile, and at the upper end, from the third quartile to the highest value in the dataset.
The median is shown as a line in the box.
Box and whisker plots are valuable graphical representations for displaying crude distributions of many samples in
parallel as any skew can easily be seen from the shape of the plots. For normally distributed data, the plot will be
symmetrical with the median sitting exactly mid-way along the box, and with whisker lines of equal length.
Positively skewed data will show a shorter whisker to the first quartile and the median closer to the first quartile.
Conversely, for negatively skewed data the median will be closer to the upper quartile.
16
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Scatter plot
A Q-Q plot, or quantile – quantile plot is a special type of plot that is valuable for visually comparing if collected
data fits with a theoretical distribution shape. The most common form is a Normal plot, to check whether data fit
a normal distribution. A Normal plot is a type of scatter plot that shows a cumulative frequency of observed data
plotted against the cumulative frequency of a theoretical normal distribution.
17
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
To understand how this is constructed, here’s a simplified example. Let’s say we have 10 data points that we have
arranged from low to high:
-0.7 1.1 1.9 2.2 3.1 3.9 5.0 6.0 6.7 7.0
A quantile represents the proportion of the data represented by that value. So for the median, the quantile is
50% (i.e. 50% of the data lie below the median). We want to calculate the quantile for each of our data points.
In our example we have 10 data items, so imagine a normal distribution curve divided into 10 parts, each worth
10% of the distribution area. The lowest value, -0.7, represents the smallest 10% of the data, so let’s imagine this
sits in the middle of the 0%-10% band, (i.e at the 5% point, or as a proportion 0.5).* The next smallest sits mid-
way between the 10%-20% band, (i.e. 15%, or 0.15) and so on. Now we plot these in column 3. Let’s imagine this
lies mid-way between 0% and 10% (the middle of the lowest 10% band).
Finally, we need to identify what the equivalent value would be for the theoretical Normal distribution. These are
standardised values where 0 = mean, -1 is 1 standard deviation below the mean and so on, called z-scores (see
guide to Continuous data distributions). Using a computer or reading from a standard Normal distribution table,
we find that the lowest 5% of the distribution is represented by a z-score of -1.645. The proportion representing
15% of the distribution is -1.035 and so on.
The Normal quantile plot is then produced by plotting column 2 against column 4.
* In our simple example we divided the normal curve into 10% segments because we had 10 raw data
observations. Normally we’d use a formula such as (i-0.5)/n.
Histograms are simple to produce and understand and it is easy to crudely assess the shape, mode, mid point and
spread, visually. However if the interval ‘bins’ are too big, they can be difficult to interpret. It is not easy to
compare multiple distributions side by side because of the size of the graph. It can also be difficult to assess
normality from a histogram from a small dataset.
Box-plots are very good for assessing the mid-point, skew and spread of multiple samples side-by-side. However
there is no way to distinguish modes so there is some loss of important information.
Normal plots are very good for assessing whether data come from a normal distribution (or with Q-Q plots other
theoretical distributions), even for smaller datasets. However it is difficult to identify skews and other features of
the distribution compared to a histogram.
18