0% found this document useful (0 votes)

129 views18 pages

Study Guide - Describing data

Uploaded by

mmwatasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views18 pages

Study Guide - Describing data

Uploaded by

mmwatasa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Research Methods DLTROP202

Epidemiology & Statistics DLTROP204

Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________

Describing & summarising data

The most important stage in handling new data is to get to know it. Data can be characterised
and defined in many different ways. For statistical analysis, it is important to define what type of
data you are dealing with. Frequencies? Scores from a test? Binary ‘yes/no’ responses?
Measurements that can be specified to multiple decimal places? Whole numbers?
It is also important to get to know how the data values are distributed. Are there unusually
large, small or unrealistic outlier values that might indicate errors? Are the values weighted to
one direction? Is there data missing and does this influence the distribution?
We will also usually want to find simple ‘models’ that can conveniently summarise the data set.
We usually do this by describing where the middle of the set of data lies, a measure of central
tendency, such as the mean or median. We may also need to indicate the spread or variance
of the values in the dataset.
Having a good understanding of your data, and knowing how to correctly characterise and
summarise your data, is critical. Once data has been collected, many novice researchers often
struggle to identify the appropriate statistical test or strategy to use to analysing their data. The
root of this challenge is to have a good handle on the types of data, variables and distributions
you are dealing with, characterised in the specific terms used by statisticians.
____________________________________________________________________________
Contents:
1. Approaches to statistical analysis ………………………….…………………………… 3
2. Types of data & variables ……………………………………………….……………….. 4
3. Summarising data – frequency distributions ……………………………….…………. 7
4. Summarising: location and spread Central tendency ………….….……. 11

Dispersion …………….…………… 13

Box and whisker plots …………..…. 16

Scatter plots & Normal plots… …….17

_______________________________________________________________________
This guide covers these core ideas, but do read in conjunction with wider reading to ensure you
gain a fuller understanding of these critical concepts:
Kirkwood, B.R. & Sterne, J.A.C. (2003) Essential medical statistics (2nd edn), Oxford: Blackwell Science.
Chapters 1-3: Basics, defining & describing data
Chapter 4-5: Means, standard deviations; Normal distribution

Bland, M. (2015) An introduction to medical statistics (4th edn), Oxford: Oxford University Press
Chapters 4-5: Summarising & presenting data

Peacock, J.L. & Peacock, P.J. (2011) Oxford handbook of medical statistics, Oxford: Oxford University Press
Chapter 6: Summarising data

Field, A. (2009) Discovering statistics using SPSS (3rd edn), London: Sage
Chapter 1: Why is my evil lecturer forcing me to learn statistics?

2
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
1. Approaches to statistical analysis
As with any subject, if you are new to statistics, you may feel bewildered by the scope of the subject and not
know where to begin. A useful starting point is to understand the broad structure of the subject, so this guide will
begin with an overview of the main branches of statistics.

There are two major branches with differing goals: descriptive and inferential statistics.

Descriptive statistics
Descriptive statistics is concerned with summarising data. This is usually the starting point of all statistical
investigations, but with a descriptive approach the analyst would mainly focus on summarizing and looking for
patterns within the raw data collected in the study. This would include characterizing the data, measuring the
central location or ‘average’ of the data, and measuring the spread or variation in the data, focusing on how
individuals (people or other data units) in the dataset differed. The data will usually be described and summarized
both in numerical form, using statistics such as the mean or standard deviation, and graphically, using various
charts, histograms or plots. In descriptive statistical analysis, the researchers will remain within the parameters of
the data collected and s/he cannot reach any conclusions beyond this data.

Inferential statistics
In contrast, in the other main branch, inferential statistics, the goal is to make inferences about a general
population based on a smaller sample, or the relationships between populations from comparing samples.
Inferential statistics is concerned with, and applied to populations. The focus is on collecting samples that are
representative of the total population, and examining the properties of these samples (the sample ‘statistics’).
The interest is then turned to the population and the properties of the population (the population ‘parameters’).

There are two broad methods in inferential statistics: estimation and hypothesis testing

Estimation: The objective of estimation statistics is to identify a specific statistic of interest. This may be the
average height of a population (mean), or the prevalence of a disease (proportion). It may be to identify how two
groups or treatments differ (difference) or the linear relationship between two variables of interest (regression
line).

The goal is to determine an estimate of this measure for the population, based on the properties of a randomized
sample. The outcome is a point estimate and an associated margin of error either side of this that contains the
real population value. The margin of error either side is referred to as a confidence interval.

Hypothesis testing: The other category of inferential statistics involves testing hypotheses, usually to identify the
scale of any differences between groups of data and whether any differences are significant in a statistical sense.
The focus traditionally has been on determining this significance as a probability, or ‘p’ value and expressing
significance as a ‘yes / no’ relationship depending on whether the p value crosses a threshold of p<0.5.
Increasingly, emphasis on p values is downplayed in favour of effect sizes and confidence intervals.

Improved computing power has seen a trend toward more advanced regression modeling approaches.

This guide is focused on the first branch – descriptive statistics.

3
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________

2. Types of data and variables

You will see in different textbooks variations in the way data can be classified. It is very important to understand
data classification as a clear understanding of this is essential for understanding not only principles of data
distribution, but selection of appropriate statistical procedures.

The terms ‘data’ and ‘variable’ are used interchangeably to describe information we collect. Data is the
information obtained, but because there are usually different groups and forms of data we refer to the specific
measures as variables.

Is the variable qualitative (categorical) or quantitative (numerical)?

The first step in describing data is to decide if it each particular variable is qualitative or quantitative (numerical).
Qualitative variables are usually expressed in word form and are more frequently referred to as categorical or
attribute variables. Each categorical variable will comprise values that will usually have no numerical relationship,
although in the case of ordinal data there is a logical ordering.

Categorical variables will comprise qualitative values. So for example, for the categorical variable ‘eye colour’, the
values are attributes labelled ‘brown’, ‘blue’, ‘green’ and so on. Information provided for categorical variables is
usually in the form of frequencies or counts: the number of instances recorded for each value.

Describing qualitative, categorical data

Nominal variable. The variable units are categories that are usually specified by a word or
name. These are usually referred to as nominal or categorical data. There
is no order to the units. Here are a few examples.

Variable Value label

Religion Islam, Christianity, Buddhism etc
Species Anopheles darling, Anopheles gambiae,
Anopheles stephensi
Hospital (city) Blantyre, Lilongwe, Mzuzu

Binary (or dichotomous) variable. A categorical variable with only two options.
These are frequently used as often decision-making involves determining
whether a condition is present or not.

Variable Value label

Diagnosis Disease present / not present
Culture result Positive / negative
Sex Male / Female

Ordinal (ranked) variable: This is a categorical variable that does have a logical,
natural order, based on a relative ranking, although there is no numerical
relationship as the categories may be different in scale.

Variable Value label

Satisfaction Very high, High, Medium, Low, Very low
Age group Under 18, 18-30, 30-60, Over 60.
Infestation None, Slight, Moderate, Heavy, Severe

4
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
For treatment in statistical testing using software (e.g. IBM SPSS), categorical data are often reclassified to a
number. For example, for the variable ‘Species’, the values could be reclassified as: Anopheles darling = 1,
Anopheles Gambiae = 2, Anopheles stephensi = 3. However, these numbers are simply labels – there is no order.

Statistical tests that use categorical data are usually based around the count data recorded for specific value units
of the variable.

‘Dummy variables’: In some analyses, such as regression, categorical variables with more than two categories are
often reclassified as numerical ‘dummy variables’, assigned a combination of values of 0 or 1. For example a
variable with 4 categories A-D could be recoded as [A = 1,0,0], [B = 0,1,0], [C = 0,0,1] and [D = 0,0,0]. This enables
the categorical information to be included in the regression model.

Describing numerical data

When dealing with numerical data, we can distinguish between data values that are discrete or continuous units.

Discrete data: If the numbers are restricted to particular values, such as whole numbers, and cannot be
meaningfully expressed as a fraction, we call this data ‘discrete’. In most cases this will be count or
frequency data, such as the number of children in a particular family, or the number of snails caught in a
net when sampling. These can only be whole numbers; 3.5 children in a family, or 8.67 snails in one catch
is meaningless. Discrete frequency data are usually summarised as a table or bar chart.

Continuous data: This is where the numbers are meaningfully divisible into fractions or decimal places.
Typically this is measurement data such as a dosage (100mg), weight (76.654 kg), height (176.2 cm) or
temperature (22.6°C). The range of possible values cannot be counted, so for summarisation purposes the
data are usually ‘discretized’, or made more discrete by defining range bins. So heights could be
represented in ‘bins’ of 0-10cms, 10-20cms and so on. In this way, frequency distributions of continuous
scale data are represented graphically as histograms.

Why the discrete – continuous form is important: When analysing numerical data, we treat discrete data and
continuous ‘measurement’ data differently in statistical modelling. Continuous data is commonly modelled
against a theoretical normal distribution and is analysed using particular statistical methods. Discrete count data
on the other hand are often described according to different theoretical distribution forms (e.g. Poisson
distribution). More on this can be found in the Study Guides on Discrete and Continuous distributions.

Is the distinction obvious? There is overlap. Whilst discrete data such as the number of snails caught in a net is a
numerical variable, discrete data can also be associated with categorical variables. The number of people with
blue eyes is discrete data associated with one of the values of the categorical variable ‘eye colour’.

Also, some variables may technically be continuous but in practical terms treated as discrete. Your age is
technically continuous data as it could be subdivided down from years to days, minutes or seconds if need be.
However this is unfeasible and in reality there is a finite range that we’d use to express age so it is really a form of
discrete data, often expressed as an integer such as 28 years old.

How about ratio & interval data? Some textbooks describe numerical data as interval or ratio. For interval data,
a zero on a scale is just another number and has no real meaning, as the measurement can just as easily go below
or above this. So temperature measured in Celsius is interval data, as the temperature can be above or below
zero. In contrast, for ratio data an absolute zero exists and would mean the absence of that unit. With ratio data
the ratios of values are meaningful, so a family with six children have 2x the number of children than a family with
three children. Zero children would mean the absence of children. In this case, the data are discrete, but ratio
data can be continuous too. Weight is an example of a continuous ratio variable, you cannot be less than zero kg.
Other examples include volume, height, per capita income, rates (eg. number per 10,000).

5
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Choosing a level of measurement
When designing a research project, you will need to consider what level of unit measure you will record.
Information can often be measured in many different ways, resulting in data in different forms.

Let’s imagine you are designing a survey. How could you collect and report age data?

Categorical: Child, Pensioner, Adult

Binary categorical: Under 30 / Over 30

Ordinal categorical: Under 18, 18-29, 30-39, 40-49, 50-59, Over 60

Discrete numerical: 17 (years old)

Continuous numerical: 14,235.7 days !!

Age is technically a continuous data variable, but we treat it as discrete. It can be it ‘discretized’ in many ways.
Why would you do this? Well, in survey design you may decide that you only need the age data at a crude level
and for speed of coding in the field you may want to simplify how the age is recorded, perhaps as a simple ordinal
category (as child, adult, pension-age). If precision is needed the age in years will be more appropriate. The
choice is important as it influences the choices of statistical testing available at a later stage in the analysis. We
want data that provides the most information at the least expense. As a general rule it is better to aim for higher
resolution data when collecting as this can always be reclassified to a more crude level at a later stage.

Other naming conventions for variables

When dealing with data in statistical analysis, variables are also described in terms of their anticipated
relationship to each other if there is thought to be a dependency pathway. Let’s imagine that we have collected
data about members of a population that includes information about age, religious belief, gender, the number of
hours spent outside per day, education level, weight and height. We could treat these variables of interest on
equal terms and explore if there are any associations between. For example we could see if there existed an
association between height and weight, or education level and number of hours spent outdoors. Because we are
not looking for pathways we just describe the variables in their original form.

Now, however, let’s introduce a variable such as a disease. We are particularly interested to see if the other
variables might be associated with this ‘outcome’ disease variable, as it may help us identify an underlying cause.
Now we have a relationship between potential ‘exposures’ (age, gender, time spent outside, weight etc) and an
outcome (presence of disease).

With this potential dependency relationship, we often label our groups depending on which side of this
connection they lie. The disease will be described as the dependent variable, or outcome variable. The other
variables of interest are termed independent variables, or exposure variables. In fact there are many terms
researchers use, which depend on subject traditions, study designs or statistical methods used. These are
grouped below, but most terms are variations on the same principle.
The variable being manipulated The outcome or focus of the study Usage
to see if it affects the outcome What we are trying to understand
Independent variable Dependent variable General science
Exposure, Risk factor Outcome variable Medical sciences
Experimental variable
Explanatory variable
Predictor variable Response or dependent variable Regression modelling
Factor (if nominal)
Covariate (if continuous)

6
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
As well as the exposure-outcome relationship examined within a study, we also consider those variables that
either are not examined but that may significantly influence the exposure-outcome relationship. These are
termed extraneous or confounding variables.

3. Summarising data – frequency distributions

The understanding of variation and spread in data is of central importance in statistical analysis. All data will
contain variation, most of which will be natural, and some may be due to measurement errors. The variation may
be evenly distributed (homogeneous) across the sample. The data may show a symmetrical distribution around a
central value, or it may show skew, with a weighting in the lower values (right, or positive skew) or higher values
(left, or negative skew). The data may contain extreme scores, called outliers. Outliers may be real, or they may
result from errors in the data measurement process. If the latter, they will need to be dealt with by correction or
removal because outliers can have a very significant effect on the quality and precision of your results. Their
presence may also influence your choice of analysis method. The data may also show gaps, which would be
unusual and would warrant further investigation.

Any study of data should start with a screening process, the simplest way usually being to represent the data
graphically as a scatter-plot, bar-chart or histogram to identify the general distribution shape and locate any
unusual or extreme values. Raw data will usually be disorganised so the first step is often to create an array, in
which the data are ordered numerically from to lowest values to highest (the range).

Categorical data
The simplest way to summarise categorical data is to present the counts for each value in a frequency table. In
the example below, the categorical variable is ‘Household material’ and the values are the different types of
material observed. Often it is easier to compare frequencies between the category groups using a bar chart that
shows the counts on the x-axis (vertical) and category groups on the y-axis (horizontal). The lower table and bar
chart represent the same data as proportions,
or relative frequency.

Household material Frequency

Mud and wattle, thatch roof 67
Mud & wattle, corrugated iron roof 38
Mud brick 43
Concrete block 19
Fired bricks 8
Total 175

Household material Frequency Relative

frequency
Mud and wattle, thatch 67 0.383
Mud and wattle, iron 38 0.217
Mud brick 43 0.246
Concrete block 19 0.109
Fired bricks 8 0.046
Total 175

7
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
A relative frequency represents the data as a
proportion of 1.00, or this could be
expressed as a percentage. Area diagrams
such as pie charts can sometimes be an
effective way to represent proportions for
categorical data particularly if the
information needs to be conveyed quickly
such as a presentation or on a poster.

However, with visual representations of data

there is a trade-off between aesthetics and
conveying information. The advantage of a
bar graph over a pie chart is that data values
can more easily be read or compared.

For ordered categorical data as well as the count and proportion (i.e. frequency and relative frequency), it is
valuable to include the cumulative frequency, where the count from each new category in the ordered grouping
is added to that of the previous level. In the example below, categories of infestation are arranged from high to
low – the primary interest being higher levels of infestation. Measuring the cumulative frequencies and
proportions allows us to say, for instance, that of a total of 276 households, 76 had moderate to severe
infestation. We can also do the same for the proportions too, to give a cumulative relative frequency. This allows
us easily to see that 27.4% of cases had moderate to severe levels of infestation.

Infestation Frequency Relative Cumulative Relative

frequency frequency cumulative
(%) frequency (%)
Severe 4 1.3 4 1.3
Heavy 18 6.5 22 7.8
Moderate 54 19.6 76 27.4
Slight 90 32.6 166 60.1
None 110 39.9 276 100
Total 276 100 276 100

Summarising multiple categorical variables

Household type Zone 1 Zone 2 Zone 3
Mud/wattle 31 61 13
Mud brick 16 6 21
Concrete/fired brick 13 3 11
The table above is a contingency table that
cross-tabulates information from two
nominal variables: household type and
geographical zone, each with three values.

Contingency tables are used to examine

associations between variables, and in this
case, simply to visualise the different tallies
of household type in each geographical zone.
The clustered column bar chart to the right is
a simple way to visualise this data. The data
could also be represented as proportions.

8
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Continuous data
Raw measurement data usually represents various points on a scale. To represent the frequency distribution we
need to divide these measurement readings into interval groups for the purpose of visualising the spread of data.
In the table below, the raw data comprised 91 pulse-rate readings. This continuous data has then been divided
into equal spaced intervals ordered into an array from low to high values (column 1).

Pulse / min Frequency Cumulative Relative

frequency cumulative
frequency (%)
60-64 4 4 4.40
65-69 9 13 14.29
70-74 12 25 27.47
75-79 22 47 51.65
80-84 18 65 71.43
85-89 16 81 89.01
90-94 7 88 96.70
95-99 3 91 100

The frequency of observed data that fell into each of these intervals is recorded in column 2. Columns 1 and 2 can
be represented graphically as a histogram. Column 3 shows the cumulative frequency, with the values for each
category added as the table is completed. The relative cumulative frequency, the percentage of the observations
falling into each interval, is shown in column 4. These values are derived from dividing each cumulative frequency
value by the total number (91) to determine the proportion, then multiplying this by 100 to present as a
percentage

Histograms: For continuous data, the simplest way to represent graphically the frequency distribution is as a
histogram, in which the horizontal x-axis represents interval groupings or ‘bins’ that contain values within a pre-
determined range.

The main difference with a histogram is that the area of the bar represents the frequency, not the height, and the
bins can be different widths. Therefore the vertical y-axis should be labelled frequency density, rather than
frequency. The frequency density is calculated by frequency ÷ class width. However, where the bins are the same
width the frequency density is effectively also the frequency of data in that interval bin. .

The histogram below is for the pulse-rate data shown in the frequency distribution table above.

9
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Note that, compared to a bar chart which has a gaps between the bars, with a histogram the bins touch one
another. This is because we are dealing with continuous data. Any ‘gaps’ between bins are simply interval bins
containing no data.

Histograms allow us to assess the general shape of the distribution of our continuous data. In this case we can see
that it tends to centre round a mid-point value and is broadly symmetrical, but with a little more ‘weight’ to the
upper values on the right hand side.

One decision we make when dealing

with continuous data is to choose a
width for the interval spacing. If the
interval spacing is too wide this will
result in ‘blocky’ interval bars that
are too crude to distinguish the
distribution shape. If too narrow we
can result in too ‘spiky’ a
distribution with many empty bins,
again making it difficult to
determine the distribution shape.
[Image www.statistics.laerd.com].

Assessing normality: The Normal curve is a theoretical data distribution shape used to model continuous data.
This is explained in detail in the separate guide to Continuous Data Distributions (pages 2-11). It is usually valuable
to determine if our data frequency distribution is a close fit (or not) to the theoretical Normal distribution. If it is,
it opens up options to generalise from our sample and utilise a wide range of statistical methods. (If not there are
many alternatives too).

In the first instance we ‘eyeball’ the distribution shape and make a visual assessment of whether it follows the
classic bell-shaped symmetrical shape of the Normal distribution. To be more precise there are a range of graphs
such as the Normal plot (p17) and statistical tests for assessing the ‘goodness of fit’ to a Normal distribution,
mostly using statistical software (see the separate notes ‘Determining normality of distributions’).

Positive and negative skew: The data may be asymmetrical. It was often thought that symmetry was the natural
form for most data but we have increasingly come to realise that much data is naturally skewed. Skewness is
often present in smaller sample sizes because of sampling error that will be high in small samples; it often reduces
and the sample distribution approaches normality with increasing sample size. If not, the data could be
transformed, say using the log, square root or reciprocal of each value, towards a normal distribution shape to
allow statistical testing prior to back-transforming the results. For more on transformations and alternative
distributions see the guides on ‘Continuous data distributions’.

If the distribution has values weighted at the low values (left side of the histogram) with a long tail to the right,
we say it is positively skewed. Negative skew, which is usually less common, is when the greater weighting of
values is at the upper value end (right-hand side of histogram).

10
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
4. Summarising data – locating the centre and spread of the distribution
The most accurate way to represent continuous data is to show all the raw data. However this is almost never
convenient, so we have to summarise the data to a form that can be communicated. The most important qualities
to summarise are the central value of the data, and a measure of the spread or dispersion of the data.

Measures of central tendency

The main measures of central tendency are the arithmetic mean, mode and median. Other variations of the
mean – the geometric and harmonic means – are also important for some data.

Example:

A clinician is studying the number of times a sample of nine children (0-5 years) have been treated for malaria.

Child 1 Child 2 Child 3 Child 4 Child 5 Child 6 Child 7 Child 8 Child 9

No. times treated 2 1 2 2 5 1 2 1 20

Mode: The mode is the value that occurs most frequently. In the scenario above, the most frequent
number of treatments is 2. The mode is usually the main measure of central tendency for categorical
data.

Median: To determine the median we rearrange the values from lowest to highest and take the physical
centre of the data. For our example, the central value (in bold) is 2. If there are an even number of
values, we take the middle two, add them and divide by two.

1 1 1 2 2 2 2 5 20

Arithmetic mean: When we talk of the mean, we usually refer to the arithmetic mean. Here we sum up
the values of the data and divide by the number of observations (in this case 9 children).

2 + 1 + 2 + 2 + 5 + 1 + 2 + 1 + 20 / 9 = 4

Note that here our mean is 4, which is quite different to the median (physical centre) of 2. This is because
we have an outlier in the data – a value that is substantially different from the rest of the data (Child 9,
who had received treatment 20 times). The arithmetic mean is sensitive to the influence of outliers so
data need to be carefully screened first to identify outliers and to establish if they are genuine and
meaningful (to retain for analysis), or a potential error in measurement or recording (in which case
discard).

If we excluded Child 9, from our data analysis and recalculated the arithmetic mean, it would align with
the median value.

2+1+2+2+5+1+2+1 / 8= 2

Note that the arithmetic mean is a theoretical number derived from a calculation. None of the actual
raw values may in fact correspond with the calculated mean.

11
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Geometric mean: The arithmetic mean is calculated directly from the actual values in the data. With
some data there is a multiplicative dimension to the data, for example if the data represents a growth
rate or percentage growth. It can also be used as a way to reduce the effect of outliers in the data as it
prevents undue weight being assigned to such values.

To calculate the geometric mean we multiply the values, then take the ‘nth’ root (so if 9 data point we
take the 9th root). In our scenario:
9
 (2 x 1 x 2 x 2 x 5 x 1 x 2 x 1 x 20)
1/9
or alternatively (1600) / = 2.27

Compare this to the arithmetic mean that contained our outlier. This value of 2.27 is much closer to the
other measures than the arithmetic mean was, so is a better summary statistic in this scenario.

Weighted mean: In some cases we decide that certain values are ‘weighted’. For example in a course
we may have 3 tests of which the final test is weighted 3 times as much as the first two.

Marks: Test 1 = 65, Test 2 = 58, Test 3 = 64

[(1 x 65) + (1 x 58) + (3 x 64) ] / (1 + 1 + 3) = 65 + 58 + 192 / 5 = 63

Relationship between measures of centrality and distribution shape

Without ever seeing the distribution of a set of data, we can gain some insight about the distribution shape from
the relationships between the mean, median and mode. As illustrated below, where the data distribution is
symmetrical, the values of the mean, mode and median will coincide. Each is a good measure of centrality for
‘normally’-distributed data.

However, if the data distribution is skewed, the three measures of centrality will not coincide. The mean in
particular is susceptible to distortion with strong skew, or extreme value outliers, as we have already seen.
Therefore the mean should not be used for skewed data as it is likely to be a poor representation of the middle.
The median is much more reliable when we have outliers or skewed data as it represents the physical mid-point
of the distribution of values.

For a negatively skewed distribution, the value of the mean will always be lower than the median, so it will
estimate the centre of the distribution to be a lower value than it is. Conversely, for a positively skewed
distribution, the mean will indicate a higher value than is realistic for the centre.

12
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Measures of dispersion (variation)
There are many ways that the spread of the data can be described, but the two most important are the variance
and the standard deviation. The simplest way to think of these is the average difference an observation is from
the mean.

Variance & standard deviation

In this scenario we have 15 observations with the following values (arranged in an array from lowest to highest).

3 3 5 6 8 11 12 12 12 16 18 20 21 24 25

The arithmetic mean is 13.06, the median is 12, and the mode is 12, so it appears to be close to a normal
distribution.

To calculate the variance: Subtract each value from the mean, then square the result. Then work out the
arithmetic mean of these numbers (i.e. sum them and divide by the number of observations).

Thus: (3 – 13.06)2 + (3 – 13.06) 2 + (5 – 13.06) 2 + (6 – 13.06) 2 + (8 – 13.06) 2……and so on

Therefore the variance is the sum of the squared differences from the mean.

Why do we square the values? Try calculating without squaring. Those values to the left of the mean are minus
numbers, and those to the right are positive, so they would cancel each other out. By squaring the values they will
always be positive, allowing them to be summed.

However, the problem with the variance is because we squared the values, the variance value is difficult to
interpret in the context of the raw data. To remove the squaring effect, we can simply take the square root of the
variance to return the figure to the same units as the original data.

This is the Standard Deviation. In simple terms it represents the average difference of a value from the mean.
Thus, the larger the standard deviation value, the wider the spread of data. A small standard deviation means that
the distribution is compact around the central value.

An important assumption for many statistical tests is that the variances are similar between two or more samples
being compared. So we can easily check this by comparing the standard deviation or variance values between the
samples.

Population or sample? The variance formula above stands well for population data, but usually we are dealing
with a sample of the population. In populations there are usually numerous extreme cases (outliers). In samples,
being smaller, extremes are more likely to be missed. As a result, sample variances are more likely to
underestimate the population variance. Therefore we can adjust the formula to correct for this effect by
subtracting 1 from the sample number (n-1) in the denominator.

(value 1 – mean)2 + (value 2 – mean) 2 + (value 3 – mean) 2 + (value 4– mean) 2 …and so on

n-1
So (n - 1) is simply a correction that brings the variance estimation closer to true population variance and helps
mitigate effects of sample bias.

Caution note: Like the mean, the variance and standard deviation are sensitive to skew. Therefore we can only
report the variance and standard deviation if we are reasonably sure the data is approximately normally
distributed. If skewed, another measure such as quartiles (p13) or box plots (p14) is a better summary.

13
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Standard deviation and the empirical rule
The theoretical normal distribution has important properties that allow us to make predictions and probability
estimates. Read the separate guide ‘Continuous data distributions’ for more detail. The mean and standard
deviation can entirely describe a normally distributed set of data. This means that the distribution shape can be
reconstructed from these values using an equation. The resulting curve is called a probability density function
(again, refer to the guide mentioned). Arising from this curve is another rule that concerns the standard
deviation, called the empirical rule, which states that:

- 68.26% of values lie within ±1 standard deviation

- 95.45% of values lie within ±2 standard deviations
- 99.74% of values lie within ±3 standard deviations

The example below shows the proportions of values contained between the standard deviation segments.

Standard error: Closely related to the standard deviation is the standard error. The standard deviation applies to
distributions of observed data, either population data or samples of whole populations. It is a descriptive
statistic, concerned with how spread-out our data distribution is.

In inferential statistics, we use samples to estimate values for the population. For any sample, if we divide the
standard deviation by the square root of the sample number, this gives us the standard error. This is a value that
indicates the likely amount of sampling error or uncertainty that there is in our sample, if we were to use it to
generalise to the population. It tells us about the quality of our estimation. Therefore, for inferential statistics the
standard error is of central importance as it tells us the precision of a sample-based estimate. More on this in the
guide to ‘Sampling Theory and Estimation’.

14
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
Other measures of dispersion
Range: The range is simply the difference between the largest and smallest value. In our example from earlier:

3 3 5 6 8 11 12 12 12 16 18 20 21 24 25

The range is 22 (25 – 3).

The range, however, is susceptible to outliers. Taking our first example of the number of treatments experienced
by the 9 children:

1 1 1 2 2 2 2 5 20

The range is 19 (20 – 1).

Now this is quite misleading as the majority of values range between 1 and 5, so 19 is quite
distorting. Instead, a better approach that takes better account of extreme values is the inter-
quartile range.

Quartiles and the inter-quartile range

Refer back to how we calculated the median (page 9). We arranged the raw data into an array from lowest to
highest values. The median was the middle value that bisected the sample at the half-way point. We can also
identify those values at the one-quarter point (first quartile) and the three-quarter point (third quartile). The
second quartile corresponds with the median. By doing this, the data is subdivided into four equal parts.

4, 5, 5, 6, 7, 8, 10, 10, 12, 12, 18

Inter-quartile range

In the sequence above, there are 11 numbers.

The median is the 6th value = 8

The first quartile is the 3rd value = 5

The third quartile is the 9th value = 12

The inter-quartile range is the difference between the upper and lower quartile (12 – 5) = 7

The inter-quartile range is a good indicator of dispersion where there are extreme outliers as these will be
excluded from this summary statistic. A graphical representation based on quartiles is the box and whisker plot
(page 14).

Deciles are another way of subdividing the array of data at each 10 percentage point, although used far less
frequently.

15
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________

More ways of presenting distributions of continuous data

Box and whiskers plot

A box and whiskers plot (or ‘box-plot’) is constructed from quartile information, with the box representing the
inter-quartile range between the first and third quartiles, and the bars or ‘whiskers’ extending from the lowest
data point to the first quartile, and at the upper end, from the third quartile to the highest value in the dataset.
The median is shown as a line in the box.

Box and whisker plots are valuable graphical representations for displaying crude distributions of many samples in
parallel as any skew can easily be seen from the shape of the plots. For normally distributed data, the plot will be
symmetrical with the median sitting exactly mid-way along the box, and with whisker lines of equal length.

Positively skewed data will show a shorter whisker to the first quartile and the median closer to the first quartile.
Conversely, for negatively skewed data the median will be closer to the upper quartile.

Box and whisker plots are very useful for comparing

many distributions side-by-side, as shown to the left.
This allows a simple visual comparison of the range,
mid-point, distribution shape, relative values and
variation within each sample.

16
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________

Scatter plot

A scatter plot is used to plot data from two

continuous variables against one another to
identify whether any association exists
between the two. Often these are used to
visualise any linear relationship that may exist,
or to confirm whether any suspected
association is linear or curvilinear.

In the example to the right, height is plotted

against weight and a crude positive linear
relationship is found – as height increases, so
does weight. Any such relationship is usually
quantified as a correlation coefficient or
summarised as a regression ‘best-fit’ line, as
show here.

Drawing scatter plots is important at the early

stage in a correlation analysis or in regression
modelling, to visually assess relationships
between variables.

Normal plot / Q – Q plot

A Q-Q plot, or quantile – quantile plot is a special type of plot that is valuable for visually comparing if collected
data fits with a theoretical distribution shape. The most common form is a Normal plot, to check whether data fit
a normal distribution. A Normal plot is a type of scatter plot that shows a cumulative frequency of observed data
plotted against the cumulative frequency of a theoretical normal distribution.

This is achieved by setting out the raw data in an array

from low to high, then using a formula to determine for
each data point what the equivalent expected value would
be were that point from a standard Normal distribution.

Thus if the raw data is very close to a theoretical Normal

distribution we should get a scatter plot showing the points
arranged in a straight line. This will be calculated using
software nowadays but could be done by hand using graph
paper and a standard Normal distribution table found in
the appendix of a textbook.

In the example to the left the data is linear so pretty well

fits a normal distribution shape. If the data were negatively
skewed, the line would be curved, bulging to the bottom
right. If positively skewed it would be the mirror, bulging to
top-left.

17
Research Methods / Epidemiology & Statistics Describing data
_____________________________________________________________________________________________
To understand how this is constructed, here’s a simplified example. Let’s say we have 10 data points that we have
arranged from low to high:

-0.7 1.1 1.9 2.2 3.1 3.9 5.0 6.0 6.7 7.0

A quantile represents the proportion of the data represented by that value. So for the median, the quantile is
50% (i.e. 50% of the data lie below the median). We want to calculate the quantile for each of our data points.

In our example we have 10 data items, so imagine a normal distribution curve divided into 10 parts, each worth
10% of the distribution area. The lowest value, -0.7, represents the smallest 10% of the data, so let’s imagine this
sits in the middle of the 0%-10% band, (i.e at the 5% point, or as a proportion 0.5).* The next smallest sits mid-
way between the 10%-20% band, (i.e. 15%, or 0.15) and so on. Now we plot these in column 3. Let’s imagine this
lies mid-way between 0% and 10% (the middle of the lowest 10% band).

Location Raw data Proportion Theoretical

below raw data quantile
1 -0.7 0.05 -1.645
2 1.1 0.15 -1.035
3 1.9 0.25
4 2.2 0.35
5 3.1
6 3.9
7 5.0
8 6.0
9 6.7
10 7.0

Finally, we need to identify what the equivalent value would be for the theoretical Normal distribution. These are
standardised values where 0 = mean, -1 is 1 standard deviation below the mean and so on, called z-scores (see
guide to Continuous data distributions). Using a computer or reading from a standard Normal distribution table,
we find that the lowest 5% of the distribution is represented by a z-score of -1.645. The proportion representing
15% of the distribution is -1.035 and so on.

The Normal quantile plot is then produced by plotting column 2 against column 4.

* In our simple example we divided the normal curve into 10% segments because we had 10 raw data
observations. Normally we’d use a formula such as (i-0.5)/n.

How best to represent continuous data?

There are pros and cons to each way of representing continuous data, so it really depends on purpose.

Histograms are simple to produce and understand and it is easy to crudely assess the shape, mode, mid point and
spread, visually. However if the interval ‘bins’ are too big, they can be difficult to interpret. It is not easy to
compare multiple distributions side by side because of the size of the graph. It can also be difficult to assess
normality from a histogram from a small dataset.

Box-plots are very good for assessing the mid-point, skew and spread of multiple samples side-by-side. However
there is no way to distinguish modes so there is some loss of important information.

Normal plots are very good for assessing whether data come from a normal distribution (or with Q-Q plots other
theoretical distributions), even for smaller datasets. However it is difficult to identify skews and other features of
the distribution compared to a histogram.

Basics of Biostatistics ALL
No ratings yet
Basics of Biostatistics ALL
456 pages
Statistical Analysis with Software Application
100% (1)
Statistical Analysis with Software Application
6 pages
Basic Statistics (3685) PPT - Lecture On 20-01-2019
100% (1)
Basic Statistics (3685) PPT - Lecture On 20-01-2019
64 pages
Biostatics For Nurses
No ratings yet
Biostatics For Nurses
74 pages
Introduction (Data Presentation & Summarization
No ratings yet
Introduction (Data Presentation & Summarization
148 pages
ITC 112 Lesson 1
No ratings yet
ITC 112 Lesson 1
54 pages
1 Biostatistics LECTURE 1
100% (1)
1 Biostatistics LECTURE 1
64 pages
Biostatistics Introduction
No ratings yet
Biostatistics Introduction
52 pages
1.introduction to biostat
No ratings yet
1.introduction to biostat
39 pages
Unit 1 introduction to biostatistics
No ratings yet
Unit 1 introduction to biostatistics
33 pages
Biostatistics Nurses Hnd
No ratings yet
Biostatistics Nurses Hnd
125 pages
1st Lecture-Introduction to Biostatistics and Types of Data-15!02!2025
No ratings yet
1st Lecture-Introduction to Biostatistics and Types of Data-15!02!2025
27 pages
1.BIOSTATISTICS INTRODUCTION
No ratings yet
1.BIOSTATISTICS INTRODUCTION
72 pages
Lecture 1_Online_INTRODUCTION TO BIOSTATISTICS [Compatibility Mode]
100% (1)
Lecture 1_Online_INTRODUCTION TO BIOSTATISTICS [Compatibility Mode]
28 pages
BIO STATISTICS of First Semester
No ratings yet
BIO STATISTICS of First Semester
143 pages
Statistical Biology Module
No ratings yet
Statistical Biology Module
74 pages
Eco2011 Notes
No ratings yet
Eco2011 Notes
96 pages
Biostatistics
No ratings yet
Biostatistics
124 pages
1 Introduction To Biotatistics
No ratings yet
1 Introduction To Biotatistics
48 pages
[email protected]
No ratings yet
[email protected]
7 pages
Descriptive Statistics and Graphical Techniques-V1
No ratings yet
Descriptive Statistics and Graphical Techniques-V1
52 pages
1 Introduction To Biostatistics
No ratings yet
1 Introduction To Biostatistics
247 pages
Lecture No. 12 Community Dentistry
No ratings yet
Lecture No. 12 Community Dentistry
21 pages
TYBA SEM VI Practicals in Cognitive Processes and Psychological Testing English Version
100% (1)
TYBA SEM VI Practicals in Cognitive Processes and Psychological Testing English Version
141 pages
Business Analytics
No ratings yet
Business Analytics
12 pages
Basics of Statistics
No ratings yet
Basics of Statistics
40 pages
Biostastics
No ratings yet
Biostastics
430 pages
1 Introduction
No ratings yet
1 Introduction
97 pages
Practical Research 2 Activity 2
No ratings yet
Practical Research 2 Activity 2
8 pages
Ns Statistics 2022
No ratings yet
Ns Statistics 2022
70 pages
PHS202 Biostatistics
No ratings yet
PHS202 Biostatistics
26 pages
Data Analysis
No ratings yet
Data Analysis
84 pages
Module 1
No ratings yet
Module 1
44 pages
Intro to Biostat (1)
No ratings yet
Intro to Biostat (1)
43 pages
Biostat Sadek
No ratings yet
Biostat Sadek
69 pages
Basic of Biostatistics_1
No ratings yet
Basic of Biostatistics_1
34 pages
Intro To Biostatistics Lecture BSMLS 3-A&B
No ratings yet
Intro To Biostatistics Lecture BSMLS 3-A&B
74 pages
Basic Concept in Statistics-Biostat
No ratings yet
Basic Concept in Statistics-Biostat
29 pages
Concepts of Statistics
No ratings yet
Concepts of Statistics
29 pages
Different Types of Variable Used in Data Collection
No ratings yet
Different Types of Variable Used in Data Collection
26 pages
Lecture-1- Ch-1 -Basic concept
No ratings yet
Lecture-1- Ch-1 -Basic concept
39 pages
Bio Introduction
No ratings yet
Bio Introduction
101 pages
Important Concepts Doc
No ratings yet
Important Concepts Doc
40 pages
Lecture 1
100% (1)
Lecture 1
33 pages
Module 001 Basic Statistical Concept
No ratings yet
Module 001 Basic Statistical Concept
12 pages
Statistics Analysis With Software Application
No ratings yet
Statistics Analysis With Software Application
22 pages
DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
No ratings yet
DR - Nesrin H. Darwesh University of Duhok-College of Dentistry
15 pages
Biostatistics Introduction
100% (1)
Biostatistics Introduction
39 pages
Lecture No 01 Statistics 13-2-24
No ratings yet
Lecture No 01 Statistics 13-2-24
34 pages
What Is Statistics1
No ratings yet
What Is Statistics1
20 pages
Basic Concepts in Statistics
No ratings yet
Basic Concepts in Statistics
42 pages
Chapter-1 (Introduction To Biostatistics)
No ratings yet
Chapter-1 (Introduction To Biostatistics)
30 pages
W1 Lesson 1 - Basic Statistical Concepts - Module PDF
No ratings yet
W1 Lesson 1 - Basic Statistical Concepts - Module PDF
11 pages
Biostatistics
No ratings yet
Biostatistics
78 pages
The Basic Concepts of Statistics
No ratings yet
The Basic Concepts of Statistics
8 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
WK 1b Biostat
No ratings yet
WK 1b Biostat
38 pages
MODULE 1 Introduction, Levels of Measurement, Frequency Distribution
No ratings yet
MODULE 1 Introduction, Levels of Measurement, Frequency Distribution
25 pages
Understandingstatisticsinresearch 151026064600 Lva1 App6892
No ratings yet
Understandingstatisticsinresearch 151026064600 Lva1 App6892
37 pages
1 - 2 Biostatistics
No ratings yet
1 - 2 Biostatistics
24 pages
Introduction To Data Viz Lecture 2
No ratings yet
Introduction To Data Viz Lecture 2
44 pages
11.3 Olyset duo net Burkina Faso
No ratings yet
11.3 Olyset duo net Burkina Faso
36 pages
Basic analytic procedure
No ratings yet
Basic analytic procedure
35 pages
(Synthesis+Lectures+on+Engineering) +David+L +Whitman,+Ronald+E +Terry-Fundamentals+of+Engineering+Economics+and+Decision+Analysis-Morgan+&+Claypool+Publishers+ (2012) PDF
100% (1)
(Synthesis+Lectures+on+Engineering) +David+L +Whitman,+Ronald+E +Terry-Fundamentals+of+Engineering+Economics+and+Decision+Analysis-Morgan+&+Claypool+Publishers+ (2012) PDF
221 pages
The Role of Human Resources Professionals' Feedback in Enhancing Workplace Culture, Employee Satisfaction and Performance in San Pedro City, Laguna
No ratings yet
The Role of Human Resources Professionals' Feedback in Enhancing Workplace Culture, Employee Satisfaction and Performance in San Pedro City, Laguna
40 pages
Title Defense Manuscript
No ratings yet
Title Defense Manuscript
54 pages
GENEVIEVE BRIAND, R. CARTER HILL - Using Excel For Principles of Econometrics-Wiley (2011) PDF
100% (1)
GENEVIEVE BRIAND, R. CARTER HILL - Using Excel For Principles of Econometrics-Wiley (2011) PDF
484 pages
INLA Rinla
No ratings yet
INLA Rinla
87 pages
11.3 next gen nets trial Tanzania
No ratings yet
11.3 next gen nets trial Tanzania
12 pages
BAEC–301
No ratings yet
BAEC–301
6 pages
Syllabus of Innovative Programme - Master of Management Studies in Digital Marketing
No ratings yet
Syllabus of Innovative Programme - Master of Management Studies in Digital Marketing
66 pages
Edited I III
No ratings yet
Edited I III
33 pages
ADV 344K in Class Notes
No ratings yet
ADV 344K in Class Notes
56 pages
1.introduction to epidemiology_basic concepts
No ratings yet
1.introduction to epidemiology_basic concepts
5 pages
To Quantitative Analysis: To Accompany by Render, Stair, Hanna and Hale Power Point Slides Created by Jeff Heyl
No ratings yet
To Quantitative Analysis: To Accompany by Render, Stair, Hanna and Hale Power Point Slides Created by Jeff Heyl
94 pages
Chapter 17: Autocorrelation (Serial Correlation) : - o o o o - o
No ratings yet
Chapter 17: Autocorrelation (Serial Correlation) : - o o o o - o
32 pages
1.Three COVID_19 Lessons
No ratings yet
1.Three COVID_19 Lessons
3 pages
Measures of Centers
No ratings yet
Measures of Centers
11 pages
Chapter One Definition of Statistics
No ratings yet
Chapter One Definition of Statistics
17 pages
Stats Cheat Sheet (Size 11)
No ratings yet
Stats Cheat Sheet (Size 11)
5 pages
Sampling-33-5-W Pentti On Replication
No ratings yet
Sampling-33-5-W Pentti On Replication
5 pages
The Use of - Tests For Small Samples
No ratings yet
The Use of - Tests For Small Samples
11 pages
Determinants of Bank Credit To The Private Sector The Case of Albania
No ratings yet
Determinants of Bank Credit To The Private Sector The Case of Albania
36 pages
Week 011-Understanding Data and Ways To Systematically Collect Data
No ratings yet
Week 011-Understanding Data and Ways To Systematically Collect Data
13 pages
Second Semester M Com. Degree Examination, August 2017 Pa R-Co 223: Quantitative Techniques Admission)
No ratings yet
Second Semester M Com. Degree Examination, August 2017 Pa R-Co 223: Quantitative Techniques Admission)
9 pages
INFORMS Job Task Analysis 2012 2019
No ratings yet
INFORMS Job Task Analysis 2012 2019
16 pages
Diaz-Gimenez Et Al (1997)
No ratings yet
Diaz-Gimenez Et Al (1997)
19 pages
Action Research
No ratings yet
Action Research
64 pages
Customer Perception On E-Banking Service: Dr. Uday Singh Rajput
No ratings yet
Customer Perception On E-Banking Service: Dr. Uday Singh Rajput
10 pages
Frontmatter
No ratings yet
Frontmatter
24 pages
MATLAB Reconciliation
No ratings yet
MATLAB Reconciliation
8 pages
AM1 - Tutorial 1
100% (1)
AM1 - Tutorial 1
12 pages
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
No ratings yet
Data Mining Techniques and Its Applications in Banking Section - Chitra and Subashini
8 pages
3 Two Sample Independent Test
No ratings yet
3 Two Sample Independent Test
5 pages
Statistical Theory and Its Solutions
From Everand
Statistical Theory and Its Solutions
Pasquale De Marco
No ratings yet

Study Guide - Describing data

Uploaded by

Study Guide - Describing data

Uploaded by

Research Methods DLTROP202

Epidemiology & Statistics DLTROP204

Describing & summarising data

Box and whisker plots …………..…. 16

Scatter plots & Normal plots… …….17

This guide is focused on the first branch – descriptive statistics.

2. Types of data and variables

Is the variable qualitative (categorical) or quantitative (numerical)?

Describing qualitative, categorical data

Variable Value label

Variable Value label

Variable Value label

Describing numerical data

Categorical: Child, Pensioner, Adult

Binary categorical: Under 30 / Over 30

Ordinal categorical: Under 18, 18-29, 30-39, 40-49, 50-59, Over 60

Discrete numerical: 17 (years old)

Continuous numerical: 14,235.7 days !!

Other naming conventions for variables

3. Summarising data – frequency distributions

Household material Frequency

Household material Frequency Relative

However, with visual representations of data

Infestation Frequency Relative Cumulative Relative

Summarising multiple categorical variables

Contingency tables are used to examine

Pulse / min Frequency Cumulative Relative

One decision we make when dealing

Measures of central tendency

Child 1 Child 2 Child 3 Child 4 Child 5 Child 6 Child 7 Child 8 Child 9

No. times treated 2 1 2 2 5 1 2 1 20

Marks: Test 1 = 65, Test 2 = 58, Test 3 = 64

[(1 x 65) + (1 x 58) + (3 x 64) ] / (1 + 1 + 3) = 65 + 58 + 192 / 5 = 63

Relationship between measures of centrality and distribution shape

Variance & standard deviation

Thus: (3 – 13.06)2 + (3 – 13.06) 2 + (5 – 13.06) 2 + (6 – 13.06) 2 + (8 – 13.06) 2……and so on

(value 1 – mean)2 + (value 2 – mean) 2 + (value 3 – mean) 2 + (value 4– mean) 2 …and so on

- 68.26% of values lie within ±1 standard deviation

The range is 22 (25 – 3).

The range is 19 (20 – 1).

Quartiles and the inter-quartile range

4, 5, 5, 6, 7, 8, 10, 10, 12, 12, 18

In the sequence above, there are 11 numbers.

The median is the 6th value = 8

The first quartile is the 3rd value = 5

The third quartile is the 9th value = 12

More ways of presenting distributions of continuous data

Box and whisker plots are very useful for comparing

A scatter plot is used to plot data from two

In the example to the right, height is plotted

Drawing scatter plots is important at the early

Normal plot / Q – Q plot

This is achieved by setting out the raw data in an array

Thus if the raw data is very close to a theoretical Normal

In the example to the left the data is linear so pretty well

Location Raw data Proportion Theoretical

How best to represent continuous data?

You might also like