0% found this document useful (0 votes)
4 views29 pages

Notes - EDA-Unit4 (1)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views29 pages

Notes - EDA-Unit4 (1)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

CCS346 - EXPLORATORY DATA ANALYSIS (C311)

UNIT IV BIVARIATE ANALYSIS

Relationships between Two Variables - Percentage Tables - Analysing Contingency Tables - Handling Several
Batches - Scatterplots and Resistant Lines

COURSE OBJECTIVE:

To apply bivariate data exploration and analysis.

COURSE OUTCOME:

CO4: Apply bivariate data exploration and analysis.


Relationships between Two Variables
Bivariate Analysis:
● Represent relationship between two variables
● Can suggest hypotheses about the way in which the world works. I
● One variable - considered a cause and the other - an effect
● Can call these variables by different names.
❖ Cause Variable - Explanatory Variable - Independent Variable- X
❖ Effect Variable - Response Variable - Dependent Variable - Y
Causal reasoning - assisted by the construction of a schematic model of the hypothesized causes and
effects - a causal path model.
Eg: Social class a child - likely to have an effect on its school performance

Such models are drawn up according to a set of conventions:


1. The variables - inside boxes or circles and labelled- Social class background and School
performance.
2. Arrows - from the cause variables to effect variables - from social class background to
school performance.
3. Positive effects - Solid lines and negative effects - dashed lines
4. Strength of explanatory variable - Number on the arrow
5. Not all the causes have been specified - an extra arrow on effect variable - unlabelled
Hypothetical causal path models
● Can be constructed before examining data
● Can represent expected relationships
● In non-experimental research - variables may be operating simultaneously - provides
invaluable insights

Percentage Tables
Introduction
● Deals with the relationship between two variables - both are nominal variables
● Techniques of analysis best suited to data
○ Percentages - old model
○ Inferential techniques based on loglinear models - Recent model
● The relationship between variables if of importance
Eg: Relationship between an individual's social class background and their propensity to participate in
higher education.
Social class - can be considered as nominal or ordinal
Proportions, percentages and probabilities
● Presentation of information - Nominal scale variable
○ Bar chart can be used
○ The height or length of the bar - represent the number of cases in a category -
relative size of categories could be understood
The same effect can be achieved numerically by means of proportions or percentages.
● Proportion - number in each category is divided by the total number of cases N -
1036/6180 = 0.168
● Percentages - proportions multiplied by 100 - 0.168*100 = 16.8

● Proportions and percentages can be converted back to raw frequencies if the total number of cases
is known
● Proportion - precision better - several decimal places could be used even when the total sample
size is very small. Do not prefer proportions, when sample size is less than 20.
● Proportions and percentages are bounded numbers
○ have a floor of zero, and a ceiling of 1.0 and 100 respectively.
○ When numbers are bounded at the top and the bottom - can cause problems in an analysis
based on very small or large percentages
● Proportions
can be used descriptively
can also be thought of as probabilities
Eg: Probability of an individual aged 19 in 2005 having a parent in a 'Higher professional' occupation is
0.168.
Contingency tables
The distribution of a single variable - represented graphically as a bar chart. The univariate
distributions of the social class background of individuals are shown in figure
These two distributions describe two separate features.
Eg: Class distribution - provides socio-economic status of parents with 19-year-old children in 2005. It
does not strictly represent the whole occupational structure. Individuals who did not have children would
be excluded from the analysis.
From both the bar chart - nothing can be inferred about the relationship between the two.
● Joint distribution of the two variables- represented using three-dimensional bar chart
● Contingency table - represent the same numerically

● Contingent - 'true only under existing or specified conditions'


● Contingency table
○ shows the distribution of each variable conditional upon each category of the other.
○ Rows - categories of one of the variables
○ Column - categories of the other variable
○ Pigeonhole - Cells - individual cases are tallied depending on its value on both variables.
cell frequency - number of cases in each cell.
● Marginals - used to obtain univariate distributions
○ Row total - presented at the right end
○ Column total - presented at the bottom end

● Depicts the bivariate relationship between the two variables


● Hard to grasp - difficult to decide, on the basis of raw numbers
Percentage tables
● Cast data in percentage form
● Common way to make contingency tables readable
● There are three different ways
○ Total Percentage
○ Row Percentage
○ Column Percentage
● Total Percentage:
○ divide each cell frequency by the grand total
○ Table as a whole is little more readable than the raw frequencies
○ Not often constructed
Eg: Shows that the 663 respondents with higher professional parents who were in full-time education at
age 19 represented 10.8 per cent of the total population aged 19 in 2005.
Row Percentage:
● Divide each cell frequency by its appropriate row total
● Usually read down the columns
● Reading along the rows would convey the already known information
(ie) marginal distribution and the fact that the percentages sum to 100

The row percentages show the different outcomes for individuals with a particular social class
background. Nearly two-thirds of those with a parent in a higher professional occupation are still in
full-time education at age 19, less than a quarter of those with parents in Lower supervisory or Routine
occupations are still in full-time education by this age.
Column Percentage:
● Also called an 'outflow' table
● Better suited for exploring causal ideas

The fifth column of data shows that of all those who are 'out of work', only 5.3 per cent
i.e. just over five in every hundred or one in twenty, arc from backgrounds where one of the parents was a
higher professional, whereas approximately a quarter (25.1 pcr cent) of those who were 'out of work' had
parents with routine occupations.
Good table manners
A well-designed table is
● Easy to read
● Takes effort, time and perhaps many drafts to perfect
● helps the data analyst
● can reveal patterns in the data
● Guidelines on how to construct a lucid table of numerical data
1.Reproducibility versus clarity
● Presenting data does two jobs
○ tell a story
○ check the conclusions by inspecting the data
● For clarity - prefer visual display
● Story line - leave out extraneous
● Leave original data as much as possible in numerical form - to allow others to inspect and
possibly reinterpret the results.
2.Labelling
● A clear title - should summarize the contents
● Should be as short as possible
● Should specify when the data were collected, the geographical unit covered, and the unit of
analysis.
● Label - variables included in the rows and columns
● Do not use mnemonics
3.Sources
● Specify actual source of the data - not good to say - from Social Trends
● Include the volume and year, table or page, column in a complex table
4.Sample data
● Provide special referencing if data are based on a sample drawn from a wider population,
● Provide enough information to assess the adequacy of the sample.
● Include the following details
○ method of sampling (for example 'stratified random sample' or 'sample based on
interlocking age and sex quotas'),
○ achieved sample size,
○ response rate or refusal rate,
○ geographical area which the sample covers
○ frame from which it was drawn.
5.Missing data
● Common ways in which data can mislead people - using unstated principle of selection
● Don't exclude cases from analysis - miss out particular categories of a variable or ignore particular
attitudinal items
● Providing an overall response does not provide details about missing information.
6. Definitions
● No hard and fast rule about how much definitional information to include in thetables. Can
become unreadable if too much was included.
● Complex terms are explained elsewhere in the text - include a precise section or page reference
7.Opinion data
● When presenting opinion data, always give the exact wording of the question put to respondents,
including the response categories
● There can be big differences in replies to open questions and forced choice questions
○ 'Who do you think is the most powerful person in Britain today?' -open question
○ 'Which of the people on this card do you think is the most powerful person in Britain
today?' - forced choice question
8.Ensuring frequencies can be reconstructed
● It should be possible to convert a percentage table back into the raw cell frequencies.
● To retain the clarity of a percentage table - provide the minimum number of base NS needed for
the entire frequency table to be reconstructed.
9.Showing which way the percentages run
● Proportions add up to 1 and percentages add up to 100
● rounding may mean that they are slightly out - total comes to something other than the expected
figure because of rounding error.
● It is usually helpful to include an explicit total of 100
10.Layout
● The effective use of space and grid lines - decides whether the table is easy to read or not.
● In general, white space is preferable, but grid lines can help indicate how far a heading or
subheading extends in a complex table.
○ Eg: Tables of monthly data can be broken up by spaces between every December and
January
○ Labels must not be allowed to get in the way of the data.
○ Set variable headings off from the table, and further set off the category headings. Avoid
underlining words or numbers.
Clarity is often increased by reordering either the rows or the columns. It can be helpful to arrange
them in increasing order of size, or size of effect on another variable. Make a decision about which
variable to put in the rows and which in the columns by combining the following considerations:
1. closer figures are easier to compare;
2. comparisons are more easily made down a column;
3. a variable with more than three categories is best put in the rows so that there is plenty of room for
category labels.
Analysing Contingency Tables

Percentage tables - way of making contingency data more readable


● Mays of analysing contingency data - to generate summary measure of the effect of one variable
upon another- one variable is the explanatory variable and the other is response or outcome
variable.
Which way should proportions run?
● Hypothesis about the possible causal relationship between variables - conveyed by the choice of
which proportions are used for analysis.
○ age - associated with whether individuals feel safe walking alone after dark
○ older people, and particularly older women, are more likely to feel unsafe than younger
individuals
Eg:
● Explanatory variable - old age
● Response or outcome variable - feeling unsafe
○ Does not suggest that feeling unsafe causes people to be old
● Rule when dealing with contingency data
○ Construct the proportions so that they sum to one within the categories of the
explanatory variable.
○ would only work if the explanatory variable was always put in the rows
○ response variable provides the proportions, and the explanatory variable the
categories.

The base for comparison


One variable in a table has a likely causal effect on the other variable and this can help to interpret
the figures in a meaningful way.

● Explanatory variable - age group, a three-category variable


● Response or outcome variable - feelings of safety walking alone after dark
● Rule - the proportions expressing fear of walking alone after dark should be calculated within
each age group. data - full raw frequencies
○ Should be 'weighted' to take account of the design effects d to provide the best possible
estimates of summary statistics
○ Simplicity - the analyses are carried out on unweighted data
● Numbers - have meaning in comparison with other numbers.
To decide whether 0.07 of individuals aged 16—39 reporting feeling 'very unsafe' when walking alone
after dark is high or low, it can be compared with the 0.07 of individuals aged 40—59 and 0.16 of
individuals aged 60 or over.
One category is picked to act as the base for comparison with all other categories.
By making comparisons with the base
● quantitative estimates of the causal effect of one variable on another can be made
● positive and negative relationships between nominal level variables can be distinguished
● Which category should be selected as the base for comparison?
● to some extent arbitrary
● there should be a substantial number of cases in it.
● should be a category of substantive interest.
● the category that is markedly different from others
● negative relationships between variables to a minimum - several variables whose
interrelationships are to be examined
Base caegory - for comparison among age groups feeling unsafe walking alone after dark?
Category with a relatively large number of individuals within it - age-groups are all of similar
size, any one of them could be used as the base category
If we select the youngest age group as the base and then pick feeling very unsafe as the base for
comparison in the fear of walking alone after dark variable, we will almost certainly avoid too many
negative relationships. In summary, each age group can be compared with those aged 16—39 in their
feeling very unsafe when walking alone after dark.
Represent three-category variable in a causal path model as two dichotomous variables - age group

Choosing one category as a base effectively turns any polytomous variable into a series of dichotomous
variables known as dummy variables.

Age group is represented by two dummy variables. The effect of the first is denoted b1 and the effect of
the second b2. A line is drawn under which the base category of the explanatory variable is noted; the fact
that some young people are afraid of walking alone after dark (path a) reminds us that there are some
factors influencing feeling very unsafe that this particular model does not set out to explain.
Summarizing effects by subtracting proportions
d - the difference in proportions
● This measure of effect has two virtues: it is simple and intuitively appealing.
● In this example, it is the proportion of people who feel very unsafe that is at issue, and the shadow
proportion, i.e. those who feel 'very safe' 'fairly safe' or 'a bit unsafe', is ignored.
● The effect, d, is calculated by subtracting the proportion in the base category of the explanatory
variable from this proportion in the nonbase category of the explanatory variable.
In this particular example, path bl represents the effect of being in the oldest age group as opposed to
being in the youngest age group on the chances of feeling very unsafe walking alone after dark. It is
found by subtracting the proportion of the youngest age group feeling very unsafe from the proportion of
the oldest age group class giving the same response.
Eg : d = 0.16 0.07, or +0.09 - .positive - older people are more likely to be afraid of walking alone
after dark than are the youngest age group.
If different base case selected - can end up with negative values of d
Eg: Trying to explain feeling safe when walking alone after dark, the d for the oldest age group would
have been 0.84 0.93, or —0.09. The magnitude of effect would not have altered but the sign would have
been reversed.
Path b2 represents the effect of being in the middle age group on feeling very unsafe walking alone after
dark. We might expect this to be lower than the effect of being in the oldest age group. It is.
d = 0.07 — 0.07, or 0; the younger two age groups are extremely similar in their fear of walking alone
after dark.
The value of path a is given by the proportion of cases in the base category of the explanatory variable
who fall in the non-base category of the response variable.
7 per cent of those in the youngest age group reported feeling very unsafe walking alone after dark. The
value of this path is therefore 0.07.

The quantified model - allows to decompose the proportion of older people who are fearful of walking
alone after dark (0.16) into a fitted component (0.07) and an effect (+0.09).
A simple relationship between an explanatory variable X and a response variable Y can be expressed as

proportions can also be expressed in the same way. The overall proportion Y who feel very unsafe when
walking alone after dark is 4,614/44,786, or 0.103 . This can be decomposed as:
where X and X2 are the proportions of the sample in the oldest and middle age groups (0.338 and 0.334)
respectively.
Properties of d as a measure of effect
The difference in proportions, d, has been used to summarize the effect of being in a category of
one variable upon the chances of being in a category of another.
Advantages.
● People understand it intuitively
● It retains the same numerical value, if the other category in a dichotomy is chosen as the base for
comparison.
● it can be used to decompose overall proportions- it can be decomposed itself
● If the proportions were run in the opposite direction, the value of d would change. Some
statisticians dislike this property: they prefer symmetric measures of association which take the
same value whichever way round the causal effect is presumed to run.
● Quantitative summaries of causal effects such as d, however, are nearly all asymmetric, taking
different values depending on which variable is presumed to be the cause of the other. force us to
be explicit about our hypotheses as to which variable is the explanatory variable and which is the
response.
● Proportions are bounded numbers - distinguish differences between proportions in the middle of
the range from those between extreme proportions.
In the above example, the proportions stating that they felt very unsafe walking alone after dark were all
relatively small. This means that although the size of the effect d for the oldest age group is relatively
small (0.09), older people are more than twice as likely as the younger age groups to say that they feel
very unsafe. For proportions in the middle of the range e.g. 0.51 compared with 0.60 the difference 'd'
would be identical to that calculated above.
How large a difference in proportions is a 'significant' difference?
The results found using the large sample - will be extremely similar to the results obtained from
another large sample, or even the whole population

● Women are more likely than men to feel very unsafe walking alone after dark (0.16 vs 0.03, with
men as the base category d 0.13).
● Provided a sample is of sufficient size (above around 30 cases) then we know that the distribution
of sample means (or any other sample parameter) for all samples of that size drawn from the
population will have a normal distribution (bell-shaped curve). This is the case regardless of the
shape of the distribution of the original variable of interest in the population.
Eg: weekly income - positively skewed- repeated random samples - plot the mean of each of the samples
as a histogram the distribution of these means would have a normal or Gaussian shape.This is known as
the 'Law of Large Numbers'.
However, not all surveys have such large sample sizes and if, for example, we read a report stating that
women were more concerned about walking alone after dark than men, based on twenty-five interviews
with men and twenty-five interviews with women, in which four women and one man stated that they felt
very unsafe walking alone after dark, we might be rather less convinced.
'inferential statistics' - how to use analysis of data from samples of individuals to infer information about
the population as a whole.
Statistic known as 'chi-square' can be calculated which provides useful information about how far the
results from the sample can be generalized to the population as a whole. Calculating the chi-square statistic
— innocent until proven guilty?
Eg: 100 men and 100 women are asked about their fear of walking alone after dark.
Until the survey is conducted - have information on number of men and women in our sample

After survey - foung that in total 20 individuals i.e. 0.1 of the sample state that they feel very unsafe
when walking alone after dark.

If, in the population as a whole, the proportion of men who feel very unsafe walking alone after dark is
the same as the proportion of women who feel very unsafe walking alone after dark, the proportions and
frequencies would be as given below
Survey carried out and values cross-tabulated with observed values

To judge whether there is a relationship between gender and fear of walking alone after dark - compare
actual values observed, with expected values
The chi-square statistic provides a formalized way of making comparison.
● Chi-square statistic also has a distribution.
● Given the low value of probability - reject the 'null hypothesis' that there is no relationship
between gender and feeling unsafe walking alone after dark
● Conclude that the table shows a 'statistically significant' relationship between the variables.
● No mathematical way of deciding the cut-off-point - below which the null hypothesis is rejected.
Type I and Type 2 errors
Level of probability associated with a particular chi-square gives a measure of how likely to be
mistaken. This probability is sometimes thought of as 'Type 1' error. That is, the probability that a result is
applicable to the population, when in fact been unlucky in selecting a random sample with an unusual
profile.
Eg: probability of making a Type 1 error is 0.002 or 2 in a thousand - very low value that makes us
confident that the null hypothesis is to be rejected.
● Higher probability associated with a chi-square, such as 0.02, there is a two in a hundred chance
of making a Type 1 error - decide not to reject the null hypothesis.
● Reject the null hypothesis if the probability is less than 0.05 (or 1 in 20) and accept the null
hypothesis if the probability is greater than or equal to 0.05.
● Sample size is small, - difference exists between two groups, but it is found that the probability
associated with the chi-square is above the conventional cut-off of 0.05.- risk of making a 'Type 2'
error.
Eg: Women are slightly more likely than men to report that they feel very unsafe
With a sample size of just 165 individuals chi-square is calculated to be 0.241 with an associated
probability of 0.623 and we do not reject the null hypothesis
Type 2 error - fail to reject the nu11hypothesis even though there is a difference between the groups in the
population.
Type 2 errors are particularly common when we have a small sample size or when we are looking at a
small subgroup within a larger survey.
Degrees of freedom
● The probability associated with a particular value of chi-square does not only depend on the size
of the chi-square, but also the size of the table from which chi-square was calculated.
● A table with two rows and two columns- one degree of freedom because once one cell is known,
the values in the other cells can be calculated based on the row and column marginals.
● A table with two columns and three rows- two degrees of freedom.
● In formal terms the number of degrees of freedom for a table with r rows and c columns is given
by the equation below:

For a table with 4 rows and 5 columns the degrees of freedom would be 12
Sample size and limitations on using the chi-square statistic
● Size of sample required partly depends on the distribution of the variables of interest.
● Chi-square - enables to focus on the relationship between two categorical variables.
● There are a number of different stages to understand how to interpret a statistic such as chi-square
with its associated probability.
● Practical guide of interpreting contingency tables:
I . It is important to ensure that the proportions sum to 1 for each category of the explanatory variable
2. When comparing proportions or percentages ensure that the comparing numbers do not add up to 1
(or 100 in the casc of percentages).
3. If there are more than two categories of the explanatory variable decide which should be the base
category and compare the other categories with this one.
4. Note the size of the table - too many categories t- collapse some of the categories and simplify the
table - smallest possible table is a two-by-two table with one degree of freedom.
5. Inspect the value of chi-square associated with the table. The larger it is, the more likely the table is
to show a statistically significant association i.e. to have an associated probability less than 0.05
(written p < 0.05). For a two by two table (1 degree of freedom) a value of chi-square larger than 3.84
will be significant. For a table with two degrees of freedom a value of chi-square of 5.99 is needed and
for a table with three degrees of freedom chi-square needs to be 7.81 to be significant.
6. Check the probability associated with chi-square - gives the probability that the values observed in
the table have occurred by chance under the assumption that in the population as a whole there is no
association between the variables in the table. The smaller the value - reject the null hypothesis and that
there is a statistically significant relationship between the variables.
Handling Several Batches

● Boxplot - facilitates comparisons between distributions


● Explore the association between a variable measured at the interval level and a variable measured
at a nominal level
Unemployment — an issue of concern

United Kingdom had relatively low levels of unemployment compared with Germany, France and Spain,
but rates were even lower in Denmark and the Netherlands. The unemployment rates in Poland can be
seen to have been particularly high 8.3 Regional variation in unemployment rates are shown above.
The histogram and the five number summary of median, quartiles and extremes are shown
It focus on four different aspects of the distribution: the level, spread, shape and outliers. Typical
local unemployment rates in 2005 were around 4.2 per cent, and the middle 50 per cent of areas varied
between 3.1 and 5.6 per cent, giving a midspread of 2.5 (i.e. 5.6 — 3.1). The main body of the data is
roughly symmetrical, and there are no points that stand out as being a distance from the main bulk of the
data.

The regional median is not the same as the national average. It is slightly lower (3.9 per cent
compared with 4.2 per cent). There is also somewhat less variation in the unemployment rates
experienced by towns in the same region than there was in the sample of areas drawn from all over the
country. Two areas that appeared relatively normal in the context of the national spread of unemployment
rates, Mansfield and Nottingham
Boxplots
● It is important to display data well when communicating it to others. Pictures are better at
conveying the story line than numbers.
● Pictorial representation of data - histogram - it preserved a great deal of the numerical
information.
● The boxplot is a device for conveying the information in the five number summaries economically
and effectively.

The middle 50 per cent of the distribution is represented by a box. The median is shown as a line
dividing that box. Whiskers are drawn connecting the box to the end of the main body of the data. They
are not drawn right up to the inner fences because there may not be any data points that far out. They
extend to the adjacent values, the data points which come nearest to the inner fence while still being
inside or on them. The outliers are drawn in separately. They can be coded with symbols to denote
whether they are ordinary or far outliers, and are often identified by name.
Outliers are points that are unusually distant from the rest of the data. To identify the outliers in a
particular dataset, a value 1.5 times the dQ, or a step, is calculated; as usual, fractions other than one-half
are ignored. Then the points beyond which the outliers fall (the inner fences) and the points beyond which
the far outliers fall (the outer fences) are identified; inner fences lie one step beyond the quartiles and
outer fences lie two steps beyond the quartiles.
Outliers
● points which are a lot higher or lower than the main body of the data.
● require the data analyst's special attention.
● They are important as diagnostics, and they can arise for one of four reasons:
1. They may just result from a fluke of the particular sample that was drawn. - assessed by traditional
statistical tests
2. They may arise through measurement or transcription errors,
3. They may occur because the whole distribution is strongly skewed. - need to transform the data.
4. They may suggest that these particular data points do not really belong substantively to the same data
batch.

Rule of thumb:
We define the main bodv of the data as spreading from one and a half times the dQ higher than the
upper quartile to one and a half times the dQ lower than the lower quartile. Values further out than this
are outliers. Moreover, it is also useful to define a cut-off for even more unusual points: we treat as a far
outlier any point which is more than three times the dQ higher than the QU or lower than the QL
Multiple boxplots
● Boxplots, laid out side by side permit comparisons to be made with ease.
The level:
● Median unemployment rate varies from approximately 3 per cent in the South West to around 7.5
per cent in London.
● The original hypothesis was therefore correct: one source of the variation in local area
unemployment rates is a regional effect. If there was no such effect, the medians would all be
roughly the same. An unemployed worker who travelled from region to region in search of work
would certainly find different average levels of unemployment prevailing.
The spread:
● There is some regional variation in spread.
● The South West and East Midlands have a smaller midspread than other areas, and Yorkshire and
Humberside has a larger midspread.
● There is no systematic evidence that the spread increases as the median increases
The shape:
● The datasets seem about as symmetrical as batches of data ever are. Some of the batches have
longer upper tails than lower tails; this is because unemployment overall was relatively low in
2005.
● The floor of zero unemployment prevents the possibility of a long lower tail at a time when
median unemployment is relatively low.
Outliers:
● The regional outliers are generally not the same as the outliers in the whole batch. For example,
Greenwich and Enfield appear to be outliers when viewed in the context of the country as a whole,
but are seen to be almost normal in the context of London's unemployment rate.

Decomposing the variation in unemployment rates


Implicit model - that variation in unemployment rates can be accounted for partly by region. This
can be formalized as follows:
Unemployment rate = Regional fit + Residual
This model says that part of the reason for variation in all the local areas in the country stems from
the fact that they are in different regions, and in part stems from other factors unconnected with region.
The unemployment rate in Ealing, for example (7.9 per cent), can be decomposed into a typical rate for
London (7.5 per cent) and a residual of +0.4 per cent.
The regional medians could themselves be decomposed into two parts: a grand median and a
regional effect, which indicates how far the median rate in a particular region deviates from the grand
median:
Unemployment rate = Grand median + Regional effect + Residual
The median unemployment rate in the whole of Great Britain was 4.2 per cent, so the London
regional median of 7.5 per cent was 3.3 per cent higher. The unemployment rate in Ealing can therefore
also be decomposed as 4.2 + 3.3 + 0.4 or 7.9 per cent as before.
An effect is here calculated as the difference between a conditional fit and the grand median. It is
the value that is entered on the arrows in causal path models.
To quantify how much of the variation in unemployment rates is accounted for by region, the
regional median is fitted to each value, residuals from the fit are calculated, and the variation in the
residuals is compared with the original variation. Figure 8.8 shows a worksheet where this is done for the
sample of local areas. The residuals from the regional fit show the variation in unemployment rates not
accounted for by region; Ribble Valley, for example, has a very low unemployment rate for the North
West.
The residuals can be displayed using a histogram and summary descriptive statistics
● The midspread in the original sample batch was 2.5, the residual midspread is now reduced
to only 2.0.
● The variation in all areas taken together is larger than that in the regions taken separately.
● This reduction in midspread is an indicator of the strength of the effect of the explanatory
variable on the response variableSource: Sample as figure 8.2.
Moving to the individual as a unit of analysis and using a statistical test

The T-test
● Provides a measure of the difference between the means of two groups.
● Also takes account of the amount of variation within each group

● Where Xl and X2 are the means of the two samples


● SXI is the standard deviation of one sample and SX2 is the standard deviation of the other sample.
● n1 is the sample size of the first sample and n2 is the sample size of the second sample
● The size of the t-statistic depends on the size of the difference between the two means adjusted for
the amount of spread and the sample sizes of the two samples.
To test whether boys' and girls' mathematics scores are significantly different (that is whether the
differences in the sample represent differences in the population)
Null hypothesis- boys' and girls' mathematics scores are the same, or in other words that there is
no association between the variables gender and mathematics score.
The T-statistic is then calculated in order to help us decide whether to reject the null hypothesis. If
we find that we can reject the null hypothesis this would lead us to believe that there is an association
between gender and mathematics score in the population of interest.
Assuming equal variances?
● There is a slightly different formula for calculating the T-statistic depending on whether the two
groups have similar levels of variability within them. 20
● Results are almost completely identical regardless of whether equal variances are assumed - when
the sample size in the two groups is almost the same.
● Comparing more than two groups: parental interest and children's mathematics scores
● Need a technique and a statistical test that is appropriate for comparing more than two groups.
For example, if we wish to examine the impact of mothers' interest in education on their children's
performance on the maths test we can once again use the technique of multiple boxplots to explore the
data.

Children's mathematics score and their mothers' interest in their education appear to be closely associated.
One-way analysis of variance
When we have a continuous dependent variable (such as maths test score) and a categorical
independent variable with more than two categories (such as mother's interest in education), the
appropriate statistical procedure to use is 'One-way analysis of variance'.
Once we shift from exploratory data analysis, in the form of comparing boxplots, to inferential
statistics we also need to shift our description of the level and spread of the data from the median and the
interquartile range to the mean and the standard deviation.
One-way analysis of variance is used to test the null hypothesis that a number of independent
population means are equal.
Eg: test whether the mean mathematics scores for different groups of children, defined by their mothers'
interest in education, are equal
Examine the data - produce descriptive statistics for each of the groups.
Average (mean) mathematics scorc for children with 'Very interested' mothers is 22.27, compared
with 18.07 for the 'Over concerned' mothers and just 9.83 for the mothers who have 'little interest'.
The one-way analysis of variance procedure is based on the fact that there is a known statistical
relationship between the variability of means and the variability of individual observations from the same
population. If the independent sample means vary more than would be expected, if the null hypothesis is
true, the null hypothesis should be rejected in favour of the alternative hypothesis that the means of the
separate groups within the population are not all equal.

The between-groups mean square indicates how much individual observations should vary if the
null hypothesis is true. This is computed based on the variability of the sample means. In this case it is
based on the variability between the five mean mathematics test scores for the groups defined by mothers'
interest in education. The within groups mean square indicates how much the observations within the
groups really do vary.
The F statistic is then calculated based on the ratio of these two estimates of the variance. The
between-groups mean square is divided by the within groups mean square. The larger the value of the F
statistic the more likely you are to reject the null hypothesis. The distribution of the F statistic depends
both on the number of groups that are being compared (5 in this example) and the number of cases in the
groups (13,724 in total). The between-groups degree of freedom is one fewer than the number of groups
(i.e. 5—1 = 4 in this example), and the within groups degree of freedom is the number of case minus the
number of groups (i.e. 13,724 - 5 1 3,719, in this example).
● To interpret the results of this one-way analysis of variance
● The F statistic and
● Its associated probability or 'significance'.
If the observed probability is small then the null hypothesis that all group means are equal within
the population can be rejected. In this example the probability is so small (less than 0.001) that the
computer prints the value as .000. Therefore we can reject the null hypothesis that mothers' interest in
education is not associated with children's score on the maths test.
Scatterplots and Resistant Lines
Introduction
● Techniques for dealing with the relationship between two interval level variables.
● The data are often called paired, X—Y data, since for each case - pair of values to be displayed
together.
● The techniques can also be used to analyse data at the level of the individual, rather than using
local authorities as the unit of analysis.
Scatterplots
To depict the information about the value of two interval level variables at once, each case is
plotted on a graph known as a scatterplot. Visual inspection of well-drawn scatterplots of paired data can
be one of the most effective ways of spotting important features of a relationship.

A scatterplot has two axes — a vertical axis, conventionally labelled Y and a horizontal axis,
labelled X. The variable that is thought of as a cause (the explanatory variable) is placed on the X-axis
and the variable that is thought of as an effect (the response variable) is placed on the Y-axis. Each case is
entered on the plot at the point representing its X and Y values.
Scatterplots depict bivariate relationships. To show a third variable would require a
three-dimensional space, and to show four would be impossible. However, the value of a third nominal
variable can often usefully be shown by using a different symbol for each value of a third variable.
Scatterplots are inspected to see if there is any sort of pattern visible, to see if the value of Y could
be predicted from the value of X, or if the relationship is patternless.
1. Is the relationship monotonic?
2. Are the variables positively or negatively related?
3. Can the relationship be summarized as a straight line or will it need a curve?
4. How much effect does X have on Y? In other words, how much does Y increase (or decrease) for
every unit increase of X?
5. How highly do the variables correlate? In other words, how tightly do the points cluster around a
fitted line or curve?
6. Are there any gaps in the plot? Do we have examples smoothly ranged across the whole scale of X
and Y, or are there gaps and discontinuities? Caution may need to be exercised when one is making
statements about the relationship in the gap.
7. Are there any obvious outliers? One of the mapr goals of plotting is to draw attention to any unusual
data points.
Linear relationships
The scatterplot of the percentage of lone parent households by the percentage of households with
no car or van is shown

The pattern - monotonic, positive relationship. To summarize the relationship between two interval-level
variables try to fit a line

The degree of slope or gradient of the line is given by the coefficient b; the steeper the slope, the
bigger the value of b. The coefficient b gives a measure of how much Y increases for a unit increase in
the value of X.
The slope of a line can be derived from any two points on it. If we choose two points on the line,
one on the left-hand side with a low X value (called XL , YL ) , and one on the right with a high X value
(called XR YR ) then the slope is
Where to draw the line?
● Fit a line which will come as near as possible to the data points.
● Rules to follow:
1. Make half the points lie above the line and half below along the full length of the line.
2. Make each point as near to the line as possible (minimizing distances perpendicular to the
line).
3. Make each point as near to the line in the Y direction as possible (minimizing vertical
distances).
4. Make the squared distance between each point and the line in the Y direction as small as
possible (minimizing squared vertical distances).
● Popular criterion for line fitting - minimizing squared deviations from the line in the Y direction
(rule 4). This technique is known as linear regression
Fitting a resistant line
Line fitting involves joining two typical points: the X-axis is roughly divided into three parts,
conditional summary points for X and Y are found in each of the end thirds, and then a line is drawn
connecting the right-hand and left-hand summary points.
Ordering and grouping
In dividing the X-axis, the aim is to get one-third of the cases into each of the three parts. Dividing
the X-axis into three is in principle straightforward, but in practice there are snags, especially where there
are not many data points.
Guidelines:
1. The X-axis should be divided into three approximately equal lengths.
2. There should be an equal number of data points in each third.
3. The left and the right batch should be balanced, with an equal number of data points in each.
4. Any points which have the same X value must go into the same third.
5. No subdivision of the X-axis should account for more than half the range of the X-values found in
the data.

Obtaining the summary value


A summary X and Y value must be found within each third. The summary X value is the median
X in each third; in the first third of the data, the summary X value is 5.29, the value for the Eastern
region.
The middle summary point is not used to draw a straight line, but it should not lie too far from the
line if the underlying relationship really is linear.
The method involves calculating two half-slopes: the left-hand half-slope is calculatcd between
the first and the middle summary point, and the right-hand halfslope between the middle and the third
point. If the half-slopes are nearly equal, the relationship is fairly linear. If one is more than double the
other, we should not fit a straight line.
Deriving the coefficients of the line
The slope and the intercept could be read off a graph. It is, however, quicker and more accurate to
calculate them arithmetically from the summary points. The slope is given by

More accurate estimate of the intercept if the mean of all three summary values is used:

Fitting the 'best' straight line through the scatterplot minimizes the total sum of these residual values.

You might also like