0% found this document useful (0 votes)
37 views68 pages

Quantitive Research - Assignments

This document outlines assignments and questions for several weeks of a course. It includes details on Assignment 2 which involves hypothesis testing and ANOVA models. Assignment 3 includes questions about survey data, such as the number of female respondents, percentage married/living together, and average weekly alcohol consumption. It discusses assumptions for ANOVA models and performing normality checks on distributions when analyzing survey data with two-factor ANOVA.

Uploaded by

akseld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views68 pages

Quantitive Research - Assignments

This document outlines assignments and questions for several weeks of a course. It includes details on Assignment 2 which involves hypothesis testing and ANOVA models. Assignment 3 includes questions about survey data, such as the number of female respondents, percentage married/living together, and average weekly alcohol consumption. It discusses assumptions for ANOVA models and performing normality checks on distributions when analyzing survey data with two-factor ANOVA.

Uploaded by

akseld
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Table of Contents

Assignment 2 ....................................................................................................................1
2.1 ........................................................................................................................................... 2
2.2 ........................................................................................................................................... 2
Assumptions .....................................................................................................................8
Assignment 3 ....................................................................................................................9
Question 3.1 ............................................................................................................................ 9
Question 3.2 .......................................................................................................................... 11
Question 3.3 .......................................................................................................................... 11
Question 3.4 .......................................................................................................................... 16
Question 3.5 .......................................................................................................................... 22
Evaluation ............................................................................................................................. 22
Ex. 4 ...............................................................................................................................23
4.1 ......................................................................................................................................... 23
Assignment for week 12..................................................................................................27
7.1 ......................................................................................................................................... 27
7.2 ......................................................................................................................................... 36
Week 13 .........................................................................................................................37
8.1 ......................................................................................................................................... 37
Week 14 .........................................................................................................................46
Assignment 9 ......................................................................................................................... 46
a) ...................................................................................................................................................................47

Week 17 .........................................................................................................................55
11.1 ....................................................................................................................................... 55
1 ....................................................................................................................................................................55
11.2 ....................................................................................................................................... 56
1 ....................................................................................................................................................................56
2 ....................................................................................................................................................................56
3 ....................................................................................................................................................................56

Assignment 2
2.1

Hypothesis testing

1.

In the first one, we are testing for multiply differences in means, which indicates that our
H_0 will state that all the different means are equal to each other.

H_1 is going to state, that at least two or more means are different from the others.

ANOVA:

One-way ANOVA: The one-way ANOVA concerns itself with the independent variable of at
least three groups, from the same categories

Two-way ANOVA: The two-way ANOVA Also concerns itself with the same as the one-way
ANOVA, however, the two-way ANOVA also tries to describe more through the blocks.

Two-factor ANOVA: The two-factor ANOVA takes two factors, which could be age, and
income, and tries to draw statistical conclusions upon this. However, the two-factor ANOVA
also looks at the relationship between the two factors.

2.2

Template:
Make the estimated model:

Insert dependent, mean, independent and error factor, according to what ANOVA we are
asked to create.

In this case is would be: Wage, Mean, Gender, Education, Interaction and error-factor.

Then make the hypothesis test, also according to the ANOVA test.
From our normality test, we can see that we have two categories, that stand out, because
they definitely do not fit in with a normal distribution, furthermore both of these categories,
have rather low sample sizes. The first being women with schooling years of 9, which has a
sample size of 20, and secondly males with 17 schooling years, which has a sample size of
60.

The other categories don’t exactly fit in with a normal distribution, however they all
represent rather large sample sizes; therefore, we can use the UCLT and conclude that they
will, represent normal distributions.
Now I want to assess the confidence level and critical value, to interpret upon our
hypothesis.

Firstly, I will check for equality of our variances. I will do this by dividing the largest variance
of the 8 groups, with the lowest variance of the 8 groups.

Equality of variance, we use the largest variance divided by the smallest variance.

From Row 1 and 8 we get the smallest, and largest variance. These are now divided to
calculate the difference in variance.

2253.9466102
= 6.97
323.35526316

Now I want to calculate our critical value. To do this we need the specific test statistic for
our two-factor ANOVA.

Our critical value for F(60-1,20-1) = 𝑉1 = 59, 𝑉2 = 19 𝛼 = 0.002 𝑎𝑛𝑑 𝐹𝐴 = 6.97


This means that our critical value, at an confidence interval with alpha = 0.002 is 3.90.

We can hereby conclude, with 95% certainty that our there is not enough evidence to
support the null hypothesis, stating that wage between the different categories of schooling
and gender, is equal each other.

We thereby say, with 95% certainty, that there is a difference between your gender, and
educational level, that is affecting your wage.
We here from see, that men with 15 and 17 years of education, are significantly different
from the other categories, as they alone are represented respectively by the letters B and A.

Assumptions
H0 is assumed to be true
Data issues (SRS, independence, trustworthiness)
Assignment 3

Question 3.1

Hvor mange kvinder har svaret på spørgeskemaet?

Vi kan altså se at ved mænd = 0, har vi et count på 262. Hvilket betyder at der er 262 kvinder,
som har deltaget I undersøgelsen.

Hvor mange procent er gifte eller samboende?

For, married / living with a partner, we see that the percentage is 24% of the entire study.
Hvad er det gennemsnitlige ugentlige alkoholforbrug?

We can thereby see, that the mean number of alcohol, consumed by the sample is 6,02

Hvad er standardafvigelsen for Friends_Tr50?

The standart deviation for friends is 47,147

Hvad er det hyppigst forekommende svar på Helping_poor?

The most frequently answer to the question of the importance of “Helping the poor” is, the
fourth answer option, which correspond to moderately agreeing. Number four option has a
total of 36,55% of the votes.
Question 3.2

The population for this questionnaire could be multiple. In the assignment it is stated, that
the data is a draw from a questionnaire, of second semester HA, SOC and Bscb students
from Aarhus and Herning. The population could therefore be, all students attending HA,
SOC and Bscb, at Aarhus Bss and Herning Bss.

Let’s say that is a total of 15.000 students. Our questionnaire has a total of 591 participants.
Which accounts for about 4% of the “Population” Therefore we could say that N<5%. And
not assume a process, even though a process may be what they are looking to actually use
the data for.

I would not say that the data is random, since it sounds as though every single person on
the second semester has been asked to do the survey.

Assuming that people has done the survey by themselves without the influence of others,
the data should be reliable.

I do however see a problem in the fact that it is stated that three people will win, which
sounds as though it will be based on their answer’s correctness. If that is true, that is a
problem, because people may perceive that they need to answer in a certain way, which
really does not correspond with what they actually feel like themselves.
Question 3.3

Model:
Alcohol consumption = 𝜇 + 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝑐𝑖𝑣𝑖𝑙 + 𝐺𝑒𝑛𝑑𝑒𝑟 ∗ 𝑐𝑖𝑣𝑖𝑙 + 𝜀

Assumptions:
SRS, trustworthiness and independence has been discussed.

When performing the Two-factor ANOVA, I will start by looking for normality.

Here we see the two of the distributions, that look furthest away from a normal distribution.
They are respectively, Gender: women, status: Single and alcohol consumption. And
Gender: Women, status: Married and alcohol consumption.

We do however see, rather large sample sizes, of respectively 125 and 74, we can therefore
apply the UCLT to fulfil the normality assumption.

The other categories don’t exactly fit in with a normal distribution, however they all
represent rather large sample sizes; therefore, we can use the UCLT and conclude that they
will, represent normal distributions.
Now I want to assess the confidence level and critical value, to interpret upon our
hypothesis.
Firstly, I will check for equality of our variances. I will do this by dividing the largest variance
of the 6 groups, with the lowest variance of the 6 groups.

Equality of variance, we use the largest variance divided by the smallest variance.

From these data, of the variance, we see that the value of 10,822, in row 3 is the smallest
value, while the variance of 170,767 in row 5 is the largest.

170,767
𝐸𝑞𝑢𝑎𝑙𝑖𝑡𝑦 𝑜𝑓 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 15,78
10,822

Now I want to calculate our critical value. To do this we need the specific test statistic for
our two-factor ANOVA.

Our critical value for F(74-1,66-1) = 𝑉1 = 73, 𝑉2 = 65 𝛼 = 0.00333 𝑎𝑛𝑑 𝐹𝐴 = 15,78

𝑘(𝑘−1) 6∗5 30 15 𝛼/2 𝛼 0.05


Bonferroni adjustment: = = = = 15, 𝑘(𝑘−1) = 30 = = 0.001667
2 2 2 1 30
2

We can hereby see, that our critical value is 1.9503, which 15,78 is very far away from.
Our p-value, is also extremely low.

This also means that we with 95% confidence can say, that there is not enough evidence to
support our H_0 hypothesis. Meaning that we reject H_0, thereby saying that there is a
difference, between the different factors effect on each other. Still with 95% certainty.

I will now be doing the Actual Two-Factor Test.


We can see from the effect test, that while both gender, and civil status, is shown to have
respectively, very-, and fairly significance to the test, the interaction between the two, has
no significance.

This can be further visualised using the students t-distribution table, which shows that not a
single one of the groups, have any significance due to the fact, that no group has a letter for
itself.

While the diagram is a bit hard to read, it does show to a large degree, that the variables
follow each other.
I will therefore be removing the interaction between the two variables, to see what results
that might bring.

Here we can see, that both the gender and civil status, are significant to the test, with
gender being the most significant.

Question 3.4
The categories we need to make this experiment is Gender (Male), Civil(Status) and the
question “Helping poor people” where the ladder, should be changed to a continuous
variable.
To answer this, I will be doing a Two-Factor ANOVA, because I want to look at weather
gender has an effect on sociability, while also seeing, if social status plays a role.

Our hypothesis for the two-factor ANOVA looks the following:

I will now be looking for normality in the data.

These two distributions are the two, that looks the least like normal distributions. They are
the distributions of: Women in a relationship but living alone. And Men who are single.
Both the distributions are left skewed, with quite similar means of 3.54 and 3.523
respectively, showing that the mean for both these distributions are slightly closer to a four,
than a three in the questionnaire.

Even though both these distributions don’t fit in a normal distribution, we see that they
respective sample sizes of 63 and 195 are sufficiently large enough to apply the UCLT, which
means that due to the large sizes, these will approach normal distributions. This is also true
for the rest of the categories, where we for every one of them, apply the UCLT to make
them approach normal distributions.
Now I want to assess the confidence level and critical value, to interpret upon our
hypothesis.

Firstly, I will check for equality of our variances. I will do this by dividing the largest variance
of the 6 groups, with the lowest variance of the 6 groups.

Equality of variance, we use the largest variance divided by the smallest variance.

We see that all of the variances are quite low, and rather close to each other. The smallest
however, can be found I row, 3. This one has a variance of 0.8286. The largest is found in the
fourth row and has a value of 1.127.

1.127
Equality of variances = 0.8286 = 1.36

Now I want to calculate our critical value. To do this we need the specific test statistic for
our two-factor ANOVA.

Our critical value for F(195-1,74-1) = 𝑉1 = 194, 𝑉2 = 74 𝛼 = 0.00333 𝑎𝑛𝑑 𝐹𝐴 = 1.36

𝑘(𝑘−1) 6∗5 30 15 𝛼 𝛼 0.05


Bonferroni adjustment: = = = = 15, 𝑘(𝑘−1) = 15 = = 0.00333
2 2 2 1 15
2

We can hereby see that our critical value is at 1.743, with a p-value of 0.0641 which is
relatively close to the threshold of <5%. However, it is not quite there.
This also means, that we with 95% confidence can say that we fail to reject H_0 because it
would seem as though there is enough evidence to support the hypothesis that the
variables are sufficiently equal each other, to not see any significant difference.

I will now be doing the Actual two-factor ANOVA.

We can see from this, that neither the Interaction between Gender and civil status, seem to
have any significant influence on the test.

This can be further


visualised, using the Least
squares means students-t
table, which clearly shows
that there (with 95%
confidence) is absolutely
no categories, that are
significant compared to
the others, since they
actually all share the letter
A. They are basically all
equal each other to some
degree.

This is stating very clearly,


that the gender a person
in this test has, and the
social status of that
person, has no significant
influence of the sociability
of that person.

This is true across every


category in this series.
We now see, that gender is still a significant variable, while civil status is still of no
significance.

From this Least square means students t-distribution. We can further visualize that there is
no significance of civil status affecting the sociability of the people in the questionnaire.

I will therefore also be removing the Civil variable, so the test ends up being a one-way
ANOVA, with one factor and a variable (Gender)
We now see that the F-Ratio is 3.178 and the probability has dropped, so the Gender of a
person, is now of no significance wrt. The level of Sociability.

This is further illustrated by this Students-t distribution table, telling us that none of the
groups in the gender category, are having a significant effect on the sociability of a person.
We thereby can see, that in the Two-way ANOVA, the gender of a person had a significant
effect on the level of sociability. That was where the civil status, still played a role. Civil
status however, showed that it had no significant influence on a person’s sociability.

Therefore, I removed it, consequently turning my before, Two-factor ANOVA, to a Two-way


ANOVA, and now lastly, into a One-way ANOVA.

Now the answer was quite different. Because when civil status had been removed as a
variable. The gender of a person suddenly, was no longer significant to that persons degree
of sociability, with regards to helping poor people.

Question 3.5

Evaluation

When doing a ANOVA, and removing from that ANOVA, a new set of hypothesis,
assumptions and model has to be setup.

That is every time we remove something from the test.


𝛼/2
The Bonferroni adjustment, for F-tests looks the following: 𝐾∗(𝐾−1)
2

Alpha being divided by 2, and k(k-1) being divided by two cancels out. Therefore, the
𝛼
Bonferroni adjustment for F-tests, end up looking like this:
𝐾(𝐾−1)

Ex. 4

4.1

Step 1: Hypothesis

𝐻0 : 𝐺𝑒𝑛𝑑𝑒𝑟 𝑎𝑛𝑑 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡


𝐻1 : 𝐺𝑒𝑛𝑑𝑒𝑟 𝑎𝑛𝑑 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 𝑎𝑟𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡 (𝑆𝑜𝑚𝑒ℎ𝑜𝑤)

Step 2: Significance level

𝛼 = 0.05

Step 3: Test statistic

𝑟 𝑐 2
(𝑓𝑖𝑗 − 𝑒𝑖𝑗 ) 2
∑∑ ≈ 𝑥(𝑟−1)∗(𝑐−1)
𝑒𝑖𝑗
𝑖=1 𝑗=1

𝑟𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 ∗ 𝑐𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙


𝑒𝑖𝑗 =
𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
72 ∗ 256 2304
𝑒𝑖𝑗 = = = 18.432
1000 125

Rule of five is satisfied

• SRS
• Independence
• Trustworthiness

Constant class probability

Mutually exclusive and collectively exhaustive groups

Last two are simply meant to be stated, no need to comment on them.

Step 4: Test value

2
𝑋𝑜𝑏𝑠

(156 − 177.8)2
≈ 2.672891
177.8

Step 5: Critical value

2 2
𝜒(𝑟−1)∗(𝑐−1)∗𝛼 = 𝜒(3−1)∗(

Step 6: P-value

Step 7: Conclusion
𝐴𝐶 A
𝐶
𝐵 𝑃(𝐴𝐶 ∩ 𝐵 𝐶 ) = 0.7 − 0.4 = 0.3 𝑷(𝑨 ∩ 𝑩𝑪 ) = 𝟎. 𝟒 𝑃(𝐵 𝐶 ) = 1 − 0.3 = 0.7
B 𝑃(𝐴𝐶 ∩ 𝐵) = 0.3 − 0.15 = 0.15 𝑷(𝑨 ∩ 𝑩) = 𝟎. 𝟏𝟓 𝑷(𝑩) = 𝟎. 𝟑
0.45 0.55 1

Step 1: Hypothesis

𝐻0 : 𝐼𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑦 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑐𝑖𝑣𝑖𝑙 𝑠𝑡𝑎𝑡𝑢𝑠


𝐻1 : 𝐷𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑐𝑖𝑣𝑖𝑙 𝑠𝑡𝑎𝑡𝑢𝑠

Step 2: Significance level


𝛼 = 0.05

Step 3:
𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 ∗ 𝐶𝑜𝑙𝑢𝑚𝑛 𝑡𝑜𝑡𝑎𝑙 129 ∗ 262
𝑒𝑖𝑗 = = = 57.1878
𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒 591

• SRS
• Independence
• Trustworthiness
• H_0 is true

Test value:
Assignment for week 12

7.1
1. Discuss linearity.

The line does not look particularly linear. For one, we have

2. Model

Level-level model

The general model: 𝑦 = 𝛽0 + 𝛽1 ∗ 𝑥1 + 𝜖

Specific model: Price = 𝛽0 + 𝛽1 ∗ 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦 + 𝜖


̂ = 𝑏0 + 𝑏1 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦
Estimated model: 𝑃𝑟𝑖𝑐𝑒

Assumptions:
SRS: 29 successive days, does not seem random. We have no way of reacting or knowing of
any particular seasonality, sales ect.

Trustworthiness: What is the data collected for? Maybe a sales manager makes the
numbers look better. If the data has been subtracted from a database, it is hard to
manipulate the numbers of price and quantity. Furthermore, both price and quantity is easy
to measure.
Variation of X. Variation is not equal at many points. In the low quantity ranges, there are
skewed observations, in the high price-ranges, above our best-fit line. In the middle
quantities between 10-40 there is a skewness below our best-fit line.

Observations also decrease as quantity increase.

Level-level:

Evaluation of assumptions for level-level model:

Most relevant to discuss:

• Zero conditional mean (𝑬(𝝐 𝒍 𝒙) = 𝟎): Given any x, we do not seem to have zero
conditional mean of errors. Our errors float above 0 in the start – below 0 from 3-8
and above from 8 -. At no point do we really have zero conditional mean.

• Independency between errors: We do have mayor issues between errors – also


given to the discussion of {Simple random sample} – we are dealing with 29
successive days.

• (Normal distributed error):

• Homoscedasticity:
Estimation of Level-level model

Test model hypothesis:

1. Hypothesis
𝐻0 : 𝛽1 = 0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦 𝑎𝑛𝑑 𝑝𝑟𝑖𝑐𝑒
𝐻1 : 𝛽1 ≠ 0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑎 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡ℎ𝑒 𝑡𝑤𝑜.

Significance level:

𝛼 = 0.05

Choice of test statistic:

Calculate test statistic:

−0.178687−0
𝑡𝑜𝑏𝑠 = = −8.01
0.022307

Critical value:
±𝑡𝑛−𝑘−1,𝛼
2

𝑡𝑛−𝑘−1 = 𝑡29−1−1 = 27

𝛼 0.05
= = 0.025
2 2

𝑡27,0.025 = 2.052

We will reject H_0 and the model are therefore significant.


P-value: very low.

2 ∗ 𝑃(𝑇27 < −8.01) ≈ 0

Conclusion:
There is a relationship between price and quantity – Our p-value is <0.0001.

Coefficient of determinations (R^2)

R^2 =0.7038 which means, that 70.38% of the variation in Price can be explained by the
variation on quantity.

191.19759
𝑅2 = = 0.7038
271.65252

̂ = 11.059 − 0.179 ∗ 𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦


The model was estimated to be: 𝑃𝑟𝑖𝑐𝑒

The level-level Intercept:


Our intercept is 11.059. This however does not make much sense, as there will be no price
of product, when there exists no product.

The level-level b_1


When quantity increases – price decreases. (Negative relationship) and in particular when
quantity increase by 1, price will decrease by -0.179
Question 3:

Prediction interval/Indiv conf. int. = Used for one particular instance.

Confidence interval/Mean conf. int. = Used for the average.

Formular

2
1 (𝑥𝑔 − 𝑥̅ )
𝑌 𝑙 𝑋 = (𝑏0 + 𝑏1 ∗ 𝑥𝑔 ) ± 𝑡𝑛−2,𝛼 ∗ 𝑠𝜖 ∗ √1 + +
2 𝑛 𝑆𝑆𝑥

4.

Using jmp.

Mean price interval: {7.8035 − 8.9545}

5.

Four model options:

Level-level
Log-level
Level-log
Log-log
Model:

General model: log(𝑦) = 𝛽0 + 𝛽1 ∗ log(𝑥1) + 𝜖


̂ ) = 𝑏0 + 𝑏1 ∗ log (𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦)
Estimated model: 𝐿𝑜𝑔(𝑝𝑟𝑖𝑐𝑒

Assumptions:
Nothing new regarding SRS, trustworthiness and

Variation in x
Zero conditional mean:

Normal distributed error:

Homoscedasticity:

Estimated model is:


𝐿𝑜𝑔(𝑝𝑟𝑖𝑐𝑒) = 3.45 − 0.56 log(𝑄𝑢𝑎𝑛𝑡𝑖𝑡𝑦)

Test model log-log short version.

−0.5598−0
Test stat: 𝑇𝑜𝑏𝑠 = = −13.33
0.04198
Conclusion_ We firmly reject H_0

Coefficient of determinant (R^2)

R^2 explains that 86.8% of the variation log(price) can be explained by the variation in
log(quantity).

Intercept

log(𝑝𝑟𝑖𝑐𝑒) = 3.45 − 0.56 log(𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦)

B1
The -0.56 b1 means that a 1% increase in quantity, leads to a 0.56% decrease in price. Which
means there is a negative relationship between the two.

Look at note 1, page 14: Brightspace. On how to interpret Log-models.


7.2
Week 13

8.1
Template:

True model:

𝐵 = 𝛽0 + 𝛽1 ∗ 𝐶 + 𝛽3 ∗ 𝐸 + 𝜖

1) Model formulation:

̂ = 𝑏0 + 𝑏1 ∗ 𝑌𝑒𝑎𝑟 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑏2 ∗ 𝑎𝑔𝑒 + 𝑏3 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝜖


Estimated model: 𝐼𝑛𝑐𝑜𝑚𝑒

2) Evaluation of assumptions: (MR)

• SRS (Simple random sampling)


• Trustworthiness
• Normal distribution of (𝜖)
• Zero conditional mean
• Independency between (𝜖)
- Generally, only a problem when we are dealing with time series.
• Homoscedasticity
• Multicollinearity
- (JMP-> Analyse -> Multivariate methods -> Multivariate -> plug our x-variables in
Y.) Check for any variables with a value of 0.8-0.9 or over.
Exercise:

1) Model formulation:

̂ = 𝑏0 + 𝑏1 ∗ 𝑌𝑒𝑎𝑟 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑏2 ∗ 𝑎𝑔𝑒 + 𝑏3 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟 + 𝜖


Estimated model: 𝐼𝑛𝑐𝑜𝑚𝑒

3) Evaluation of assumptions: (MR)

• SRS (Simple random sampling)


Stated in exercise, that data is randomly collected.

• Trustworthiness
Since the data has been collected randomly, we can assume that the data has been
collected from a database and therefore the data is trustworthy.

However, there can be biases as to differences in how different nationalities might count
(Years of education) and
• Normal distribution of (𝜖)

• Zero conditional mean


• Independency between (𝜖)
- Generally, only a problem when we are dealing with time series.
• Homoscedasticity
• Multicollinearity
JMP-> Analyse -> Multivariate methods -> Multivariate -> plug our x-variables in Y.

Check for any variables with a value of 0.8-0.9 or over.

We have no issues here with multicollinearity, our largest correlation is between age and
female, at -0.0756 which is far from being of high correlation.
3: Evaluation of the model.

Estimated model is:

̂ = 𝑏0 + 𝑏1 ∗ 𝑌𝑒𝑎𝑟 𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝑏2 ∗ 𝑎𝑔𝑒 + 𝑏3 ∗ 𝑔𝑒𝑛𝑑𝑒𝑟


𝐼𝑛𝑐𝑜𝑚𝑒

̂ = 64.63 + 18.32 ∗ 𝑌𝑒𝑎𝑟𝐸𝐷𝑈 + 1.21 ∗ 𝐴𝑔𝑒 − 75.38 ∗ 𝐹𝑒𝑚𝑎𝑙𝑒


𝐼𝑛𝑐𝑜𝑚𝑒

4) Test of the whole model.

Hypothesis:

Type equation here.

𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑎𝑏𝑜𝑣𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑤𝑖𝑙𝑙 𝑑𝑖𝑣𝑒𝑟𝑡 𝑓𝑟𝑜𝑚 0

We assume 𝐻0 to be true.

Significance level: 𝛼 = 0.05 : or 5%

𝑀𝑆𝑅
Test statistic: 𝑀𝑆𝐸 ~𝐹𝐾:𝑛−𝑘−1
What is K? : Number of X’s. in this instance that is = 3.

Calculate test statistic:

𝑀𝑆𝑅 2839756
= ≈ 99.14
𝑀𝑆𝐸 28643

Critical value: 𝑓𝑘:𝑛−𝑘−1;𝛼 = 𝑓3:1711−3−1;0.05

Conclusion:

We are very sure of ourconclusion – the risk of saying our H_0 is true when in fact it isn’t
since our p-value is basically = 0%. We firmly reject the H_0 since our F_obs is way above
our critical value. At least one of B_1, B_2 and B_3 differs from 0.

5) Test of parameters:

Hypothesis:

𝐻0 : 𝛽𝑗 = 0 We could also write Beta_age = 0, or Beta_education = 0


𝐻1 : 𝛽𝑗 ≠ 0

Significance level: 𝛼 = 0.05, or 5%

𝑏𝑗 −𝛽𝑗0 1.21−0
Test statistic: 𝑡𝑛−𝑘−1:𝛼 = ≈ 4.352517986
2 𝑆𝑏𝑗 0.278

Intercept t_obs = 2.99,


Years education: t_obs = 13.63,
Age: t_obs = 4.33,
Female t_obs = -9.19
Critical value: ±𝑡𝑛−𝑘−1;𝛼=±1.96
2

We reject the null for each of our three variables.

Assumptions: We had some heteroscedasticity and some issues with normality which could
affect the validity of the test somewhat.

Our data on age, years of education and gender is explaining 14.8% of the variation in
income.

B)

Diskuter hvorledes modellen kunne omformuleres således, at man undgik


eventuelle forudsætningsproblemer.
Plotted with logged income:

c) Interpret each of the estimated coefficients:

We use the level-level model to interpret the estimated coefficients:

̂ = 64.63 + 18.32 ∗ 𝑌𝑒𝑎𝑟𝐸𝐷𝑈 + 1.21 ∗ 𝐴𝑔𝑒 − 75.38 ∗ 𝐹𝑒𝑚𝑎𝑙𝑒


𝐼𝑛𝑐𝑜𝑚𝑒

Intercept:

B1: Holding Age and education constant, increasing years of education by 1, would indeed
increase your annual income by 18.310 DKK.

B2: Every time we increase age by 1, income will increase by 1.200DKK.

B3: The interpretation of the female variable, says that when you are a Female relative to
being a male, your income will decrease by 75.380 DKK.
Answer the following questions, regardless of problem with assumptions or not.

Interpret each of the estimated coefficients.


Add quadratic expressions (see next page) for years of education and age to
the model. Test (See Note 2, page 6) the new model compared to the original
model a).

Our restricted model:

= 𝛽0 + 𝛽1 ∗ 𝐶 + 𝛽3 ∗ 𝐸 + 𝛽4 ∗ 𝐶 2 + 𝛽5 ∗ 𝐷2 + 𝜖

(see note 2. P 6)

Test:

𝐻0 : β4 = 𝛽5 = 0
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑜𝑓 𝑡ℎ𝑜𝑠𝑒 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙 𝑡𝑜 0

Test stat:

2
(𝑅𝑈𝑅 − 𝑅𝑅2 )/𝑞
~𝐹
2
(1 − 𝑅𝑈𝑅 )/(𝑛 − 𝑘 − 1) 𝑞:𝑛−𝑘−1;𝛼

K = 5 (Unrestricted)

N = 1711

q = New variables included in unrestricted model = 2

2
(𝑅𝑈𝑅 − 𝑅𝑅2 )/𝑞 (0.2658 − 0.1484)/2
2 = ≈ 136.3163988
(1 − 𝑅𝑈𝑅 )/(𝑛 − 𝑘 − 1) 1 − 0.2658
1711 − 5 − 1

Restricted Unrestricted
𝐹𝑞:𝑛−𝑘−1;𝛼 = 𝑓2;1711−5−1;0.05 = 3.00

Conclusion:

Our observed value (136.31) is way above our critical value of 3.00, so we reject the null
– meaning

When making Centred or uncentred models in JMP.

The centred model:

𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + 𝑏4 𝑥12 + 𝑏5 𝑥22 = Centred

Only change from uncentred is that the variables that are not squared, but have squared
“brothers” are telling the slope of X. The squared variables are telling the relationship of
either convexity or concavity.

The uncentred model:

Slope in y. how the squared polynomials move?..

𝑦 = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + 𝑏4 𝑥12 + 𝑏5 𝑥22 = Uncentred

When we log our variables, we can in most instances smoothen out the data variance, in
order to better fit a normal distribution and homoscedaticity.

Week 14

Assignment 9
a)

Specific regression model:

Estimated model:

̂
log(𝐹𝐴𝑀𝐼𝑁𝐶) = 𝑏0 + 𝑏1 ∗ 𝑊𝐻𝑅𝑆 + 𝑏2 ∗ 𝐻𝐻𝑅𝑆 + 𝑏3 ∗ 𝑊𝐸𝐷𝑈 + 𝑏4 ∗ 𝐻𝑈𝑆𝐸𝐷𝑈𝐶 + 𝑏5
∗ 𝑊𝐴𝐺𝐸 + 𝑏6 ∗ 𝐻𝐴 + 𝑏7 ∗ 𝐶𝐼𝑇 + 𝜖

Assumptions:

Random sample and reliable data: We are given no info on whether or not the sample is
randomly collected. We assume the data is collected from some database randomly.
Trustworthiness should be good
We see that there is a large correlation between wife – and husband age, at 0.8881. The
correlation makes sense but might confuse our model with regard to each of the variables
effect on the log(FAMINC).

Something also for the Wife education – Husband education.

𝐿𝑜𝑔(𝐹𝑎𝑚𝐼𝑛𝑐) = 8.02 + 0.00012 ∗ 𝑊𝐻𝑅𝑆 + 0

No reduction as stated in the. Description


4 – Test of the model

F-ration = 44.96 with corresponding p-value of <0.0001% so the model is significant.

5 – interpretation of model – (cross section analysis)

Education (wife and husband): all else equal – Husbands effect on family is 5.1% per
increase in education years. For wifes, this effect is 3.34% for every year-increase in the wife
educational attainment.

Here we have the confidence intervals for our variables.

We can see that wife-education, confidence interval is: 𝑊𝐸𝐷𝑈 𝜖 [0.016 ; 0.051]
𝐻𝑢𝑠𝐸𝐷𝑈 𝜖 [0.037 ; 0.065]

Here we can see, that the confidence intervals forwife and husbands educational
attainment and their affect on log(FAMINC) – we see that the confidence intervals for the
parameter estimates are clearly overlapping, meaning that we no statistical evidence to
infer that these effects are different for wives and husbands.

CIT: By living in a large city, we can see that your Family income will increase by just short of
21%
Uncentered polynomials:

We wont reduce this model either to allow for testing of joint significance.

Test of model:

𝐻0 : 𝛽0 = 𝛽1 = 0: 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑛𝑜 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑡ℎ𝑒 𝑞𝑢𝑎𝑑𝑟𝑎𝑡𝑖𝑐 𝑡𝑒𝑟𝑚𝑠 𝑎𝑛𝑑 𝑓𝑎𝑚𝑖𝑙𝑦


− 𝑖𝑛𝑐𝑜𝑚𝑒.
𝐻1 :

Unrestricted Restricted

Q = The number of additional variables in unrestricted model = 2

K = Number of variables in unrestricted model = 9

Critical value: 𝑓𝑞;𝑛−𝑘−1;𝛼 = 𝑓2;753−9−1;0.05 = 𝑓2;743;0.05 = 3.0078


Our observed value (7.68) is above our critical value 3.01 and we reject the null – meaning
that the unrestricted model is better at explaining log(faminc) than our restricted model.
We are certain of our conclusion – with a p-value of 0.0005.

What does it mean when we have a negative b_2 = concave. Goes up till turning point, then
down.
If positive = convex. Goes down till turning point, then down.

Interpretation of the model:

Close to the effects we got from our linear model.


The two age variables got following maximums:

𝑏 −0.0128094
Wife’s age = -2∗𝑏5 = 2∗(−0.000341) = 56.68
8

𝑏6 −0.08847
Husband’s age = − = = 48.07
2 ∗ 𝑏9 2∗(−0.000881)
Week 17

11.1

General model:
𝑦 = 𝛽0 + 𝛽1 ∗ 𝐶 + 𝛽3 ∗ 𝐸 + 𝜖

Specific regression model:


𝑦 = 𝑏0 + 𝑏1

Estimated model:
𝐿𝑜𝑦𝑎𝑙𝑡𝑦 =
11.2

3
Week 18 logistic regression and time series

1 Explanatory variables
Variables could be: Quality, Brand, Price and Male.

1: Model: Formulation:

𝑝
𝐿𝑜𝑔𝑖𝑐(𝑀𝐴𝐶) = ln( )
1−𝑝
= 𝛽0 + 𝛽1 ∗ 𝑀𝑎𝑙𝑒 + 𝛽2 ∗ 𝐵𝑟𝑎𝑛𝑑 + 𝛽3 ∗ 𝑆𝑡𝑢𝑑𝑦𝑡𝑖𝑚𝑒 + 𝛽4 ∗ 𝐶𝑎𝑟𝑒𝑒𝑟 + 𝛽5 ∗ 𝑇𝑎𝑡𝑜𝑜 + 𝛽6
∗ 𝑄𝑢𝑎𝑙𝑖𝑡𝑦 + 𝜖

2: Checking assumptions:

Trustworthyness: Could be dependent on each other since they could take the survet
together. Do they take the survey seriously? The more engaged students are more likely to
respond to the survey than the less engaged. The perception of the scales for the scaled
variables.

SRS: No – the data is based on a survey. People can choose tnot to participate. Not that
random and generalize with only 2^nd semester students.

Binary Y variable:
This is within the 80/20 range that is required.
The x-variables doesn’t seem particularly evenly distributed. However our larger sample size
of 591 give us somewhat reasonable number of observations across the outcomes.

Multicollinearity:
To assess the multicollinearity assumption, we use the multivariate, to assess the
collinearity of our model. We are looking for any correlations at or above 0.8. We see that
there are no variables correlating to other variables at this high of a level. There are
however two variables explaining large amount of each other. Male and alcohol, have a 0.32
correlation, while quality and brand have a 0.138 correlation. This means that alcohol
explain 32% of the male variable and brand explain 13.8% of the quality variable.

Multicollinearity assumption is fulfilled.

3: Estimation of model

Reducing the model:

Further reducing
As the assignment says, our model should only include 3 significant variables, I will be
further reducing once more, removing tattoo.

4: Assessing the model:

𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠 381


Hit ratio (Mac = 1) = 𝐴𝑐𝑡𝑢𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑒𝑠 = 381+20 = 0.950125 = 95%

𝐶𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑓𝑎𝑙𝑖𝑢𝑟𝑒𝑠 39


Hit ratio (Mac = 0) = 𝐴𝑐𝑡𝑢𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑎𝑙𝑖𝑢𝑟𝑒𝑑 = 39+151 = 0.20526 = 20.5%

381+39
Hit ratio total = 381+20+39+151 = 0.71 = 71%
Our empty model allows us to predict correctly 67.85% of the time while our model based
on studytime, male and brand allows us the hit rate to increase to 71%, which is a
percentage point increase of only 71.06-67,85=3.21

71.06
= 1.04 = 4.7%
67.85

We would like to have around 25% increase in predictability

5 Interception:

The negative parameter estimate for male, signifies a negative effect on whether or not you
have a mac.

Odds ratio:

We are to focus on the male variant, since this is of the highest significance.

If we use the odds ratio for interpretation, we can calculate the effect on the odds value for
a marginal change in the variables.

The effect of a “1 unit” increase in Male / being male as opposed to being a woman: Being
male, decrease your change of owning a mac by 0.418 -1 = -0.582 = - 58.2%. It decreases the
odds of owning a mac by a factor 0.418
Using the reciprocal for the male variable:

Value of reciprocal 2.3883. Going from being a male, to female would increase the odds of
you owning a mac by a factor of 2.388. or a percentage change of: (2.38 - 1) * 100 = 138%
Decreasing male by one unit (being female as opposed to being a male) increase the chance
of you owning a mac by 138%.

Time series:
Very similar to cross-sectional regressions, we already have worked with. Where we sample
a given point in the time across individuals or units.

However, Time series data is NOT randomly samples, in same way as cross-sectional

Exercise 12.2 – Time series

1: Model formulation
Estimated model:
̂
𝐸𝑛𝑟𝑜𝑙𝑙𝑚𝑒𝑛𝑡 = 𝑏0 + 𝑏1 ∗ 𝑇𝑖𝑚𝑒 − 𝑣𝑎𝑟𝑖𝑎𝑏𝑒𝑙
Note: we can use Year, obtain same coefficient, however the intercept alters quite a bit and
it may be more intuitive to use “time variable” rather than “Year”
2: Assumptions
Trustworthyness:

Student enrolment is most likely found at some university database, in that regard the data
should be trustworthy. The cost of fraud might be higher than the gain from manipulating
the numbers.

Variation in X:
17 years of observations which should be alright for predicting the trend of enrolment.

Error terms in a time series setting:

Now we move on to examine the assumptions regarding the error terms in a time series
setting.

Fit model -> Insert variables -> Run model

Zero conditional mean: (𝑬(𝝐 𝑰 𝒙) = 𝟎)

The zero conditional mean on time series data demands the model to fit the data perfectly –
in other words, no other variable than time variable, should affect enrolment in order for
the zero conditional mean to be fulfilled.

In this case, we simply have both positive and negative errors, therefore zero conditional
mean, is not fulfilled.

Normally distributed error terms: 𝑬𝐼𝑿 ~ 𝑵


Reffering the the discussion of the zero conditional mean – Normality is quite
difficult to discuss with one observation per x. Normality will therefore also be
an issue.
Homoscedasticity, Var(E I X) = Var(X) = Var(𝝐) = 𝝈𝟐

No serial(Auto) correlations,
New in time-series. Serial correlation comes when errors from one time period
is carried over into future time periods.

To test this, we conduct a Durbin Whatson test

To test for independency between the error terms, we test for autocorrelation
with the Durbin Watson test:

The hypothesis we test is as follows, where the alternative can both be


formulated as double sided, though positive auto(serial) correlations is what is
most typical in time series.

𝐻0 : 𝑇ℎ𝑒 𝑎𝑟𝑒 𝑛𝑜 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠


𝐻1 : 𝑇ℎ𝑒𝑟𝑒 𝑖𝑠 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛

We are testing one-sided as positive autocorrelations is the most typical case.

1. Test statistic:
∑𝑛𝑖=2(𝑒𝑖 − 𝑒𝑖 − 1)2
𝑑=
∑𝑛𝑖=1 𝑒𝑖2

Small values of d, indicates positive autocorrelation and large values of d indicate


negative autocorrelation. D can take the values between 0 and 4
2: Calculation of the test:

We choose our alpha / 𝛼 = 0.05 or 5%

Here 𝐾 = 1 𝑎𝑛𝑑 𝑛 = 17

Critical limits:

𝑑𝑖 = 1.13 𝑎𝑛𝑑 𝑑𝑢 = 1.38

We reject the null – and see positive autocorrelation in our test as our Durbin Watson
test statistic is 0.66 while our critical limit are 1.13 and 1.38 – thus falling out, into the
rejection region

Time variable squared

[ Add new variable -> formular -> insert variable -> double click -> Square variable (^2) ]

Here k = 2 and n = 17
Which gives us 𝑑𝑖 = 1.02 𝑎𝑛𝑑 𝑑𝑢 = 1.54

Our test statistic falls into the inconclusive region – Meaning that at least now we don’t
have clear positive autocorrelations and the quadratic term helped.

4: Assessment of the model

Enrollment = y

Time variable and Time variable^2 = X’s

Exercise 12.3
Level-level, log-level, level-log, log-log

You might also like