Quiz 2
Katrodiya tapankumar Ashokbhai
2024-08-18
Question 1
Load the data
dataset = read.csv("dataset.csv")
I look at the top few entries using the head function to confirm data are loaded.
head(dataset, n= 10)
## number
## 1 139
## 2 116
## 3 122
## 4 115
## 5 122
## 6 126
## 7 107
## 8 112
## 9 112
## 10 121
To understand data
dim(dataset)
## [1] 52 1
nrow(dataset)
## [1] 52
ncol(dataset)
## [1] 1
names(dataset)
## [1] "number"
To visualize data I create a histogram
hist(dataset$number, xlab ="Number", main = "Datasets of the number",
col = "lightgreen" )
To draw a box plot
boxplot(dataset$number, horizontal = TRUE, pch = 16, main = "Dataset
of Number", col = "lightblue", xlab ="Number")
To compute the mean, median, standard deviation and First Quartile (Q1)
I compute the mean
mean(dataset$number)
## [1] 114.3269
So the mean is 114.33
I compute the median
median(dataset$number)
## [1] 121.5
So the median is 121.5
I compute the standard deviation
sd(dataset$number)
## [1] 35.40894
So the standard deviation is 35.41
For the first quantile
quantile(dataset$number, 1/4)
## 25%
## 107
So the first quantile is 107.0
comment on the shape of the distribution
For the Mean :
The mean (114.33) is less than the median (121.5), which suggests that the distribution
may be left-skewed (negatively skewed).
For the Standard Deviation :
The standard deviation (35.41) is relatively large compared to the mean, indicating that
there is significant variability in the data.
For First Quartile (Q1)
The first quartile (Q1) is 107.0, which is fairly close to the mean (114.33). This proximity
indicates that a significant portion of the data is concentrated below the median
Conclusion
The fact that the mean is less than the median suggests that the distribution is left-skewed,
or negatively skewed. The majority of the data is above the mean, but some lower numbers
push the mean downward, as indicated by this skewness.
Question 2
Step 1: To upload the data
diabetes = read.csv("diabetes.csv")
I look at the top few entries to confirm data are loaded.
head(diabetes$HDL, n = 10)
## [1] 27.4 51.4 42.1 53.8 57.6 32.5 47.6 25.9 47.3 84.6
To understand data
dim(diabetes)
## [1] 87 6
nrow(diabetes)
## [1] 87
ncol(diabetes)
## [1] 6
names(diabetes)
## [1] "sex" "BG" "HbA1c" "LDL" "HDL" "Tri"
Step 2: Define hypothesis
H0: No difference in HDL between males and females
H1: the mean HDL levels are greater in females than males.
Step 3: pre-checking data
I am interested in HDL’data
I quickly look at the summary of HDL
summary(diabetes$HDL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.90 39.80 47.10 45.58 51.15 84.60
I look the frequency of the males and females of this data
table(diabetes$sex)
##
## Female Male
## 44 43
barplot(table(diabetes$sex), main = "male and female count", col =
c("lightpink", "lightblue"))
To visualize data I create a histogram
hist(diabetes$HDL, xlab = "HDL", main = "Data of HDL", col =
"lightpink", breaks = 15)
I look at the data
split by the sex variable
aggregate(HDL~sex, data = diabetes, mean)
## sex HDL
## 1 Female 46.76591
## 2 Male 44.35814
To visualize I create a histogram and box plot
library(lattice)
histogram(~HDL|sex, data = diabetes)
boxplot(HDL~sex, data = diabetes, horizontal = TRUE, pch = 16)
Step 4: compute sample statistic
The actual difference in means can be computed from the aggregate data
HDL = -diff(aggregate(HDL~sex,diabetes, mean)$HDL)
HDL
## [1] 2.40777
To simulate the sex variable I use the sample function.
sex.sim = sample(diabetes$sex)
Step 5: Generate randomized distribution under Null0
To create a new sample with the same size as the original sample I use the replicate
function
HDL0 = replicate(1000,{
sex.sim = sample(diabetes$sex)
-diff(aggregate(HDL~sex.sim, data = diabetes, mean)$HDL)
})
hist(HDL0, main = "Replicate's sample Histogram", xlab = "Thousand of
HDL sample", col = "lightgoldenrod")
Step 6: compute p-value
pVal = mean(HDL0 > HDL)
pVal
## [1] 0.125
So p-value is 0.12
I run a t.test
t.test(HDL~sex, data= diabetes, alternative="greater",)
##
## Welch Two Sample t-test
##
## data: HDL by sex
## t = 1.1631, df = 68.33, p-value = 0.1244
## alternative hypothesis: true difference in means between group
Female and group Male is greater than 0
## 95 percent confidence interval:
## -1.04406 Inf
## sample estimates:
## mean in group Female mean in group Male
## 46.76591 44.35814
I extract t-statistic
t.test(HDL~sex, data= diabetes, alternative="greater")$statistic
## t
## 1.163112
so t-statistic is 1.16
Step 7: Conclusion
The p-value is 0.12, which is greater than the significance level of 0.05. Therefore, we do
not reject the null hypothesis.
There is insufficient evidence at the 0.05 significance level to conclude that the mean HDL
levels are greater in females than in males.