R语言数据可视化分析和统计检验
写在前面
🌍✨📚最近听了北京理工大学王蓓老师关于R语言的讲座,受益匪浅,现在把自己学习的内容和收获进行记录和分享。
1、数据读取及分析
首先读取Excel数据,在这里需要对数据进行说明:该数据一共包含了5列数据,其中speaker有9类、sentence有4类、Focus有2类(NF和XF)、Boundary有3类(word、clause和sentence)。
## 1. Read an EXCEL file------------------------------------------------------------------------
# Note that you need to use "Import Dataset" to find where your data file is
# MaxF0 is from a within-subject design of 2(X-focus*Neutral Focus)*3(Word, clause and sentence Boundary)
# The dependent variable is maximum F0 of the post-focus word (word X+1), "maxf0st"
# All together, we have 9 speakers* 4 base sentence *2 Focus*3 Boundary=216 observations
MaxF0<-read_excel("D:/BaiduNetdiskDownload/01/FocusMaxF0syl8.xlsx")
View(MaxF0) # Show the data
colnames(MaxF0) #column names
length(MaxF0$Speaker) # check the number of observations
str(MaxF0) # Show the variables
MaxF0$Focus=as.factor(MaxF0$Focus) # set "Focus" as a factor
MaxF0$Boundary=as.factor(MaxF0$Boundary) # set "Boundary" as a factor
#reorder the levels as the way you wish, otherwise it is alphabet order
MaxF0$Boundary=factor(MaxF0$Boundary,levels=c("Word","Clause","Sentence")) # 这一串代码很有意思,在ggplot绘图中非常有用,主要用于调整横坐标的标签的顺序
MaxF0_XF<-subset(MaxF0,MaxF0$Focus=="XF") #choose a subset
summary(MaxF0_XF)
MaxF0_XF1<-MaxF0[MaxF0$Focus=="XF",] #choose a subset
summary(MaxF0_XF1)
MaxF0_B2<-subset(MaxF0,MaxF0$Boundary!="Word") #筛选出非word的数据行
summary(MaxF0_B2)
# Save your data
getwd()#get working directory
write.table(MaxF0_B2,file="C:/Users/wangb/Desktop/Bei_desk/R2020BIT/MaxF0_B2.txt") # save the data frame to your own directory.
运行结果:
> colnames(MaxF0) #column names
[1] "Speaker" "sentence" "Focus" "Boundary" "maxf0st"
> length(MaxF0$Speaker) # check the number of observations
[1] 216
> str(MaxF0) # Show the variables
tibble [216 × 5] (S3: tbl_df/tbl/data.frame)
$ Speaker : num [1:216] 1 1 1 1 1 1 1 1 1 1 ...
$ sentence: num [1:216] 1 1 1 1 1 1 2 2 2 2 ...
$ Focus : chr [1:216] "NF" "NF" "NF" "XF" ...
$ Boundary: chr [1:216] "Word" "Clause" "Sentence" "Word" ...
$ maxf0st : num [1:216] 19 18.6 19.6 14.6 15.3 ...
> summary(MaxF0_XF)
Speaker sentence Focus Boundary maxf0st
Min. :1 Min. :1.00 NF: 0 Word :36 Min. :11.34
1st Qu.:3 1st Qu.:1.75 XF:108 Clause :36 1st Qu.:14.63
Median :5 Median :2.50 Sentence:36 Median :16.50
Mean :5 Mean :2.50 Mean :16.60
3rd Qu.:7 3rd Qu.:3.25 3rd Qu.:18.12
Max. :9 Max. :4.00 Max. :24.12
> summary(MaxF0_XF1)
Speaker sentence Focus Boundary maxf0st
Min. :1 Min. :1.00 NF: 0 Word :36 Min. :11.34
1st Qu.:3 1st Qu.:1.75 XF:108 Clause :36 1st Qu.:14.63
Median :5 Median :2.50 Sentence:36 Median :16.50
Mean :5 Mean :2.50 Mean :16.60
3rd Qu.:7 3rd Qu.:3.25 3rd Qu.:18.12
Max. :9 Max. :4.00 Max. :24.12
> summary(MaxF0_B2)
Speaker sentence Focus Boundary maxf0st
Min. :1 Min. :1.00 NF:72 Word : 0 Min. :12.40
1st Qu.:3 1st Qu.:1.75 XF:72 Clause :72 1st Qu.:16.02
Median :5 Median :2.50 Sentence:72 Median :17.96
Mean :5 Mean :2.50 Mean :17.93
3rd Qu.:7 3rd Qu.:3.25 3rd Qu.:19.70
Max. :9 Max. :4.00 Max. :24.73
2、组间均值和标准差统计分析
有时候我们需要知道某列对应的数据在指定类别列中的统计值(如均值、标准差、最大值和最小值等)
# 2.Get group mean and sd--------------------------------------------------------------------
#Get the maximum, mimum mean etc.
summary(MaxF0$maxf0st)
# calculate mean divided by one variable
tapply(MaxF0$maxf0st,MaxF0$Focus,mean)
# calculate mean/sd divied by two variables use either tapply or aggregate
maxf0Mean<-with(MaxF0,tapply(maxf0st, list(Focus, Boundary),mean))
maxf0Mean
maxf0Sd<-with(MaxF0,tapply(maxf0st, list(Focus, Boundary),sd))
maxf0Sd
maxf0Mean2 <- aggregate(MaxF0$maxf0st,list(MaxF0$Focus,MaxF0$Boundary),mean)
maxf0Mean2
运行结果:
> summary(MaxF0$maxf0st)
Min. 1st Qu. Median Mean 3rd Qu. Max.
11.34 15.90 17.50 17.58 19.42 24.73
> # calculate mean divided by one variable
> tapply(MaxF0$maxf0st,MaxF0$Focus,mean)
NF XF
18.56700 16.60231
> maxf0Mean
Word Clause Sentence
NF 18.18518 18.56963 18.94618
XF 15.59388 16.40753 17.80552
> maxf0Sd
Word Clause Sentence
NF 2.265554 2.595972 2.212157
XF 2.376251 2.392706 2.498015
> maxf0Mean2
Group.1 Group.2 x
1 NF Word 18.18518
2 XF Word 15.59388
3 NF Clause 18.56963
4 XF Clause 16.40753
5 NF Sentence 18.94618
6 XF Sentence 17.80552
可以看出maxf0Mean2中统计了Group.1和Group.2对应的X变量的统计数据。我更倾向于使用aggregate()函数进行统计,因为它的结果更加直观。
3、图像数据探索
这里我们需要一步一步地探索原始数据,对原始数据的分布情况进行初步了解,以便进行之后的数据分析。
3.1 图像绘制(查看是否存在极端数据,以及数据分布情况)
# Plot all the data to see whether there is any extreme values
plot(MaxF0$maxf0st)
plot(sort(MaxF0$maxf0st),ylab="MaxF0(st)") # display the dat from the smallest to the largest value
# Plot the histrogram figure
hist(MaxF0$maxf0st)
结果展示:
3. 2 数据标准化(Z-scores)
z分数(z-score),也叫标准分数(standard score)是一个数与平均数的差再除以标准差的过程。在统计学中,标准分数是一个观测或数据点的值高于被观测值或测量值的平均值的标准偏差的符号数。
#transfer the data to z-score
maxf0_Z<-scale(MaxF0$maxf0st,center=TRUE,scale=TRUE)
plot(maxf0_Z)
3.3 绘制数据相关性
# plot correlations
MaxF0_XF<-subset(MaxF0,MaxF0$Focus=="XF")
MaxF0_NF<-subset(MaxF0,MaxF0$Focus=="NF")
plot(MaxF0_XF$maxf0st,MaxF0_NF$maxf0st)
QQ图可以反映数据是否符合正态分布:
# Check whehther the data is with a normal distribution
hist(MaxF0$maxf0st)
qqnorm(MaxF0$maxf0st)
qqline(MaxF0$maxf0st,col="red")
绘制一行三列的直方图:
library(dplyr)
par(mfrow=c(1,3))
hist(MaxF0[MaxF0$Boundary=="Word",]$maxf0st,col="light blue",xlab="MaxF0 (st)")
hist(MaxF0[MaxF0$Boundary=="Clause",]$maxf0st,col="green",xlab="MaxF0 (st)")
hist