RTraining
RTraining
Endalew T.
Debre Markos University
College of Natural Science
Department of Statistics
2 Console pane
Interactively run R commands
commands are submitted to R to execute.
Execute R code line by line.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 10 / 85
Cont. . . .
3. Environment/history pane
Environment: Shows all the objects (data, variables, functions)
currently loaded in memory
History: search and view command history
4. Files/Plots/Packages/Help/Viewer Pane
Files: Browse navigated files and folders.
Plots: Displays/view generated graphs and plots
Packages: View, install, and load R packages
Help: Access documentation for R functions
Viewer: View help documentations for any package/function.
Appearances
The overall appearance can be customized as well.
Go to Tools > Global Options> Appearance on the menu bar to
change themes, fonts, and more.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 12 / 85
Working Directory in R
install.packages("package name")
Installing package from R studio
Loading Packages
library(packages name)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 14 / 85
Data Structure in R
seq(1:10)
seq(from=1, to=10)
seq(to=10, from=1)
The parameters by=value and length=value specify a step size
and length for the sequence respectively
seq(1,5, by=2)
## [1] 1 3 5
seq(1,10, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
seq(from=1, by=2.25, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 18 / 85
Factors
Factors are the data objects which are used to categorize the data
and store it as levels.
They are useful for storing categorical data.
fac <- factor(c("Male", "Female", "Male", "Male",
"Female", "Male","Female"))
fac
levels(fac)
Arrays are the R data objects which store the data in more than
two dimensions.
Arrays are n-dimensional data structures.
Every element of a vector must be the same data type
array(data, dim = c(dim1, dim2, dim3, ...))
Function Description
class(x): returns class/type of vector x
length(x): returns the total number of elements
x[length(x)]: returns last value of vector x
rev(x): returns reversed vector
sort(x): returns sorted vector
unique(x): returns vector without multiple elements
range(x): Range of x
quantile(x): Quantiles of x for the given probabilities
which.max(x): index of maximum
which.min(x): index of minimum
rm(object_name1, object_name2)
rm(list = ls(all=T))
Tidyverse Packages
From SPSS
library(haven)
dataset <- read_sav("filepath/filename.sav")
From STATA
library(haven)
dataset <- read_stata("filepath/filename.dta")
# or
dataset <- read_dta("filepath/filename.dta")
From urt
library(readr)
crop<-read.delim('http://www.bio.ic.ac.uk/research/mjcraw
head(crop,n=3)
library(haven)
library(tidyverse)
write_csv(starwars, "starwars_data.csv")
starwars_clean <- select(starwars, -films,
-vehicles, -starships)
write_sav(starwars_clean, "starwars_clean_data.sav")
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 33 / 85
What are dplyr and tidyr?
The package dplyr provides easy tools for the most common data
manipulation tasks
The dplyr is a powerful R-package to manipulate, clean and
summarize unstructured data.
It makes data exploration and data analysis easy and fast in R.
The Most common dplyr functions are:
function Description
select() Subset columns(variables)
filter(): Subset rows on conditions
mutate(): Create new columns
group− by () : Group the data
summarize(): Create summary by category variable
arrange(): Sort the data(results)
join(): Join data frames(tables)
count(): Count Software
discrete
Endalew T. (Debre Markos University College of NaturalR Science
values
Department
Training of Statistics) April 14, 2025 34 / 85
Select() function
function Description
starts_with() Starts with an exact prefix
ends_with() ends with an exact suffix
contains() contains a literal string
matches() matches a regular expression
num_range() Numerical ranges like x01,xo2,x03,..
one_of() variables in character vector
everything() all variables
starwars %>%
select(-films,-vehicles,-starships)
# OR
select(starwars,-films,-vehicles,-starships)
# single criteria
filter(starwars, species == "Human")
filter(starwars, mass > 1000)
# Multiple criteria
filter(starwars, hair_color == "none" &
eye_color == "black")
starwars %>%
filter(hair_color=="none",eye_color=="black")
starwars %>%
mutate(BMI=mass/(height/100)ˆ2) %>%
mutate(Height=height/100)
starwars%>%
select(sex,height,mass,species) %>%
filter(species=="Human") %>%
na.omit() %>%
mutate(height=height/100) %>%
mutate(MBI=mass/heightˆ2) %>%
group_by(sex) %>%
summarise(average_BMI=mean(MBI))
This reshapes the count result into a table where rows are
gender, columns are frame, and cells are counts.
function description
x, y: variables
color: colors the lines of geometries
fill: fill geometries or fill color
group: groups based on the data
shape: shape of point, an integer value 0 to 24, or NA
linetype: type of line, a integer value 0 to 6 or a string
size: sizes of elements, a non-negative numeric value
alpha: changes the transparency,a numeric value 0 to 1
function description
geom_histogram histogram plot
geom_point() Scatter plot
geom_line() Line plot
geom_bar() Bar chart
geom_boxplot() boxplot
geom_smooth() Add trend line (e.g., linear regression)
geom_density() Density curve
Here are some key aspects you can customize: Axes, Titles and
Legends
Title and axes components: changing size, colour and face
-Customizing Axis Labels with labs()
used to modify plot labels, including x-axis, y-axis, and plot title
to split by Frame
diabetes %>%
drop_na() %>%
ggplot(aes(weight,fill = gender))+
geom_histogram(binwidth = 10, fill = "steelblue",
color = "white")+
facet_wrap(~frame)+
theme_minimal()
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 55 / 85
Bar chart
simple barchart
diabetes %>%
ggplot(aes(x = frame)) + geom_bar()
diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar()
geom_bar(position = "<POSITION>")
When we have aesthetics mapped, how are they positioned?
bar: dodge, fill, stacked (default)
diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "dodge")
diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "stack")
diabetes %>%
drop_na() %>%
ggplot(aes(frame, weight, fill=gender))+
geom_boxplot()+
labs(x="weight of patients",
y= "Foot Risk Awareness and Management Education")
library(ggpmisc)
diabetes %>%
drop_na() %>%
ggplot(aes(x=height,y=weight, linetype =gender ))+
geom_point()+
geom_smooth(method = "lm", se = TRUE) +
stat_poly_eq(
aes(label = paste(..eq.label.., ..rr.label..,
sep = "~~~")),
formula = y ~ x,
parse = TRUE
)
diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_wrap("frame")
diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_grid("frame")
Add Labels
xlab() , ylab() , labs(x = "X-axis name", y = "y-axis name")
Add titles
Add legends
Theme Description
theme_gray() Default ggplot2 theme
theme_bw() Black and white theme, good for print
theme_minimal() Very clean and minimal background
theme_classic() Classic look (no grid lines)
theme_light() Light background with subtle grid lines
theme_dark() Dark version of theme_light()
theme_void() Removes everything (useful for pie charts or maps)
mtcars %>%
ggplot(aes(hp, mpg, col = factor(cyl))) +
geom_point(size = 3)+
theme_dark()
boxplot
par(mfrow=c(2,2))
boxplot(yield~block,col=c(2:5),ylab = "crop yield", xlab
boxplot(yield~irrigation,col=c(2:3),ylab = "Crop yield",
boxplot(yield~density,col=c(2:5),ylab = "crop yield",
xlab ="Sowing density", main="Boxplot of crop producti0")
Sumarize by group
library(dplyr)
crop %>% group_by(density) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(fertilizer) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(irrigation) %>%
summarise(average_yield=mean(yield))
model1<-aov(yield~fertilizer,data=crop)
summary(model1)
TukeyHSD(model1)
two<-aov(yield~density+irrigation)
summary(two)
two1<-aov(yield~density*fertilizer) # with interaction
summary(two1)
two2<-aov(yield~density+irrigation)
summary(two2)
two3<-aov(yield~density*irrigation) # with interaction
summary(two3)
TukeyHSD(two1)
TukeyHSD(two3)
yi = β0 + β1 x1 + β2 x2 + ... + βp xp + ϵi
par(mfrow=c(2,2));
plot(mymodel)#
plot(mymodel,which=4)
library(car)
vif(mymodel) # Test of multicollineraty
outlierTest(mymodel)
influence(mymodel)
dffits(mymodel)
dfbetas(mymodel)
hatvalues(mymodel)
cooks.distance(mymodel)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 77 / 85
Example
library(tidyverse)
library(gtsummary)
library(survival)
# Multivariable regression
model <- glm(response~trt+age+grade,
data=trial,family=binomial)
tab_model(model, file = "result.doc") # save output