0% found this document useful (0 votes)
6 views85 pages

RTraining

Software training material

Uploaded by

meseretab1289
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views85 pages

RTraining

Software training material

Uploaded by

meseretab1289
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

R Software Training

Data Manupliation using tidyverse


Statistical Analysis Using R

Endalew T.
Debre Markos University
College of Natural Science
Department of Statistics

April 14, 2025

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 1 / 85
Outline

1 The basics of the R environment and RStudio.


2 Data preparation using R packages
3 Programming techniques and managing datasets
4 Regression analysis and data visualization
5 Creating publication-ready Word tables in R

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 2 / 85
What is R?

R is a popular programming language used for statistical


computing and graphical presentation.
A programming language designed for statistical data analysis
A statistical software program
A community of data scientists and practitioners
It is one of the most widely used languages by statisticians, data
analysts and researchers to manage, manipulate, analyze and
visualize data.
R is case sensitive

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 3 / 85
Why use R?
It is a great resource for data analysis, data visualization, data
science and machine learning
It provides many statistical techniques (statistical tests,
classification, clustering and data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box
plot, scatter plot, etc. . .
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to
solve different problems
Packages for almost everything:
Data processing and cleaning
Data visualization
Interactive web-apps
Type setting, writing articles and slides
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 4 / 85
Installing R and RStudio
Install R:
1 Got to https://cran.rstudio.com/ to access the R installation
page. Then click the download link for Windows:

2 Choose the “base” sub-directory.

3 Then click on the download link at the top of the page to


download the latest version of R:

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 5 / 85
Cont. . . ..
Install RStudio:
To download RStudio, go to
https://www.rstudio.com/products/rstudio/download/ and
download the Windows version.

Then click on the downloaded file and follow the installation


instructions.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 6 / 85
The R User Interface

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 7 / 85
R studio

Best Integrated Development Environment (IDE) for R.


Powerful and makes using R easier
RStudio can:
Organize your code, output, and plots.
Auto-complete code and highlight syntax.
Help view data and objects.
Enable easy integration of R code into documents.
User-friendly interfaces

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 8 / 85
The RStudio User Interface

Using RStudio is completely optional, but it has many helpful


features to make writing and managing R code easier.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 9 / 85
Cont. . . ..
1. Source pane
You will write your R code/script here and it will be run in the
console.
write R code in reusable files
To create a new R script you can either go to
File -> New -> R Script, or
click on the icon with the + sign and select R Script, or
simply press Ctrl+Shift+N.
Make sure to save the script.

2 Console pane
Interactively run R commands
commands are submitted to R to execute.
Execute R code line by line.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 10 / 85
Cont. . . .

3. Environment/history pane
Environment: Shows all the objects (data, variables, functions)
currently loaded in memory
History: search and view command history

4. Files/Plots/Packages/Help/Viewer Pane
Files: Browse navigated files and folders.
Plots: Displays/view generated graphs and plots
Packages: View, install, and load R packages
Help: Access documentation for R functions
Viewer: View help documentations for any package/function.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 11 / 85
Customization
Panes
The size and position of the panes can be customized.
On the top right of each pane, there are buttons to adjust the
pane size.
Also, place your mouse pointer/cursor on the borderline between
panes and when the pointer changes its shape, click and drag to
adjust the pane size.
For more options, go to View > Panes on the menu bar.
Alternatively, Tools > Global Options > Pane Layout.

Appearances
The overall appearance can be customized as well.
Go to Tools > Global Options> Appearance on the menu bar to
change themes, fonts, and more.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 12 / 85
Working Directory in R

The working directory is just a file path on your computer that


sets the default location of any files you read into R, or save out
of R.
The working directory is where you are currently saving data in R.
What is the current working directory?
Type in getwd()
getwd()

## [1] "C:/Users/tesfa/Desktop/R folder/Training"


How to set the working directory?
Type in setwd("path")
setwd("C:/Users/tesfa/Desktop/R folder/Training")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 13 / 85
Installing and Loading Packages
Packages are collections of R functions, data, and compiled code
in a well-defined format.
Installing Packages

install.packages("package name")
Installing package from R studio

Loading Packages

library(packages name)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 14 / 85
Data Structure in R

A data structure is a particular way of organizing data in a


computer so that it can be used effectively.
Data structures in R programming are tools for holding multiple
values.
The most essential data structures used in R include:
Vectors
Lists
Data frames
Matrices
Arrays
Factors

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 15 / 85
Vector

Vectors are the simplest type of object in R.


Every element of a vector must be the same data type
Use the c() function we create a vector
In R, c() stands for combine or concatenate.
It is a function used to create vectors by combining individual
values into a single vector.
There are 3 main types of vectors:
Numeric vectors
Character vectors
Logical vectors

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 16 / 85
Cont. . . .
# numeric Vector
Y <- c(1,2,3,4,5,6,7,8,9,10)
class(Y)
# character vector
name <- c("Abebe","Kebede", "Almaze","Aster")
class(name)
# logical vector
logic <- c(TRUE,FALSE,FALSE, TRUE)
class(logic)
Crate a Vector using rep()
The function rep()can be used for replicating an object in various
complicated ways.
bloodg<-rep(c("A","B","AB","O"),c(2,3,4,3))
bloodg
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 17 / 85
Cont. . . .
Create a vector using seq() function

seq(1:10)
seq(from=1, to=10)
seq(to=10, from=1)
The parameters by=value and length=value specify a step size
and length for the sequence respectively
seq(1,5, by=2)
## [1] 1 3 5
seq(1,10, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
seq(from=1, by=2.25, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 18 / 85
Factors

Factors are the data objects which are used to categorize the data
and store it as levels.
They are useful for storing categorical data.
fac <- factor(c("Male", "Female", "Male", "Male",
"Female", "Male","Female"))
fac
levels(fac)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 19 / 85
Lists

A list could consist of a numeric vector, a logical value, a matrix,


a complex vector, a character array, a function, and so on.
Lists are created with the list() command:
L<-list(object-1,object-2,...,object-m)
L <- list( c(1,5,3), matrix(1:6, nrow=3),
c("Hello", "world") )
L

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 20 / 85
Matrices

A matrix is a rectangular arrangement of numbers in rows and


columns.
Every element of a vector must be the same data type
the basic syntax
matrix(data, nrow, ncol, byrow = FALSE)

data: the elements to be placed in the matrix


nrow: indicates the number of rows
ncol: indicates the number of columns
byrow: fill matrix by row (default is column-wise)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 21 / 85
Cont. . . ..

Naming rows and columns names

rownames(m1) <- c("Row1", "Row2", "Row3")


colnames(m1) <- c("Col1", "Col2", "Col3")
print(m1)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 22 / 85
Arrays

Arrays are the R data objects which store the data in more than
two dimensions.
Arrays are n-dimensional data structures.
Every element of a vector must be the same data type
array(data, dim = c(dim1, dim2, dim3, ...))

dim1: indicates the number of rows


dim2: indicates the number of columns
dim3: indicates the matrix layers

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 23 / 85
Data frames

A Data Frame in R is a tabular data structure that stores values


of any data type
df <- data.frame(Age=c(20,18,19,12,15),
Gender=c("Male","Female","Female",
"Male","Male"),
Educ_level=c("Secondary","Primary",
"Primary","secondary",
"Illitrate"))
df

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 24 / 85
Useful functions in vector

Function Description
class(x): returns class/type of vector x
length(x): returns the total number of elements
x[length(x)]: returns last value of vector x
rev(x): returns reversed vector
sort(x): returns sorted vector
unique(x): returns vector without multiple elements
range(x): Range of x
quantile(x): Quantiles of x for the given probabilities
which.max(x): index of maximum
which.min(x): index of minimum

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 25 / 85
Remove Objects from R Environment

To remove objects from the R environment, use the following


functions:
1. To remove a specified number of objects in R:

rm(object_name1, object_name2)

2. To remove all the objects in R

rm(list = ls(all=T))

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 26 / 85
Data Manipulation and Claning using
dplyr() package
what is tidyverse
Tidyverse is a collection of friendly and consistent tools for data
analysis and visualization.
Tidyverse helps you to import, clean, transform, visualize, and
model data in a consistent and efficient way.
All packages included in tidyverse are automatically installed when
installing the tidyverse package:
install.packages("tidyverse")
Then to work functions under tidyverse package we must always
load the package into the workplace.
library(tidyverse)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 27 / 85
cont. . . .

Tidyverse Packages

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 28 / 85
Data Import and Export

The collection of numerical value is known as data.


Data can be different forms.
To analyze data using R programming language, first import data
in R.
This can be different formats CSV or any other delimiter
separated. - After importing data they can be manipulate,
analyze and report it.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 29 / 85
haven Package
haven package in R is used to read and write data from other
statistical software.
Used to Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files
install.packages("haven")
library(haven)

function Description Example


read_sas(): Read SAS data <- read_sas(“file.sas7bdat”)
read_spss(): SPSS
read_sav(): SPSS data <- read_sav(“file.sav”)
read_por(): SPSS
read_stata(): Stata data <- read_stata(“filename.dta”)
read_dta(): Stata data <- read_dta(“file.dta”)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 30 / 85
Import data into R
From Excel
library(readxl)
dataset <- read_excel("filePath/filename.xlx")

From SPSS
library(haven)
dataset <- read_sav("filepath/filename.sav")

From STATA
library(haven)
dataset <- read_stata("filepath/filename.dta")
# or
dataset <- read_dta("filepath/filename.dta")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 31 / 85
Cont. . . . . . .

From urt
library(readr)
crop<-read.delim('http://www.bio.ic.ac.uk/research/mjcraw
head(crop,n=3)

## yield block irrigation density fertilizer


## 1 90 A control low N
## 2 95 A control low P
## 3 107 A control low NP
From SAS
library(haven)
dataset <- read_sas("filepath/filename.sas7bdat")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 32 / 85
Writing (Saving) data
Function Description
write− csv (): Comma separated values
write− excel− csv (): CSV that you plan to open in Excel
write− sas() : SAS .sas7bdat files
write− sav (): SPSS .sav files
write− stata(): Stata .dta files
write− delim(): General delimited files

library(haven)
library(tidyverse)
write_csv(starwars, "starwars_data.csv")
starwars_clean <- select(starwars, -films,
-vehicles, -starships)
write_sav(starwars_clean, "starwars_clean_data.sav")
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 33 / 85
What are dplyr and tidyr?
The package dplyr provides easy tools for the most common data
manipulation tasks
The dplyr is a powerful R-package to manipulate, clean and
summarize unstructured data.
It makes data exploration and data analysis easy and fast in R.
The Most common dplyr functions are:

function Description
select() Subset columns(variables)
filter(): Subset rows on conditions
mutate(): Create new columns
group− by () : Group the data
summarize(): Create summary by category variable
arrange(): Sort the data(results)
join(): Join data frames(tables)
count(): Count Software
discrete
Endalew T. (Debre Markos University College of NaturalR Science
values
Department
Training of Statistics) April 14, 2025 34 / 85
Select() function

It allows you to select things from your data.


it allows you to select variable or columns.
select() specific columns from the data set.

function Description
starts_with() Starts with an exact prefix
ends_with() ends with an exact suffix
contains() contains a literal string
matches() matches a regular expression
num_range() Numerical ranges like x01,xo2,x03,..
one_of() variables in character vector
everything() all variables

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 35 / 85
Cont. . . .
# Select variables
starwars %>%
select(height,mass,sex,birth_year,eye_color)
# select variables starts_with()
starwars %>%
select(starts_with("b"))

select variables using ends− with() function


starwars %>%
select(ends_with("color"))

select variables using contains() functions


starwars %>%
select(contains("i"))

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 36 / 85
Cont. . . ..

starwars %>%
select(-films,-vehicles,-starships)

# OR
select(starwars,-films,-vehicles,-starships)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 37 / 85
filter()

filter() used to Subset rows by value


It used to choose rows based on certain criteria.
filter(<DATA>, <PREDICATES>)

Predicates: TRUE/FALSE statements


Comparisons: >, >=, <, <=, != (not equal), and == (equal).
Operators: & is “and”, | is “or”, and ! is “not”
using pipe oprator
DATA %>%
filter(<PREDICATES>)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 38 / 85
Cont. . . .

# single criteria
filter(starwars, species == "Human")
filter(starwars, mass > 1000)

# Multiple criteria
filter(starwars, hair_color == "none" &
eye_color == "black")
starwars %>%
filter(hair_color=="none",eye_color=="black")

filter(starwars, hair_color == "none" |


eye_color == "black")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 39 / 85
Cont. . . ..

select Variables by removing unwanted variables from the dataset.


starwars %>%
select(-name, -films,-vehicles,-starships) %>%
filter(mass>50 & sex=="male")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 40 / 85
mutate()

To create new columns based on the values in existing columns.


mutate(<DATA_name>, <NAME> = <FUNCTION>)

starwars %>%
mutate(BMI=mass/(height/100)ˆ2) %>%
mutate(Height=height/100)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 41 / 85
group_by() and summarize()

Take a column of data and reduce it down to a summary statistic,


by some grouping variable
Take a column of data from a data frame and reduce it down to a
single summary statistic
dplyr makes this very easy through the use of the group− by ()
function. The summarize() function
group− by () is often used together with summarize(), which
collapses each group into a single row summary of that group.
group− by () takes as arguments the column names that contain
the categorical variables for which you want to calculate the
summary statistics.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 42 / 85
Cont. . . ..

starwars%>%
select(sex,height,mass,species) %>%
filter(species=="Human") %>%
na.omit() %>%
mutate(height=height/100) %>%
mutate(MBI=mass/heightˆ2) %>%
group_by(sex) %>%
summarise(average_BMI=mean(MBI))

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 43 / 85
Cont. . . ..
Group multiple columns by adding categorical variables in
group_by().
After group the data, summarize multiple variables at the same
time:
library(pander)
diabetes %>%
filter(!is.na(weight)) %>%
group_by(gender) %>%
summarize(mean_age = mean(age),
min_age=min(age),
median_age=median(age),
max_age=max(age),
n=n())%>% pander() %>%
head()
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 44 / 85
arrange()

The arrange() function is used to order/sort the data frame rows


in either ascending or descending based on column value.
It used to sort the rows in ascending order(from smallest to
largest).
diabetes %>%
select(gender,height,weight,frame) %>%
arrange(weight)
# arrange the data in descending order by weight.
diabetes %>%
select(gender,height,weight,frame) %>%
arrange(desc(weight))

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 45 / 85
count()

count() function is used to count the number of occurrences of


each unique value in a column or a combination of columns.
It used to count the unique values of one or more variables.
diabetes %>%
count(frame)
starwars %>%
count(eye_color,name="frequency")

To Count combinations of two categorical variables


diabetes %>%
count(gender,frame,name= "frequency")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 46 / 85
Cont. . . ..

Create two-way table


diabetes %>%
drop_na() %>%
count(gender,frame) %>%
group_by(gender) %>%
pivot_wider(names_from = frame,values_from = n,
values_fill = 0)

This reshapes the count result into a table where rows are
gender, columns are frame, and cells are counts.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 47 / 85
Data visualization with ggplot2

Data visualization is part art and part science.


A data visualization first and foremost has to accurately convey
the data.
It must not mislead or distort.
A data visualization should be aesthetically pleasing.
Good visual presentations tend to enhance the message of the
visualization.
What are the key principles, methods, and concepts required to
visualize data for publications, reports, or presentations?

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 48 / 85
Why ggplot2?

A grammar of graphics is a grammar used to describe and create


a wide range of statistical graphics.
The promise of a grammar for graphics.
Easy to manage, save, etc.
Graphs are composed of layers.
Easy to add stuff to existing graphs.
ggplot2 graphics take less work to make beautiful and
eye-catching graphics.
Enables creation of reproducible visualization patterns.
Publication quality & beyond

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 49 / 85
Aesthetics (aes()) function

function description
x, y: variables
color: colors the lines of geometries
fill: fill geometries or fill color
group: groups based on the data
shape: shape of point, an integer value 0 to 24, or NA
linetype: type of line, a integer value 0 to 6 or a string
size: sizes of elements, a non-negative numeric value
alpha: changes the transparency,a numeric value 0 to 1

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 50 / 85
Geometric layers
**Geometries (geom_*()) function**
The general syntax is:
ggplot(data = data, mapping = aes(mapings))+
geom_function()
Geom Components

function description
geom_histogram histogram plot
geom_point() Scatter plot
geom_line() Line plot
geom_bar() Bar chart
geom_boxplot() boxplot
geom_smooth() Add trend line (e.g., linear regression)
geom_density() Density curve

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 51 / 85
Cont. . . .
Shape of point, an integer value 0 to 24, or NA
The linetype aesthetic can be specified with either an integer 0 to
6.
0 = blank, 1 = solid, 2 = dashed, 3 = dotted,
4 = dotdash, 5 = longdash, 6 = twodash

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 52 / 85
Facets

Facets are added as an additional layer to the plot.


Facet wraps are a useful way to view individual categories in their
own graph.
The following table describes how facet formulas work in
facet_grid() and facet_wrap():

Type function Description


Grid facet_grid(. ~ x) Facet horizontally across x values
Grid facet_grid(y ~ .) Facet vertically across y values
Grid facet_grid(y ~ x) Facet 2-dimensionally
Wrap facet_wrap(~ x) Facet across x values
Wrap facet_wrap(~ x + y) Facet across x and y values

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 53 / 85
Customize Our Plot

Here are some key aspects you can customize: Axes, Titles and
Legends
Title and axes components: changing size, colour and face
-Customizing Axis Labels with labs()
used to modify plot labels, including x-axis, y-axis, and plot title

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 54 / 85
Histogram
diabetes %>%
drop_na() %>%
ggplot(aes(weight))+
geom_histogram(binwidth = 10, fill = "steelblue",
color = "white")+
theme_minimal()

to split by Frame
diabetes %>%
drop_na() %>%
ggplot(aes(weight,fill = gender))+
geom_histogram(binwidth = 10, fill = "steelblue",
color = "white")+
facet_wrap(~frame)+
theme_minimal()
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 55 / 85
Bar chart
simple barchart
diabetes %>%
ggplot(aes(x = frame)) + geom_bar()

Component Bar chart


diabetes %>%
drop_na() %>%
ggplot(aes(x=frame,colour = gender))+
geom_bar()

diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar()

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 56 / 85
Cont. . . ..
Positions

geom_bar(position = "<POSITION>")
When we have aesthetics mapped, how are they positioned?
bar: dodge, fill, stacked (default)
diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "dodge")

diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "stack")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 57 / 85
Box Plot

diabetes %>%
drop_na() %>%
ggplot(aes(frame, weight, fill=gender))+
geom_boxplot()+
labs(x="weight of patients",
y= "Foot Risk Awareness and Management Education")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 58 / 85
scatter plot

library(ggpmisc)
diabetes %>%
drop_na() %>%
ggplot(aes(x=height,y=weight, linetype =gender ))+
geom_point()+
geom_smooth(method = "lm", se = TRUE) +
stat_poly_eq(
aes(label = paste(..eq.label.., ..rr.label..,
sep = "~~~")),
formula = y ~ x,
parse = TRUE
)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 59 / 85
Cont. . . ..

diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_wrap("frame")

diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_grid("frame")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 60 / 85
Labels, titles, and legends

Add Labels
xlab() , ylab() , labs(x = "X-axis name", y = "y-axis name")

Add titles

Add legends

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 61 / 85
Themes
Control the non-data parts of the plots.
like \textcolor(red}{fonts, backgrounds, grid lines, legends,
margins, and titles}.
Pre specified themes are the following

Theme Description
theme_gray() Default ggplot2 theme
theme_bw() Black and white theme, good for print
theme_minimal() Very clean and minimal background
theme_classic() Classic look (no grid lines)
theme_light() Light background with subtle grid lines
theme_dark() Dark version of theme_light()
theme_void() Removes everything (useful for pie charts or maps)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 62 / 85
Cont. . .

mtcars %>%
ggplot(aes(hp, mpg, col = factor(cyl))) +
geom_point(size = 3)+
theme_dark()

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 63 / 85
Activity

1 Using storm data from the nasaweather package, create a scatter


plot between wind and pressure, with color being used to
distinguish the type of storm.
2 Using the penguins data set from the palmerpenguins package.
a. Create a scatterplot of bill− length− mm against bill− depth− mm
where individual species are colored and a regression line is added
to each species. Add regression lines to all of your facets. What
do you observe about the association of bill depth and bill length?
b. Repeat the same scatterplot but now separate your plot into
facets by species. How would you summarize the association
between bill depth and bill length.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 64 / 85
Statistical Analysis Using R

What is Statistical modeling


Statistical modeling uses mathematical models and statistical
conclusions to create data that can be used to understand
real-life situations.
Statistical modeling to generate sample data and make
predictions about the real world.
A statistical model is a collection of probability distributions on a
set of all possible outcomes of an experiment.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 65 / 85
Types of Statistical Models

Statistical models can be placed into groups based on parameters.


The following are explanations of the statistical models.
Parametric Models have probability distributions that have set
parameters which are known.
Nonparametric Models have values in which the parameters can
change and are not established from the beginning.
Semiparametric Models are a blend of the parametric and
nonparametric models, fixed and flexible.

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 66 / 85
T test
Three are three type of t-test
one-sample t-test compares the mean score of a sample to a
known value, usually the population mean.
independent t-test used to to compare the mean of one sample
with the mean of another sample to see if there is a statistically
significant difference between the two.
Paired t-test used to determine whether there is a difference
between the average values of paired samples subjected to two
different conditions
The t-test default command is
t.test(x, y = NULL,
alternative = c("two.sided","less", "greater"),
mu =0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 67 / 85
Independent t-test

Using diabetes data


library(dplyr)
female<-diabetes %>% filter(weight & gender=="female")
FW<-female$weight
male<-diabetes %>% filter(weight & gender=="male")
MW<-male$weight
t.test(MW,FW)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 68 / 85
ANOVA

ANOVA provides a statistical test of whether or not the means of


several groups are equal, and it generalizes the t-test to more
than two groups.
one-way ANOVA is a technique used to compare means of two or
more groups
Example: Agricultural researchers designed an experiment to look
at crop yield in a number of plots in a research farm. Crop yield
was recorded as a function of irrigation (2 levels: irrigated or
not),sowing density (3 levels: low, medium, high), fertilizer
application (3 levels: N, P, NP).
The data can be retrieved from:
http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/splityield.txt

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 69 / 85
Cont. . . .
data import from Google
crop<- read.delim('http://www.bio.ic.ac.uk/research/
mjcraw/statcomp/data/splityield.txt')
str(crop) # To see the nature of data
attach(crop)
head(crop) #To observer the first 6 observation
tail(crop) #To observer the last 6 observation
crop$density<-factor(crop$density,
levels=c('low','medium','high'))
crop$irrigation<-factor(crop$irrigation,
levels=c('Control','Irrigated'))
crop$fertilizer<-factor(crop$fertilizer,levels=c('N','P',
crop$block<-factor(crop$block,levels=c("A","B","C","D"))
levels(crop$density)
levels(crop$irrigation)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 70 / 85
Cont. . . .

boxplot

par(mfrow=c(2,2))
boxplot(yield~block,col=c(2:5),ylab = "crop yield", xlab
boxplot(yield~irrigation,col=c(2:3),ylab = "Crop yield",
boxplot(yield~density,col=c(2:5),ylab = "crop yield",
xlab ="Sowing density", main="Boxplot of crop producti0")

boxplot(yield~fertilizer,col=c(2:5),ylab = "crop yield",


xlab ="used fertilizer appilication",
main="Boxplot of crop production yield by fertilizer")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 71 / 85
Cont. . . ..

Sumarize by group

library(dplyr)
crop %>% group_by(density) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(fertilizer) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(irrigation) %>%
summarise(average_yield=mean(yield))

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 72 / 85
One-way ANOVA

model1<-aov(yield~fertilizer,data=crop)
summary(model1)

post hoc test for one way anova

TukeyHSD(model1)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 73 / 85
Two-way ANOVA R

two<-aov(yield~density+irrigation)
summary(two)
two1<-aov(yield~density*fertilizer) # with interaction
summary(two1)
two2<-aov(yield~density+irrigation)
summary(two2)
two3<-aov(yield~density*irrigation) # with interaction
summary(two3)

post hoc test for two way anova

TukeyHSD(two1)
TukeyHSD(two3)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 74 / 85
Linear Regression Analysis
Regression is a type of analysis where you want to be able to
predict some scores using other information.
Linear regression is used to predict the value of an outcome
variable y based on one or more input predictor variables.
Simple Linear Regression– regression with only one predictor
variable (IV).
yi = β0 + β1 x1 + ϵi
Multiple Linear Regression – regression with more than one
predictor variable (IVs).

yi = β0 + β1 x1 + β2 x2 + ... + βp xp + ϵi

Multiple linear regression is useful for modelling the relationship


between a numeric outcome or dependent variable with multiple
explanatory/independent variable.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 75 / 85
Regression Analysis R command

mymodel<-lm(y ∼ x ): # To fit simple regression model


summary: # To show the fitting regression model results
anova: # To display the regression model ANOVA table
coef(mymodel): # To display regression coefficient parameter
fits<-mymodel$fitted: # To store the fitted values
resids<-mymodel$residuals: # To store the residual values
beta1hat=mymodel$coeff[2]: # To assign the slope coefficient to
the name “beta1hat”
confint(mymodel): # CIs for all parameters

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 76 / 85
Cont. . . .
Diagnostic Checking

par(mfrow=c(2,2));
plot(mymodel)#
plot(mymodel,which=4)
library(car)
vif(mymodel) # Test of multicollineraty

Diagnostic checking of influential observation and outliers

outlierTest(mymodel)
influence(mymodel)
dffits(mymodel)
dfbetas(mymodel)
hatvalues(mymodel)
cooks.distance(mymodel)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 77 / 85
Example

Using Boston housing data set(~1978) from R dataset:


variables(predictors consists of mix of continuous and factors)
medv: median property value in $1000s, (Y/outcome variable)
rm: number of rooms
crim: per capital crime rate by town
rad: index of accessibility to highway
install.packages("MASS")
library(MASS)
data("Boston")
head(Boston)
attach(Boston)
str(Boston)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 78 / 85
Cont. . . .

Descriptive statistics of the Boston data


The Stargazer package for R provides a way to create publication
quality tables, and a way for researchers to avoid creating new
tables each time they tweak their dataset.
The Stargazer package provide the basic understanding needed to
create Summary Statistics Tables and Regression Tables.
install.packages("stargazer")
library(stargazer)
stargazer(Boston, type = "text",
title="Descriptive statistics",
digits=2, out="table1.txt")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 79 / 85
Cont. . .

For simple linear regression


mymodel1 <- lm(medv ~ rm, data = Boston)
vif(mymodel1)

For multiple linear regression


mymodel2 <- lm(medv ~ rm + crim + factor(rad)+ age, data =
Boston)

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 80 / 85
Cont. . .
Used to check multicollinearity
vif(mymodel2)

Used to display the regression model ANOVA table


anova(mymodel2):

fitting regression model results


summary(mymodel2)

used to construction 95% CI of parameters


confint(mymodel2,level = 0.95):

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 81 / 85
Logistic Regression

Logistic regression is a GLM used to model a binary categorical


variable using numerical and categorical predictors.
model <- glm(response ~ predictor1 + predictor2,
data = data_name, family = binomial)

response: Your binary outcome variable (coded as 0 and 1).


predictor1, predictor2, etc.: Independent variables.
data: Your dataset.
family = binomial: Specifies logistic regression (logit link).

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 82 / 85
Cont. . . ..

library(tidyverse)
library(gtsummary)
library(survival)
# Multivariable regression
model <- glm(response~trt+age+grade,
data=trial,family=binomial)
tab_model(model, file = "result.doc") # save output

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 83 / 85
Customize output
How to customize the regression output to Publication format?
The following packages and command are used to customize the
regression output
install.packages("sjPlot")
install.packages("sjmisc")
install.packages("sjlabelled")
library(sjPlot)
library(sjmisc)
library(sjlabelled)
tab_model(mymodel2)
tab_model(mymodel2,string.ci = "95% CI",
string.p = "p-value",
string.pred = "Characteristics",
show.aic = TRUE,show.aicc = TRUE)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 84 / 85
Save The output

tab_model(lme1, file = "YOURTABLENAME.doc")

Endalew T. (Debre Markos University College of NaturalR Science


Software
Department
Training of Statistics) April 14, 2025 85 / 85

You might also like