0% found this document useful (0 votes)

6 views85 pages

RTraining

Software training material

Uploaded by

meseretab1289

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views85 pages

RTraining

Software training material

Uploaded by

meseretab1289

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

R Software Training

Data Manupliation using tidyverse

Statistical Analysis Using R

Endalew T.
Debre Markos University
College of Natural Science
Department of Statistics

April 14, 2025

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 1 / 85
Outline

1 The basics of the R environment and RStudio.

2 Data preparation using R packages
3 Programming techniques and managing datasets
4 Regression analysis and data visualization
5 Creating publication-ready Word tables in R

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 2 / 85
What is R?

R is a popular programming language used for statistical

computing and graphical presentation.
A programming language designed for statistical data analysis
A statistical software program
A community of data scientists and practitioners
It is one of the most widely used languages by statisticians, data
analysts and researchers to manage, manipulate, analyze and
visualize data.
R is case sensitive

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 3 / 85
Why use R?
It is a great resource for data analysis, data visualization, data
science and machine learning
It provides many statistical techniques (statistical tests,
classification, clustering and data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box
plot, scatter plot, etc. . .
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to
solve different problems
Packages for almost everything:
Data processing and cleaning
Data visualization
Interactive web-apps
Type setting, writing articles and slides
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 4 / 85
Installing R and RStudio
Install R:
1 Got to https://cran.rstudio.com/ to access the R installation
page. Then click the download link for Windows:

2 Choose the “base” sub-directory.

3 Then click on the download link at the top of the page to

download the latest version of R:

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 5 / 85
Cont. . . ..
Install RStudio:
To download RStudio, go to
https://www.rstudio.com/products/rstudio/download/ and
download the Windows version.

Then click on the downloaded file and follow the installation

instructions.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 6 / 85
The R User Interface

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 7 / 85
R studio

Best Integrated Development Environment (IDE) for R.

Powerful and makes using R easier
RStudio can:
Organize your code, output, and plots.
Auto-complete code and highlight syntax.
Help view data and objects.
Enable easy integration of R code into documents.
User-friendly interfaces

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 8 / 85
The RStudio User Interface

Using RStudio is completely optional, but it has many helpful

features to make writing and managing R code easier.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 9 / 85
Cont. . . ..
1. Source pane
You will write your R code/script here and it will be run in the
console.
write R code in reusable files
To create a new R script you can either go to
File -> New -> R Script, or
click on the icon with the + sign and select R Script, or
simply press Ctrl+Shift+N.
Make sure to save the script.

2 Console pane
Interactively run R commands
commands are submitted to R to execute.
Execute R code line by line.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 10 / 85
Cont. . . .

3. Environment/history pane
Environment: Shows all the objects (data, variables, functions)
currently loaded in memory
History: search and view command history

4. Files/Plots/Packages/Help/Viewer Pane
Files: Browse navigated files and folders.
Plots: Displays/view generated graphs and plots
Packages: View, install, and load R packages
Help: Access documentation for R functions
Viewer: View help documentations for any package/function.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 11 / 85
Customization
Panes
The size and position of the panes can be customized.
On the top right of each pane, there are buttons to adjust the
pane size.
Also, place your mouse pointer/cursor on the borderline between
panes and when the pointer changes its shape, click and drag to
adjust the pane size.
For more options, go to View > Panes on the menu bar.
Alternatively, Tools > Global Options > Pane Layout.

Appearances
The overall appearance can be customized as well.
Go to Tools > Global Options> Appearance on the menu bar to
change themes, fonts, and more.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 12 / 85
Working Directory in R

The working directory is just a file path on your computer that

sets the default location of any files you read into R, or save out
of R.
The working directory is where you are currently saving data in R.
What is the current working directory?
Type in getwd()
getwd()

## [1] "C:/Users/tesfa/Desktop/R folder/Training"

How to set the working directory?
Type in setwd("path")
setwd("C:/Users/tesfa/Desktop/R folder/Training")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 13 / 85
Installing and Loading Packages
Packages are collections of R functions, data, and compiled code
in a well-defined format.
Installing Packages

install.packages("package name")
Installing package from R studio

Loading Packages

library(packages name)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 14 / 85
Data Structure in R

A data structure is a particular way of organizing data in a

computer so that it can be used effectively.
Data structures in R programming are tools for holding multiple
values.
The most essential data structures used in R include:
Vectors
Lists
Data frames
Matrices
Arrays
Factors

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 15 / 85
Vector

Vectors are the simplest type of object in R.

Every element of a vector must be the same data type
Use the c() function we create a vector
In R, c() stands for combine or concatenate.
It is a function used to create vectors by combining individual
values into a single vector.
There are 3 main types of vectors:
Numeric vectors
Character vectors
Logical vectors

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 16 / 85
Cont. . . .
# numeric Vector
Y <- c(1,2,3,4,5,6,7,8,9,10)
class(Y)
# character vector
name <- c("Abebe","Kebede", "Almaze","Aster")
class(name)
# logical vector
logic <- c(TRUE,FALSE,FALSE, TRUE)
class(logic)
Crate a Vector using rep()
The function rep()can be used for replicating an object in various
complicated ways.
bloodg<-rep(c("A","B","AB","O"),c(2,3,4,3))
bloodg
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 17 / 85
Cont. . . .
Create a vector using seq() function

seq(1:10)
seq(from=1, to=10)
seq(to=10, from=1)
The parameters by=value and length=value specify a step size
and length for the sequence respectively
seq(1,5, by=2)
## [1] 1 3 5
seq(1,10, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
seq(from=1, by=2.25, length=5)
## [1] 1.00 3.25 5.50 7.75 10.00
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 18 / 85
Factors

Factors are the data objects which are used to categorize the data
and store it as levels.
They are useful for storing categorical data.
fac <- factor(c("Male", "Female", "Male", "Male",
"Female", "Male","Female"))
fac
levels(fac)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 19 / 85
Lists

A list could consist of a numeric vector, a logical value, a matrix,

a complex vector, a character array, a function, and so on.
Lists are created with the list() command:
L<-list(object-1,object-2,...,object-m)
L <- list( c(1,5,3), matrix(1:6, nrow=3),
c("Hello", "world") )
L

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 20 / 85
Matrices

A matrix is a rectangular arrangement of numbers in rows and

columns.
Every element of a vector must be the same data type
the basic syntax
matrix(data, nrow, ncol, byrow = FALSE)

data: the elements to be placed in the matrix

nrow: indicates the number of rows
ncol: indicates the number of columns
byrow: fill matrix by row (default is column-wise)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 21 / 85
Cont. . . ..

Naming rows and columns names

rownames(m1) <- c("Row1", "Row2", "Row3")

colnames(m1) <- c("Col1", "Col2", "Col3")
print(m1)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 22 / 85
Arrays

Arrays are the R data objects which store the data in more than
two dimensions.
Arrays are n-dimensional data structures.
Every element of a vector must be the same data type
array(data, dim = c(dim1, dim2, dim3, ...))

dim1: indicates the number of rows

dim2: indicates the number of columns
dim3: indicates the matrix layers

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 23 / 85
Data frames

A Data Frame in R is a tabular data structure that stores values

of any data type
df <- data.frame(Age=c(20,18,19,12,15),
Gender=c("Male","Female","Female",
"Male","Male"),
Educ_level=c("Secondary","Primary",
"Primary","secondary",
"Illitrate"))
df

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 24 / 85
Useful functions in vector

Function Description
class(x): returns class/type of vector x
length(x): returns the total number of elements
x[length(x)]: returns last value of vector x
rev(x): returns reversed vector
sort(x): returns sorted vector
unique(x): returns vector without multiple elements
range(x): Range of x
quantile(x): Quantiles of x for the given probabilities
which.max(x): index of maximum
which.min(x): index of minimum

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 25 / 85
Remove Objects from R Environment

To remove objects from the R environment, use the following

functions:
1. To remove a specified number of objects in R:

rm(object_name1, object_name2)

2. To remove all the objects in R

rm(list = ls(all=T))

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 26 / 85
Data Manipulation and Claning using
dplyr() package
what is tidyverse
Tidyverse is a collection of friendly and consistent tools for data
analysis and visualization.
Tidyverse helps you to import, clean, transform, visualize, and
model data in a consistent and efficient way.
All packages included in tidyverse are automatically installed when
installing the tidyverse package:
install.packages("tidyverse")
Then to work functions under tidyverse package we must always
load the package into the workplace.
library(tidyverse)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 27 / 85
cont. . . .

Tidyverse Packages

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 28 / 85
Data Import and Export

The collection of numerical value is known as data.

Data can be different forms.
To analyze data using R programming language, first import data
in R.
This can be different formats CSV or any other delimiter
separated. - After importing data they can be manipulate,
analyze and report it.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 29 / 85
haven Package
haven package in R is used to read and write data from other
statistical software.
Used to Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files
install.packages("haven")
library(haven)

function Description Example

read_sas(): Read SAS data <- read_sas(“file.sas7bdat”)
read_spss(): SPSS
read_sav(): SPSS data <- read_sav(“file.sav”)
read_por(): SPSS
read_stata(): Stata data <- read_stata(“filename.dta”)
read_dta(): Stata data <- read_dta(“file.dta”)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 30 / 85
Import data into R
From Excel
library(readxl)
dataset <- read_excel("filePath/filename.xlx")

From SPSS
library(haven)
dataset <- read_sav("filepath/filename.sav")

From STATA
library(haven)
dataset <- read_stata("filepath/filename.dta")
# or
dataset <- read_dta("filepath/filename.dta")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 31 / 85
Cont. . . . . . .

From urt
library(readr)
crop<-read.delim('http://www.bio.ic.ac.uk/research/mjcraw
head(crop,n=3)

## yield block irrigation density fertilizer

## 1 90 A control low N
## 2 95 A control low P
## 3 107 A control low NP
From SAS
library(haven)
dataset <- read_sas("filepath/filename.sas7bdat")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 32 / 85
Writing (Saving) data
Function Description
write− csv (): Comma separated values
write− excel− csv (): CSV that you plan to open in Excel
write− sas() : SAS .sas7bdat files
write− sav (): SPSS .sav files
write− stata(): Stata .dta files
write− delim(): General delimited files

library(haven)
library(tidyverse)
write_csv(starwars, "starwars_data.csv")
starwars_clean <- select(starwars, -films,
-vehicles, -starships)
write_sav(starwars_clean, "starwars_clean_data.sav")
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 33 / 85
What are dplyr and tidyr?
The package dplyr provides easy tools for the most common data
manipulation tasks
The dplyr is a powerful R-package to manipulate, clean and
summarize unstructured data.
It makes data exploration and data analysis easy and fast in R.
The Most common dplyr functions are:

function Description
select() Subset columns(variables)
filter(): Subset rows on conditions
mutate(): Create new columns
group− by () : Group the data
summarize(): Create summary by category variable
arrange(): Sort the data(results)
join(): Join data frames(tables)
count(): Count Software
discrete
Endalew T. (Debre Markos University College of NaturalR Science
values
Department
Training of Statistics) April 14, 2025 34 / 85
Select() function

It allows you to select things from your data.

it allows you to select variable or columns.
select() specific columns from the data set.

function Description
starts_with() Starts with an exact prefix
ends_with() ends with an exact suffix
contains() contains a literal string
matches() matches a regular expression
num_range() Numerical ranges like x01,xo2,x03,..
one_of() variables in character vector
everything() all variables

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 35 / 85
Cont. . . .
# Select variables
starwars %>%
select(height,mass,sex,birth_year,eye_color)
# select variables starts_with()
starwars %>%
select(starts_with("b"))

select variables using ends− with() function

starwars %>%
select(ends_with("color"))

select variables using contains() functions

starwars %>%
select(contains("i"))

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 36 / 85
Cont. . . ..

starwars %>%
select(-films,-vehicles,-starships)

# OR
select(starwars,-films,-vehicles,-starships)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 37 / 85
filter()

filter() used to Subset rows by value

It used to choose rows based on certain criteria.
filter(<DATA>, <PREDICATES>)

Predicates: TRUE/FALSE statements

Comparisons: >, >=, <, <=, != (not equal), and == (equal).
Operators: & is “and”, | is “or”, and ! is “not”
using pipe oprator
DATA %>%
filter(<PREDICATES>)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 38 / 85
Cont. . . .

# single criteria
filter(starwars, species == "Human")
filter(starwars, mass > 1000)

# Multiple criteria
filter(starwars, hair_color == "none" &
eye_color == "black")
starwars %>%
filter(hair_color=="none",eye_color=="black")

filter(starwars, hair_color == "none" |

eye_color == "black")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 39 / 85
Cont. . . ..

select Variables by removing unwanted variables from the dataset.

starwars %>%
select(-name, -films,-vehicles,-starships) %>%
filter(mass>50 & sex=="male")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 40 / 85
mutate()

To create new columns based on the values in existing columns.

mutate(<DATA_name>, <NAME> = <FUNCTION>)

starwars %>%
mutate(BMI=mass/(height/100)ˆ2) %>%
mutate(Height=height/100)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 41 / 85
group_by() and summarize()

Take a column of data and reduce it down to a summary statistic,

by some grouping variable
Take a column of data from a data frame and reduce it down to a
single summary statistic
dplyr makes this very easy through the use of the group− by ()
function. The summarize() function
group− by () is often used together with summarize(), which
collapses each group into a single row summary of that group.
group− by () takes as arguments the column names that contain
the categorical variables for which you want to calculate the
summary statistics.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 42 / 85
Cont. . . ..

starwars%>%
select(sex,height,mass,species) %>%
filter(species=="Human") %>%
na.omit() %>%
mutate(height=height/100) %>%
mutate(MBI=mass/heightˆ2) %>%
group_by(sex) %>%
summarise(average_BMI=mean(MBI))

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 43 / 85
Cont. . . ..
Group multiple columns by adding categorical variables in
group_by().
After group the data, summarize multiple variables at the same
time:
library(pander)
diabetes %>%
filter(!is.na(weight)) %>%
group_by(gender) %>%
summarize(mean_age = mean(age),
min_age=min(age),
median_age=median(age),
max_age=max(age),
n=n())%>% pander() %>%
head()
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 44 / 85
arrange()

The arrange() function is used to order/sort the data frame rows

in either ascending or descending based on column value.
It used to sort the rows in ascending order(from smallest to
largest).
diabetes %>%
select(gender,height,weight,frame) %>%
arrange(weight)
# arrange the data in descending order by weight.
diabetes %>%
select(gender,height,weight,frame) %>%
arrange(desc(weight))

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 45 / 85
count()

count() function is used to count the number of occurrences of

each unique value in a column or a combination of columns.
It used to count the unique values of one or more variables.
diabetes %>%
count(frame)
starwars %>%
count(eye_color,name="frequency")

To Count combinations of two categorical variables

diabetes %>%
count(gender,frame,name= "frequency")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 46 / 85
Cont. . . ..

Create two-way table

diabetes %>%
drop_na() %>%
count(gender,frame) %>%
group_by(gender) %>%
pivot_wider(names_from = frame,values_from = n,
values_fill = 0)

This reshapes the count result into a table where rows are
gender, columns are frame, and cells are counts.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 47 / 85
Data visualization with ggplot2

Data visualization is part art and part science.

A data visualization first and foremost has to accurately convey
the data.
It must not mislead or distort.
A data visualization should be aesthetically pleasing.
Good visual presentations tend to enhance the message of the
visualization.
What are the key principles, methods, and concepts required to
visualize data for publications, reports, or presentations?

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 48 / 85
Why ggplot2?

A grammar of graphics is a grammar used to describe and create

a wide range of statistical graphics.
The promise of a grammar for graphics.
Easy to manage, save, etc.
Graphs are composed of layers.
Easy to add stuff to existing graphs.
ggplot2 graphics take less work to make beautiful and
eye-catching graphics.
Enables creation of reproducible visualization patterns.
Publication quality & beyond

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 49 / 85
Aesthetics (aes()) function

function description
x, y: variables
color: colors the lines of geometries
fill: fill geometries or fill color
group: groups based on the data
shape: shape of point, an integer value 0 to 24, or NA
linetype: type of line, a integer value 0 to 6 or a string
size: sizes of elements, a non-negative numeric value
alpha: changes the transparency,a numeric value 0 to 1

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 50 / 85
Geometric layers
**Geometries (geom_*()) function**
The general syntax is:
ggplot(data = data, mapping = aes(mapings))+
geom_function()
Geom Components

function description
geom_histogram histogram plot
geom_point() Scatter plot
geom_line() Line plot
geom_bar() Bar chart
geom_boxplot() boxplot
geom_smooth() Add trend line (e.g., linear regression)
geom_density() Density curve

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 51 / 85
Cont. . . .
Shape of point, an integer value 0 to 24, or NA
The linetype aesthetic can be specified with either an integer 0 to
6.
0 = blank, 1 = solid, 2 = dashed, 3 = dotted,
4 = dotdash, 5 = longdash, 6 = twodash

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 52 / 85
Facets

Facets are added as an additional layer to the plot.

Facet wraps are a useful way to view individual categories in their
own graph.
The following table describes how facet formulas work in
facet_grid() and facet_wrap():

Type function Description

Grid facet_grid(. ~ x) Facet horizontally across x values
Grid facet_grid(y ~ .) Facet vertically across y values
Grid facet_grid(y ~ x) Facet 2-dimensionally
Wrap facet_wrap(~ x) Facet across x values
Wrap facet_wrap(~ x + y) Facet across x and y values

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 53 / 85
Customize Our Plot

Here are some key aspects you can customize: Axes, Titles and
Legends
Title and axes components: changing size, colour and face
-Customizing Axis Labels with labs()
used to modify plot labels, including x-axis, y-axis, and plot title

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 54 / 85
Histogram
diabetes %>%
drop_na() %>%
ggplot(aes(weight))+
geom_histogram(binwidth = 10, fill = "steelblue",
color = "white")+
theme_minimal()

to split by Frame
diabetes %>%
drop_na() %>%
ggplot(aes(weight,fill = gender))+
geom_histogram(binwidth = 10, fill = "steelblue",
color = "white")+
facet_wrap(~frame)+
theme_minimal()
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 55 / 85
Bar chart
simple barchart
diabetes %>%
ggplot(aes(x = frame)) + geom_bar()

Component Bar chart

diabetes %>%
drop_na() %>%
ggplot(aes(x=frame,colour = gender))+
geom_bar()

diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar()

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 56 / 85
Cont. . . ..
Positions

geom_bar(position = "<POSITION>")
When we have aesthetics mapped, how are they positioned?
bar: dodge, fill, stacked (default)
diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "dodge")

diabetes %>%
drop_na() %>% # remove missing values
ggplot(aes(x=frame,fill = gender))+
geom_bar(position = "stack")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 57 / 85
Box Plot

diabetes %>%
drop_na() %>%
ggplot(aes(frame, weight, fill=gender))+
geom_boxplot()+
labs(x="weight of patients",
y= "Foot Risk Awareness and Management Education")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 58 / 85
scatter plot

library(ggpmisc)
diabetes %>%
drop_na() %>%
ggplot(aes(x=height,y=weight, linetype =gender ))+
geom_point()+
geom_smooth(method = "lm", se = TRUE) +
stat_poly_eq(
aes(label = paste(..eq.label.., ..rr.label..,
sep = "~~~")),
formula = y ~ x,
parse = TRUE
)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 59 / 85
Cont. . . ..

diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_wrap("frame")

diabetes %>%
drop_na() %>% #remove missing value
ggplot(aes(height,weight))+
geom_point()+
facet_grid("frame")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 60 / 85
Labels, titles, and legends

Add Labels
xlab() , ylab() , labs(x = "X-axis name", y = "y-axis name")

Add titles

Add legends

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 61 / 85
Themes
Control the non-data parts of the plots.
like \textcolor(red}{fonts, backgrounds, grid lines, legends,
margins, and titles}.
Pre specified themes are the following

Theme Description
theme_gray() Default ggplot2 theme
theme_bw() Black and white theme, good for print
theme_minimal() Very clean and minimal background
theme_classic() Classic look (no grid lines)
theme_light() Light background with subtle grid lines
theme_dark() Dark version of theme_light()
theme_void() Removes everything (useful for pie charts or maps)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 62 / 85
Cont. . .

mtcars %>%
ggplot(aes(hp, mpg, col = factor(cyl))) +
geom_point(size = 3)+
theme_dark()

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 63 / 85
Activity

1 Using storm data from the nasaweather package, create a scatter

plot between wind and pressure, with color being used to
distinguish the type of storm.
2 Using the penguins data set from the palmerpenguins package.
a. Create a scatterplot of bill− length− mm against bill− depth− mm
where individual species are colored and a regression line is added
to each species. Add regression lines to all of your facets. What
do you observe about the association of bill depth and bill length?
b. Repeat the same scatterplot but now separate your plot into
facets by species. How would you summarize the association
between bill depth and bill length.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 64 / 85
Statistical Analysis Using R

What is Statistical modeling

Statistical modeling uses mathematical models and statistical
conclusions to create data that can be used to understand
real-life situations.
Statistical modeling to generate sample data and make
predictions about the real world.
A statistical model is a collection of probability distributions on a
set of all possible outcomes of an experiment.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 65 / 85
Types of Statistical Models

Statistical models can be placed into groups based on parameters.

The following are explanations of the statistical models.
Parametric Models have probability distributions that have set
parameters which are known.
Nonparametric Models have values in which the parameters can
change and are not established from the beginning.
Semiparametric Models are a blend of the parametric and
nonparametric models, fixed and flexible.

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 66 / 85
T test
Three are three type of t-test
one-sample t-test compares the mean score of a sample to a
known value, usually the population mean.
independent t-test used to to compare the mean of one sample
with the mean of another sample to see if there is a statistically
significant difference between the two.
Paired t-test used to determine whether there is a difference
between the average values of paired samples subjected to two
different conditions
The t-test default command is
t.test(x, y = NULL,
alternative = c("two.sided","less", "greater"),
mu =0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 67 / 85
Independent t-test

Using diabetes data

library(dplyr)
female<-diabetes %>% filter(weight & gender=="female")
FW<-female$weight
male<-diabetes %>% filter(weight & gender=="male")
MW<-male$weight
t.test(MW,FW)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 68 / 85
ANOVA

ANOVA provides a statistical test of whether or not the means of

several groups are equal, and it generalizes the t-test to more
than two groups.
one-way ANOVA is a technique used to compare means of two or
more groups
Example: Agricultural researchers designed an experiment to look
at crop yield in a number of plots in a research farm. Crop yield
was recorded as a function of irrigation (2 levels: irrigated or
not),sowing density (3 levels: low, medium, high), fertilizer
application (3 levels: N, P, NP).
The data can be retrieved from:
http://www.bio.ic.ac.uk/research/mjcraw/statcomp/data/splityield.txt

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 69 / 85
Cont. . . .
data import from Google
crop<- read.delim('http://www.bio.ic.ac.uk/research/
mjcraw/statcomp/data/splityield.txt')
str(crop) # To see the nature of data
attach(crop)
head(crop) #To observer the first 6 observation
tail(crop) #To observer the last 6 observation
crop$density<-factor(crop$density,
levels=c('low','medium','high'))
crop$irrigation<-factor(crop$irrigation,
levels=c('Control','Irrigated'))
crop$fertilizer<-factor(crop$fertilizer,levels=c('N','P',
crop$block<-factor(crop$block,levels=c("A","B","C","D"))
levels(crop$density)
levels(crop$irrigation)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 70 / 85
Cont. . . .

boxplot

par(mfrow=c(2,2))
boxplot(yield~block,col=c(2:5),ylab = "crop yield", xlab
boxplot(yield~irrigation,col=c(2:3),ylab = "Crop yield",
boxplot(yield~density,col=c(2:5),ylab = "crop yield",
xlab ="Sowing density", main="Boxplot of crop producti0")

boxplot(yield~fertilizer,col=c(2:5),ylab = "crop yield",

xlab ="used fertilizer appilication",
main="Boxplot of crop production yield by fertilizer")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 71 / 85
Cont. . . ..

Sumarize by group

library(dplyr)
crop %>% group_by(density) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(fertilizer) %>%
summarise(average_yield=mean(yield))
crop %>% group_by(irrigation) %>%
summarise(average_yield=mean(yield))

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 72 / 85
One-way ANOVA

model1<-aov(yield~fertilizer,data=crop)
summary(model1)

post hoc test for one way anova

TukeyHSD(model1)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 73 / 85
Two-way ANOVA R

two<-aov(yield~density+irrigation)
summary(two)
two1<-aov(yield~density*fertilizer) # with interaction
summary(two1)
two2<-aov(yield~density+irrigation)
summary(two2)
two3<-aov(yield~density*irrigation) # with interaction
summary(two3)

post hoc test for two way anova

TukeyHSD(two1)
TukeyHSD(two3)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 74 / 85
Linear Regression Analysis
Regression is a type of analysis where you want to be able to
predict some scores using other information.
Linear regression is used to predict the value of an outcome
variable y based on one or more input predictor variables.
Simple Linear Regression– regression with only one predictor
variable (IV).
yi = β0 + β1 x1 + ϵi
Multiple Linear Regression – regression with more than one
predictor variable (IVs).

yi = β0 + β1 x1 + β2 x2 + ... + βp xp + ϵi

Multiple linear regression is useful for modelling the relationship

between a numeric outcome or dependent variable with multiple
explanatory/independent variable.
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 75 / 85
Regression Analysis R command

mymodel<-lm(y ∼ x ): # To fit simple regression model

summary: # To show the fitting regression model results
anova: # To display the regression model ANOVA table
coef(mymodel): # To display regression coefficient parameter
fits<-mymodel$fitted: # To store the fitted values
resids<-mymodel$residuals: # To store the residual values
beta1hat=mymodel$coeff[2]: # To assign the slope coefficient to
the name “beta1hat”
confint(mymodel): # CIs for all parameters

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 76 / 85
Cont. . . .
Diagnostic Checking

par(mfrow=c(2,2));
plot(mymodel)#
plot(mymodel,which=4)
library(car)
vif(mymodel) # Test of multicollineraty

Diagnostic checking of influential observation and outliers

outlierTest(mymodel)
influence(mymodel)
dffits(mymodel)
dfbetas(mymodel)
hatvalues(mymodel)
cooks.distance(mymodel)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 77 / 85
Example

Using Boston housing data set(~1978) from R dataset:

variables(predictors consists of mix of continuous and factors)
medv: median property value in $1000s, (Y/outcome variable)
rm: number of rooms
crim: per capital crime rate by town
rad: index of accessibility to highway
install.packages("MASS")
library(MASS)
data("Boston")
head(Boston)
attach(Boston)
str(Boston)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 78 / 85
Cont. . . .

Descriptive statistics of the Boston data

The Stargazer package for R provides a way to create publication
quality tables, and a way for researchers to avoid creating new
tables each time they tweak their dataset.
The Stargazer package provide the basic understanding needed to
create Summary Statistics Tables and Regression Tables.
install.packages("stargazer")
library(stargazer)
stargazer(Boston, type = "text",
title="Descriptive statistics",
digits=2, out="table1.txt")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 79 / 85
Cont. . .

For simple linear regression

mymodel1 <- lm(medv ~ rm, data = Boston)
vif(mymodel1)

For multiple linear regression

mymodel2 <- lm(medv ~ rm + crim + factor(rad)+ age, data =
Boston)

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 80 / 85
Cont. . .
Used to check multicollinearity
vif(mymodel2)

Used to display the regression model ANOVA table

anova(mymodel2):

fitting regression model results

summary(mymodel2)

used to construction 95% CI of parameters

confint(mymodel2,level = 0.95):

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 81 / 85
Logistic Regression

Logistic regression is a GLM used to model a binary categorical

variable using numerical and categorical predictors.
model <- glm(response ~ predictor1 + predictor2,
data = data_name, family = binomial)

response: Your binary outcome variable (coded as 0 and 1).

predictor1, predictor2, etc.: Independent variables.
data: Your dataset.
family = binomial: Specifies logistic regression (logit link).

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 82 / 85
Cont. . . ..

library(tidyverse)
library(gtsummary)
library(survival)
# Multivariable regression
model <- glm(response~trt+age+grade,
data=trial,family=binomial)
tab_model(model, file = "result.doc") # save output

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 83 / 85
Customize output
How to customize the regression output to Publication format?
The following packages and command are used to customize the
regression output
install.packages("sjPlot")
install.packages("sjmisc")
install.packages("sjlabelled")
library(sjPlot)
library(sjmisc)
library(sjlabelled)
tab_model(mymodel2)
tab_model(mymodel2,string.ci = "95% CI",
string.p = "p-value",
string.pred = "Characteristics",
show.aic = TRUE,show.aicc = TRUE)
Endalew T. (Debre Markos University College of NaturalR Science
Software
Department
Training of Statistics) April 14, 2025 84 / 85
Save The output

tab_model(lme1, file = "YOURTABLENAME.doc")

Endalew T. (Debre Markos University College of NaturalR Science

Software
Department
Training of Statistics) April 14, 2025 85 / 85

01-MSBA-615 - Introduction To R Programming and R Studio
No ratings yet
01-MSBA-615 - Introduction To R Programming and R Studio
47 pages
Learn R Programming in A Day
100% (7)
Learn R Programming in A Day
229 pages
Learn R Programming in 24 Hours
From Everand
Learn R Programming in 24 Hours
Alex Nordeen
No ratings yet
Likert Scales: How To (Ab) Use Them Susan Jamieson
No ratings yet
Likert Scales: How To (Ab) Use Them Susan Jamieson
8 pages
Statistical Methods Lab Manual-2021-22
No ratings yet
Statistical Methods Lab Manual-2021-22
58 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
Data analysis using R(Student copy) (1)
No ratings yet
Data analysis using R(Student copy) (1)
79 pages
SSMDA Expt 7
No ratings yet
SSMDA Expt 7
16 pages
R Programming
No ratings yet
R Programming
59 pages
Unit 1- Data Analysis Using r
No ratings yet
Unit 1- Data Analysis Using r
28 pages
LAB MANUAL
No ratings yet
LAB MANUAL
46 pages
Genetics
No ratings yet
Genetics
392 pages
R Language Lab Manual Lab 1
No ratings yet
R Language Lab Manual Lab 1
32 pages
1.R Unit 1
No ratings yet
1.R Unit 1
49 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
R Tutorial
No ratings yet
R Tutorial
100 pages
Unit---3
No ratings yet
Unit---3
64 pages
Computing-II - Lecture Notes-I
No ratings yet
Computing-II - Lecture Notes-I
72 pages
D1_2_Intro_R
No ratings yet
D1_2_Intro_R
52 pages
D1_R-intro
No ratings yet
D1_R-intro
33 pages
Table 1
No ratings yet
Table 1
34 pages
r Programming Lab
No ratings yet
r Programming Lab
26 pages
unit 3 big data
No ratings yet
unit 3 big data
25 pages
Lecture Notes - Programming in R
No ratings yet
Lecture Notes - Programming in R
9 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Notes For R Tool
No ratings yet
Notes For R Tool
74 pages
R4beginners v3
100% (1)
R4beginners v3
43 pages
Learn To Use: Your Hands-On Guide
100% (1)
Learn To Use: Your Hands-On Guide
43 pages
Learn To Use: Your Hands-On Guide
No ratings yet
Learn To Use: Your Hands-On Guide
45 pages
Mod1 R Programming
No ratings yet
Mod1 R Programming
49 pages
Introduction To R: Alka Vaidya Nibm
No ratings yet
Introduction To R: Alka Vaidya Nibm
50 pages
CLASS ONE
No ratings yet
CLASS ONE
66 pages
Chapter 1 Introduction (4)
No ratings yet
Chapter 1 Introduction (4)
179 pages
Topic 1 - Intro To Basics
No ratings yet
Topic 1 - Intro To Basics
38 pages
Experiment 1 PDF
No ratings yet
Experiment 1 PDF
7 pages
Experiment OEC
No ratings yet
Experiment OEC
5 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
R Studio Info For 272
No ratings yet
R Studio Info For 272
13 pages
R and R Studio Introduction
No ratings yet
R and R Studio Introduction
24 pages
R Lab
No ratings yet
R Lab
114 pages
Lec 1
No ratings yet
Lec 1
42 pages
E5 - Statistical Analysis Using R
100% (1)
E5 - Statistical Analysis Using R
45 pages
R Notes
No ratings yet
R Notes
189 pages
Introduction To R Notes
No ratings yet
Introduction To R Notes
16 pages
R Programming 2
No ratings yet
R Programming 2
11 pages
3 R and
No ratings yet
3 R and
19 pages
Basic+R Course
No ratings yet
Basic+R Course
30 pages
R Programming Presentation
100% (1)
R Programming Presentation
23 pages
DSRS BR
No ratings yet
DSRS BR
25 pages
How to use the R software
No ratings yet
How to use the R software
18 pages
R Handout Statistics and Data Analysis Using R
No ratings yet
R Handout Statistics and Data Analysis Using R
91 pages
Data Analytics Unit-1 Notes
No ratings yet
Data Analytics Unit-1 Notes
25 pages
Intro To R
No ratings yet
Intro To R
4 pages
Week1 Slides
No ratings yet
Week1 Slides
64 pages
R Manual
No ratings yet
R Manual
48 pages
Experiment_1
No ratings yet
Experiment_1
7 pages
L1 Intro R
No ratings yet
L1 Intro R
15 pages
Introduction To R
No ratings yet
Introduction To R
30 pages
SEE_R_Practical_Dhara
No ratings yet
SEE_R_Practical_Dhara
57 pages
R Programming - a Comprehensive Guide: Software
From Everand
R Programming - a Comprehensive Guide: Software
Editor IJSMI
No ratings yet
Beginner's Guide to R Programming
From Everand
Beginner's Guide to R Programming
Agasti Khatri
No ratings yet
CH 07
No ratings yet
CH 07
99 pages
9780875530024ch01-standard methods
No ratings yet
9780875530024ch01-standard methods
20 pages
Regression Statistics: Anova
No ratings yet
Regression Statistics: Anova
2 pages
SAS1
No ratings yet
SAS1
44 pages
Advancing Scientific Computing With Python's SciPy Library
No ratings yet
Advancing Scientific Computing With Python's SciPy Library
19 pages
Tutorial 3DGeoModeller
No ratings yet
Tutorial 3DGeoModeller
96 pages
Đại Học Quốc Gia Đại Học Bách Khoa Tp Hồ Chí Minh: Subject: probability and statistics
No ratings yet
Đại Học Quốc Gia Đại Học Bách Khoa Tp Hồ Chí Minh: Subject: probability and statistics
8 pages
Nlogit An R Package Presentation
No ratings yet
Nlogit An R Package Presentation
40 pages
Action Research
No ratings yet
Action Research
24 pages
One-Sample T Test Results Presentation 11. APA Style Results Presentation
No ratings yet
One-Sample T Test Results Presentation 11. APA Style Results Presentation
1 page
RISE 2.0 BDA Brochure
No ratings yet
RISE 2.0 BDA Brochure
44 pages
Sta220 Assessment 4 - Project Guidelines (35%)
No ratings yet
Sta220 Assessment 4 - Project Guidelines (35%)
2 pages
Master data analytics
No ratings yet
Master data analytics
17 pages
Manual PPAP
No ratings yet
Manual PPAP
75 pages
Types of Research Designs: 2 Quarter
No ratings yet
Types of Research Designs: 2 Quarter
43 pages
Factors Affecting Consumer Interest in Choosing A Coffee Store in Tangerang District
No ratings yet
Factors Affecting Consumer Interest in Choosing A Coffee Store in Tangerang District
13 pages
Intrusion Detection of Imbalanced Network Traffic Based On Machine Learning and Deep Learning
No ratings yet
Intrusion Detection of Imbalanced Network Traffic Based On Machine Learning and Deep Learning
4 pages
A Guide To Hidden Markov Model and Its Applications in NLP
No ratings yet
A Guide To Hidden Markov Model and Its Applications in NLP
11 pages
Producing Quality Stats V7
No ratings yet
Producing Quality Stats V7
6 pages
Statistics and Probability
100% (1)
Statistics and Probability
3 pages
A Study On Marketing Strategies Adopted by Nike in Reference To The Athletic Footwear and Apparel Industry
No ratings yet
A Study On Marketing Strategies Adopted by Nike in Reference To The Athletic Footwear and Apparel Industry
6 pages
PID Control PowerPoint Sunum
No ratings yet
PID Control PowerPoint Sunum
20 pages
Recap: Categorical Quantitative Continuous Discrete Ordinal Nominal
No ratings yet
Recap: Categorical Quantitative Continuous Discrete Ordinal Nominal
3 pages
Oracle - Data Sheet - Demantra Demand Management (057040)
No ratings yet
Oracle - Data Sheet - Demantra Demand Management (057040)
4 pages
G7 Math 7 Exam 4th
No ratings yet
G7 Math 7 Exam 4th
7 pages
T1 Q. T Nov. 2018
No ratings yet
T1 Q. T Nov. 2018
2 pages
Non Stationarity Unit Root
No ratings yet
Non Stationarity Unit Root
51 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
PR WW1 Reviewer
No ratings yet
PR WW1 Reviewer
4 pages