0% found this document useful (0 votes)
12 views

Modulel IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Modulel IV

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Module - 4

Data Analysis

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• We can read external datasets and operate with them in our R
environment by importing data into an R script.
• R offers a number of functions for importing data from various file
formats.
• First, let’s consider a data set that we can use for the demonstration. For
this demonstration, we will use two examples of a single dataset, one in
.csv form and another .txt

data1 ->

data2 ->

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a Comma-Separated Value(CSV) File:
Method 1: Using read.csv() Function Read CSV Files into R
• Syntax: read.csv(file.choose(), header)
• The function read.csv() has two parameters:
• file.choose(): It opens a menu to choose a CSV file from the desktop.
• header: It is to indicate whether the first row of the dataset is a variable
name or not. Apply T/True if the variable name is present else put F/False.

# import and store the dataset in data1


data1 <- read.csv(file.choose(), header=T)
# display the data
data1
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Importing Data in R Script
• Reading a Comma-Separated Value(CSV) File:
Method 2: Using read.table() Function
• Syntax: read. table(file.choose(), header, sep=“ , ”)
• This function specifies how the dataset is separated, in this case we take
sep=”, “ as an argument.

# import and store the dataset in data2


data2 <- read.table(file.choose(), header=T, sep=“,”)
# display the data
data2

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a Comma-Separated Value(CSV) File:

• Output:

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a Tab-Delimited(txt) File in R Programming Language:
Method 1: Using read.delim() Function
• Syntax: read.delim(file.choose(), header)
• The function has read.csv() two parameters:
• file.choose(): It opens a menu to choose a CSV file from the desktop.
• header: It is to indicate whether the first row of the dataset is a variable
name or not. Apply T/True if the variable name is present else put F/False.

# import and store the dataset in data3


data3 <- read.delim(file.choose(), header=T)
# display the data
data3
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Importing Data in R Script
• Reading a Tab-Delimited(txt) File in R Programming Language:

Method 2: Using read.table() Function


• Syntax: read. table(file.choose(), header, sep=“ \t ”)
• This function specifies how the dataset is separated, in this case we take
sep=”\t“ as an argument.

# import and store the dataset in data2


data4 <- read.table(file.choose(), header=T, sep=“\t”)
# display the data
data4

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a Tab-Delimited(txt) File in R Programming Language:
• Output:

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a excel File in R Programming Language:
Method 1: Using read_excel() from readxl
• read_excel() function is basically used to import/read an Excel file and it
can only be accessed after importing the readxl library in R language.
• Syntax: read_excel(path)
• The read_excel() method extracts the data from the Excel file and returns
it as an R data frame.

library(readxl)
Data_gfg <- read_excel("Data_gfg.xlsx")
Data_gfg

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data in R Script
• Reading a excel File in R Programming Language:
Method 1: Using read.xlsx() from xlsx
• read.xlsx() function is imported from the xlsx library of R language and
used to read/import an excel file in R language.
• Syntax: read.xlsx(path)
• The read_excel() method extracts the data from the Excel file and returns
it as an R data frame.

Data_gfg <-read.xlsx('Data_gfg.xlsx’)
Data_gfg

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data Using R-Studio
• Here we are going to import data through R studio with the following
steps.
• Steps:
1. From the Environment tab click on the Import Dataset Menu.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Importing Data Using R-Studio
• Steps:
2. Select the file extension from the option.

3. In the third step, a pop-up box will appear, either enter the file name or browse
the desktop.
4. The selected file will be displayed on a new window with its dimensions.
5. In order to see the output on the console, type the filename.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Exporting Data from R Scripts
• When a program is terminated, the entire data is lost.
• Storing in a file will preserve one’s data even if the program terminates.
• If one has to enter a large number of data, it will take a lot of time to
enter them all. However, if one has a file containing all the data, he/she
can easily access the contents of the file using a few commands in R.
• Exporting data to a text file:
• One of the important formats to store a file is in a text file. R provides
various methods that one can export data to a text file.
• write.table():
• The R base function write.table() can be used to export a data frame or a matrix to
a text file.
• In This section of R studio we get the data saved as the name that we gave in the
code. and when we select that files we get this type of output.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Exporting Data from R Scripts
• write.table():

• Syntax: write.table(x, file,sep = ” “, dec = “.”, row.names = TRUE, col.names = TRUE)

Parameters:
• x: a matrix or a data frame to be written.
• file: a character specifying the name of the result file.
• sep: the field separator string, e.g., sep = “\t” (for tab-separated value).
• dec: the string to be used as decimal separator. Default is “.”
• row.names: either a logical value indicating whether the row names of x are to be
written along with x, or a character vector of row names to be written.
• col.names: either a logical value indicating whether the column names of x are to
be written along with x, or a character vector of column names to be written.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Exporting Data from R Scripts
• write.table():
• Example:
# R program to illustrate Exporting data from R
# Creating a dataframe
df = data.frame( "Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45))

# Export a data frame to a text file using write.table()


write.table(df, file = "myDataFrame.txt",
sep = "\t",
row.names = TRUE,
col.names = NA)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Exporting Data from R Scripts
• write_tsv():
• This write_tsv() method is also used for to export data to a tab separated (“\t”) values by
using the help of readr package.
• Syntax: write_tsv(file, path)
Parameters:
• file: a data frame to be written
• path: the path to the result file
Example:
# R program to illustrate Exporting data from R
# Importing readr library
library(readr)
# Creating a dataframe
df = data.frame( "Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45) )
# Export a data frame using write_tsv()
write_tsv(df, path = "MyDataFrame.txt")
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Exporting Data from R Scripts
• write.csv():
• This write.csv() method is recommendable for exporting data to a csv file. It
uses “.” for the decimal point and a comma (“, ”) for the separator.
• Syntax: write.csv(file, path)
Parameters:
• file: a data frame to be written
• path: the path to the result file
Example:
# R program to illustrate Exporting data from R
# Importing readr library
library(readr)
# Creating a dataframe
df = data.frame( "Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45) )
# Export a data frame using write.csv()
Write.csv(df, file = "My_Data.csv")
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Exporting Data from R Scripts
• write.csv2():
• This method is much similar as write.csv() but it uses a comma (“, ”) for the
decimal point and a semicolon (“;”) for the separator.
• Syntax: write.csv2(file, path)
Parameters:
• file: a data frame to be written
• path: the path to the result file
Example:
# R program to illustrate Exporting data from R
# Importing readr library
library(readr)
# Creating a dataframe
df = data.frame( "Name" = c("Amiya", "Raj", "Asish"),
"Language" = c("R", "Python", "Java"),
"Age" = c(22, 25, 45) )
# Export a data frame using write_tsv()
Write.csv2(df, file = "My_Data.csv")
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
• dplyr is a powerful R-package to transform and summarize tabular data
with rows and columns.
• The package contains a set of functions (or “verbs”) that perform common
data manipulation operations such as filtering for rows, selecting specific
columns, re-ordering rows, adding new columns and summarizing data.
• In addition, dplyr contains a useful function to perform another common
task which is the “split-apply-combine” concept.
• Install and load dplyr:
To install dplyr
install.packages("dplyr")
To load dplyr
library(dplyr)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
• The below are some of the most common dplyr functions:
• rename() : rename columns
• recode() : recode values in a column
• select() : subset columns
• filter() : subset rows on conditions
• mutate() : create new columns by using information from other columns
• summarise() : create summary statistics on grouped data
• arrange() : sort results
• count() : count discrete values
• group_by() : allows for group operations in the “split-apply-combine” concept

%>%: the “pipe” operator is used to connect multiple verb actions together into a pipeline

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
• rename(): It is often necessary to rename variables to make them more
meaningful.
• Example :
#library(dplyr)
Data <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data
dplyr::rename(Data,PULSE1=Pulse,DURATION1=Duration)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
• select(): The select() function is used to pick specific variables or features of a
DataFrame or tibble.
• It selects columns based on provided conditions like contains, matches, starts
with, ends with, and so on.
• Syntax: select(data,col1,col2,…)
• This function returns an object of the same type as data.
• Example:
dplyr::select(Data,Training)
dplyr::select(Data,Pulse,Training)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
• select(): Select column list either by name or index number
Data <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45))
dplyr::rename(Data,PULSE1=Pulse,DURATION1=Duration)
dplyr::select(Data,Training)
dplyr::select(Data,Pulse,Training)
dplyr::select(Data,1,3)
dplyr::select(Data,1:3)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
• select(): Some additional options to select columns based on a specific
criteria include
1. ends_with() = Select columns that end with a character string
2. contains() = Select columns that contain a character string
3. matches() = Select columns that match a regular expression
4. one_of() = Select columns names that are from a group of names
• Example:
select(Data, starts_with(“Mo"))

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
filter():
• The filter() function is used to produce a subset of the data frame, retaining all
rows that satisfy the specified conditions.
• The filter() method in R programming language can be applied to both grouped
and ungrouped data. The expressions include comparison operators (==, >, >= )
, logical operators (&, |, !, xor()) , range operators (between(), near()) as well as
NA value check against the column values.
• Syntax: filter(df , condition)
• Parameters :
• df: The data frame object
• condition: filtering based upon this condition
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
filter() Example:
#library(dplyr)
df=data.frame(x=c(12,31,4,66,78),
y=c(22.1,44.5,6.1,43.1,99),
z=c(TRUE,TRUE,FALSE,TRUE,TRUE))
df
# condition
dplyr::filter(df, x<50 & z==TRUE)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
• mutate(): mutate() function in R Programming Language is used to add new
variables in a data frame which are formed by performing operations on
existing variables.
• Syntax: mutate(x, expr)
• In R there are five types of main function for mutate that are discribe as below.
we will use dplyr package in R for all mutate functions.
• mutate()
• transmute()
• mutate_all() - to apply a transformation to all variables in a data frame simultaneously.
• mutate_at()
• mutate_if() - to apply a transformation to variables in a data frame based on a specific
condition.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
mutate() Example:
library(dplyr)
# Create a data frame
data <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
age = c(7, 5, 9, 16),
ht = c(46, NA, NA, 69),
school = c("yes", "yes", "no", "no") )
data
# Calculating a variable x3 which is sum of height and age printing with ht and age
dplyr::mutate(d, x3 = ht + age)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
transmute() Example:
# Use transmute to create a new variable 'age_in_months' and drop the 'age' variable
result <- transmute(d,
name = name,
age_in_months = age * 12,
ht,
school)
print(result)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
summarise_all():
• The summarise_all method in R is used to affect every column of the data
frame. The output data frame returns all the columns of the data frame where
the specified function is applied over every column..
• Syntax: summarise_all(data, function)
• Arguments :
• data – The data frame to summarise the columns of
• function – The function to apply on all the data frame columns.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
Summarise_all() Example:

# creating a data frame


df <- data.frame(col1=c(1:10),col2=c(11:20))
print("original dataframe")
print(df)
print("summarised dataframe")
dplyr::summarise_all(df, mean)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
arrange():
• arrange() function in R Language is used for reordering of table rows with the help of
column names as expression passed to the function.
• Syntax: arrange(x, expr)
• Parameters:
• x: data set to be reordered
• expr: logical expression with column name
• Example:
#library(dplyr)
d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),
age = c(7, 5, 9, 16) )
# Arranging name according to the age
d2<- dplyr::arrange(d, age)
print(d2)
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Data cleaning and summarizing with dplyr
package
• arrange(): arrange() function in R Language is used for reordering of table
rows with the help of column names as expression passed to the function.
• To arrange in a descending order:
• arrange(d1, desc(age))
• To arrange in order using col1 and then by col2:
• arrange(d, age,rollno)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Data cleaning and summarizing with dplyr
package
Group_by():
• Group_by() function belongs to the dplyr package in the R programming language, which
groups the data frames.
• Group_by() function alone will not give any output. It should be followed by summarise()
function with an appropriate action to perform. It works similar to GROUP BY in SQL and
pivot table in excel.
• Example:
library(dplyr)
df = read.csv("Sample_Superstore.csv")
df_grp_region = df %>% group_by(Region) %>%
summarise(total_sales = sum(Sales),
total_profits = sum(Profit),.groups = 'drop')

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Exploratory Data Analysis
• Exploratory Data Analysis or EDA is a statistical approach or technique for analysing data
sets in order to summarize their important and main characteristics generally by using
some visual aids.
• The EDA approach can be used to gather knowledge about the following aspects of data:
• Main characteristics or features of the data.
• The variables and their relationships.
• Finding out the important variables that can be used in our problem.
• Exploratory Data Analysis in R:
• In R Language, we are going to perform EDA under two broad classifications:
• Descriptive Statistics, which includes mean, median, mode, inter-quartile range, and so on.
• Graphical Methods, which includes Box plot, Histogram, Pie graph, Line chart, Barplot, Scatter
Plot and so on.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Exploratory Data Analysis
• Diagrammatic representation of data:
• The diagrammatic representation of data is one of the best and attractive way of
presenting data
• It caters both educated and uneducated section of the society.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Histograms
• A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical intervals.
• We can create histograms in R Programming Language using the hist() function.
• Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
• v: This parameter contains numerical values used in histogram.
• main: This parameter main is the title of the chart.
• col: This parameter is used to set color of the bars.
• xlab: This parameter is the label for horizontal axis.
• border: This parameter is used to set border color of each bar.
• xlim: This parameter is used for plotting values of x-axis.
• ylim: This parameter is used for plotting values of y-axis.
• breaks: This parameter is used as width of each bar.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Histograms
• Example:
bitmap(file="out.png")
# Create data for the graph.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)
# Create the histogram.
hist(v, xlab = "No.of Articles ",col = "green", border = "black")

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Histograms
• Example:
# Create data for the graph.
v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)
# Create the histogram.
hist(v, xlab="No.of Articles ",col = "green", border = "black“,xlim=c(0,50),ylim=c(0,5),break=5)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Pie graph
• A pie chart is a circular statistical graphic, which is divided into slices to illustrate
numerical proportions.
• It depicts a special chart that uses “pie slices”, where each sector shows the relative
sizes of data.
• A circular chart cuts in the form of radii into segments describing relative
frequencies or magnitude also known as a circle graph.
• Syntax: pie(x, labels, radius, main, col, clockwise)
• x: This parameter is a vector that contains the numeric values which are used in the pie chart.
• labels: This parameter gives the description to the slices in pie chart.
• radius: This parameter is used to indicate the radius of the circle of the pie chart.(value
between -1 and +1).
• main: This parameter is represents title of the pie chart.
• clockwise: This parameter contains the logical value which indicates whether the slices are
drawn clockwise or in anti clockwise direction.
• col: This parameter give colors to the pie in the graph.
Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP
Pie graph
• Example:
bitmap(file="out.png")
Temp<- c(23, 36, 50, 43)
Cities <- c("Banglore", "Pune", "Chennai", "Amaravati")
# Plot the chart.
pie(Temp, Cities)

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Pie graph
• Example:
bitmap(file="out.png")
Temp<- c(23, 36, 50, 43)
Cities <- c("Banglore", "Pune", "Chennai", "Amaravati")
# Plot the chart.
pie(Temp, Cities, main = "City pie chart",
col = rainbow(length(Temp)) )

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Barplot
• Bar charts are a popular and effective way to visually represent categorical data in a
structured manner.
• R uses the barplot() function to create bar charts. Here, both vertical and Horizontal
bars can be drawn.
• Syntax: barplot(H, xlab, ylab, main, names.arg, col)
• H: This parameter is a vector or matrix containing numeric values which are used in bar chart.
• xlab: This parameter is the label for x axis in bar chart.
• ylab: This parameter is the label for y axis in bar chart.
• main: This parameter is the title of the bar chart.
• names.arg: This parameter is a vector of names appearing under each bar in bar chart.
• col: This parameter is used to give colors to the bars in the graph.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Barplot
• Example:
bitmap(file="out.png")
# Create the data for the chart
A <- c(17, 32, 8, 53, 1)
# Plot the bar chart
barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Barplot
• Example:
# Create the data for the chart
A <- c(17, 32, 8, 53, 1)
# Plot the bar chart
barplot(A, horiz = TRUE, xlab = "X-axis", ylab = "Y-axis", main ="Horizontal Bar Chart" )

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP


Boxplots
• A box graph is a chart that is used to display information in the form of distribution
by drawing boxplots for each of them.
• This distribution of data is based on five sets (minimum, first quartile, median, third
quartile, and maximum).
• Syntax: boxplot(x, data, notch, varwidth, names, main)
• x: This parameter sets as a vector or a formula.
• data: This parameter sets the data frame.
• notch: This parameter is the label for horizontal axis.
• varwidth: This parameter is a logical value. Set as true to draw width of the box proportionate
to the sample size.
• main: This parameter is the title of the chart.
• names: This parameter are the group labels that will be showed under each boxplot.

Prepared by : Dr. Srinivasa Rao Pokuri, Faculty SCOPE, VIT AP

You might also like