0% found this document useful (0 votes)
173 views

Big-Data Unit-4

DATA ANALYTICS WITH R, Introduction to R, Data Manipulation, Data Visualization, Data Analysis

Uploaded by

Tulshiram Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
173 views

Big-Data Unit-4

DATA ANALYTICS WITH R, Introduction to R, Data Manipulation, Data Visualization, Data Analysis

Uploaded by

Tulshiram Kamble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110

DATA ANALYTICS WITH

R / WEKA MACHINE
LEARNING
UNIT-4
What is R Programming languages?
• It is a programming language & analytics tool that was developed in
1993 by Robert Gentleman & Ross Ihaka.
• It is one of the most popular analytics tool used in Data Analytics &
Business Analytics.
• It has CRAN( Comprehencive R Archive Network) that is a repository
having more than 10000 R packages, having all the required
functionalities for working with data.
Features of R

• Open Source: Freely available and supported by a large community.


• Comprehensive Statistical Analysis: Built-in support for a wide range of
statistical techniques.
• Extensive Package Ecosystem: Thousands of packages available for various
tasks.
• Data Manipulation and Wrangling: Powerful tools for data cleaning and
transformation.
• Data Visualization: Advanced plotting capabilities with packages like ggplot2.
• Vectorized Operations: Efficient handling of vector and matrix operations.
• Extensible: Users can create custom functions and packages.
• Integration with Other Languages: Compatible with Python, C++, SQL, and
more.
Applications of R programming
• Healthcare Data Analysis: R is used to analyse patient data, identify trends, predict outcomes,
and improve treatment plans in hospitals and research institutions.
• Financial Risk Management: Banks and financial institutions use R for risk analysis, credit scoring,
and to build predictive models for stock market trends and investment strategies.
• Retail and E-commerce: Companies like Amazon and Walmart use R to analyze customer
purchasing patterns, optimize pricing strategies, and manage inventory through predictive
analytics.
• Social Media Analytics: R is used to analyze social media platforms like Twitter and Facebook for
sentiment analysis, trend analysis, and to measure the impact of marketing campaigns.
• Geographical Data Analysis: R can handle spatial data and perform geographical data analysis
with packages like sp, rgeos, and rgdal.
Data Types in R
Data Type Description Example
Numeric Represents real numbers, including both integer and floating-point numbers. 2, 3.14, -5.67
Integer Represents whole numbers (without decimals) specifically defined with an L suffix. 1L, 100L, -10L
Character Represents text or strings of characters. "Hello", "R Language"
Logical Represents boolean values, which are either TRUE or FALSE. TRUE , FALSE
Complex Represents complex numbers with real and imaginary parts. 1+2i, 3-4i
Factor Represents categorical data, often used for fields that have a fixed number of unique values. factor("Male", "Female")
Date Represents dates. as.Date("2024-08-28")
Raw Represents raw bytes, mainly used for low-level operations and binary data manipulation. charToRaw("A")
Variables in R
• In R, variables are used to store data that can be referenced and
manipulated throughout your script or program.
• A valid variable name consists of letters, numbers and the dot or
underline characters.
Variable Name Naming Convention Example Usage
age Simple age <- 25
total_sales Snake case total_sales <- 1000
AverageHeight Camel case AverageHeight <- 175.5
num_of_items Snake case num_of_items <- 50
maxTemp Camel case maxTemp <- 37.2
.internal_value Starts with period .internal_value <- 42
customerName Camel case customerName <- "Alice"
Variable Assignment
• The variables can be assigned values using leftward( <- ), rightward( -> )
and equal to ( = ) operator.
• The values of the variable can be printed using print() or cat() function.
• The cat() function combines multiple items into a continuous print
output.
• To know all the variables currently available in the workspace we use
the ls() function.
• Variables can be deleted by using the rm() function.
Example
# Using leftward assignment # Printing all variables
a <- 5 print(a) # Output: 5
b <- "Hello"
print(b) # Output: "Hello"
c <- TRUE
print(c) # Output: TRUE
# Using rightward assignment
7 -> d print(d) # Output: 7
"World" -> e print(e) # Output: "World"
FALSE -> f print(f) # Output: FALSE

# Using equal sign for assignment


print(g) # Output: 3.14
g = 3.14
h = "R Programming"
print(h) # Output: "R Programming"
i = NULL print(i) # Output: NULL
Types of Operators in R

• Arithmetic Operators:
Operator Description Example Result
+ Addition 5 +3 8
- Subtraction 5-3 2
* Multiplication 5*3 15
/ Division 5/ 3 2.5
^or ** Exponentiation 5 ^2 25
%% Modulus (remainder) 5 %% 2 1
%/% Integer Division 5 %/% 2 2
2. Relational Operators:

Operator Description Example Result


== Equal to 5 ==3 FALSE
!= Not equal to 5 !=3 TRUE
> Greater than 5 >3 TRUE
< Less than 5 <3 FALSE
>= Greater than or equal to 5 >=3 TRUE
<= Less than or equal to 5 <=3 FALSE
Logical Operators:
S. No Operator Descri pti on Exampl e
1 & This operator is known as the Logical AND a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first b <- c(2, 4, TRUE , 2+3i)
print(a&b)
element of both the vector and returns TRUE
if both the elements are TRUE.
It will give us the following
output:
[1] TRUE FALS E TRUE TRUE
2 | This operator is called the Logical OR a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first b <- c(2, 4, TRUE , 2+3i)
print(a|b)
element of both the vector and returns TRUE
It will give us the following
if one of them is TRUE.
output:
[1] TRUE TRUE TRUE TRUE
3 ! This operator is known as Logical NOT a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first print(!a)

element of the vector and gives the opposite It w ill give us the following
logical value as a result. output :
[1] FALS E TRUE FALS E FALS E

4 && This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives TRUE as a result, only if b <- c(2, 4, TRUE , 2+3i)
print(a&&b)
both are TRUE.
It will give us the following
output:
[1] TRUE
5 || This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives the result TRUE, if one b <- c(2, 4, TRUE , 2+3i)
print(a||b)
of them is true.
It will give us the following
output:
[1] TRUE
Assignment Operators
S. No Operator Description Example
1 <- or = or <<- These operators are known as left a <- c(3, 0, TRUE, 2+2i)
assignment operators. b <<- c(2, 4, TRUE, 2+3i)
d = c(1, 2, TRUE, 2+3i)
print(a)
print(b)
print(d)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
[1] 1+0i 2+0i 1+0i 2+3i
2 -> or ->> These operators are known as c(3, 0, TRUE, 2+2i) -> a
right assignment operators. c(2, 4, TRUE, 2+3i) ->> b
print(a)
print(b)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
Miscellaneous Operators
S. No Operator Descri pti on Exampl e
1 : The colon operator is used to create v <- 1:8
the series of numbers in sequence for print(v)
a vector. It will give us the
following output:
[1] 1 2 3 4 5 6 7 8
2 %in% This is used when we want to identify if a1 <- 8
an element belongs to a vector. a2 <- 12
d <- 1:10
print(a1%in%t)
print(a2%in%t)
It will give us the
following output:
[1] FALS E
[1] FALS E
3 %*% It is used to multiply a matrix with its M=matrix(c(1,2,3,4,5,6),
transpose. nrow=2,
ncol=3,
byrow=TRUE )
T=m%*%T(m)
print(T)
It will give us the
following output:
14 32
32 77
Decision Making in R
• Decision making is about deciding the order of execution of
statements based on certain conditions. In decision making
programmer needs to provide some condition which is evaluated by
the program, along with it there also provided some statements
which are executed if the condition is true and optionally other
statements if the condition is evaluated to be false.
• The decision making statement in R are as followed:
• if statement
• if-else statement
• nested if-else statement
• switch statement
R - If Statement
• An if statement consists of a Boolean expression followed by one or
more statements.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
• If the Boolean expression evaluates to be true, then the block of code
inside the if statement will be executed. If Boolean expression
evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed.
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
R - If...Else Statement
• An if statement can be followed by an optional else statement which
executes when the boolean expression is false.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
• If the Boolean expression evaluates to be true, then the if block of
code will be executed, otherwise else block of code will be executed.
Example

x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
Nested if-else statement
• An if statement can be followed by an optional else if...else statement, which is very
useful to test various conditions using single if...else if statement.

• Syntax
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
R - Switch Statement
• A switch statement allows a variable to be tested for equality against
a list of values. Each value is called a case, and the variable being
switched on is checked for each case.

• Syntax
switch(expression, case1, case2, case3....)
Example

x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
R - Loops
• There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first,
followed by the second, and so on.
• Types of Loops in R programming:
• Repeat loop
• While loop
• For loop
• Next
R - Repeat Loop
• The Repeat loop executes the same code again and again until a stop
condition is met.
• Syntax

repeat {
commands
if(condition) {
break
}
}
Example
v <- c("Hello","loop")
cnt <- 2

repeat {
print(v)
cnt <- cnt+1

if(cnt > 5) {
break
}
}
R - While Loop
• he While loop executes the same code again and again until a stop
condition is met.
• Syntax

while (test_expression) {
statement
}
Example
v <- c("Hello","while loop")
cnt <- 2

while (cnt < 7) {


print(v)
cnt = cnt + 1
}
R - For Loop
• A For loop is a repetition control structure that allows you to
efficiently write a loop that needs to execute a specific number of
times.

• Syntax

for (value in vector) {


statements
}
Example

v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
R - Next Statement
• The next statement in R programming language is useful when we
want to skip the current iteration of a loop without terminating it. On
encountering next, the R parser skips further evaluation and starts
next iteration of the loop.
• Syntax
The basic syntax for creating a next statement in R is −
next
Example
v <- LETTERS[1:6]
for ( i in v) {

if (i == "D") {
next
}
print(i)
}
R - Strings
• Any value written within a pair of single quote or double quotes in R is treated as a
string. Internally R stores every string within double quotes, even when you create
them with single quote.
Rules Applied in String Construction
• The quotes at the beginning and end of a string should be both double quotes or
both single quote. They can not be mixed.
• Double quotes can be inserted into a string starting and ending with single quote.
• Single quote can be inserted into a string starting and ending with double quotes.
• Double quotes can not be inserted into a string starting and ending with double
quotes.
• Single quote can not be inserted into a string starting and ending with single quote.
Example
a <- 'Start and end with single quote'
print(a)
b <- "Start and end with double quotes"
print(b)
c <- "single quote ' in between double quotes"
print(c)
d <- 'Double quotes " in between single quote'
print(d)
R - Functions
• A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
• In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
• The function in turn performs its task and returns control to the interpreter as well as any result
which may be stored in other objects.
• Function Definition
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components
• The different parts of a function are −
• Function Name − This is the actual name of the function. It is stored
in R environment as an object with this name.
• Arguments − An argument is a placeholder. When a function is
invoked, you pass a value to the argument. Arguments are optional;
that is, a function may contain no arguments. Also arguments can
have default values.
• Function Body − The function body contains a collection of
statements that defines what the function does.
• Return Value − The return value of a function is the last expression in
the function body to be evaluated.
Built-in Function

• Simple examples of in-built functions


are seq(), mean(), max(), sum(x) and paste(...) etc. They are directly
called by user written programs.

# Create a sequence of numbers from 32 to 44.


print(seq(32,44))
# Find mean of numbers from 25 to 82.
print(mean(25:82))
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
User-defined Function
• We can create user-defined functions in R. They are specific to what a user
wants and once created they can be used like the built-in functions. Below is
an example of how a function is created and used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.


new.function(6)
Calling a Function without an
Argument
# Create a function without an argument.
new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}

# Call the function without supplying an argument.


new.function()
Calling a Function with Argument
Values (by position and by name)
# Create a function with arguments.
new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


new.function(a = 11, b = 5, c = 3)
Calling a Function with Default
Argument
# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)
Data Structure in R
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
Vectors
• A vector is a basic data structure which plays an important role in R
programming.
• In R, a sequence of elements which share the same data type is known
as vector.
• Vector is classified into two parts, i.e., Atomic vectors and Lists.
• A vector supports logical, integer, double, character, complex, or raw
data type.
• There is only one difference between atomic vectors and lists. In an
atomic vector, all the elements are of the same type, but in the list, the
elements are of different data types.
• In R, we use c() function to create vector. This function returns a one-
dimensional array or simple vector.
Create Vector in R
• Using the colon(:) operator
• Use seq() function
Using the colon(:) operator
• We can create a vector with the help of the colon operator. There is the following
syntax to use colon operator:
z<-x:y
• Example:
a<-4:-10
a

O/P
[1] 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Using the seq() function
• In R, we can create a vector with the help of the seq() function. A
sequence function creates a sequence of elements as a vector. The
seq() function is used in two ways, i.e., by setting step size with ?by'
parameter or specifying the length of the vector with the 'length.out'
feature.
Example:
seq_vec<-seq(1,4,by=0.5)
seq_vec
class(seq_vec)

Output
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Atomic vectors in R
• In R, there are four types of atomic vectors. Atomic vectors play an
important role in Data Science. Atomic vectors are created with the
help of c() function. These atomic vectors are as follows:
• Numeric vector
• Logical vector
• Integer vector
• Character vector
Numeric vector
• The decimal values are known as numeric data types in R. If we assign
a decimal value to any variable d, then this d variable will become a
numeric type. A vector which contains numeric elements is known as
a numeric vector.
Example:
d<-45.5
num_vec<-c(10.1, 10.2, 33.2)
d
num_vec
class(d)
class(num_vec)

Output
[1] 45.5
[1] 10.1 10.2 33.2
[1] "numeric"
[1] "numeric"
Integer vector
• A non-fraction numeric value is known as integer data. This integer
data is represented by "Int." The Int size is 2 bytes and long Int size of
4 bytes. There is two way to assign an integer value to a variable, i.e.,
by using as.integer() function and appending of L to the value.
• A vector which contains integer elements is known as an integer
vector.
Example:
d<-as.integer(5)
e<-5L
int_vec<-c(1,2,3,4,5)
int_vec<-as.integer(int_vec)
int_vec1<-c(1L,2L,3L,4L,5L)
class(d)
class(e)
class(int_vec)
class(int_vec1)

Output:
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"
Character vector
• A character is held as a one-byte integer in memory. In R, there are
two different ways to create a character data type value, i.e., using
as.character() function and by typing string between double
quotes("") or single quotes('').
• Example: • Output
d<-'shubham'
e<-"Arpita" [1] "shubham"
f<-65
f<-as.character(f)
[1] "Arpita"
d [1] "65"
e
f
[1] "1" "2" "3" "4" "5"
char_vec<-c(1,2,3,4,5) [1] "shubham" "arpita" "nishka"
char_vec<-as.character(char_vec) "vaishali"
char_vec1<-c("shubham","arpita","nishka","vaishali")
char_vec
[1] "character"
class(d) [1] "character"
class(e)
[1] "character"
class(f)
class(char_vec) [1] "character"
class(char_vec1)
[1] "character"
Logical vector
The logical data types have only two values i.e., True or False. These
values are based on which condition is satisfied. A vector which
contains Boolean values is known as the logical vector.
R Lists
• Lists are the objects of R which contain elements of different types
such as number, vectors, string and another list inside it. It can also
contain a function or a matrix as its elements.
• List containing strings, numbers, vectors and logical values.
Creating a List

Example:
list_1<-list("Shubham","Arpita","Vaishali")
list_1

Output:
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] "Vaishali"
R Arrays
• In R, arrays are the data objects which allow us to store data in more than two dimensions. In R, an
array is created with the help of the array() function. This array() function takes a vector as an input
and to create an array it uses vectors values in the dim parameter.
• For example- if we will create an array of dimension (2, 3, 4) then it will create 4 rectangular matrices
of 2 row and 3 columns.
R Array Syntax:
array_name <- array(data, dim= (row_size, column_size, matrices, dim_names))
• data
The data is the first argument in the array() function. It is an input vector which is given to the array.
• matrices
In R, the array consists of multi-dimensional matrices.
• row_size
This parameter defines the number of row elements which an array can store.
• column_size
This parameter defines the number of columns elements which an array can store.
• dim_names
This parameter is used to change the default names of rows and columns.
Example
#Creating two vectors of different lengths
vec1 <-c(1,3,5)
vec2 <-c(10,11,12,13,14,15)

#Initializing names for rows, columns and matrices


col_names <- c("Col1","Col2","Col3")
row_names <- c("Row1","Row2","Row3")
matrix_names <- c("Matrix1","Matrix2")

#Taking the vectors as input to the array


res <- array(c(vec1,vec2),dim=c(3,3,2),dimnames=list(row_names,col_names,matrix_names))
print(res)
Output

, , Matrix1
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15

, , Matrix2
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15
Accessing array elements
• # Creating a 2x3 array
my_array <- array(1:6, dim = c(2, 3))
print(my_array)
• output
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
• # Access the element in the 1st row, 2nd column
my_array[1, 2] # This gives you 3
Factors
• Factors are the data objects which are used to categorize the data and
store it as levels.
• They can store integers & strings.
• E.g. “Male”, “Female” & True , False etc.
• Factors are created using the Factor() function by taking a vector as
input.
Example
# Creating a vector as input.
data <-
c("Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","Arpita","Sumit")
print(data)
print(is.factor(data))

# Applying the factor function.


factor_data<- factor(data)
print(factor_data)
print(is.factor(factor_data))
• Output

[1]"Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit",
"Arpita","Sumit“
[1] FALSE
[1]Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","
Arpita","Sumit“
Levels: Arpita Nishka Shubham Sumit
[1] TRUE
R Data Frame
• A data frame is a two-dimensional array-like structure or a table in
which a column contains values of one variable, and rows contains
one set of values from each column.
• There are following characteristics of a data frame:
• The columns name should be non-empty.
• The rows name should be unique.
• The data which is stored in a data frame can be a factor, numeric, or
character type.
• Each column contains the same number of data items.
create Data Frame
# Creating the data frame.
emp.data<- data.frame(
employee_id = c (1:5),
employee_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
sal = c(623.3,915.2,611.0,729.0,843.25),
starting_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)

# Printing the data frame.


print(emp.data)
Output
employee_id employee_name sal starting_date
1 Shubham 623.30 2012-01-01
2 Arpita 915.20 2013-09-23
3 Nishka 611.00 2014-11-15
4 Gunjan 729.00 2014-05-11
5 Sumit 843.25 2015-03-27
R Matrix
• In R, a two-dimensional rectangular data set is known as a matrix.
• In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix elements are the
real numbers.
• A matrix is created using matrix() function.
• Syntax:
matrix(data, nrow, ncol, byrow, dim_name)
• data
The first argument in matrix function is data. It is the input vector which is the data elements of the
matrix.
• nrow
The second argument is the number of rows which we want to create in the matrix.
• ncol
The third argument is the number of columns which we want to create in the matrix.
• byrow
The byrow parameter is a logical clue. If its value is true, then the input vector elements are arranged by
row.
• dim_name
The dim_name parameter is the name assigned to the rows and columns.
Example
#Arranging elements sequentially by row.
P <- matrix(c(5:16), nrow = 4, byrow = TRUE)
print(P)

# Arranging elements sequentially by column.


Q <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(Q)

# Defining the column and row names.


row_names = c("row1", "row2", "row3", "row4")
ccol_names = c("col1", "col2", "col3")

R <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(row_names, col_names))


print(R)
Output

[,1] [,2] [,3]


[1,] 5 6 7
[2,] 8 9 10
[3,] 11 12 13
[4,] 14 15 16

[,1] [,2] [,3]


[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14

col1 col2 col3


row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Data Manipulation
• Data Manipulation involves modifying data to make it easier to read
and to be more organized.
• We Manipulate data for analysis and visualization.
• When data collection process is done, data have error and inaccurate
in reading , data manipulation helps to remove this inaccuracies &
make data more accurate and precise.
Ways to Manipulate / treat data
• Manipulating data using inbuilt base R functions.
• Use of Packages
• Use ML algorithms
Data Manipulation techniques
• Filtering & ordering rows
• Renaming and adding columns
• Computing summary statistics
Packages
• Many of R’s most useful functions do not come preloaded when you
start R, but reside in packages that can be installed on top of R.
• R packages are similar to libraries in C, C++, and Javascript, packages
in Python, and gems in Ruby.
• An R package bundles together useful functions, help files, and data
sets.
Installing Packages
• To use an R package, you must first install it on your computer and
then load it in your current R session. The easiest way to install an R
package is with the install.packages R function. Open R and type the
following into the command line:
install.packages("<package name>")
• You can install multiple packages at once by linking their names with
R’s concatenate function, c.
• For example, to install the ggplot2, reshape2, and dplyr packages, run:
install.packages(c("ggplot2", "reshape2", "dplyr"))
Loading Packages
• Installing a package doesn’t immediately place its functions at your
fingertips. It just places them on your computer. To use an R package,
you next have to load it in your R session with the command:
library(<package name>)
• To see which packages you currently have in your R library, run:
library()
Sample in R
• sample takes a sample of the specified size from the elements of x
using either with or without replacement.
• Syntax of sample():
sample(x, size, replace = FALSE, prob = NULL)
• x - vector or a data set.
• size - sample size.
• replace – Make sure that either no element occurs twice or occurs
twice.
• prob - probability weights
Example
sample(1:100,10,replace=true)

Output:
[1] 55 39 71 98 41 26 17 38 82 49
Pipe Operator
• Pipe operator wrap multiple functions together.
• It denoted as %>%.
• It can be use with functions like filter(), select(), arrange(), group_by()
etc.
• Example:
stud_data %>% filter(gender1==“male”,math_score >50)
Data manipulation Functions
• Select()
• Filter()
• Summarise()
• Arrange()
• Mutate()
• Transmutate()
Select() function
• To select columns (variables) by their names.
• The first argument to this function is the data frame and subsequent
arguments are columns to keep.
• To implement select function we need to load the dplyr package.
library(dplyr)
mydata<-mtcars

• To display specific columns in the dataset


select(mydata,1:3)
• The following functions help you to select variables based on their
names
Helpers Description
everything() All variables.
starts_with() Starts with prefix
end_with() End with prefix
contains() Contains a literal string
matches() Match the regular expression
num_range() Numeriacal range like x01,x02,x03
one_of() Variables in character vector
Start_with()
• Start_with() function used to select variables start with alphabet.
• E.g.
mydata1=select(mydata,start_width(“cy1”))
• Adding negative sign before start_width() implies dropping the
variables start with “Y”
• E.g.
mydata2=select(mydata,-start_width(“cy1”))
Contain()
• Selecting variables contain ‘s’ in their names
• E.g.
mydata4=select(mydata,contains(“s”))
Match() function
• Syntax:
match(v1,v2,nomatch=NA_integer, incomparables=NULL)

• V1=vector to which the values to be matched


• V2=vector to which the values should be matched against
• nomatch=value to be returned when there is no match
• incomparables=values to be excluded from the match function

• E.g.
V1<-c(2,5,6,3,7)
V2<-c(15,16,7,3,2,7,5)
match(v1,v2)

O/P: [1] 5 7 NA 4 3
Matches() function
• Mathces() function select column names with matching regular
expressions.
select(mydata,matches(“^(d)”))
The above statement will display the columns matches with “d”
Num_range() function
• Function used to select column names containing number like
col1,col2,col3

Colnames(mydata)<-sprint(“col%d”,1:7)
Select(mydata,num_range(“col1,1:3))
Rename() function
• It is used to change variable NAME.
• Syntax:
rename(data, new_name = old_name)

Data=Data Frame
New_name= New variable name you want to keep
Old_name=Existing variable name
Filter() function
• It is used to subset data with matching logical conditions. Pick rows
based on their values.
• It is used to find rows with matching criteria. It also words like select()
function, i.e., we pass a data frame along condition separated by
comma.
• Syntax:
filter(data,condition)
e.g.
filter(mydata,mpg>=30.00,cy1==4)
summarise() function
• It is used to create summary statistics for a dataset, such as
calculating the mean, sum, or count of variables, often after grouping
the data with group_by().
• Syntax:
summarize(dataframeName, aggregate_function(columnName))
Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C", "A"),
values = c(10, 20, 30, 40, 50, 15)
)
summarise(data, mean_value = mean(values), sum_value = sum(values))

o/p:
mean_value sum_value
1 27.5 165
Arrange() function
• The arrange() function in R, part of the dplyr package, is used to
reorder rows of a data frame or tibble based on the values of one or
more columns. By default, arrange() sorts the data in ascending order,
but you can sort in descending order using the desc() function.
• Syntax:
arrange(data, column_name)
• Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C"),
values = c(30, 10, 20, 40, 50)
)
arrange(data, values)

O/p:
category values
1 B 10
2 A 20
3 A 30
4 B 40
5 C 50
mutate() function
• It is used to create new variables.
• Syntax:
mutate(data_frame,expression(s))

E.g.:

Mydata=mutate_all(mydata,funs(“new” = . * 1000))
Data Analysis & Visualization
• It is a technique used for the graphical representation of data.
• By using elements like scatter charts, graphs, histograms, maps etc.
will make our data more understandable.
• Data Visualization makes it easy to recognize patterns, trends, and
exceptions in our data.
• Ggplot2 is a powerful and flexible R package for producing elegant
graphics.
• Ggplot2 divides plot into three different fundamental parts:
• Plot=data+ Aesthetics + geometry
• Data: is a frame.
• Aesthetics: It is used to indicate X & Y variables. It can also be used to control the
color, the size or the shape of points, the height of bars etc.
• Geometry: defines the type of graphics like histogram, box plot, line plot, density
plot etc.
• There are two major functions in ggplot2 package: qplot() & ggplot() functions.
• qplot() stands for quick plot, which can be used to produce easily simple plots.
• Ggplot() function is more flexible and robust than qplot for building a plot piece by piece.
R visualization Packages
• Plotly
• Ggplot2
• Tidyquant
• Taucharts
• Geofacts
• googleVis
• dygraphs
Introduction to ggplot2
• It is a plotting system.
• It is used to build professional –looking graphs.
• Use plots quickly with minimal code.
• To install ggplot2 package
• Install.packages(“ggplot2”)
Types of Data Visualization
• Scatter plot
• Bar & stack bar chart
• Histogram
• Pie chart
• Box chart
• Area chart
• Heat map
• Correlogram
Scatter Plots
• The scatter plots are used to compare variables.
• A comparison between variables is required when we need to define
how much one variable is affected by another variable.
• Data is represented as a collection of points.
• Each point on the scatter plot defines the values of two variables. One
variable is selected for the vertical axis and other for horizontal axis.
• In R, there are two ways of creating scatter plot, using plot() function
& using the ggplot2 package’s function.
• Plot() function is used to plot R objects. The basic syntax for creating scatter plot
in R:
plot(x, y, type, main, sub, xlab, ylab, asp, col,…..)

• X: x coordinate of the plot, a single plot structure, a function, or an R object.


• Y: y coordinate points in the plot.
• Type: ‘p’ for points, ‘l’ for lines, ‘b’ for both, ‘h’ for high density vertical lines etc.
• Main: title of the plot
• Sub: subtitle of the plot
• Xlab:Title for the x-axis
• Ylab: title for the y-axis
• Asp: aspect ratio
• Col: color of the plot
Example
• Step 1: Create the Data
# Create some sample data
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(2.3, 3.5, 4.1, 5.6, 6.8, 7.4, 8.0, 9.1, 10.3, 11.2)

• Step 2: Create the Scatter Plot


# Plot the scatter plot
plot(x, y,
main = "Scatter Plot Example", # Title of the plot
xlab = "X-axis Label", # Label for the x-axis
ylab = "Y-axis Label", # Label for the y-axis
pch = 19, # Type of point (19 is filled circle)
col = "blue") # Color of the points
Bar Charts
• A bar chart is a pictorial representation in which numerical values of
variables are represented by length or height of lines or rectangles of
equal width.
• It is used for summarizing a set of categorical data.
• R provides the barplot() function, which has following syntax:
• Barplot(h, x, y, main, names.arg, col)
• H is a vector or matrix containing numeric values used in bar chart.
• xlab is label for x axis
• ylab is label for y axis
• Main is the title of the bar chart
• Names.arg is a vector of names appearing under each bar.
• Col used to give colors to the bars in the graph.
Example
# Sample data
categories <- c("A", "B", "C", "D", "E")
values <- c(10, 20, 15, 25, 30)# Create the bar chart

barplot(values,
names.arg = categories, # Category names
main = "Bar Chart Example", # Title of the plot
xlab = "Categories", # X-axis label
ylab = "Values", # Y-axis label
col = "skyblue", # Color of the bars
border = "blue") # Color of the border
Histogram
• It is a type of bar chart which show the frequency of number of values
which compared with set of values ranges.
• In histogram each bar represents the height of the number of v alues
present in the given range.
• R provide hist() function which takes a vector as an input and use
more parameters to add more functionality.
• Syntax:

hist(v, main, xlab, xlim, ylim, breaks, col, boarder)

V: vector containing numeric values used in histogram


Main: indicates title of the chart.
Col: used to set color of the bars.
Border: is used to set border color of each bar
xlab: used to give description of x-axis
xlim: used to specify the range of values on the x-axis.
ylim: used to specify the range of values on the y-axis.
break: used to mention the width of each bar.
Example

# Sample data
data <- c(5, 7, 8, 12, 15, 18, 20, 22, 24, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80)

# Create the histogram


hist(data,
main = "Histogram Example", # Title of the plot
xlab = "Value", # X-axis label
ylab = "Frequency", # Y-axis label
col = "lightblue", # Color of the bars
border = "black") # Color of the bar borders

You might also like