Big-Data Unit-4
Big-Data Unit-4
R / WEKA MACHINE
LEARNING
UNIT-4
What is R Programming languages?
• It is a programming language & analytics tool that was developed in
1993 by Robert Gentleman & Ross Ihaka.
• It is one of the most popular analytics tool used in Data Analytics &
Business Analytics.
• It has CRAN( Comprehencive R Archive Network) that is a repository
having more than 10000 R packages, having all the required
functionalities for working with data.
Features of R
• Arithmetic Operators:
Operator Description Example Result
+ Addition 5 +3 8
- Subtraction 5-3 2
* Multiplication 5*3 15
/ Division 5/ 3 2.5
^or ** Exponentiation 5 ^2 25
%% Modulus (remainder) 5 %% 2 1
%/% Integer Division 5 %/% 2 2
2. Relational Operators:
element of the vector and gives the opposite It w ill give us the following
logical value as a result. output :
[1] FALS E TRUE FALS E FALS E
4 && This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives TRUE as a result, only if b <- c(2, 4, TRUE , 2+3i)
print(a&&b)
both are TRUE.
It will give us the following
output:
[1] TRUE
5 || This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives the result TRUE, if one b <- c(2, 4, TRUE , 2+3i)
print(a||b)
of them is true.
It will give us the following
output:
[1] TRUE
Assignment Operators
S. No Operator Description Example
1 <- or = or <<- These operators are known as left a <- c(3, 0, TRUE, 2+2i)
assignment operators. b <<- c(2, 4, TRUE, 2+3i)
d = c(1, 2, TRUE, 2+3i)
print(a)
print(b)
print(d)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
[1] 1+0i 2+0i 1+0i 2+3i
2 -> or ->> These operators are known as c(3, 0, TRUE, 2+2i) -> a
right assignment operators. c(2, 4, TRUE, 2+3i) ->> b
print(a)
print(b)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
Miscellaneous Operators
S. No Operator Descri pti on Exampl e
1 : The colon operator is used to create v <- 1:8
the series of numbers in sequence for print(v)
a vector. It will give us the
following output:
[1] 1 2 3 4 5 6 7 8
2 %in% This is used when we want to identify if a1 <- 8
an element belongs to a vector. a2 <- 12
d <- 1:10
print(a1%in%t)
print(a2%in%t)
It will give us the
following output:
[1] FALS E
[1] FALS E
3 %*% It is used to multiply a matrix with its M=matrix(c(1,2,3,4,5,6),
transpose. nrow=2,
ncol=3,
byrow=TRUE )
T=m%*%T(m)
print(T)
It will give us the
following output:
14 32
32 77
Decision Making in R
• Decision making is about deciding the order of execution of
statements based on certain conditions. In decision making
programmer needs to provide some condition which is evaluated by
the program, along with it there also provided some statements
which are executed if the condition is true and optionally other
statements if the condition is evaluated to be false.
• The decision making statement in R are as followed:
• if statement
• if-else statement
• nested if-else statement
• switch statement
R - If Statement
• An if statement consists of a Boolean expression followed by one or
more statements.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
• If the Boolean expression evaluates to be true, then the block of code
inside the if statement will be executed. If Boolean expression
evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed.
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
R - If...Else Statement
• An if statement can be followed by an optional else statement which
executes when the boolean expression is false.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
• If the Boolean expression evaluates to be true, then the if block of
code will be executed, otherwise else block of code will be executed.
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
Nested if-else statement
• An if statement can be followed by an optional else if...else statement, which is very
useful to test various conditions using single if...else if statement.
• Syntax
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
R - Switch Statement
• A switch statement allows a variable to be tested for equality against
a list of values. Each value is called a case, and the variable being
switched on is checked for each case.
• Syntax
switch(expression, case1, case2, case3....)
Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
R - Loops
• There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first,
followed by the second, and so on.
• Types of Loops in R programming:
• Repeat loop
• While loop
• For loop
• Next
R - Repeat Loop
• The Repeat loop executes the same code again and again until a stop
condition is met.
• Syntax
repeat {
commands
if(condition) {
break
}
}
Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
R - While Loop
• he While loop executes the same code again and again until a stop
condition is met.
• Syntax
while (test_expression) {
statement
}
Example
v <- c("Hello","while loop")
cnt <- 2
• Syntax
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
R - Next Statement
• The next statement in R programming language is useful when we
want to skip the current iteration of a loop without terminating it. On
encountering next, the R parser skips further evaluation and starts
next iteration of the loop.
• Syntax
The basic syntax for creating a next statement in R is −
next
Example
v <- LETTERS[1:6]
for ( i in v) {
if (i == "D") {
next
}
print(i)
}
R - Strings
• Any value written within a pair of single quote or double quotes in R is treated as a
string. Internally R stores every string within double quotes, even when you create
them with single quote.
Rules Applied in String Construction
• The quotes at the beginning and end of a string should be both double quotes or
both single quote. They can not be mixed.
• Double quotes can be inserted into a string starting and ending with single quote.
• Single quote can be inserted into a string starting and ending with double quotes.
• Double quotes can not be inserted into a string starting and ending with double
quotes.
• Single quote can not be inserted into a string starting and ending with single quote.
Example
a <- 'Start and end with single quote'
print(a)
b <- "Start and end with double quotes"
print(b)
c <- "single quote ' in between double quotes"
print(c)
d <- 'Double quotes " in between single quote'
print(d)
R - Functions
• A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
• In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
• The function in turn performs its task and returns control to the interpreter as well as any result
which may be stored in other objects.
• Function Definition
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components
• The different parts of a function are −
• Function Name − This is the actual name of the function. It is stored
in R environment as an object with this name.
• Arguments − An argument is a placeholder. When a function is
invoked, you pass a value to the argument. Arguments are optional;
that is, a function may contain no arguments. Also arguments can
have default values.
• Function Body − The function body contains a collection of
statements that defines what the function does.
• Return Value − The return value of a function is the last expression in
the function body to be evaluated.
Built-in Function
O/P
[1] 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Using the seq() function
• In R, we can create a vector with the help of the seq() function. A
sequence function creates a sequence of elements as a vector. The
seq() function is used in two ways, i.e., by setting step size with ?by'
parameter or specifying the length of the vector with the 'length.out'
feature.
Example:
seq_vec<-seq(1,4,by=0.5)
seq_vec
class(seq_vec)
Output
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Atomic vectors in R
• In R, there are four types of atomic vectors. Atomic vectors play an
important role in Data Science. Atomic vectors are created with the
help of c() function. These atomic vectors are as follows:
• Numeric vector
• Logical vector
• Integer vector
• Character vector
Numeric vector
• The decimal values are known as numeric data types in R. If we assign
a decimal value to any variable d, then this d variable will become a
numeric type. A vector which contains numeric elements is known as
a numeric vector.
Example:
d<-45.5
num_vec<-c(10.1, 10.2, 33.2)
d
num_vec
class(d)
class(num_vec)
Output
[1] 45.5
[1] 10.1 10.2 33.2
[1] "numeric"
[1] "numeric"
Integer vector
• A non-fraction numeric value is known as integer data. This integer
data is represented by "Int." The Int size is 2 bytes and long Int size of
4 bytes. There is two way to assign an integer value to a variable, i.e.,
by using as.integer() function and appending of L to the value.
• A vector which contains integer elements is known as an integer
vector.
Example:
d<-as.integer(5)
e<-5L
int_vec<-c(1,2,3,4,5)
int_vec<-as.integer(int_vec)
int_vec1<-c(1L,2L,3L,4L,5L)
class(d)
class(e)
class(int_vec)
class(int_vec1)
Output:
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"
Character vector
• A character is held as a one-byte integer in memory. In R, there are
two different ways to create a character data type value, i.e., using
as.character() function and by typing string between double
quotes("") or single quotes('').
• Example: • Output
d<-'shubham'
e<-"Arpita" [1] "shubham"
f<-65
f<-as.character(f)
[1] "Arpita"
d [1] "65"
e
f
[1] "1" "2" "3" "4" "5"
char_vec<-c(1,2,3,4,5) [1] "shubham" "arpita" "nishka"
char_vec<-as.character(char_vec) "vaishali"
char_vec1<-c("shubham","arpita","nishka","vaishali")
char_vec
[1] "character"
class(d) [1] "character"
class(e)
[1] "character"
class(f)
class(char_vec) [1] "character"
class(char_vec1)
[1] "character"
Logical vector
The logical data types have only two values i.e., True or False. These
values are based on which condition is satisfied. A vector which
contains Boolean values is known as the logical vector.
R Lists
• Lists are the objects of R which contain elements of different types
such as number, vectors, string and another list inside it. It can also
contain a function or a matrix as its elements.
• List containing strings, numbers, vectors and logical values.
Creating a List
Example:
list_1<-list("Shubham","Arpita","Vaishali")
list_1
Output:
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] "Vaishali"
R Arrays
• In R, arrays are the data objects which allow us to store data in more than two dimensions. In R, an
array is created with the help of the array() function. This array() function takes a vector as an input
and to create an array it uses vectors values in the dim parameter.
• For example- if we will create an array of dimension (2, 3, 4) then it will create 4 rectangular matrices
of 2 row and 3 columns.
R Array Syntax:
array_name <- array(data, dim= (row_size, column_size, matrices, dim_names))
• data
The data is the first argument in the array() function. It is an input vector which is given to the array.
• matrices
In R, the array consists of multi-dimensional matrices.
• row_size
This parameter defines the number of row elements which an array can store.
• column_size
This parameter defines the number of columns elements which an array can store.
• dim_names
This parameter is used to change the default names of rows and columns.
Example
#Creating two vectors of different lengths
vec1 <-c(1,3,5)
vec2 <-c(10,11,12,13,14,15)
, , Matrix1
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15
, , Matrix2
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15
Accessing array elements
• # Creating a 2x3 array
my_array <- array(1:6, dim = c(2, 3))
print(my_array)
• output
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
• # Access the element in the 1st row, 2nd column
my_array[1, 2] # This gives you 3
Factors
• Factors are the data objects which are used to categorize the data and
store it as levels.
• They can store integers & strings.
• E.g. “Male”, “Female” & True , False etc.
• Factors are created using the Factor() function by taking a vector as
input.
Example
# Creating a vector as input.
data <-
c("Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","Arpita","Sumit")
print(data)
print(is.factor(data))
[1]"Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit",
"Arpita","Sumit“
[1] FALSE
[1]Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","
Arpita","Sumit“
Levels: Arpita Nishka Shubham Sumit
[1] TRUE
R Data Frame
• A data frame is a two-dimensional array-like structure or a table in
which a column contains values of one variable, and rows contains
one set of values from each column.
• There are following characteristics of a data frame:
• The columns name should be non-empty.
• The rows name should be unique.
• The data which is stored in a data frame can be a factor, numeric, or
character type.
• Each column contains the same number of data items.
create Data Frame
# Creating the data frame.
emp.data<- data.frame(
employee_id = c (1:5),
employee_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
sal = c(623.3,915.2,611.0,729.0,843.25),
starting_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
Output:
[1] 55 39 71 98 41 26 17 38 82 49
Pipe Operator
• Pipe operator wrap multiple functions together.
• It denoted as %>%.
• It can be use with functions like filter(), select(), arrange(), group_by()
etc.
• Example:
stud_data %>% filter(gender1==“male”,math_score >50)
Data manipulation Functions
• Select()
• Filter()
• Summarise()
• Arrange()
• Mutate()
• Transmutate()
Select() function
• To select columns (variables) by their names.
• The first argument to this function is the data frame and subsequent
arguments are columns to keep.
• To implement select function we need to load the dplyr package.
library(dplyr)
mydata<-mtcars
• E.g.
V1<-c(2,5,6,3,7)
V2<-c(15,16,7,3,2,7,5)
match(v1,v2)
O/P: [1] 5 7 NA 4 3
Matches() function
• Mathces() function select column names with matching regular
expressions.
select(mydata,matches(“^(d)”))
The above statement will display the columns matches with “d”
Num_range() function
• Function used to select column names containing number like
col1,col2,col3
Colnames(mydata)<-sprint(“col%d”,1:7)
Select(mydata,num_range(“col1,1:3))
Rename() function
• It is used to change variable NAME.
• Syntax:
rename(data, new_name = old_name)
Data=Data Frame
New_name= New variable name you want to keep
Old_name=Existing variable name
Filter() function
• It is used to subset data with matching logical conditions. Pick rows
based on their values.
• It is used to find rows with matching criteria. It also words like select()
function, i.e., we pass a data frame along condition separated by
comma.
• Syntax:
filter(data,condition)
e.g.
filter(mydata,mpg>=30.00,cy1==4)
summarise() function
• It is used to create summary statistics for a dataset, such as
calculating the mean, sum, or count of variables, often after grouping
the data with group_by().
• Syntax:
summarize(dataframeName, aggregate_function(columnName))
Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C", "A"),
values = c(10, 20, 30, 40, 50, 15)
)
summarise(data, mean_value = mean(values), sum_value = sum(values))
o/p:
mean_value sum_value
1 27.5 165
Arrange() function
• The arrange() function in R, part of the dplyr package, is used to
reorder rows of a data frame or tibble based on the values of one or
more columns. By default, arrange() sorts the data in ascending order,
but you can sort in descending order using the desc() function.
• Syntax:
arrange(data, column_name)
• Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C"),
values = c(30, 10, 20, 40, 50)
)
arrange(data, values)
O/p:
category values
1 B 10
2 A 20
3 A 30
4 B 40
5 C 50
mutate() function
• It is used to create new variables.
• Syntax:
mutate(data_frame,expression(s))
E.g.:
Mydata=mutate_all(mydata,funs(“new” = . * 1000))
Data Analysis & Visualization
• It is a technique used for the graphical representation of data.
• By using elements like scatter charts, graphs, histograms, maps etc.
will make our data more understandable.
• Data Visualization makes it easy to recognize patterns, trends, and
exceptions in our data.
• Ggplot2 is a powerful and flexible R package for producing elegant
graphics.
• Ggplot2 divides plot into three different fundamental parts:
• Plot=data+ Aesthetics + geometry
• Data: is a frame.
• Aesthetics: It is used to indicate X & Y variables. It can also be used to control the
color, the size or the shape of points, the height of bars etc.
• Geometry: defines the type of graphics like histogram, box plot, line plot, density
plot etc.
• There are two major functions in ggplot2 package: qplot() & ggplot() functions.
• qplot() stands for quick plot, which can be used to produce easily simple plots.
• Ggplot() function is more flexible and robust than qplot for building a plot piece by piece.
R visualization Packages
• Plotly
• Ggplot2
• Tidyquant
• Taucharts
• Geofacts
• googleVis
• dygraphs
Introduction to ggplot2
• It is a plotting system.
• It is used to build professional –looking graphs.
• Use plots quickly with minimal code.
• To install ggplot2 package
• Install.packages(“ggplot2”)
Types of Data Visualization
• Scatter plot
• Bar & stack bar chart
• Histogram
• Pie chart
• Box chart
• Area chart
• Heat map
• Correlogram
Scatter Plots
• The scatter plots are used to compare variables.
• A comparison between variables is required when we need to define
how much one variable is affected by another variable.
• Data is represented as a collection of points.
• Each point on the scatter plot defines the values of two variables. One
variable is selected for the vertical axis and other for horizontal axis.
• In R, there are two ways of creating scatter plot, using plot() function
& using the ggplot2 package’s function.
• Plot() function is used to plot R objects. The basic syntax for creating scatter plot
in R:
plot(x, y, type, main, sub, xlab, ylab, asp, col,…..)
barplot(values,
names.arg = categories, # Category names
main = "Bar Chart Example", # Title of the plot
xlab = "Categories", # X-axis label
ylab = "Values", # Y-axis label
col = "skyblue", # Color of the bars
border = "blue") # Color of the border
Histogram
• It is a type of bar chart which show the frequency of number of values
which compared with set of values ranges.
• In histogram each bar represents the height of the number of v alues
present in the given range.
• R provide hist() function which takes a vector as an input and use
more parameters to add more functionality.
• Syntax:
# Sample data
data <- c(5, 7, 8, 12, 15, 18, 20, 22, 24, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80)