0% found this document useful (0 votes)

173 views

Big-Data Unit-4

DATA ANALYTICS WITH R, Introduction to R, Data Manipulation, Data Visualization, Data Analysis

Uploaded by

Tulshiram Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views

Big-Data Unit-4

DATA ANALYTICS WITH R, Introduction to R, Data Manipulation, Data Visualization, Data Analysis

Uploaded by

Tulshiram Kamble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 110

DATA ANALYTICS WITH

R / WEKA MACHINE
LEARNING
UNIT-4
What is R Programming languages?
• It is a programming language & analytics tool that was developed in
1993 by Robert Gentleman & Ross Ihaka.
• It is one of the most popular analytics tool used in Data Analytics &
Business Analytics.
• It has CRAN( Comprehencive R Archive Network) that is a repository
having more than 10000 R packages, having all the required
functionalities for working with data.
Features of R

• Open Source: Freely available and supported by a large community.

• Comprehensive Statistical Analysis: Built-in support for a wide range of
statistical techniques.
• Extensive Package Ecosystem: Thousands of packages available for various
tasks.
• Data Manipulation and Wrangling: Powerful tools for data cleaning and
transformation.
• Data Visualization: Advanced plotting capabilities with packages like ggplot2.
• Vectorized Operations: Efficient handling of vector and matrix operations.
• Extensible: Users can create custom functions and packages.
• Integration with Other Languages: Compatible with Python, C++, SQL, and
more.
Applications of R programming
• Healthcare Data Analysis: R is used to analyse patient data, identify trends, predict outcomes,
and improve treatment plans in hospitals and research institutions.
• Financial Risk Management: Banks and financial institutions use R for risk analysis, credit scoring,
and to build predictive models for stock market trends and investment strategies.
• Retail and E-commerce: Companies like Amazon and Walmart use R to analyze customer
purchasing patterns, optimize pricing strategies, and manage inventory through predictive
analytics.
• Social Media Analytics: R is used to analyze social media platforms like Twitter and Facebook for
sentiment analysis, trend analysis, and to measure the impact of marketing campaigns.
• Geographical Data Analysis: R can handle spatial data and perform geographical data analysis
with packages like sp, rgeos, and rgdal.
Data Types in R
Data Type Description Example
Numeric Represents real numbers, including both integer and floating-point numbers. 2, 3.14, -5.67
Integer Represents whole numbers (without decimals) specifically defined with an L suffix. 1L, 100L, -10L
Character Represents text or strings of characters. "Hello", "R Language"
Logical Represents boolean values, which are either TRUE or FALSE. TRUE , FALSE
Complex Represents complex numbers with real and imaginary parts. 1+2i, 3-4i
Factor Represents categorical data, often used for fields that have a fixed number of unique values. factor("Male", "Female")
Date Represents dates. as.Date("2024-08-28")
Raw Represents raw bytes, mainly used for low-level operations and binary data manipulation. charToRaw("A")
Variables in R
• In R, variables are used to store data that can be referenced and
manipulated throughout your script or program.
• A valid variable name consists of letters, numbers and the dot or
underline characters.
Variable Name Naming Convention Example Usage
age Simple age <- 25
total_sales Snake case total_sales <- 1000
AverageHeight Camel case AverageHeight <- 175.5
num_of_items Snake case num_of_items <- 50
maxTemp Camel case maxTemp <- 37.2
.internal_value Starts with period .internal_value <- 42
customerName Camel case customerName <- "Alice"
Variable Assignment
• The variables can be assigned values using leftward( <- ), rightward( -> )
and equal to ( = ) operator.
• The values of the variable can be printed using print() or cat() function.
• The cat() function combines multiple items into a continuous print
output.
• To know all the variables currently available in the workspace we use
the ls() function.
• Variables can be deleted by using the rm() function.
Example
# Using leftward assignment # Printing all variables
a <- 5 print(a) # Output: 5
b <- "Hello"
print(b) # Output: "Hello"
c <- TRUE
print(c) # Output: TRUE
# Using rightward assignment
7 -> d print(d) # Output: 7
"World" -> e print(e) # Output: "World"
FALSE -> f print(f) # Output: FALSE

# Using equal sign for assignment

print(g) # Output: 3.14
g = 3.14
h = "R Programming"
print(h) # Output: "R Programming"
i = NULL print(i) # Output: NULL
Types of Operators in R

• Arithmetic Operators:
Operator Description Example Result
+ Addition 5 +3 8
- Subtraction 5-3 2
* Multiplication 5*3 15
/ Division 5/ 3 2.5
^or ** Exponentiation 5 ^2 25
%% Modulus (remainder) 5 %% 2 1
%/% Integer Division 5 %/% 2 2
2. Relational Operators:

Operator Description Example Result

== Equal to 5 ==3 FALSE
!= Not equal to 5 !=3 TRUE
> Greater than 5 >3 TRUE
< Less than 5 <3 FALSE
>= Greater than or equal to 5 >=3 TRUE
<= Less than or equal to 5 <=3 FALSE
Logical Operators:
S. No Operator Descri pti on Exampl e
1 & This operator is known as the Logical AND a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first b <- c(2, 4, TRUE , 2+3i)
print(a&b)
element of both the vector and returns TRUE
if both the elements are TRUE.
It will give us the following
output:
[1] TRUE FALS E TRUE TRUE
2 | This operator is called the Logical OR a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first b <- c(2, 4, TRUE , 2+3i)
print(a|b)
element of both the vector and returns TRUE
It will give us the following
if one of them is TRUE.
output:
[1] TRUE TRUE TRUE TRUE
3 ! This operator is known as Logical NOT a <- c(3, 0, TRUE , 2+2i)
operator. This operator takes the first print(!a)

element of the vector and gives the opposite It w ill give us the following
logical value as a result. output :
[1] FALS E TRUE FALS E FALS E

4 && This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives TRUE as a result, only if b <- c(2, 4, TRUE , 2+3i)
print(a&&b)
both are TRUE.
It will give us the following
output:
[1] TRUE
5 || This operator takes the first element of both a <- c(3, 0, TRUE , 2+2i)
the vector and gives the result TRUE, if one b <- c(2, 4, TRUE , 2+3i)
print(a||b)
of them is true.
It will give us the following
output:
[1] TRUE
Assignment Operators
S. No Operator Description Example
1 <- or = or <<- These operators are known as left a <- c(3, 0, TRUE, 2+2i)
assignment operators. b <<- c(2, 4, TRUE, 2+3i)
d = c(1, 2, TRUE, 2+3i)
print(a)
print(b)
print(d)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
[1] 1+0i 2+0i 1+0i 2+3i
2 -> or ->> These operators are known as c(3, 0, TRUE, 2+2i) -> a
right assignment operators. c(2, 4, TRUE, 2+3i) ->> b
print(a)
print(b)
It will give us the following
[1] 3+0i 0+0i 1+0i 2+2i
[1] 2+0i 4+0i 1+0i 2+3i
Miscellaneous Operators
S. No Operator Descri pti on Exampl e
1 : The colon operator is used to create v <- 1:8
the series of numbers in sequence for print(v)
a vector. It will give us the
following output:
[1] 1 2 3 4 5 6 7 8
2 %in% This is used when we want to identify if a1 <- 8
an element belongs to a vector. a2 <- 12
d <- 1:10
print(a1%in%t)
print(a2%in%t)
It will give us the
following output:
[1] FALS E
[1] FALS E
3 %*% It is used to multiply a matrix with its M=matrix(c(1,2,3,4,5,6),
transpose. nrow=2,
ncol=3,
byrow=TRUE )
T=m%*%T(m)
print(T)
It will give us the
following output:
14 32
32 77
Decision Making in R
• Decision making is about deciding the order of execution of
statements based on certain conditions. In decision making
programmer needs to provide some condition which is evaluated by
the program, along with it there also provided some statements
which are executed if the condition is true and optionally other
statements if the condition is evaluated to be false.
• The decision making statement in R are as followed:
• if statement
• if-else statement
• nested if-else statement
• switch statement
R - If Statement
• An if statement consists of a Boolean expression followed by one or
more statements.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
• If the Boolean expression evaluates to be true, then the block of code
inside the if statement will be executed. If Boolean expression
evaluates to be false, then the first set of code after the end of the if
statement (after the closing curly brace) will be executed.
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
R - If...Else Statement
• An if statement can be followed by an optional else statement which
executes when the boolean expression is false.
• Syntax
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
• If the Boolean expression evaluates to be true, then the if block of
code will be executed, otherwise else block of code will be executed.
Example

x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
Nested if-else statement
• An if statement can be followed by an optional else if...else statement, which is very
useful to test various conditions using single if...else if statement.

• Syntax
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
R - Switch Statement
• A switch statement allows a variable to be tested for equality against
a list of values. Each value is called a case, and the variable being
switched on is checked for each case.

• Syntax
switch(expression, case1, case2, case3....)
Example

x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
R - Loops
• There may be a situation when you need to execute a block of code
several number of times. In general, statements are executed
sequentially. The first statement in a function is executed first,
followed by the second, and so on.
• Types of Loops in R programming:
• Repeat loop
• While loop
• For loop
• Next
R - Repeat Loop
• The Repeat loop executes the same code again and again until a stop
condition is met.
• Syntax

repeat {
commands
if(condition) {
break
}
}
Example
v <- c("Hello","loop")
cnt <- 2

repeat {
print(v)
cnt <- cnt+1

if(cnt > 5) {
break
}
}
R - While Loop
• he While loop executes the same code again and again until a stop
condition is met.
• Syntax

while (test_expression) {
statement
}
Example
v <- c("Hello","while loop")
cnt <- 2

while (cnt < 7) {

print(v)
cnt = cnt + 1
}
R - For Loop
• A For loop is a repetition control structure that allows you to
efficiently write a loop that needs to execute a specific number of
times.

• Syntax

for (value in vector) {

statements
}
Example

v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
R - Next Statement
• The next statement in R programming language is useful when we
want to skip the current iteration of a loop without terminating it. On
encountering next, the R parser skips further evaluation and starts
next iteration of the loop.
• Syntax
The basic syntax for creating a next statement in R is −
next
Example
v <- LETTERS[1:6]
for ( i in v) {

if (i == "D") {
next
}
print(i)
}
R - Strings
• Any value written within a pair of single quote or double quotes in R is treated as a
string. Internally R stores every string within double quotes, even when you create
them with single quote.
Rules Applied in String Construction
• The quotes at the beginning and end of a string should be both double quotes or
both single quote. They can not be mixed.
• Double quotes can be inserted into a string starting and ending with single quote.
• Single quote can be inserted into a string starting and ending with double quotes.
• Double quotes can not be inserted into a string starting and ending with double
quotes.
• Single quote can not be inserted into a string starting and ending with single quote.
Example
a <- 'Start and end with single quote'
print(a)
b <- "Start and end with double quotes"
print(b)
c <- "single quote ' in between double quotes"
print(c)
d <- 'Double quotes " in between single quote'
print(d)
R - Functions
• A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
• In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
• The function in turn performs its task and returns control to the interpreter as well as any result
which may be stored in other objects.
• Function Definition
An R function is created by using the keyword function. The basic syntax of an R function definition
is as follows −
function_name <- function(arg_1, arg_2, ...) {
Function body
}
Function Components
• The different parts of a function are −
• Function Name − This is the actual name of the function. It is stored
in R environment as an object with this name.
• Arguments − An argument is a placeholder. When a function is
invoked, you pass a value to the argument. Arguments are optional;
that is, a function may contain no arguments. Also arguments can
have default values.
• Function Body − The function body contains a collection of
statements that defines what the function does.
• Return Value − The return value of a function is the last expression in
the function body to be evaluated.
Built-in Function

• Simple examples of in-built functions

are seq(), mean(), max(), sum(x) and paste(...) etc. They are directly
called by user written programs.

# Create a sequence of numbers from 32 to 44.

print(seq(32,44))
# Find mean of numbers from 25 to 82.
print(mean(25:82))
# Find sum of numbers frm 41 to 68.
print(sum(41:68))
User-defined Function
• We can create user-defined functions in R. They are specific to what a user
wants and once created they can be used like the built-in functions. Below is
an example of how a function is created and used.

# Create a function to print squares of numbers in sequence.

new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.

new.function(6)
Calling a Function without an
Argument
# Create a function without an argument.
new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}

# Call the function without supplying an argument.

new.function()
Calling a Function with Argument
Values (by position and by name)
# Create a function with arguments.
new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}

# Call the function by position of arguments.

new.function(5,3,11)

# Call the function by names of the arguments.

new.function(a = 11, b = 5, c = 3)
Calling a Function with Default
Argument
# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}

# Call the function without giving any argument.

new.function()

# Call the function with giving new values of the argument.

new.function(9,5)
Data Structure in R
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
Vectors
• A vector is a basic data structure which plays an important role in R
programming.
• In R, a sequence of elements which share the same data type is known
as vector.
• Vector is classified into two parts, i.e., Atomic vectors and Lists.
• A vector supports logical, integer, double, character, complex, or raw
data type.
• There is only one difference between atomic vectors and lists. In an
atomic vector, all the elements are of the same type, but in the list, the
elements are of different data types.
• In R, we use c() function to create vector. This function returns a one-
dimensional array or simple vector.
Create Vector in R
• Using the colon(:) operator
• Use seq() function
Using the colon(:) operator
• We can create a vector with the help of the colon operator. There is the following
syntax to use colon operator:
z<-x:y
• Example:
a<-4:-10
a

O/P
[1] 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
Using the seq() function
• In R, we can create a vector with the help of the seq() function. A
sequence function creates a sequence of elements as a vector. The
seq() function is used in two ways, i.e., by setting step size with ?by'
parameter or specifying the length of the vector with the 'length.out'
feature.
Example:
seq_vec<-seq(1,4,by=0.5)
seq_vec
class(seq_vec)

Output
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Atomic vectors in R
• In R, there are four types of atomic vectors. Atomic vectors play an
important role in Data Science. Atomic vectors are created with the
help of c() function. These atomic vectors are as follows:
• Numeric vector
• Logical vector
• Integer vector
• Character vector
Numeric vector
• The decimal values are known as numeric data types in R. If we assign
a decimal value to any variable d, then this d variable will become a
numeric type. A vector which contains numeric elements is known as
a numeric vector.
Example:
d<-45.5
num_vec<-c(10.1, 10.2, 33.2)
d
num_vec
class(d)
class(num_vec)

Output
[1] 45.5
[1] 10.1 10.2 33.2
[1] "numeric"
[1] "numeric"
Integer vector
• A non-fraction numeric value is known as integer data. This integer
data is represented by "Int." The Int size is 2 bytes and long Int size of
4 bytes. There is two way to assign an integer value to a variable, i.e.,
by using as.integer() function and appending of L to the value.
• A vector which contains integer elements is known as an integer
vector.
Example:
d<-as.integer(5)
e<-5L
int_vec<-c(1,2,3,4,5)
int_vec<-as.integer(int_vec)
int_vec1<-c(1L,2L,3L,4L,5L)
class(d)
class(e)
class(int_vec)
class(int_vec1)

Output:
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"
Character vector
• A character is held as a one-byte integer in memory. In R, there are
two different ways to create a character data type value, i.e., using
as.character() function and by typing string between double
quotes("") or single quotes('').
• Example: • Output
d<-'shubham'
e<-"Arpita" [1] "shubham"
f<-65
f<-as.character(f)
[1] "Arpita"
d [1] "65"
e
f
[1] "1" "2" "3" "4" "5"
char_vec<-c(1,2,3,4,5) [1] "shubham" "arpita" "nishka"
char_vec<-as.character(char_vec) "vaishali"
char_vec1<-c("shubham","arpita","nishka","vaishali")
char_vec
[1] "character"
class(d) [1] "character"
class(e)
[1] "character"
class(f)
class(char_vec) [1] "character"
class(char_vec1)
[1] "character"
Logical vector
The logical data types have only two values i.e., True or False. These
values are based on which condition is satisfied. A vector which
contains Boolean values is known as the logical vector.
R Lists
• Lists are the objects of R which contain elements of different types
such as number, vectors, string and another list inside it. It can also
contain a function or a matrix as its elements.
• List containing strings, numbers, vectors and logical values.
Creating a List

Example:
list_1<-list("Shubham","Arpita","Vaishali")
list_1

Output:
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] "Vaishali"
R Arrays
• In R, arrays are the data objects which allow us to store data in more than two dimensions. In R, an
array is created with the help of the array() function. This array() function takes a vector as an input
and to create an array it uses vectors values in the dim parameter.
• For example- if we will create an array of dimension (2, 3, 4) then it will create 4 rectangular matrices
of 2 row and 3 columns.
R Array Syntax:
array_name <- array(data, dim= (row_size, column_size, matrices, dim_names))
• data
The data is the first argument in the array() function. It is an input vector which is given to the array.
• matrices
In R, the array consists of multi-dimensional matrices.
• row_size
This parameter defines the number of row elements which an array can store.
• column_size
This parameter defines the number of columns elements which an array can store.
• dim_names
This parameter is used to change the default names of rows and columns.
Example
#Creating two vectors of different lengths
vec1 <-c(1,3,5)
vec2 <-c(10,11,12,13,14,15)

#Initializing names for rows, columns and matrices

col_names <- c("Col1","Col2","Col3")
row_names <- c("Row1","Row2","Row3")
matrix_names <- c("Matrix1","Matrix2")

#Taking the vectors as input to the array

res <- array(c(vec1,vec2),dim=c(3,3,2),dimnames=list(row_names,col_names,matrix_names))
print(res)
Output

, , Matrix1
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15

, , Matrix2
Col1 Col2 Col3
Row1 1 10 13
Row2 3 11 14
Row3 5 12 15
Accessing array elements
• # Creating a 2x3 array
my_array <- array(1:6, dim = c(2, 3))
print(my_array)
• output
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
• # Access the element in the 1st row, 2nd column
my_array[1, 2] # This gives you 3
Factors
• Factors are the data objects which are used to categorize the data and
store it as levels.
• They can store integers & strings.
• E.g. “Male”, “Female” & True , False etc.
• Factors are created using the Factor() function by taking a vector as
input.
Example
# Creating a vector as input.
data <-
c("Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","Arpita","Sumit")
print(data)
print(is.factor(data))

# Applying the factor function.

factor_data<- factor(data)
print(factor_data)
print(is.factor(factor_data))
• Output

[1]"Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit",
"Arpita","Sumit“
[1] FALSE
[1]Shubham","Nishka","Arpita","Nishka","Nishka","Shubham","Sumit","
Arpita","Sumit“
Levels: Arpita Nishka Shubham Sumit
[1] TRUE
R Data Frame
• A data frame is a two-dimensional array-like structure or a table in
which a column contains values of one variable, and rows contains
one set of values from each column.
• There are following characteristics of a data frame:
• The columns name should be non-empty.
• The rows name should be unique.
• The data which is stored in a data frame can be a factor, numeric, or
character type.
• Each column contains the same number of data items.
create Data Frame
# Creating the data frame.
emp.data<- data.frame(
employee_id = c (1:5),
employee_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
sal = c(623.3,915.2,611.0,729.0,843.25),
starting_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)

# Printing the data frame.

print(emp.data)
Output
employee_id employee_name sal starting_date
1 Shubham 623.30 2012-01-01
2 Arpita 915.20 2013-09-23
3 Nishka 611.00 2014-11-15
4 Gunjan 729.00 2014-05-11
5 Sumit 843.25 2015-03-27
R Matrix
• In R, a two-dimensional rectangular data set is known as a matrix.
• In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix elements are the
real numbers.
• A matrix is created using matrix() function.
• Syntax:
matrix(data, nrow, ncol, byrow, dim_name)
• data
The first argument in matrix function is data. It is the input vector which is the data elements of the
matrix.
• nrow
The second argument is the number of rows which we want to create in the matrix.
• ncol
The third argument is the number of columns which we want to create in the matrix.
• byrow
The byrow parameter is a logical clue. If its value is true, then the input vector elements are arranged by
row.
• dim_name
The dim_name parameter is the name assigned to the rows and columns.
Example
#Arranging elements sequentially by row.
P <- matrix(c(5:16), nrow = 4, byrow = TRUE)
print(P)

# Arranging elements sequentially by column.

Q <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(Q)

# Defining the column and row names.

row_names = c("row1", "row2", "row3", "row4")
ccol_names = c("col1", "col2", "col3")

R <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(row_names, col_names))

print(R)
Output

[,1] [,2] [,3]

[1,] 5 6 7
[2,] 8 9 10
[3,] 11 12 13
[4,] 14 15 16

[,1] [,2] [,3]

[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14

col1 col2 col3

row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14
Data Manipulation
• Data Manipulation involves modifying data to make it easier to read
and to be more organized.
• We Manipulate data for analysis and visualization.
• When data collection process is done, data have error and inaccurate
in reading , data manipulation helps to remove this inaccuracies &
make data more accurate and precise.
Ways to Manipulate / treat data
• Manipulating data using inbuilt base R functions.
• Use of Packages
• Use ML algorithms
Data Manipulation techniques
• Filtering & ordering rows
• Renaming and adding columns
• Computing summary statistics
Packages
• Many of R’s most useful functions do not come preloaded when you
start R, but reside in packages that can be installed on top of R.
• R packages are similar to libraries in C, C++, and Javascript, packages
in Python, and gems in Ruby.
• An R package bundles together useful functions, help files, and data
sets.
Installing Packages
• To use an R package, you must first install it on your computer and
then load it in your current R session. The easiest way to install an R
package is with the install.packages R function. Open R and type the
following into the command line:
install.packages("<package name>")
• You can install multiple packages at once by linking their names with
R’s concatenate function, c.
• For example, to install the ggplot2, reshape2, and dplyr packages, run:
install.packages(c("ggplot2", "reshape2", "dplyr"))
Loading Packages
• Installing a package doesn’t immediately place its functions at your
fingertips. It just places them on your computer. To use an R package,
you next have to load it in your R session with the command:
library(<package name>)
• To see which packages you currently have in your R library, run:
library()
Sample in R
• sample takes a sample of the specified size from the elements of x
using either with or without replacement.
• Syntax of sample():
sample(x, size, replace = FALSE, prob = NULL)
• x - vector or a data set.
• size - sample size.
• replace – Make sure that either no element occurs twice or occurs
twice.
• prob - probability weights
Example
sample(1:100,10,replace=true)

Output:
[1] 55 39 71 98 41 26 17 38 82 49
Pipe Operator
• Pipe operator wrap multiple functions together.
• It denoted as %>%.
• It can be use with functions like filter(), select(), arrange(), group_by()
etc.
• Example:
stud_data %>% filter(gender1==“male”,math_score >50)
Data manipulation Functions
• Select()
• Filter()
• Summarise()
• Arrange()
• Mutate()
• Transmutate()
Select() function
• To select columns (variables) by their names.
• The first argument to this function is the data frame and subsequent
arguments are columns to keep.
• To implement select function we need to load the dplyr package.
library(dplyr)
mydata<-mtcars

• To display specific columns in the dataset

select(mydata,1:3)
• The following functions help you to select variables based on their
names
Helpers Description
everything() All variables.
starts_with() Starts with prefix
end_with() End with prefix
contains() Contains a literal string
matches() Match the regular expression
num_range() Numeriacal range like x01,x02,x03
one_of() Variables in character vector
Start_with()
• Start_with() function used to select variables start with alphabet.
• E.g.
mydata1=select(mydata,start_width(“cy1”))
• Adding negative sign before start_width() implies dropping the
variables start with “Y”
• E.g.
mydata2=select(mydata,-start_width(“cy1”))
Contain()
• Selecting variables contain ‘s’ in their names
• E.g.
mydata4=select(mydata,contains(“s”))
Match() function
• Syntax:
match(v1,v2,nomatch=NA_integer, incomparables=NULL)

• V1=vector to which the values to be matched

• V2=vector to which the values should be matched against
• nomatch=value to be returned when there is no match
• incomparables=values to be excluded from the match function

• E.g.
V1<-c(2,5,6,3,7)
V2<-c(15,16,7,3,2,7,5)
match(v1,v2)

O/P: [1] 5 7 NA 4 3
Matches() function
• Mathces() function select column names with matching regular
expressions.
select(mydata,matches(“^(d)”))
The above statement will display the columns matches with “d”
Num_range() function
• Function used to select column names containing number like
col1,col2,col3

Colnames(mydata)<-sprint(“col%d”,1:7)
Select(mydata,num_range(“col1,1:3))
Rename() function
• It is used to change variable NAME.
• Syntax:
rename(data, new_name = old_name)

Data=Data Frame
New_name= New variable name you want to keep
Old_name=Existing variable name
Filter() function
• It is used to subset data with matching logical conditions. Pick rows
based on their values.
• It is used to find rows with matching criteria. It also words like select()
function, i.e., we pass a data frame along condition separated by
comma.
• Syntax:
filter(data,condition)
e.g.
filter(mydata,mpg>=30.00,cy1==4)
summarise() function
• It is used to create summary statistics for a dataset, such as
calculating the mean, sum, or count of variables, often after grouping
the data with group_by().
• Syntax:
summarize(dataframeName, aggregate_function(columnName))
Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C", "A"),
values = c(10, 20, 30, 40, 50, 15)
)
summarise(data, mean_value = mean(values), sum_value = sum(values))

o/p:
mean_value sum_value
1 27.5 165
Arrange() function
• The arrange() function in R, part of the dplyr package, is used to
reorder rows of a data frame or tibble based on the values of one or
more columns. By default, arrange() sorts the data in ascending order,
but you can sort in descending order using the desc() function.
• Syntax:
arrange(data, column_name)
• Example:
data <- data.frame(
category = c("A", "B", "A", "B", "C"),
values = c(30, 10, 20, 40, 50)
)
arrange(data, values)

O/p:
category values
1 B 10
2 A 20
3 A 30
4 B 40
5 C 50
mutate() function
• It is used to create new variables.
• Syntax:
mutate(data_frame,expression(s))

E.g.:

Mydata=mutate_all(mydata,funs(“new” = . * 1000))
Data Analysis & Visualization
• It is a technique used for the graphical representation of data.
• By using elements like scatter charts, graphs, histograms, maps etc.
will make our data more understandable.
• Data Visualization makes it easy to recognize patterns, trends, and
exceptions in our data.
• Ggplot2 is a powerful and flexible R package for producing elegant
graphics.
• Ggplot2 divides plot into three different fundamental parts:
• Plot=data+ Aesthetics + geometry
• Data: is a frame.
• Aesthetics: It is used to indicate X & Y variables. It can also be used to control the
color, the size or the shape of points, the height of bars etc.
• Geometry: defines the type of graphics like histogram, box plot, line plot, density
plot etc.
• There are two major functions in ggplot2 package: qplot() & ggplot() functions.
• qplot() stands for quick plot, which can be used to produce easily simple plots.
• Ggplot() function is more flexible and robust than qplot for building a plot piece by piece.
R visualization Packages
• Plotly
• Ggplot2
• Tidyquant
• Taucharts
• Geofacts
• googleVis
• dygraphs
Introduction to ggplot2
• It is a plotting system.
• It is used to build professional –looking graphs.
• Use plots quickly with minimal code.
• To install ggplot2 package
• Install.packages(“ggplot2”)
Types of Data Visualization
• Scatter plot
• Bar & stack bar chart
• Histogram
• Pie chart
• Box chart
• Area chart
• Heat map
• Correlogram
Scatter Plots
• The scatter plots are used to compare variables.
• A comparison between variables is required when we need to define
how much one variable is affected by another variable.
• Data is represented as a collection of points.
• Each point on the scatter plot defines the values of two variables. One
variable is selected for the vertical axis and other for horizontal axis.
• In R, there are two ways of creating scatter plot, using plot() function
& using the ggplot2 package’s function.
• Plot() function is used to plot R objects. The basic syntax for creating scatter plot
in R:
plot(x, y, type, main, sub, xlab, ylab, asp, col,…..)

• X: x coordinate of the plot, a single plot structure, a function, or an R object.

• Y: y coordinate points in the plot.
• Type: ‘p’ for points, ‘l’ for lines, ‘b’ for both, ‘h’ for high density vertical lines etc.
• Main: title of the plot
• Sub: subtitle of the plot
• Xlab:Title for the x-axis
• Ylab: title for the y-axis
• Asp: aspect ratio
• Col: color of the plot
Example
• Step 1: Create the Data
# Create some sample data
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(2.3, 3.5, 4.1, 5.6, 6.8, 7.4, 8.0, 9.1, 10.3, 11.2)

• Step 2: Create the Scatter Plot

# Plot the scatter plot
plot(x, y,
main = "Scatter Plot Example", # Title of the plot
xlab = "X-axis Label", # Label for the x-axis
ylab = "Y-axis Label", # Label for the y-axis
pch = 19, # Type of point (19 is filled circle)
col = "blue") # Color of the points
Bar Charts
• A bar chart is a pictorial representation in which numerical values of
variables are represented by length or height of lines or rectangles of
equal width.
• It is used for summarizing a set of categorical data.
• R provides the barplot() function, which has following syntax:
• Barplot(h, x, y, main, names.arg, col)
• H is a vector or matrix containing numeric values used in bar chart.
• xlab is label for x axis
• ylab is label for y axis
• Main is the title of the bar chart
• Names.arg is a vector of names appearing under each bar.
• Col used to give colors to the bars in the graph.
Example
# Sample data
categories <- c("A", "B", "C", "D", "E")
values <- c(10, 20, 15, 25, 30)# Create the bar chart

barplot(values,
names.arg = categories, # Category names
main = "Bar Chart Example", # Title of the plot
xlab = "Categories", # X-axis label
ylab = "Values", # Y-axis label
col = "skyblue", # Color of the bars
border = "blue") # Color of the border
Histogram
• It is a type of bar chart which show the frequency of number of values
which compared with set of values ranges.
• In histogram each bar represents the height of the number of v alues
present in the given range.
• R provide hist() function which takes a vector as an input and use
more parameters to add more functionality.
• Syntax:

hist(v, main, xlab, xlim, ylim, breaks, col, boarder)

V: vector containing numeric values used in histogram

Main: indicates title of the chart.
Col: used to set color of the bars.
Border: is used to set border color of each bar
xlab: used to give description of x-axis
xlim: used to specify the range of values on the x-axis.
ylim: used to specify the range of values on the y-axis.
break: used to mention the width of each bar.
Example

# Sample data
data <- c(5, 7, 8, 12, 15, 18, 20, 22, 24, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80)

# Create the histogram

hist(data,
main = "Histogram Example", # Title of the plot
xlab = "Value", # X-axis label
ylab = "Frequency", # Y-axis label
col = "lightblue", # Color of the bars
border = "black") # Color of the bar borders

6 4360704 Nosql Lab Manual
No ratings yet
6 4360704 Nosql Lab Manual
169 pages
CT127 3 2 Pfda NP000327
No ratings yet
CT127 3 2 Pfda NP000327
21 pages
Social Media Analytics Unit-1
No ratings yet
Social Media Analytics Unit-1
43 pages
UI Full Stack by Sudhakar Sir
No ratings yet
UI Full Stack by Sudhakar Sir
12 pages
Data Analytics Notes
No ratings yet
Data Analytics Notes
1 page
Web Technology
No ratings yet
Web Technology
2 pages
Data Analytics Handwritten Notes
No ratings yet
Data Analytics Handwritten Notes
47 pages
MEAN STACK - UNIT-1-Introduction
No ratings yet
MEAN STACK - UNIT-1-Introduction
24 pages
JavaScript Notes
50% (2)
JavaScript Notes
41 pages
Jquery Notes W3schools
100% (2)
Jquery Notes W3schools
8 pages
1.python Tutorial For Beginners - Introduction To Python - DataFlair PDF
0% (1)
1.python Tutorial For Beginners - Introduction To Python - DataFlair PDF
20 pages
Big-Data Unit-3
100% (1)
Big-Data Unit-3
54 pages
AngularJS-unit 2
100% (1)
AngularJS-unit 2
41 pages
Angular JS Unit-4
No ratings yet
Angular JS Unit-4
43 pages
Big - Data Unit-2
100% (2)
Big - Data Unit-2
64 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
JQuery Unit-2
No ratings yet
JQuery Unit-2
18 pages
JQuery Unit-1
0% (1)
JQuery Unit-1
19 pages
JQuery Unit-3
No ratings yet
JQuery Unit-3
14 pages
Angular JS Unit-3
No ratings yet
Angular JS Unit-3
29 pages
Syllabus of Add On Subject (407) Jquery For BBA (CA) Sem IV PDF
No ratings yet
Syllabus of Add On Subject (407) Jquery For BBA (CA) Sem IV PDF
3 pages
Data Analytics - Unit-IV
No ratings yet
Data Analytics - Unit-IV
21 pages
Introduction To JavaScript Notes
No ratings yet
Introduction To JavaScript Notes
5 pages
Unit-III (Data Analytics)
100% (1)
Unit-III (Data Analytics)
15 pages
AngularJS Notes
80% (5)
AngularJS Notes
23 pages
Web Technology Full Notes by Shasun
100% (3)
Web Technology Full Notes by Shasun
75 pages
WebTechnology Study Materials
100% (2)
WebTechnology Study Materials
143 pages
Bba CA Project Sem 4
100% (1)
Bba CA Project Sem 4
5 pages
Web Technology - Lecture Notes, Study Material and Important Questions, Answers
No ratings yet
Web Technology - Lecture Notes, Study Material and Important Questions, Answers
5 pages
CSE3100-Lab Manual
No ratings yet
CSE3100-Lab Manual
19 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
No ratings yet
Data Mining Question Bank Chapter-1 (Introduction To Data Warehouse and Data Mining) Expected Questions 1 Mark Questions
6 pages
R Unit 1 2018 Notes
No ratings yet
R Unit 1 2018 Notes
36 pages
React JS Interview Questions 1691870520
No ratings yet
React JS Interview Questions 1691870520
35 pages
Advanced Excel
No ratings yet
Advanced Excel
37 pages
Unit-II (Data Analytics)
100% (1)
Unit-II (Data Analytics)
17 pages
Advanced Javascript Interview Questions
100% (3)
Advanced Javascript Interview Questions
23 pages
Data Science-Lab Manual
100% (1)
Data Science-Lab Manual
15 pages
Javascript Notes by Heera Singh Bellary
100% (2)
Javascript Notes by Heera Singh Bellary
133 pages
Javascript Notes
No ratings yet
Javascript Notes
108 pages
Full Stack - Unit 1
No ratings yet
Full Stack - Unit 1
15 pages
Counting Ones in A Window: The Cost of Exact Counts
100% (1)
Counting Ones in A Window: The Cost of Exact Counts
13 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
B.B.A. (C. A.) - Sem-II Question Papers
100% (6)
B.B.A. (C. A.) - Sem-II Question Papers
15 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
Data Science Course Content
No ratings yet
Data Science Course Content
4 pages
Data Analytics III I
No ratings yet
Data Analytics III I
86 pages
Data Science Notes
100% (1)
Data Science Notes
59 pages
Data Analytics Question Paper
100% (2)
Data Analytics Question Paper
2 pages
Instagram User Analytics
No ratings yet
Instagram User Analytics
6 pages
Dunham - Data Mining PDF
83% (6)
Dunham - Data Mining PDF
156 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Event Handlers in Javascript
100% (1)
Event Handlers in Javascript
3 pages
Ccw331-Question Bank
No ratings yet
Ccw331-Question Bank
4 pages
Unit 1-Big Data Analytics & Lifecycle
No ratings yet
Unit 1-Big Data Analytics & Lifecycle
130 pages
JavaScript Syllabus
No ratings yet
JavaScript Syllabus
3 pages
MCQ - Bda
33% (3)
MCQ - Bda
3 pages
Ds4015 Big Data Analytics QB
No ratings yet
Ds4015 Big Data Analytics QB
155 pages
UNIT-III Lecture Notes
No ratings yet
UNIT-III Lecture Notes
18 pages
MongoDB - Lab 11
No ratings yet
MongoDB - Lab 11
16 pages
Unit 4 - Big Data Technologies
No ratings yet
Unit 4 - Big Data Technologies
48 pages
Data Analysis Using R - 2
No ratings yet
Data Analysis Using R - 2
23 pages
combined-8-15
No ratings yet
combined-8-15
8 pages
2023 R Programming Apr May (AICTE)
No ratings yet
2023 R Programming Apr May (AICTE)
3 pages
R Seminar 1
No ratings yet
R Seminar 1
41 pages
Latent Variable Modeling Using R 1st Edition A. Alexander Beaujean - Download the complete ebook in PDF format and read freely
100% (1)
Latent Variable Modeling Using R 1st Edition A. Alexander Beaujean - Download the complete ebook in PDF format and read freely
56 pages
R With RStudio For Introductory Statistics
No ratings yet
R With RStudio For Introductory Statistics
163 pages
Unit 1- Data Analysis Using r
No ratings yet
Unit 1- Data Analysis Using r
28 pages
Tugas Praktikum Metstat
No ratings yet
Tugas Praktikum Metstat
99 pages
Meta-Analysis of Diagnostic Accuracy With Mada: Philipp Doebler Heinz Holling
No ratings yet
Meta-Analysis of Diagnostic Accuracy With Mada: Philipp Doebler Heinz Holling
21 pages
StatisticUsing R PDF
No ratings yet
StatisticUsing R PDF
35 pages
R Programming Notes
No ratings yet
R Programming Notes
76 pages
LondonR - Algorithmic Trading in R - Malcolm Sherrington - 20131203
No ratings yet
LondonR - Algorithmic Trading in R - Malcolm Sherrington - 20131203
30 pages
Ds Resume
No ratings yet
Ds Resume
4 pages
3-1 SML Lab Instructor Manual - Cordinator Copy For The A.Y 2023-24
No ratings yet
3-1 SML Lab Instructor Manual - Cordinator Copy For The A.Y 2023-24
77 pages
Applied Predictive Modeling Full Access Download
No ratings yet
Applied Predictive Modeling Full Access Download
15 pages
Lai Et Al. - 2019 - Evaluating The Popularity of R in Ecology
No ratings yet
Lai Et Al. - 2019 - Evaluating The Popularity of R in Ecology
7 pages
An Introduction To GUI Programming Using R
No ratings yet
An Introduction To GUI Programming Using R
25 pages
Lecture Notes
100% (1)
Lecture Notes
82 pages
Business Algorithm and Data Structures For Information Systems
No ratings yet
Business Algorithm and Data Structures For Information Systems
3 pages
R Programming
No ratings yet
R Programming
21 pages
Data Scientist/ Machine Learning Engineer: Summary
No ratings yet
Data Scientist/ Machine Learning Engineer: Summary
4 pages
Lecture 7
No ratings yet
Lecture 7
30 pages
Statistics With R Programming
No ratings yet
Statistics With R Programming
2 pages
Machine Learning Absolute Beginners Introduction 2nd PDF
100% (2)
Machine Learning Absolute Beginners Introduction 2nd PDF
128 pages
R Exercises For Modules
100% (1)
R Exercises For Modules
41 pages
Demgn801 Business Analytics 1 75
No ratings yet
Demgn801 Business Analytics 1 75
75 pages
Instant ebooks textbook R Programming for Bioinformatics 1st Edition Robert Gentleman download all chapters
100% (4)
Instant ebooks textbook R Programming for Bioinformatics 1st Edition Robert Gentleman download all chapters
61 pages
2150 - Winter 2018 PDF
No ratings yet
2150 - Winter 2018 PDF
5 pages
Introduction To R
No ratings yet
Introduction To R
2 pages

Big-Data Unit-4

Uploaded by

Big-Data Unit-4

Uploaded by

DATA ANALYTICS WITH

• Open Source: Freely available and supported by a large community.

# Using equal sign for assignment

Operator Description Example Result

while (cnt < 7) {

for (value in vector) {

• Simple examples of in-built functions

# Create a sequence of numbers from 32 to 44.

# Create a function to print squares of numbers in sequence.

# Call the function new.function supplying 6 as an argument.

# Call the function without supplying an argument.

# Call the function by position of arguments.

# Call the function by names of the arguments.

# Call the function without giving any argument.

# Call the function with giving new values of the argument.

#Initializing names for rows, columns and matrices

#Taking the vectors as input to the array

# Applying the factor function.

# Printing the data frame.

# Arranging elements sequentially by column.

# Defining the column and row names.

R <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(row_names, col_names))

[,1] [,2] [,3]

[,1] [,2] [,3]

col1 col2 col3

• To display specific columns in the dataset

• V1=vector to which the values to be matched

• X: x coordinate of the plot, a single plot structure, a function, or an R object.

• Step 2: Create the Scatter Plot

hist(v, main, xlab, xlim, ylim, breaks, col, boarder)

V: vector containing numeric values used in histogram

# Create the histogram

You might also like