1| Data Science Using R
Week(s) Experiment
R AS CALCULATOR APPLICATION
1 a. Using with and without R objects on console
b. Using mathematical functions on console
c. Write an R script, to create R objects for calculator application and save in a specified
locationin disk.
2 DESCRIPTIVE STATISTICS IN R
a. Write an R script to find basic descriptive statistics using summary, str, quartile
function on mars& cars datasets.
b. Write an R script to find subset of dataset by using subset (), aggregate ()
functions on iris dataset.
3 READING AND WRITING DIFFERENT TYPES OF
DATASETS a. Reading different types of data sets (.txt, .csv) from
Web and disk and writing in file in specific disk location.
b. Reading Excel data sheet in R.
c. Reading XML dataset in R.
4 VISUALIZATIONS
a. Find the data distributions using box and scatter plot.
b. Find the outliers using plot.
c. Plot the histogram, bar chart and pie chart on sample data.
5 CORRELATION AND
COVARIANCE a. Find the correlation
matrix.
b. Plot the correlation plot on dataset and visualize giving an overview of relationships among
data on iris data.
c. Analysis of covariance: variance (ANOVA), if data have categorical variables on iris data.
6 REGRESSION MODEL
Import a data from web storage. Name the dataset and now do Logistic Regression to find out
relation between variables that are affecting the admission of a student in a institute based on
his or her GRE score, GPA obtained and rank of the student. Also check the model is fit or not.
Require (foreign), require (MASS).
7 MULTIPLE REGRESSION MODEL
Apply multiple regressions, if data have a continuous Independent variable. Apply on
above dataset.
8 REGRESSION MODEL FOR PREDICTION
Apply regression Model techniques to predict the data on above dataset.
9 CLASSIFICATION MODEL
a. Install relevant package for classification.
b. Choose classifier for classification problem.
c. Evaluate the performance of classifier.
10 CLUSTERING MODEL
a. Clustering algorithms for unsupervised classification.
b. Plot the cluster data using R visualizations.
2| Data Science Using R
WEEK-1
R AS CALCULATOR APPLICATION
Aim: To perform various operations on console using with and without objects
A. Using without R objects on console
Using with R objects on console:
B. Using mathematical functions on console
C. Write an R script, to create R objects for calculator application and save in a
specified location in disk.
<- function(x, y) {
return(x + y)
3| Data Science Using R
}
Subtract
<- function(x, y)
{
return(x -
y) }
multiply
<- function(x, y)
{
return(x *
y) }
Divide
<- function(x, y)
{
return(x /
y) }
# take input from the user
print("Select operation.")
print("1.Add")
print("2.Subtract")
print("3.Multiply")
print("4.Divide")
choice = as.integer(readline(prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer(readline(prompt="Enter first number: "))
num2 = as.integer(readline(prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch(choice, add(num1, num2), subtract(num1,
num2), multiply(num1, num2), divide(num1, num2))
print(paste(num1, operator, num2, "=", result))
output:
[1] "Select operation."
[1] "1.Add"
[1] "2.Subtract"
[1] "3.Multiply"
[1] "4.Divide"
Enter choice[1/2/3/4]: 3
Enter first number: 300
Enter second number: 4
[1] "300/4”= 75
Theory: In the console when the objects are declared these data will be stored in
environment and these will be stored until the next same variable declared and then it
will be erased after the ending the session
4| Data Science Using R
WEEK-2
DESCRIPTIVE STATISTICS IN
Aim: Write a R programRto find the descriptive statistics using summary
a.Write an R script to find basic descriptive statistics using summary,
str, quartile function on mtcars& cars datasets
5| Data Science Using R
6| Data Science Using R
7| Data Science Using R
b. Write an R script to find subset of dataset by using subset (), aggregate () functions on iris
dataset.
Theory:
Descriptive statistics, as the name implies, refers to the statistics that describe your dataset.
For a large dataset, it gives you a bite-sized summary that can help you understand your data.
Imagine this as being the Resume of the data you are going to work with, it tells you what
your data holds. Statisticians often need to create a descriptive statistics report as a first step
before diving into rigorous analytics and inferential statistics of a data.
8| Data Science Using R
WEEK-3
00000
READING AND WRITING DIFFERENT TYPES
Aim: Write a script to read and write different types of datasetsTS
OFDATASE
a. Reading different types of data sets (.txt, .csv) from web and disk and writing in file in
specific disk location.
Theory:
Usually we will be using data already in a file that we need to read into R in order to work on
it. R can read data from a variety of file formats
To read an entire data frame directly, the external file will normally have a special form
The first line of the file should have a name for each variable in the data frame.
9| Data Science Using R
WEEK-4
Each additional line of the file has as its first item a row label and the values for each variable
Aim: Write a R script to visualize the various types of plot types
VISUALIZATION
S using box and scatterd plot
a. Find the data distribution
10 | D a t a S c i e n c e U s i n g R
b.Find the outliers using plot.
c. Plot the histogram, bar chart and pie chart on sample data.
Histogram:
11 | D a t a S c i e n c e U s i n g R
Barchart:
Piechart:
Theory: boxplot() in R helps to visualize the distribution of the data by quartile and detect
the presence of outliers. You can use the geometric object geom_boxplot () from ggplot2
library to draw a boxplot () in R.
12 | D a t a S c i e n c e U s i n g R
WEEK-5
CORRELATION AND COVARIANCE
Aim: Write a R script to plot the correlation on iris data set
a) How to find a correlation matrix and plot the correlation on iris data set
13 | D a t a S c i e n c e U s i n g R
b. Plot the correlation plot on dataset and visualize giving an overview of
relationships among data on iris data.
c. Analysis of covariance: variance (ANOVA) if data have categorical variables
on iris data.
Theory:
Covariance in R programming
In Statistics, Covariance is the measure of the relation between two variables of a dataset.
That is, it depicts the way two variables are related to each other.
For an instance, when two variables are highly positively correlated, the variables move
ahead in the same direction.
Correlation in R programming
Correlation on a statistical basis is the method of finding the relationship between the
variables in terms of the movement of the data. That is, it helps us analyze the effect of
changes made in one variable over the other variable of the dataset.
When two variables are highly (positively) correlated, we say that the variables depict the
same information and have the same effect on the other data variables of the dataset .
14 | D a t a S c i e n c e U s i n g R
WEEK-6
REGRESSION MODEL
Aim: Write a script to read the data from web and to get the relation between variables
Import a data from web storage. Name the dataset and now do Logistic Regression to find out
relation between variables that are affecting the admission of a student in a institute based on
his or her GRE score, GPA obtained and rank of the student. Also check the model is fit or
not. require (foreign),require(MASS).
Theory:
Regression Analysis in R
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used
to determine the relationship between the dependent and independent variables of the
dataset. Regression analysis helps to understand how dependent variables change when one
of the independent variables is changes and other independent variables are kept constant.
This helps in building a regression model and further, helps in forecasting the values with
respect to a change in one of the independent variables. On the basis of types of dependent
variables, a number of independent variables, and the shape of the regression line, there are
4 types of regression analysis techniques i.e., Linear Regression, Logistic Regression,
Multinomial Logistic Regression and Ordinal Logistic Regression.
15 | D a t a S c i e n c e U s i n g R
WEEK-6
MULTIPLE REGRESSION MODEL
Aim: Write a script to Apply multiple regression
a. Apply multiple regression, if data have a continuous independent variable
16 | D a t a S c i e n c e U s i n g R
Theory:
Multiple linear regression is a statistical analysis technique used to predict a variable’s
outcome based on two or more variables. It is an extension of linear regression and also
known as multiple regression. The variable to be predicted is the dependent variable, and
the variables used to predict the value of the dependent variable are known as independent
or explanatory variables.
The multiple linear regression enables analysts to determine the variation of the model and
each independent variable’s relative contribution. Multiple regression is of two types,
linear and non-linear regression.
17 | D a t a S c i e n c e U s i n g R
WEEK-8
REGRESSION MODEL FOR PREDICTION
Aim: Write a script to Apply regression Model techniques to predict the data on various
sets
a. Apply regression Model techniques to predict the data on above dataset.
Theory:
In R programming, predictive models are extremely useful for forecasting future outcomes
and estimating metrics that are impractical to measure. For example, data scientists could
use predictive models to forecast crop yields based on rainfall and temperature, or to
determine whether patients with certain traits are more likely to react badly to a new
medication
18 | D a t a S c i e n c e U s i n g R
WEEK-9
CLASSIFICATION MODEL
Aim: Write a r script to install the packages and to classify the various
problems
a.Install relevant package for classification
install.packages("rpart.plot")
install.packages("tree")
install.packages ("ISLR")
install.packages("rattle")
library(tree)
library(ISLR)
library(rpart.plot)
library(rattle)
b. Choose classifier for classification problem
SOURCE CODE:
19 | D a t a S c i e n c e U s i n g R
> tree.fit <- tree(Salary~Hits+Years, data=Hitters)
> summary(tree.fit)
Regression tree:
tree(formula = Salary ~ Hits + Years, data = Hitters) Number of terminal nodes:
8
Residual mean deviance: 101200 = 25820000 / 255 Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1238.00 -157.50 -38.84 0.00 76.83 1511.00
plot(tree.fit, uniform=TRUE,margin=0.2) text(tree.fit, use.n=TRUE, all=TRUE,
cex=.8)
#plot(tree.fit)
>split <- createDataPartition(y=Hitters$Salary, p=0.5,
list=FALSE) > train <- Hitters[split,]
> test <- Hitters[-split,] #Create tree model
> trees <- tree(Salary~., train)
> plot(trees)
> text(trees, pretty=0)
# Cross validate to see whether pruning the tree will improve Performance
OUTPUT:
SOURCE CODE:
#Cross validate to see whether pruning the tree will improve performance
> cv.trees <- cv.tree(trees)
> plot(cv.trees)
> prune.trees <- prune.tree(trees, best=4)
> plot(prune.trees)
20 | D a t a S c i e n c e U s i n g R
> text(prune.trees, pretty=0)
OUTPUT:
SOURCE CODE:
yhat <- predict(prune.trees, test)
> plot(yhat, test$Salary)
> abline(0,1 [1] 150179.7
> mean((yhat - test$Salary)^2)
> [1] 150179.7
OUTPUT:
> mean((yhat - test$Salary)^2) [1] 150179.7
Theory:
In classification in R, we try to predict a target class. The possible classes are
already known and so are all of the classes’ identifying properties. The
algorithm needs to identify which class does a data object belong to.
21 | D a t a S c i e n c e U s i n g R
WEEK-9
CLUSTERING MODEL
Aim: Write a script to cluster the various algorithms for unsupervised classification
a. Clustering algorithms for unsupervised classification.
1. Clustering algorithms for unsupervised classification.
2. library(cluster)
> set.seed(20)
> irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
# nstart = 20. This means that R will try 20 different random starting assignments and then
select the one with the lowest within cluster variation.
> irisCluster
OUTPUT:
Petal.Length Petal.Width 1 1.462000 0.246000
2 4.269231 1.342308
3 5.595833 2.037500
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[42] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2
[83] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3
[124] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 2.02200 13.05769 16.29167
(between_SS / total_SS = 94.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
SOURCE CODE:
>irisCluster$cluster <- as.factor(irisCluster$cluster)
>ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point()
OUTPUT:
22 | D a t a S c i e n c e U s i n g R
SOURCE CODE:
> d <- dist(as.matrix(mtcars)) # find distance matrix
> hc <- hclust(d) # apply hirarchical clustering
> plot(hc) # plot the dendrogram
OUTPUT:
2.Plot the cluster data using R visualizations.
SOURCE CODE:
## generate 25 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
cbind(rnorm(15,5,0.5), rnorm(15,5,0.5))) clusplot(pam(x, 2))
OUTPUT:
SOURCE CODE:
## add noise, and try again :
x4 <- cbind(x, rnorm(25), rnorm(25)) clusplot(pam(x4, 2))
23 | D a t a S c i e n c e U s i n g R
OUTPUT:
Theory:
In clustering in R, we try to group similar objects together. The principle behind R clustering is that
objects in a group are similar to other objects in that set and no objects in different groups are similar
to each other.
24 | D a t a S c i e n c e U s i n g R