0% found this document useful (0 votes)

4 views24 pages

Da Thoery

The document outlines various programming objectives and theories related to data analysis techniques using R, including Market Basket Analysis with the Apriori algorithm, data preprocessing, Principal Component Analysis (PCA), Simple Linear Regression, K-Nearest Neighbors (KNN) classification, and K-Means clustering. Each section provides an objective, theoretical background, and steps to implement the techniques in R, emphasizing the importance of data handling and analysis in machine learning. The document serves as a comprehensive guide for performing essential data analysis tasks.

Uploaded by

kattarhindushivraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views24 pages

Da Thoery

Uploaded by

kattarhindushivraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Practical 13 kunal sharma

2200320130099
PROGRAM-10
OBJECTIVE:- To perform market basket analysis using Association
Rules(Apriori).

Theory
Market Basket Analysis(MBA) is a data mining technique to discover associations
between items. It's widely used in:

 Retail (cross-selling, promotions)

 Online shopping (Amazon recommendations)
 Healthcare (co-prescriptions)

Association Rules are built in the form: A=>B

Meaning: "If A happens, B is likely to happen too."

Each rule has:

 Support: How often item(s) occur together

 Confidence: How often B occurs when A has occurred
 Lift: Strength of the rule (higher lift means stronger relationship)

About the Apriori Algorithm

 Find frequent itemsets based on minimum support
 Generate strong association rules from them based on confidence and lift
 It is called "Apriori" because it uses prior knowledge (frequent itemsets).

Example Dataset: Transactions

Let’s imagine customers bought these items:

Transaction Items

1 Milk, Bread, Butter

2 Beer, Bread

3 Milk, Bread, Butter, Beer

4 Milk, Bread

5 Butter, Bread
CODE

EXPECTED OUTPUT:
Program-5
Objective
To perform data preprocessing operations using R programming:
- Handle missing data appropriately.
- Apply Min-Max Normalization to scale numerical attributes between 0 and 1.

Theory
Data Preprocessing is a vital step in any data analysis or machine learning project. Real-world datasets
often have inconsistencies, missing values, or features with varying scales. Preprocessing ensures that the
data is clean, standardized, and ready for modeling, improving both model performance and accuracy.

1. Handling Missing Data

- Remove rows or columns with missing values.
- Impute missing values using statistical methods like mean, median, or mode.

2. Min-Max Normalization
- Rescales data to fit within a specified range, usually [0, 1].
- Formula: (x - min(x)) / (max(x) - min(x))
- Useful for distance-based and gradient-based algorithms.

Key Points
- Check and handle missing values before normalization.
- Mean imputation is common for numerical data.
- Min-Max normalization preserves relative scaling.
- Improves model stability and fairness across features.

Flowchart: Data Preprocessing Steps

Start
-> Load Dataset
-> Check for Missing Values
-> Handle Missing Values
(Remove rows OR Replace with Mean/Median/Mode)
-> Apply Min-Max Normalization
-> Preprocessed Dataset Ready
-> End
Handling Missing Data in R
Applying Min-Max Normalization in R
Complete Combined Workflow
kunal sharma
2200320130099
PROGRAM - 6

OBJECTIVE : To perform dimensionality reduction operation using PCA for

Houses Data Set

Theory

Principal Component Analysis (PCA) is a statistical technique used for

dimensionality reduction while preserving as much variance (information) as
possible. It transforms the original dataset into a new set of orthogonal
(uncorrelated) variables called principal components.

Key Concepts in PCA:

1. Dimensionality Reduction:

o PCA reduces the number of features (dimensions) in the dataset by

creating new dimensions (principal components) that capture the
maximum variance.

Principal Components (PCs):

• Principal Components are the new axes (directions) in the data space
that maximize the variance of the data.

Eigenvalues and Eigenvectors:

• Eigenvectors represent the directions of the new axes (principal

components), and they are the "loadings" of the original features.

Covariance Matrix:

• PCA starts by calculating the covariance matrix of the dataset to measure

how variables change together. The covariance matrix describes the
relationships (covariances) between pairs of variables.

Steps in PCA:

1
• Standardization (Optional but Recommended): If the features have
different units or scales, they need to be standardized so that each
feature has a mean of 0 and a standard deviation of 1. This is done using
the scale() function in R.

Proportion of Variance Explained:

• The proportion of variance explained by each principal component is

given by the eigenvalue of the component divided by the sum of all
eigenvalues.

Scree Plot:

• A scree plot is a graphical representation showing the eigenvalues (or

variance explained) of each principal component. It helps in selecting the
number of principal components to retain based on the "elbow" point,
where the variance explained starts to level off.

Biplot:

• A biplot is a 2D or 3D plot that shows the data projected onto the first
few principal components. It also shows how the original variables are
related to these components.

Applications of PCA:

Dimensionality Reduction:

• PCA is widely used for reducing the dimensionality of large datasets,

especially when dealing with high-dimensional data like images, gene
expression data, etc.

2
Input

3
OUTPUT

4
5
Program:07 kunal sharma
2200320130099

Objective: To perform Simple Linear Regression with R.

THEORY :
Introduction
Simple Linear Regression is a supervised learning technique that
models the relationship between two continuous variables:
• Independent Variable (X): Predictor
• Dependent Variable (Y): Response
The goal is to find a best-fitting straight line (called regression
line) that can predict the dependent variable based on the
independent variable.

Equation of a Simple Linear Regression Model:

Y=β0+β1X+ϵY = \beta_0 + \beta_1X + \epsilonY=β0+β1X+ϵ
Where:
• YYY = Dependent variable (outcome)
• XXX = Independent variable (input)
• β0\beta_0β0 = Intercept (value of Y when X = 0)
• β1\beta_1β1 = Slope (change in Y for one unit change in X)
• ϵ\epsilonϵ = Error term (difference between observed and
predicted Y)

Steps in Simple Linear Regression

1. Load the data containing two continuous variables.

2. Fit a linear model using lm() function in R.
3. Summarize the model using summary() to check
coefficients, R-squared, p-values.
4. Predict new values using the model.
5. Visualize using scatter plots and regression lines.

CODE:
OUTPUT:
kunal sharma
Program:09 2200320130099

Objective: Write R script to diagnose any disease using KNN

classification and plot the results.
THEORY :
Introduction
K-Nearest Neighbors (KNN) is a simple and effective supervised
learning algorithm used for both classification and regression
problems. In classification, KNN predicts the label of a data point
based on the majority label of its k nearest neighbors in the
feature space.

How KNN Works

1. Choose the number of neighbors (k).
2. Calculate the distance (usually Euclidean) between the new
input and all training data points.
3. Select the k closest points.
4. Assign the most common class among those k points to the
input.
Steps in R Script
1. Load Dataset
We use the publicly available Pima Indians Diabetes dataset.
2. Data Preprocessing
• Normalize the feature values between 0 and 1 (important
for distance calculations in KNN).
• Split the dataset into training (70%) and testing (30%) sets.
3. Build and Train KNN Model
• Use the knn() function from the class package.
• Choose k=5 neighbors.
4. Prediction and Evaluation
• Predict outcomes for the test set.
• Calculate confusion matrix and accuracy.
5. Visualization
• Plot two features (Glucose vs BMI) to show how predictions
are classified.
• Use color for predicted labels and shape for actual labels to
visualize correct and incorrect predictions.
CODE:
library(class)
library(caret)
library(ggplot2)
library(dplyr)

data <-
read.csv("https://raw.githubusercontent.com/plotly/datasets
/master/diabetes.csv")
head(data)
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))}
data_norm <- as.data.frame(lapply(data[, 1:8], normalize))
data_norm$Outcome <- as.factor(data$Outcome) # Keep
labels
set.seed(123)
trainIndex <- createDataPartition(data_norm$Outcome, p =
0.7, list = FALSE)
trainData <- data_norm[trainIndex, ]
testData <- data_norm[-trainIndex, ]
k <- 5
knn_pred <- knn(train = trainData[, -9],
test = testData[, -9],
cl = trainData$Outcome,
k = k)
conf_matrix <- table(Predicted = knn_pred, Actual =
testData$Outcome)
print(conf_matrix)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy: ", round(accuracy * 100, 2), "%\n")
testData$Predicted <- knn_pred

ggplot(testData, aes(x = Glucose, y = BMI, color = Predicted,

shape = Actual)) +
geom_point(size = 4) +
labs(title = paste("KNN Classification (k =", k, ")"),
x = "Glucose",
y = "BMI") +
theme_minimal()
OUTPUT:
Program:08

Objective: To perform K-Means clustering operation and

visualize for iris data set
THEORY : Clustering is an unsupervised machine learning
technique that groups data points into clusters based on
similarity. K-Means clustering is one of the most popular
clustering algorithms, where K refers to the number of clusters
to form.
The Iris dataset is a classic dataset in machine learning,
containing 150 observations of iris flowers from three species
(setosa, versicolor, and virginica). Each observation includes four
features: sepal length, sepal width, petal length, and petal
width.
In this experiment, we perform K-Means clustering on the Iris
dataset and visualize the results to understand how the data is
grouped.

Steps to Perform K-Means Clustering

1. Load the Dataset
The Iris dataset can be loaded using built-in R libraries like
datasets. Only the numerical columns (features) are used for
clustering, not the species labels.
2. Preprocessing
We remove the Species column to avoid supervised learning
behavior, as clustering should work without knowing the true
labels.
3. Apply K-Means Clustering
We apply the kmeans() function in R:
• centers = 3 because there are three species.
• nstart = 20 to choose the best clustering result out of 20
random initializations.
• The algorithm partitions the data into 3 clusters by
minimizing the within-cluster sum of squares.
4. Analyze Results
The clustering results provide:
• Cluster centers (mean values for each cluster).
• Cluster assignments (which observation belongs to which
cluster).
• Cluster sizes.
• Total within-cluster and between-cluster sum of squares.
5. Visualization
We visualize the clusters using a scatter plot:
• Choose two significant features (e.g., Petal.Length and
Petal.Width).
• Color points based on their cluster assignments.
CODE:
OUTPUT:

Applied Data Mining
100% (1)
Applied Data Mining
284 pages
Module 1-Data Mining Introduction (Student Edition)
No ratings yet
Module 1-Data Mining Introduction (Student Edition)
39 pages
DWDM Lab Manual Final Updated New Finalll
100% (1)
DWDM Lab Manual Final Updated New Finalll
60 pages
Data Analytics Lab Manual - 250402 - 095326
No ratings yet
Data Analytics Lab Manual - 250402 - 095326
58 pages
Machine Learning With MATLAB Quick Reference
No ratings yet
Machine Learning With MATLAB Quick Reference
36 pages
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
100% (1)
Kassambara, Alboukadel - Machine Learning Essentials - Practical Guide in R (2018)
424 pages
Week 10 Abhishek Srivastava VFinal
No ratings yet
Week 10 Abhishek Srivastava VFinal
14 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
DW Lab
No ratings yet
DW Lab
85 pages
Apriori
No ratings yet
Apriori
34 pages
Shahun Term Workr1
No ratings yet
Shahun Term Workr1
34 pages
Datamining Lab Record
No ratings yet
Datamining Lab Record
36 pages
Mod 3
No ratings yet
Mod 3
50 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
DATAMINING
No ratings yet
DATAMINING
24 pages
BDA MSC It
No ratings yet
BDA MSC It
35 pages
Ebooks File Data Analytics Anil Maheshwari All Chapters
100% (1)
Ebooks File Data Analytics Anil Maheshwari All Chapters
47 pages
DWDM Lab Manual Final As On 09-04-2021 R18
No ratings yet
DWDM Lab Manual Final As On 09-04-2021 R18
88 pages
ACPusing R
No ratings yet
ACPusing R
25 pages
BDA Lab Manual (12 Weeks)
No ratings yet
BDA Lab Manual (12 Weeks)
22 pages
R - Language Lab Manual - PG 2024
No ratings yet
R - Language Lab Manual - PG 2024
29 pages
AMDA Practical - A048
No ratings yet
AMDA Practical - A048
35 pages
R Program
No ratings yet
R Program
22 pages
Econometrics I - R Summary (Maite Cabeza-Gutes)
No ratings yet
Econometrics I - R Summary (Maite Cabeza-Gutes)
77 pages
Saurabh
No ratings yet
Saurabh
22 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Lab File AD PDF
No ratings yet
Lab File AD PDF
25 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
R File Code
No ratings yet
R File Code
16 pages
Datamininganddataware
No ratings yet
Datamininganddataware
25 pages
File 2
No ratings yet
File 2
17 pages
WEEK
No ratings yet
WEEK
17 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
17 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Data Science
No ratings yet
Data Science
15 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
No ratings yet
Data Mining: Department of Computer Science & Engineering Jamia Hamdard, New Delhi
43 pages
R Lab Program
No ratings yet
R Lab Program
20 pages
Unit V
No ratings yet
Unit V
22 pages
R Fourier
No ratings yet
R Fourier
18 pages
Module 2
No ratings yet
Module 2
12 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Aman DA 111
No ratings yet
Aman DA 111
14 pages
MicroArray Analysis - 201
No ratings yet
MicroArray Analysis - 201
13 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Da 06-10
No ratings yet
Da 06-10
14 pages
ML Lab - Sukanya Raja
No ratings yet
ML Lab - Sukanya Raja
23 pages
Unit 4 Data Analytics
No ratings yet
Unit 4 Data Analytics
11 pages
FP Growth
No ratings yet
FP Growth
21 pages
Da Exp9,10
No ratings yet
Da Exp9,10
9 pages
K Nearest Neighbours (KNN) : Short Intro To KNN
No ratings yet
K Nearest Neighbours (KNN) : Short Intro To KNN
13 pages
Mla - 2 (Cia - 3) - 20221013
No ratings yet
Mla - 2 (Cia - 3) - 20221013
21 pages
Data Analytics Practical
No ratings yet
Data Analytics Practical
6 pages
Association Rules:: Books Data Set
No ratings yet
Association Rules:: Books Data Set
23 pages
Mining Energy Consumption Behavior Patterns For Households in Smart Grid
No ratings yet
Mining Energy Consumption Behavior Patterns For Households in Smart Grid
18 pages
Grid Search For KNN
No ratings yet
Grid Search For KNN
17 pages
Toc ch1
No ratings yet
Toc ch1
9 pages
Unit 4 - Association Analysis
100% (1)
Unit 4 - Association Analysis
12 pages
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
No ratings yet
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
20 pages
Bca-Machine Learning With R
No ratings yet
Bca-Machine Learning With R
8 pages
R Analysis Summary
No ratings yet
R Analysis Summary
6 pages
Model Lab
No ratings yet
Model Lab
6 pages
Data Mining - Module 6
No ratings yet
Data Mining - Module 6
7 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Intro To R Software
No ratings yet
Intro To R Software
7 pages
Image Classifaction
No ratings yet
Image Classifaction
17 pages
Machine Learning Based Recommender System For E-Commerce
No ratings yet
Machine Learning Based Recommender System For E-Commerce
9 pages
Introduction To KNN and R
No ratings yet
Introduction To KNN and R
12 pages
04 AssociationRules PDF
No ratings yet
04 AssociationRules PDF
15 pages
Dmdw-Unit-1 R16
No ratings yet
Dmdw-Unit-1 R16
17 pages
DMDW Day-Wise Lesson Plan
No ratings yet
DMDW Day-Wise Lesson Plan
4 pages
Regression PDF
No ratings yet
Regression PDF
10 pages
Ds
No ratings yet
Ds
2 pages
Assgg
No ratings yet
Assgg
12 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
8 pages
BAN5
No ratings yet
BAN5
2 pages
Rstudio Study Notes For PA 20181126
No ratings yet
Rstudio Study Notes For PA 20181126
6 pages
A Short List of Some Useful R Commands: Input and Display
No ratings yet
A Short List of Some Useful R Commands: Input and Display
2 pages
Machine Learning
100% (5)
Machine Learning
56 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
Data Mining and Business Intelligence
50% (2)
Data Mining and Business Intelligence
2 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
COMP1942 Question Paper
No ratings yet
COMP1942 Question Paper
7 pages