0% found this document useful (0 votes)
4 views24 pages

Da Thoery

The document outlines various programming objectives and theories related to data analysis techniques using R, including Market Basket Analysis with the Apriori algorithm, data preprocessing, Principal Component Analysis (PCA), Simple Linear Regression, K-Nearest Neighbors (KNN) classification, and K-Means clustering. Each section provides an objective, theoretical background, and steps to implement the techniques in R, emphasizing the importance of data handling and analysis in machine learning. The document serves as a comprehensive guide for performing essential data analysis tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views24 pages

Da Thoery

The document outlines various programming objectives and theories related to data analysis techniques using R, including Market Basket Analysis with the Apriori algorithm, data preprocessing, Principal Component Analysis (PCA), Simple Linear Regression, K-Nearest Neighbors (KNN) classification, and K-Means clustering. Each section provides an objective, theoretical background, and steps to implement the techniques in R, emphasizing the importance of data handling and analysis in machine learning. The document serves as a comprehensive guide for performing essential data analysis tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Practical 13 kunal sharma

2200320130099
PROGRAM-10
OBJECTIVE:- To perform market basket analysis using Association
Rules(Apriori).

Theory
Market Basket Analysis(MBA) is a data mining technique to discover associations
between items. It's widely used in:

 Retail (cross-selling, promotions)


 Online shopping (Amazon recommendations)
 Healthcare (co-prescriptions)

Association Rules are built in the form: A=>B

Meaning: "If A happens, B is likely to happen too."

Each rule has:

 Support: How often item(s) occur together


 Confidence: How often B occurs when A has occurred
 Lift: Strength of the rule (higher lift means stronger relationship)

About the Apriori Algorithm


 Find frequent itemsets based on minimum support
 Generate strong association rules from them based on confidence and lift
 It is called "Apriori" because it uses prior knowledge (frequent itemsets).

Example Dataset: Transactions


Let’s imagine customers bought these items:

Transaction Items

1 Milk, Bread, Butter

2 Beer, Bread

3 Milk, Bread, Butter, Beer

4 Milk, Bread

5 Butter, Bread
CODE

EXPECTED OUTPUT:
Program-5
Objective
To perform data preprocessing operations using R programming:
- Handle missing data appropriately.
- Apply Min-Max Normalization to scale numerical attributes between 0 and 1.

Theory
Data Preprocessing is a vital step in any data analysis or machine learning project. Real-world datasets
often have inconsistencies, missing values, or features with varying scales. Preprocessing ensures that the
data is clean, standardized, and ready for modeling, improving both model performance and accuracy.

1. Handling Missing Data


- Remove rows or columns with missing values.
- Impute missing values using statistical methods like mean, median, or mode.

2. Min-Max Normalization
- Rescales data to fit within a specified range, usually [0, 1].
- Formula: (x - min(x)) / (max(x) - min(x))
- Useful for distance-based and gradient-based algorithms.

Key Points
- Check and handle missing values before normalization.
- Mean imputation is common for numerical data.
- Min-Max normalization preserves relative scaling.
- Improves model stability and fairness across features.

Flowchart: Data Preprocessing Steps


Start
-> Load Dataset
-> Check for Missing Values
-> Handle Missing Values
(Remove rows OR Replace with Mean/Median/Mode)
-> Apply Min-Max Normalization
-> Preprocessed Dataset Ready
-> End
Handling Missing Data in R
Applying Min-Max Normalization in R
Complete Combined Workflow
kunal sharma
2200320130099
PROGRAM - 6

OBJECTIVE : To perform dimensionality reduction operation using PCA for


Houses Data Set

Theory

Principal Component Analysis (PCA) is a statistical technique used for


dimensionality reduction while preserving as much variance (information) as
possible. It transforms the original dataset into a new set of orthogonal
(uncorrelated) variables called principal components.

Key Concepts in PCA:

1. Dimensionality Reduction:

o PCA reduces the number of features (dimensions) in the dataset by


creating new dimensions (principal components) that capture the
maximum variance.

Principal Components (PCs):

• Principal Components are the new axes (directions) in the data space
that maximize the variance of the data.

Eigenvalues and Eigenvectors:

• Eigenvectors represent the directions of the new axes (principal


components), and they are the "loadings" of the original features.

Covariance Matrix:

• PCA starts by calculating the covariance matrix of the dataset to measure


how variables change together. The covariance matrix describes the
relationships (covariances) between pairs of variables.

Steps in PCA:

1
• Standardization (Optional but Recommended): If the features have
different units or scales, they need to be standardized so that each
feature has a mean of 0 and a standard deviation of 1. This is done using
the scale() function in R.

Proportion of Variance Explained:

• The proportion of variance explained by each principal component is


given by the eigenvalue of the component divided by the sum of all
eigenvalues.

Scree Plot:

• A scree plot is a graphical representation showing the eigenvalues (or


variance explained) of each principal component. It helps in selecting the
number of principal components to retain based on the "elbow" point,
where the variance explained starts to level off.

Biplot:

• A biplot is a 2D or 3D plot that shows the data projected onto the first
few principal components. It also shows how the original variables are
related to these components.

Applications of PCA:

Dimensionality Reduction:

• PCA is widely used for reducing the dimensionality of large datasets,


especially when dealing with high-dimensional data like images, gene
expression data, etc.

2
Input

3
OUTPUT

4
5
Program:07 kunal sharma
2200320130099

Objective: To perform Simple Linear Regression with R.


THEORY :
Introduction
Simple Linear Regression is a supervised learning technique that
models the relationship between two continuous variables:
• Independent Variable (X): Predictor
• Dependent Variable (Y): Response
The goal is to find a best-fitting straight line (called regression
line) that can predict the dependent variable based on the
independent variable.

Equation of a Simple Linear Regression Model:


Y=β0+β1X+ϵY = \beta_0 + \beta_1X + \epsilonY=β0+β1X+ϵ
Where:
• YYY = Dependent variable (outcome)
• XXX = Independent variable (input)
• β0\beta_0β0 = Intercept (value of Y when X = 0)
• β1\beta_1β1 = Slope (change in Y for one unit change in X)
• ϵ\epsilonϵ = Error term (difference between observed and
predicted Y)

Steps in Simple Linear Regression

1. Load the data containing two continuous variables.


2. Fit a linear model using lm() function in R.
3. Summarize the model using summary() to check
coefficients, R-squared, p-values.
4. Predict new values using the model.
5. Visualize using scatter plots and regression lines.

CODE:
OUTPUT:
kunal sharma
Program:09 2200320130099

Objective: Write R script to diagnose any disease using KNN


classification and plot the results.
THEORY :
Introduction
K-Nearest Neighbors (KNN) is a simple and effective supervised
learning algorithm used for both classification and regression
problems. In classification, KNN predicts the label of a data point
based on the majority label of its k nearest neighbors in the
feature space.

How KNN Works


1. Choose the number of neighbors (k).
2. Calculate the distance (usually Euclidean) between the new
input and all training data points.
3. Select the k closest points.
4. Assign the most common class among those k points to the
input.
Steps in R Script
1. Load Dataset
We use the publicly available Pima Indians Diabetes dataset.
2. Data Preprocessing
• Normalize the feature values between 0 and 1 (important
for distance calculations in KNN).
• Split the dataset into training (70%) and testing (30%) sets.
3. Build and Train KNN Model
• Use the knn() function from the class package.
• Choose k=5 neighbors.
4. Prediction and Evaluation
• Predict outcomes for the test set.
• Calculate confusion matrix and accuracy.
5. Visualization
• Plot two features (Glucose vs BMI) to show how predictions
are classified.
• Use color for predicted labels and shape for actual labels to
visualize correct and incorrect predictions.
CODE:
library(class)
library(caret)
library(ggplot2)
library(dplyr)

data <-
read.csv("https://raw.githubusercontent.com/plotly/datasets
/master/diabetes.csv")
head(data)
normalize <- function(x) {
return((x - min(x)) / (max(x) - min(x)))}
data_norm <- as.data.frame(lapply(data[, 1:8], normalize))
data_norm$Outcome <- as.factor(data$Outcome) # Keep
labels
set.seed(123)
trainIndex <- createDataPartition(data_norm$Outcome, p =
0.7, list = FALSE)
trainData <- data_norm[trainIndex, ]
testData <- data_norm[-trainIndex, ]
k <- 5
knn_pred <- knn(train = trainData[, -9],
test = testData[, -9],
cl = trainData$Outcome,
k = k)
conf_matrix <- table(Predicted = knn_pred, Actual =
testData$Outcome)
print(conf_matrix)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy: ", round(accuracy * 100, 2), "%\n")
testData$Predicted <- knn_pred

ggplot(testData, aes(x = Glucose, y = BMI, color = Predicted,


shape = Actual)) +
geom_point(size = 4) +
labs(title = paste("KNN Classification (k =", k, ")"),
x = "Glucose",
y = "BMI") +
theme_minimal()
OUTPUT:
Program:08

Objective: To perform K-Means clustering operation and


visualize for iris data set
THEORY : Clustering is an unsupervised machine learning
technique that groups data points into clusters based on
similarity. K-Means clustering is one of the most popular
clustering algorithms, where K refers to the number of clusters
to form.
The Iris dataset is a classic dataset in machine learning,
containing 150 observations of iris flowers from three species
(setosa, versicolor, and virginica). Each observation includes four
features: sepal length, sepal width, petal length, and petal
width.
In this experiment, we perform K-Means clustering on the Iris
dataset and visualize the results to understand how the data is
grouped.

Steps to Perform K-Means Clustering


1. Load the Dataset
The Iris dataset can be loaded using built-in R libraries like
datasets. Only the numerical columns (features) are used for
clustering, not the species labels.
2. Preprocessing
We remove the Species column to avoid supervised learning
behavior, as clustering should work without knowing the true
labels.
3. Apply K-Means Clustering
We apply the kmeans() function in R:
• centers = 3 because there are three species.
• nstart = 20 to choose the best clustering result out of 20
random initializations.
• The algorithm partitions the data into 3 clusters by
minimizing the within-cluster sum of squares.
4. Analyze Results
The clustering results provide:
• Cluster centers (mean values for each cluster).
• Cluster assignments (which observation belongs to which
cluster).
• Cluster sizes.
• Total within-cluster and between-cluster sum of squares.
5. Visualization
We visualize the clusters using a scatter plot:
• Choose two significant features (e.g., Petal.Length and
Petal.Width).
• Color points based on their cluster assignments.
CODE:
OUTPUT:

You might also like