Module2 BDA
Module2 BDA
Analytics
Module 2- Data Analytic Method
Introduction to R
• R is a programming language and software framework
for statistical analysis and graphics.
DATA TYPES
R has five basic or atomic object classes:
1. Character – “hello”
2. Numeric – 10.5
3. Integer – 50L
4. Complex – 3+3i
5. Logical –True/False
Each object has specific attributes like:
• names
• dim
• class
• length
• Other attributes defined by the user
Eg:
x<-list(age=c(10,20,30),weight=c(60,50,70))
• A list is a collection of objects that can be of various
types, including other lists.
VECTORS AND LISTS
• way of creating a vector is by using the c function.
• Indexing in R starts from 1 not 0.
Eg:
X<-c(“this”,”is”,”a”,”vector”)
X<-c(1,”hello”,3,4,5)
Y<-c(1,2,3,45)
X<-vector(mode=“logical”,length=4)
as.logical(x) – used for conversion of datatype
• Vector is used to store the same data type
• List is used to store different data types
• X<-list(“hello”,1,true)
• is.integer(i) # returns FALSE
• j <- as.integer(i) # coerces contents of i into an integer
• is.integer(j) # returns TRUE
Connecting Databases
• install.packages(“RODBC”)
• library(RODBC)
• conn <- odbcConnect(“DSN”, uid=“user”,
pwd=“password”)
• housing_data <- sqlQuery(conn, “select serialno, state,
persons, rooms from housing where hinc > 1000000”)
To save plots in desired format
jpeg(file=“c:/data/sales_hist.jpeg”) # create a new jpeg file
hist(sales$num_of_orders) # export histogram to jpeg
dev.off() # shut off the graphic device
Other functions - png(), bmp(), pdf(), etc.,
• v <- 1:5 # create a vector 1 2 3 4 5
•v # returns 1 2 3 4 5
• sum(v) # returns 15
• w <- v * 2 # create a vector 2 4 6 8 10
•w # returns 2 4 6 8 10
• w[3] # returns 6 (the 3rd element of w)
• z <- v + w # sums two vectors element by
element
•z # returns 3 6 9 12 15
•z>8 # returns FALSE FALSE TRUE TRUE TRUE
• z[z > 8] # returns 9 12 15
• z[z > 8 | z < 5] # returns 3 9 12 15 (“|” denotes
“or”)
Matrix
• A two-dimensional array is known as a matrix.
sales_matrix <- matrix(0, nrow = 3, ncol = 4)
• sales_matrix
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
• M <- matrix(c(1,3,3,5,0,4,3,3,3),nrow = 3,ncol
= 3)
Contingency Tables
• In R, table refers to a class of objects used to store the
observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency
table.
• It is the basis for performing a statistical test on the
independence of the factors used to build the table.
sales_table <- table(sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586
Exploratory data analysis
Step 1: Install and Load Necessary Libraries
install.packages(c("dplyr", "ggplot2", "tidyr",
"summarytools", "corrplot"))
Load the packages
• library(dplyr) # For data manipulation
• library(ggplot2) # For visualization
• library(tidyr) # For data tidying (reshape data)
• library(summarytools) # For descriptive statistics
• library(corrplot) # For correlation plots
• Step 2: Load the Dataset
my_data <- read.csv("path_to_your_dataset.csv")
View the first few rows of the dataset - head(my_data)
• Install and load readxl if working with Excel files
install.packages("readxl")
library(readxl)
Load Excel file
my_data <- read_excel("path_to_your_file.xlsx")
• Step 3: Check Basic Information About the
Dataset
View the structure of the data
str(my_data)
Get summary statistics (min, max, mean, etc.)
summary(my_data)
Check for missing values
sum(is.na(my_data))
Get column names
colnames(my_data)
• Step 4: Clean and Preprocess the Data
my_datacolumn_name[is.na(my_datacolumn_name)
]<-mean(my_datacolumn_name, na.rm = TRUE)
2. Remove Duplicates:
my_data_clean <- my_data[!duplicated(my_data), ]
4. Renaming Columns:
colnames(my_data)[colnames(my_data) ==
"old_name"] <- "new_name"
• Step 5: Descriptive Statistics and Summary
Summary statistics for numeric variables
summary(my_data)
Hypothesis testing
• The difference in means can be tested using Student’s t-test
or the Welch’s t-test.
Student’s t-test: - assumes that distributions of the two
populations have equal but unknown variances. (T- t distribution,
n1+n2-2 represents degree of freedom)
Sample 2 27 30 28 31 22 36
∑(x2-
= 2
Sample I Sample II
S12 = 21.2 / 5 = 4.24
S22 = 108 / 6 = 18
24 27 0.36 4
27 30 5.76 1
26 28 1.96 1
S2 =
21 31 12.96 4
((4*4.24)+(5*18)) /
25 22 0.16 49
(5+6-2)
- 36 - 49
= 11.88
∑(x2-
=21.2 x̄ 2)2
=108
• Procedure:
1. Null hypothesis H0 : µ1 = µ2
2. Alternative Hypothesis HA : µ1 ≠ µ2
3. Level of significance :
Degree of freedom, V1 = n1 + n2 – 2 = 9
Table value of α at 5% level is 1.833
4. Calculated value:
• T = (24.6 – 29) / sqrt( 11.88(1/5 + 1/6))
= -2.108
Hence T = -2.108
5. conclusion:
Check the condition, Calculated value < table value
-2.108 < 1.833
Condition satisfies.
Hence H0 is accepted.
Thus the samples belongs to the population with
same mean.
Clustering
• It is a method often used for exploratory analysis of the
data.
• clustering methods find the similarities between objects
according to the object attributes and group the similar
objects into clusters.
• Clustering techniques are utilized in marketing,
economics, and various branches of science.
• A popular clustering method is k-means.
K-Means
• Given a collection of objects each with n measurable
attributes,
• k-means is an analytical technique.
• for a chosen value of k, identifies k clusters of objects
based on the objects’ proximity to the center of the k
groups.
• The center is determined as the mean of each cluster’s
n-dimensional vector of attributes.
• Within sum of squares (WSS) metric is examined to
determine a reasonably optimal value of k.
Steps involved in K-means
1. Choose the value of k and the k initial guesses for the
centroids.
2. Compute the distance from each data point (xi,yi) to each
centroid. Assign each point to the closest centroid. This
association defines the first k clusters.
The distance is calculated using Euclidean distance formulae
C1
Suppor L1
Item_Se
t
t
Count Item_Se Suppor
Compare candidate t t Count
Scan D for {M} 3 support count with
count of each {O} 4 {M} 3
minimum support
candidate {N} 2 {O} 4
count
{K} 5 {K} 5
{E} 4 {E} 4
{Y} 3 {Y} 3
{D} 1
{A} 1
{U} 1
{C} 2
{I} 1
Item_Se Suppor
t t Count
{M} 3
{O} 4
{K} 5
{E} 4
{Y} 3
C2 C2
Item_Se Item_Se Support L2
t t count
Scan D for Compare candidate Item_Se Suppor
{M,O} {M,O} 1 t t Count
Generate C2 count of support count with
{M,K} {M,K} 3 minimum support
candidates each {M,K} 3
{M,E} candidate {M,E} 2 count
from L1 {O,K} 3
{M,Y} (M,Y} 2
{O,K} {O,E} 3
{O,K} 3 {K,E} 4
{O,E} {O,E} 3
{O,Y} {K,Y} 3
{O,Y} 2
{K,E} {K,E} 4
{K,Y} {K,Y} 3
{E,Y} {E,Y} 2
Item_Se Suppor
t t Count
{M,K} 3
{O,K} 3
{O,E} 3
{K,E} 4
{K,Y} 3
C3
C3
Scan D for Item_Se Support Compare candidate L3
Generate C3 Item_Se t count
count of support count with
t
candidates each {M,K,O 1 minimum support Item_Se Suppor
from L2 {M,K,O} candidate } 2 count t t Count
{M,K,E} {M,K,E} 2 {O,K,E} 3
{M,K,Y} {M,K,Y} 3
{O,K,E} {O,K,E} 2
{O,K,Y} {O,K,Y} 2
{K,E,Y} {K,E,Y}