0% found this document useful (0 votes)
0 views44 pages

Module2 BDA

The document provides an overview of data analytics using R, covering data types, vectors, lists, and connecting databases. It details exploratory data analysis steps, statistical methods for evaluation, hypothesis testing, and clustering techniques like k-means and association rules. Additionally, it introduces the Apriori algorithm for discovering relationships in large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views44 pages

Module2 BDA

The document provides an overview of data analytics using R, covering data types, vectors, lists, and connecting databases. It details exploratory data analysis steps, statistical methods for evaluation, hypothesis testing, and clustering techniques like k-means and association rules. Additionally, it introduces the Apriori algorithm for discovering relationships in large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

ITDX45- Big Data

Analytics
Module 2- Data Analytic Method
Introduction to R
• R is a programming language and software framework
for statistical analysis and graphics.
DATA TYPES
R has five basic or atomic object classes:
1. Character – “hello”
2. Numeric – 10.5
3. Integer – 50L
4. Complex – 3+3i
5. Logical –True/False
Each object has specific attributes like:
• names
• dim
• class
• length
• Other attributes defined by the user
Eg:
x<-list(age=c(10,20,30),weight=c(60,50,70))
• A list is a collection of objects that can be of various
types, including other lists.
VECTORS AND LISTS
• way of creating a vector is by using the c function.
• Indexing in R starts from 1 not 0.
Eg:
X<-c(“this”,”is”,”a”,”vector”)
X<-c(1,”hello”,3,4,5)
Y<-c(1,2,3,45)
X<-vector(mode=“logical”,length=4)
as.logical(x) – used for conversion of datatype
• Vector is used to store the same data type
• List is used to store different data types
• X<-list(“hello”,1,true)
• is.integer(i) # returns FALSE
• j <- as.integer(i) # coerces contents of i into an integer
• is.integer(j) # returns TRUE
Connecting Databases
• install.packages(“RODBC”)
• library(RODBC)
• conn <- odbcConnect(“DSN”, uid=“user”,
pwd=“password”)
• housing_data <- sqlQuery(conn, “select serialno, state,
persons, rooms from housing where hinc > 1000000”)
To save plots in desired format
jpeg(file=“c:/data/sales_hist.jpeg”) # create a new jpeg file
hist(sales$num_of_orders) # export histogram to jpeg
dev.off() # shut off the graphic device
Other functions - png(), bmp(), pdf(), etc.,
• v <- 1:5 # create a vector 1 2 3 4 5
•v # returns 1 2 3 4 5
• sum(v) # returns 15
• w <- v * 2 # create a vector 2 4 6 8 10
•w # returns 2 4 6 8 10
• w[3] # returns 6 (the 3rd element of w)
• z <- v + w # sums two vectors element by
element
•z # returns 3 6 9 12 15
•z>8 # returns FALSE FALSE TRUE TRUE TRUE
• z[z > 8] # returns 9 12 15
• z[z > 8 | z < 5] # returns 3 9 12 15 (“|” denotes
“or”)
Matrix
• A two-dimensional array is known as a matrix.
sales_matrix <- matrix(0, nrow = 3, ncol = 4)
• sales_matrix
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
• M <- matrix(c(1,3,3,5,0,4,3,3,3),nrow = 3,ncol
= 3)
Contingency Tables
• In R, table refers to a class of objects used to store the
observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency
table.
• It is the basis for performing a statistical test on the
independence of the factors used to build the table.
sales_table <- table(sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586
Exploratory data analysis
Step 1: Install and Load Necessary Libraries
install.packages(c("dplyr", "ggplot2", "tidyr",
"summarytools", "corrplot"))
Load the packages
• library(dplyr) # For data manipulation
• library(ggplot2) # For visualization
• library(tidyr) # For data tidying (reshape data)
• library(summarytools) # For descriptive statistics
• library(corrplot) # For correlation plots
• Step 2: Load the Dataset
my_data <- read.csv("path_to_your_dataset.csv")
View the first few rows of the dataset - head(my_data)
• Install and load readxl if working with Excel files
install.packages("readxl")
library(readxl)
Load Excel file
my_data <- read_excel("path_to_your_file.xlsx")
• Step 3: Check Basic Information About the
Dataset
View the structure of the data
str(my_data)
Get summary statistics (min, max, mean, etc.)
summary(my_data)
Check for missing values
sum(is.na(my_data))
Get column names
colnames(my_data)
• Step 4: Clean and Preprocess the Data

1.Handle Missing Values:


- You can handle missing data by either removing rows
with missing values or imputing them.
- Removing missing data:
my_data_clean <- na.omit(my_data)
- Imputing missing values (for example, replacing NA with
the column mean)

my_datacolumn_name[is.na(my_datacolumn_name)
]<-mean(my_datacolumn_name, na.rm = TRUE)
2. Remove Duplicates:
my_data_clean <- my_data[!duplicated(my_data), ]

3. Convert Data Types (if necessary):


my_datacolumn_name <-
as.factor(my_data$column_name)
# If you need to convert to a factor

4. Renaming Columns:
colnames(my_data)[colnames(my_data) ==
"old_name"] <- "new_name"
• Step 5: Descriptive Statistics and Summary
Summary statistics for numeric variables
summary(my_data)

For categorical variables, get frequency counts


table(my_datacolumn_name)

Use the summarytools package for more detailed


summaries
dfSummary(my_data)
• Step 6: Visualize the Data
Histograms: For understanding the distribution of
numerical variables.
ggplot(my_data, aes(x = column_name)) +
geom_histogram(binwidth = 10, fill = "blue", color =
"black", alpha = 0.7) + theme_minimal() + labs(title =
"Histogram of column_name", x = "column_name", y =
"Frequency")
Boxplots: Useful for spotting outliers and
understanding distributions.
ggplot(my_data, aes(y = column_name)) +
geom_boxplot(fill = "lightblue", color = "black") +
theme_minimal() + labs(title = "Boxplot of
column_name", y = "column_name")
Barplots: For categorical variables.
ggplot(my_data, aes(x = column_name)) + geom_bar(fill
= "green", color = "black") + theme_minimal() +
labs(title = "Barplot of column_name", x = "Category", y
= "Count")
Statistical methods for Evaluation
• Statistics exist throughout the entire Data Analytics
Lifecycle.
• Statistical techniques are used
1. during the initial data exploration and data
preparation,
2. model building,
3. evaluation of the final models, and
4. assessment of how the new models improve the
situation when deployed in the field.
Statistical tools
Hypothesis testing
– used when comparing populations, such as testing or
evaluating the difference of the means from two samples of data
• Null hypothesis (H0) – no difference
• Alternative hypothesis (HA) – difference exists between samples
For example, if the task is to identify the effect of drug A
compared to drug B on patients, the null hypothesis and
alternative hypothesis would be this.
H0 : Drug A and drug B have the same effect on patients.
HA : Drug A has a greater effect than drug B on patients.
A hypothesis test leads to either rejecting the null
hypothesis in favor of the alternative or not rejecting the
null hypothesis.
Difference of means

Hypothesis testing
• The difference in means can be tested using Student’s t-test
or the Welch’s t-test.
Student’s t-test: - assumes that distributions of the two
populations have equal but unknown variances. (T- t distribution,
n1+n2-2 represents degree of freedom)

In R, t.test(x, y, var.equal=TRUE) # run the Student’s t-test


Welch’s t-test

• Denotes sample mean, sample variance and sample size


• The degrees of freedom for Welch’s t-test is defined as,

• In R , t.test(x, y, var.equal=FALSE) # run the Welch’s t-test


Wilcoxon Rank-Sum Test
• T-test is a parametric test.
• If the populations cannot be assumed or transformed to
follow a normal distribution, a nonparametric test can
be used.
• Wilcoxon is a nonparametric hypothesis test that checks
whether two populations are identically distributed.
• In R, wilcox.test(x, y, conf.int = TRUE) #runs
Wilcoxon test
Wilcoxon Rank-Sum Test - steps
1. The first step is to rank the set of observations from the
two groups as if they came from one large group.
2. The smallest observation receives a rank of 1, the second
smallest observation receives a rank of 2, and so on.
3. Ties among the observations receive a rank equal to the
average of the ranks they span.
4. The test uses ranks instead of numerical outcomes to
avoid specific assumptions about the shape of the
distribution.
5. After ranking all the observations, the assigned ranks are
summed for at least one population’s sample.
6. If the distribution of pop1 is shifted to the right of the
other distribution, the rank-sum corresponding to pop1‘s
sample should be larger than the ranksum of pop2.
Type I and Type II Errors
• A hypothesis test may result in two types of errors,
depending on whether the test accepts or rejects the
null hypothesis.
• A type I error is the rejection of the null hypothesis
when the null hypothesis is TRUE. The probability of the
type I error is denoted by the Greek letter α.
• A type II error is the acceptance of a null hypothesis
when the null hypothesis is FALSE. The probability of the
type II error is denoted by the Greek letter β.
Analysis of Variance (ANOVA) test
• The hypothesis tests presented in the previous sections
are good for analyzing means between two populations.
But what if there are more than two populations?
• Between group variance

• Within group variance

• F-test statistic is defined as the ratio between the two.


The nicotine content in milligram of two samples of
tobacco were found to be as follows:
Sample 1 24 27 26 21 25

Sample 2 27 30 28 31 22 36

Can it be said that these samples were from normal


population with the same mean? (Assume Table value of
α at 5% level is 1.833)
Sln:
Given:
n1=5,n2=6
S12 = ∑(x1- x̄ 1)2 / n1
S22 = ∑(x2- x̄ 2)2 / n2
Finding mean:
x̄ 1= ∑x1 / n1 = (24+27+26+21+25) / 5 = 24.6
x̄ 2 = ∑x2 / n2 = (27+30+28+31+22+36) / 6 = 29
Sample I Sample II
(x1) (x1)
24 27
27 30
26 28
21 31
25 22
- 36

∑(x2-
= 2
Sample I Sample II
S12 = 21.2 / 5 = 4.24
S22 = 108 / 6 = 18
24 27 0.36 4
27 30 5.76 1
26 28 1.96 1
S2 =
21 31 12.96 4
((4*4.24)+(5*18)) /
25 22 0.16 49
(5+6-2)
- 36 - 49
= 11.88
∑(x2-
=21.2 x̄ 2)2

=108
• Procedure:
1. Null hypothesis H0 : µ1 = µ2
2. Alternative Hypothesis HA : µ1 ≠ µ2
3. Level of significance :
Degree of freedom, V1 = n1 + n2 – 2 = 9
Table value of α at 5% level is 1.833
4. Calculated value:
• T = (24.6 – 29) / sqrt( 11.88(1/5 + 1/6))
= -2.108
Hence T = -2.108
5. conclusion:
Check the condition, Calculated value < table value
-2.108 < 1.833
Condition satisfies.
Hence H0 is accepted.
Thus the samples belongs to the population with
same mean.
Clustering
• It is a method often used for exploratory analysis of the
data.
• clustering methods find the similarities between objects
according to the object attributes and group the similar
objects into clusters.
• Clustering techniques are utilized in marketing,
economics, and various branches of science.
• A popular clustering method is k-means.
K-Means
• Given a collection of objects each with n measurable
attributes,
• k-means is an analytical technique.
• for a chosen value of k, identifies k clusters of objects
based on the objects’ proximity to the center of the k
groups.
• The center is determined as the mean of each cluster’s
n-dimensional vector of attributes.
• Within sum of squares (WSS) metric is examined to
determine a reasonably optimal value of k.
Steps involved in K-means
1. Choose the value of k and the k initial guesses for the
centroids.
2. Compute the distance from each data point (xi,yi) to each
centroid. Assign each point to the closest centroid. This
association defines the first k clusters.
The distance is calculated using Euclidean distance formulae

3. Compute the centroid, the center of mass, of each newly


defined cluster from Step 2.
4. Repeat Steps 2 and 3 until the algorithm converges to an
answer.
(Convergence is reached when the computed centroids do not
change)
Association rules
• unsupervised learning method
• is a descriptive, not predictive, method often used to
discover interesting relationships hidden in a large
dataset.
• The disclosed relationships can be represented as rules
or frequent itemsets.
• Here are some possible questions that association rules
can answer:
• Which products tend to be purchased together?
• Of those customers who are similar to this person, what
products do they tend to buy?
• Of those customers who have purchased this product, what
other similar products do they tend to view or purchase?
• Given a large collection of in which each transaction
consists of one or more items, association rules go
through the items being purchased to see what items
are frequently bought together and to discover a list of
rules that describe the purchasing behaviour.
• The goal with association rules is to discover interesting
relationships among the items.
• The relationships that are interesting depend both on
the business context and the nature of the algorithm
being used for the discovery.
Apriori algorithm
• Check video lecture shared in blog
Apriori Algorithm

C1
Suppor L1
Item_Se
t
t
Count Item_Se Suppor
Compare candidate t t Count
Scan D for {M} 3 support count with
count of each {O} 4 {M} 3
minimum support
candidate {N} 2 {O} 4
count
{K} 5 {K} 5
{E} 4 {E} 4
{Y} 3 {Y} 3
{D} 1
{A} 1
{U} 1
{C} 2
{I} 1
Item_Se Suppor
t t Count
{M} 3
{O} 4
{K} 5
{E} 4
{Y} 3
C2 C2
Item_Se Item_Se Support L2
t t count
Scan D for Compare candidate Item_Se Suppor
{M,O} {M,O} 1 t t Count
Generate C2 count of support count with
{M,K} {M,K} 3 minimum support
candidates each {M,K} 3
{M,E} candidate {M,E} 2 count
from L1 {O,K} 3
{M,Y} (M,Y} 2
{O,K} {O,E} 3
{O,K} 3 {K,E} 4
{O,E} {O,E} 3
{O,Y} {K,Y} 3
{O,Y} 2
{K,E} {K,E} 4
{K,Y} {K,Y} 3
{E,Y} {E,Y} 2
Item_Se Suppor
t t Count
{M,K} 3
{O,K} 3
{O,E} 3
{K,E} 4
{K,Y} 3

C3
C3
Scan D for Item_Se Support Compare candidate L3
Generate C3 Item_Se t count
count of support count with
t
candidates each {M,K,O 1 minimum support Item_Se Suppor
from L2 {M,K,O} candidate } 2 count t t Count
{M,K,E} {M,K,E} 2 {O,K,E} 3
{M,K,Y} {M,K,Y} 3
{O,K,E} {O,K,E} 2
{O,K,Y} {O,K,Y} 2
{K,E,Y} {K,E,Y}

You might also like