0% found this document useful (0 votes)

0 views44 pages

Module2 BDA

The document provides an overview of data analytics using R, covering data types, vectors, lists, and connecting databases. It details exploratory data analysis steps, statistical methods for evaluation, hypothesis testing, and clustering techniques like k-means and association rules. Additionally, it introduces the Apriori algorithm for discovering relationships in large datasets.

Uploaded by

sangeetha.kumaravel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

0 views44 pages

Module2 BDA

Uploaded by

sangeetha.kumaravel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

ITDX45- Big Data

Analytics
Module 2- Data Analytic Method
Introduction to R
• R is a programming language and software framework
for statistical analysis and graphics.
DATA TYPES
R has five basic or atomic object classes:
1. Character – “hello”
2. Numeric – 10.5
3. Integer – 50L
4. Complex – 3+3i
5. Logical –True/False
Each object has specific attributes like:
• names
• dim
• class
• length
• Other attributes defined by the user
Eg:
x<-list(age=c(10,20,30),weight=c(60,50,70))
• A list is a collection of objects that can be of various
types, including other lists.
VECTORS AND LISTS
• way of creating a vector is by using the c function.
• Indexing in R starts from 1 not 0.
Eg:
X<-c(“this”,”is”,”a”,”vector”)
X<-c(1,”hello”,3,4,5)
Y<-c(1,2,3,45)
X<-vector(mode=“logical”,length=4)
as.logical(x) – used for conversion of datatype
• Vector is used to store the same data type
• List is used to store different data types
• X<-list(“hello”,1,true)
• is.integer(i) # returns FALSE
• j <- as.integer(i) # coerces contents of i into an integer
• is.integer(j) # returns TRUE
Connecting Databases
• install.packages(“RODBC”)
• library(RODBC)
• conn <- odbcConnect(“DSN”, uid=“user”,
pwd=“password”)
• housing_data <- sqlQuery(conn, “select serialno, state,
persons, rooms from housing where hinc > 1000000”)
To save plots in desired format
jpeg(file=“c:/data/sales_hist.jpeg”) # create a new jpeg file
hist(sales$num_of_orders) # export histogram to jpeg
dev.off() # shut off the graphic device
Other functions - png(), bmp(), pdf(), etc.,
• v <- 1:5 # create a vector 1 2 3 4 5
•v # returns 1 2 3 4 5
• sum(v) # returns 15
• w <- v * 2 # create a vector 2 4 6 8 10
•w # returns 2 4 6 8 10
• w[3] # returns 6 (the 3rd element of w)
• z <- v + w # sums two vectors element by
element
•z # returns 3 6 9 12 15
•z>8 # returns FALSE FALSE TRUE TRUE TRUE
• z[z > 8] # returns 9 12 15
• z[z > 8 | z < 5] # returns 3 9 12 15 (“|” denotes
“or”)
Matrix
• A two-dimensional array is known as a matrix.
sales_matrix <- matrix(0, nrow = 3, ncol = 4)
• sales_matrix
[,1] [,2] [,3] [,4]
[1,] 0 0 0 0
[2,] 0 0 0 0
[3,] 0 0 0 0
• M <- matrix(c(1,3,3,5,0,4,3,3,3),nrow = 3,ncol
= 3)
Contingency Tables
• In R, table refers to a class of objects used to store the
observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency
table.
• It is the basis for performing a statistical test on the
independence of the factors used to build the table.
sales_table <- table(sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586
Exploratory data analysis
Step 1: Install and Load Necessary Libraries
install.packages(c("dplyr", "ggplot2", "tidyr",
"summarytools", "corrplot"))
Load the packages
• library(dplyr) # For data manipulation
• library(ggplot2) # For visualization
• library(tidyr) # For data tidying (reshape data)
• library(summarytools) # For descriptive statistics
• library(corrplot) # For correlation plots
• Step 2: Load the Dataset
my_data <- read.csv("path_to_your_dataset.csv")
View the first few rows of the dataset - head(my_data)
• Install and load readxl if working with Excel files
install.packages("readxl")
library(readxl)
Load Excel file
my_data <- read_excel("path_to_your_file.xlsx")
• Step 3: Check Basic Information About the
Dataset
View the structure of the data
str(my_data)
Get summary statistics (min, max, mean, etc.)
summary(my_data)
Check for missing values
sum(is.na(my_data))
Get column names
colnames(my_data)
• Step 4: Clean and Preprocess the Data

1.Handle Missing Values:

- You can handle missing data by either removing rows
with missing values or imputing them.
- Removing missing data:
my_data_clean <- na.omit(my_data)
- Imputing missing values (for example, replacing NA with
the column mean)

my_datacolumn_name[is.na(my_datacolumn_name)
]<-mean(my_datacolumn_name, na.rm = TRUE)
2. Remove Duplicates:
my_data_clean <- my_data[!duplicated(my_data), ]

3. Convert Data Types (if necessary):

my_datacolumn_name <-
as.factor(my_data$column_name)
# If you need to convert to a factor

4. Renaming Columns:
colnames(my_data)[colnames(my_data) ==
"old_name"] <- "new_name"
• Step 5: Descriptive Statistics and Summary
Summary statistics for numeric variables
summary(my_data)

For categorical variables, get frequency counts

table(my_datacolumn_name)

Use the summarytools package for more detailed

summaries
dfSummary(my_data)
• Step 6: Visualize the Data
Histograms: For understanding the distribution of
numerical variables.
ggplot(my_data, aes(x = column_name)) +
geom_histogram(binwidth = 10, fill = "blue", color =
"black", alpha = 0.7) + theme_minimal() + labs(title =
"Histogram of column_name", x = "column_name", y =
"Frequency")
Boxplots: Useful for spotting outliers and
understanding distributions.
ggplot(my_data, aes(y = column_name)) +
geom_boxplot(fill = "lightblue", color = "black") +
theme_minimal() + labs(title = "Boxplot of
column_name", y = "column_name")
Barplots: For categorical variables.
ggplot(my_data, aes(x = column_name)) + geom_bar(fill
= "green", color = "black") + theme_minimal() +
labs(title = "Barplot of column_name", x = "Category", y
= "Count")
Statistical methods for Evaluation
• Statistics exist throughout the entire Data Analytics
Lifecycle.
• Statistical techniques are used
1. during the initial data exploration and data
preparation,
2. model building,
3. evaluation of the final models, and
4. assessment of how the new models improve the
situation when deployed in the field.
Statistical tools
Hypothesis testing
– used when comparing populations, such as testing or
evaluating the difference of the means from two samples of data
• Null hypothesis (H0) – no difference
• Alternative hypothesis (HA) – difference exists between samples
For example, if the task is to identify the effect of drug A
compared to drug B on patients, the null hypothesis and
alternative hypothesis would be this.
H0 : Drug A and drug B have the same effect on patients.
HA : Drug A has a greater effect than drug B on patients.
A hypothesis test leads to either rejecting the null
hypothesis in favor of the alternative or not rejecting the
null hypothesis.
Difference of means

Hypothesis testing
• The difference in means can be tested using Student’s t-test
or the Welch’s t-test.
Student’s t-test: - assumes that distributions of the two
populations have equal but unknown variances. (T- t distribution,
n1+n2-2 represents degree of freedom)

In R, t.test(x, y, var.equal=TRUE) # run the Student’s t-test

Welch’s t-test

• Denotes sample mean, sample variance and sample size

• The degrees of freedom for Welch’s t-test is defined as,

• In R , t.test(x, y, var.equal=FALSE) # run the Welch’s t-test

Wilcoxon Rank-Sum Test
• T-test is a parametric test.
• If the populations cannot be assumed or transformed to
follow a normal distribution, a nonparametric test can
be used.
• Wilcoxon is a nonparametric hypothesis test that checks
whether two populations are identically distributed.
• In R, wilcox.test(x, y, conf.int = TRUE) #runs
Wilcoxon test
Wilcoxon Rank-Sum Test - steps
1. The first step is to rank the set of observations from the
two groups as if they came from one large group.
2. The smallest observation receives a rank of 1, the second
smallest observation receives a rank of 2, and so on.
3. Ties among the observations receive a rank equal to the
average of the ranks they span.
4. The test uses ranks instead of numerical outcomes to
avoid specific assumptions about the shape of the
distribution.
5. After ranking all the observations, the assigned ranks are
summed for at least one population’s sample.
6. If the distribution of pop1 is shifted to the right of the
other distribution, the rank-sum corresponding to pop1‘s
sample should be larger than the ranksum of pop2.
Type I and Type II Errors
• A hypothesis test may result in two types of errors,
depending on whether the test accepts or rejects the
null hypothesis.
• A type I error is the rejection of the null hypothesis
when the null hypothesis is TRUE. The probability of the
type I error is denoted by the Greek letter α.
• A type II error is the acceptance of a null hypothesis
when the null hypothesis is FALSE. The probability of the
type II error is denoted by the Greek letter β.
Analysis of Variance (ANOVA) test
• The hypothesis tests presented in the previous sections
are good for analyzing means between two populations.
But what if there are more than two populations?
• Between group variance

• Within group variance

• F-test statistic is defined as the ratio between the two.

The nicotine content in milligram of two samples of
tobacco were found to be as follows:
Sample 1 24 27 26 21 25

Sample 2 27 30 28 31 22 36

Can it be said that these samples were from normal

population with the same mean? (Assume Table value of
α at 5% level is 1.833)
Sln:
Given:
n1=5,n2=6
S12 = ∑(x1- x̄ 1)2 / n1
S22 = ∑(x2- x̄ 2)2 / n2
Finding mean:
x̄ 1= ∑x1 / n1 = (24+27+26+21+25) / 5 = 24.6
x̄ 2 = ∑x2 / n2 = (27+30+28+31+22+36) / 6 = 29
Sample I Sample II
(x1) (x1)
24 27
27 30
26 28
21 31
25 22
- 36

∑(x2-
= 2
Sample I Sample II
S12 = 21.2 / 5 = 4.24
S22 = 108 / 6 = 18
24 27 0.36 4
27 30 5.76 1
26 28 1.96 1
S2 =
21 31 12.96 4
((4*4.24)+(5*18)) /
25 22 0.16 49
(5+6-2)
- 36 - 49
= 11.88
∑(x2-
=21.2 x̄ 2)2

=108
• Procedure:
1. Null hypothesis H0 : µ1 = µ2
2. Alternative Hypothesis HA : µ1 ≠ µ2
3. Level of significance :
Degree of freedom, V1 = n1 + n2 – 2 = 9
Table value of α at 5% level is 1.833
4. Calculated value:
• T = (24.6 – 29) / sqrt( 11.88(1/5 + 1/6))
= -2.108
Hence T = -2.108
5. conclusion:
Check the condition, Calculated value < table value
-2.108 < 1.833
Condition satisfies.
Hence H0 is accepted.
Thus the samples belongs to the population with
same mean.
Clustering
• It is a method often used for exploratory analysis of the
data.
• clustering methods find the similarities between objects
according to the object attributes and group the similar
objects into clusters.
• Clustering techniques are utilized in marketing,
economics, and various branches of science.
• A popular clustering method is k-means.
K-Means
• Given a collection of objects each with n measurable
attributes,
• k-means is an analytical technique.
• for a chosen value of k, identifies k clusters of objects
based on the objects’ proximity to the center of the k
groups.
• The center is determined as the mean of each cluster’s
n-dimensional vector of attributes.
• Within sum of squares (WSS) metric is examined to
determine a reasonably optimal value of k.
Steps involved in K-means
1. Choose the value of k and the k initial guesses for the
centroids.
2. Compute the distance from each data point (xi,yi) to each
centroid. Assign each point to the closest centroid. This
association defines the first k clusters.
The distance is calculated using Euclidean distance formulae

3. Compute the centroid, the center of mass, of each newly

defined cluster from Step 2.
4. Repeat Steps 2 and 3 until the algorithm converges to an
answer.
(Convergence is reached when the computed centroids do not
change)
Association rules
• unsupervised learning method
• is a descriptive, not predictive, method often used to
discover interesting relationships hidden in a large
dataset.
• The disclosed relationships can be represented as rules
or frequent itemsets.
• Here are some possible questions that association rules
can answer:
• Which products tend to be purchased together?
• Of those customers who are similar to this person, what
products do they tend to buy?
• Of those customers who have purchased this product, what
other similar products do they tend to view or purchase?
• Given a large collection of in which each transaction
consists of one or more items, association rules go
through the items being purchased to see what items
are frequently bought together and to discover a list of
rules that describe the purchasing behaviour.
• The goal with association rules is to discover interesting
relationships among the items.
• The relationships that are interesting depend both on
the business context and the nature of the algorithm
being used for the discovery.
Apriori algorithm
• Check video lecture shared in blog
Apriori Algorithm

C1
Suppor L1
Item_Se
t
t
Count Item_Se Suppor
Compare candidate t t Count
Scan D for {M} 3 support count with
count of each {O} 4 {M} 3
minimum support
candidate {N} 2 {O} 4
count
{K} 5 {K} 5
{E} 4 {E} 4
{Y} 3 {Y} 3
{D} 1
{A} 1
{U} 1
{C} 2
{I} 1
Item_Se Suppor
t t Count
{M} 3
{O} 4
{K} 5
{E} 4
{Y} 3
C2 C2
Item_Se Item_Se Support L2
t t count
Scan D for Compare candidate Item_Se Suppor
{M,O} {M,O} 1 t t Count
Generate C2 count of support count with
{M,K} {M,K} 3 minimum support
candidates each {M,K} 3
{M,E} candidate {M,E} 2 count
from L1 {O,K} 3
{M,Y} (M,Y} 2
{O,K} {O,E} 3
{O,K} 3 {K,E} 4
{O,E} {O,E} 3
{O,Y} {K,Y} 3
{O,Y} 2
{K,E} {K,E} 4
{K,Y} {K,Y} 3
{E,Y} {E,Y} 2
Item_Se Suppor
t t Count
{M,K} 3
{O,K} 3
{O,E} 3
{K,E} 4
{K,Y} 3

C3
C3
Scan D for Item_Se Support Compare candidate L3
Generate C3 Item_Se t count
count of support count with
t
candidates each {M,K,O 1 minimum support Item_Se Suppor
from L2 {M,K,O} candidate } 2 count t t Count
{M,K,E} {M,K,E} 2 {O,K,E} 3
{M,K,Y} {M,K,Y} 3
{O,K,E} {O,K,E} 2
{O,K,Y} {O,K,Y} 2
{K,E,Y} {K,E,Y}

Latest Associate-Google-Workspace-Administrator Dumps
No ratings yet
Latest Associate-Google-Workspace-Administrator Dumps
27 pages
R in Action, Second Edition
0% (2)
R in Action, Second Edition
2 pages
Setting Up A Mikrotik Hotspot With UserManager
50% (2)
Setting Up A Mikrotik Hotspot With UserManager
22 pages
An R Companion To Applied Regression 2nd Edition
No ratings yet
An R Companion To Applied Regression 2nd Edition
538 pages
Unit4 R
No ratings yet
Unit4 R
21 pages
Commands for Data Analysis using R
No ratings yet
Commands for Data Analysis using R
11 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
unit4_R
No ratings yet
unit4_R
21 pages
DEV_Lab_Manual
No ratings yet
DEV_Lab_Manual
27 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Lec448B 20160406
No ratings yet
Lec448B 20160406
30 pages
BADM Material
No ratings yet
BADM Material
113 pages
Advanced Statistical Methods using R Notes
No ratings yet
Advanced Statistical Methods using R Notes
55 pages
STATISTICS
No ratings yet
STATISTICS
6 pages
Business Analytics - L2
No ratings yet
Business Analytics - L2
41 pages
A Guide To Doing Statistics PDF
No ratings yet
A Guide To Doing Statistics PDF
320 pages
A Guide To Doing Statistics in Second Language Research Using R
No ratings yet
A Guide To Doing Statistics in Second Language Research Using R
320 pages
Unit_3 (1)
No ratings yet
Unit_3 (1)
36 pages
Statistical Analysis and Visualizations Using R: Okan Bulut
No ratings yet
Statistical Analysis and Visualizations Using R: Okan Bulut
96 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
R 2nd IA
No ratings yet
R 2nd IA
7 pages
Statistics-with-R
No ratings yet
Statistics-with-R
10 pages
BES - R Lab
No ratings yet
BES - R Lab
5 pages
DA Unit II - II
No ratings yet
DA Unit II - II
47 pages
R Console
No ratings yet
R Console
6 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Statistical Computing by Using R
100% (1)
Statistical Computing by Using R
11 pages
Statistical Analysis in Excel by Golden MCpherson
No ratings yet
Statistical Analysis in Excel by Golden MCpherson
315 pages
BAN5
No ratings yet
BAN5
2 pages
Advanced Statistics
No ratings yet
Advanced Statistics
259 pages
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
No ratings yet
Session 6-15 - Unit II & III: Probability and Distribution, Classical Tests
34 pages
Maths Record Output .
No ratings yet
Maths Record Output .
24 pages
Introduction to Business Statistics Sixth Edition Ronald M. Weiers - Download the ebook now for full and detailed access
100% (1)
Introduction to Business Statistics Sixth Edition Ronald M. Weiers - Download the ebook now for full and detailed access
47 pages
RM-EBBA-class-8-CH0-11-Quatitative-analysis
No ratings yet
RM-EBBA-class-8-CH0-11-Quatitative-analysis
37 pages
FIT3152 Data Analytics. Tutorial 01: Introduction To R. Review of Basic Statistics
No ratings yet
FIT3152 Data Analytics. Tutorial 01: Introduction To R. Review of Basic Statistics
4 pages
r Cheat Sheet
No ratings yet
r Cheat Sheet
9 pages
Introduction Qr
No ratings yet
Introduction Qr
34 pages
Midterm Codes
No ratings yet
Midterm Codes
8 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
Aditya Garg DMDW
No ratings yet
Aditya Garg DMDW
40 pages
Introduction Qr1
No ratings yet
Introduction Qr1
34 pages
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
No ratings yet
X - 15 x-1 2. Print ('Hello Word!') ## (1) "Hello Word!" 3. X - 4 y - 5 Z - X+y Print (Z) 4. X - 4 y - 5 Cat ('The Sum of X and y Is', X+y)
15 pages
Real Statistics Using Excel - Examples Workbook Charles Zaiontz, 9 April 2015
No ratings yet
Real Statistics Using Excel - Examples Workbook Charles Zaiontz, 9 April 2015
1,595 pages
RM - Unit 3
No ratings yet
RM - Unit 3
35 pages
Inferential Statistics
No ratings yet
Inferential Statistics
3 pages
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
No ratings yet
Introduction To Data Analysis: Professor David Richardson IIT Stuart School of Business
31 pages
Basic Statistics with R: Reaching Decisions with Data Stephen C. Loftus 2024 Scribd Download
100% (5)
Basic Statistics with R: Reaching Decisions with Data Stephen C. Loftus 2024 Scribd Download
66 pages
Module - 4 (R Training) - Basic Stats & Modeling
No ratings yet
Module - 4 (R Training) - Basic Stats & Modeling
15 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Data Analyses R Manual NYTS
No ratings yet
Data Analyses R Manual NYTS
24 pages
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
No ratings yet
Business Analytics Unit - IV Notes_60637706_2025_05!15!02_16
28 pages
R Manual PDF
No ratings yet
R Manual PDF
78 pages
Business Statistics - Session 1 - 3
No ratings yet
Business Statistics - Session 1 - 3
63 pages
STAT359 Study Guide
No ratings yet
STAT359 Study Guide
7 pages
Final - Dabm Lab Manual Dmice
No ratings yet
Final - Dabm Lab Manual Dmice
49 pages
An R Companion To Statistical Thinking For The 21st Century
No ratings yet
An R Companion To Statistical Thinking For The 21st Century
159 pages
Stats Mid Term
No ratings yet
Stats Mid Term
22 pages
Workshop Activity: X Seq y Length
No ratings yet
Workshop Activity: X Seq y Length
3 pages
R Short Course
No ratings yet
R Short Course
40 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Linear Algebra Fundamentals
From Everand
Linear Algebra Fundamentals
Kartikeya Dutta
No ratings yet
SM5J-B - Datasheet (Low) - LG FHD Standard Signage - 230913
No ratings yet
SM5J-B - Datasheet (Low) - LG FHD Standard Signage - 230913
3 pages
7-67-0012 Rev 3 Short Sliding Support On Existing
No ratings yet
7-67-0012 Rev 3 Short Sliding Support On Existing
1 page
Abap Cds View
No ratings yet
Abap Cds View
4 pages
dx700 t4 Specification Sheet English
No ratings yet
dx700 t4 Specification Sheet English
4 pages
Behavior-Driven Development With Gherkin 45ac9f1
No ratings yet
Behavior-Driven Development With Gherkin 45ac9f1
2 pages
Impacts of Internet On Social Life of People
No ratings yet
Impacts of Internet On Social Life of People
9 pages
Hemt Technology For High Speed Logic &communication
No ratings yet
Hemt Technology For High Speed Logic &communication
19 pages
Inp. Opt. of Bc, Interrupt Initiated IO Design of Basic Computer&AC
No ratings yet
Inp. Opt. of Bc, Interrupt Initiated IO Design of Basic Computer&AC
13 pages
IoT UNit 3 IPU
No ratings yet
IoT UNit 3 IPU
116 pages
1.4 1 Install and Configure Application Software
No ratings yet
1.4 1 Install and Configure Application Software
13 pages
Tricentis-White-Paper_Quality-Center-Migration-Guide
No ratings yet
Tricentis-White-Paper_Quality-Center-Migration-Guide
11 pages
Development of Pedestrian Artificial Intelligence Utilizing Unreal Engine 4 Graphic Engine
No ratings yet
Development of Pedestrian Artificial Intelligence Utilizing Unreal Engine 4 Graphic Engine
5 pages
Orientation Manual
No ratings yet
Orientation Manual
10 pages
Dbms - Unit 3 - Notes (Odbc & JDBC)
No ratings yet
Dbms - Unit 3 - Notes (Odbc & JDBC)
29 pages
Introduction To Solar Cell
No ratings yet
Introduction To Solar Cell
24 pages
ECEA'24Newsletter - Issue 9
No ratings yet
ECEA'24Newsletter - Issue 9
6 pages
Aerulshell PHP
No ratings yet
Aerulshell PHP
117 pages
Types of Production & Methods of Production
No ratings yet
Types of Production & Methods of Production
44 pages
Midexam Be III Viiig Additional Backlog 02022013
No ratings yet
Midexam Be III Viiig Additional Backlog 02022013
1 page
Syllabus ET
100% (2)
Syllabus ET
198 pages
Chip Oct11
No ratings yet
Chip Oct11
109 pages
Transaction Processing Systems
No ratings yet
Transaction Processing Systems
6 pages
EMT 303 Topic 2
No ratings yet
EMT 303 Topic 2
22 pages
Design of A Modified Hand Operated Maize Sheller
No ratings yet
Design of A Modified Hand Operated Maize Sheller
15 pages
MPD 200
No ratings yet
MPD 200
9 pages
Heart Arrhythmia Detection and Its Analysis Using Matlab IJERTV10IS110176
No ratings yet
Heart Arrhythmia Detection and Its Analysis Using Matlab IJERTV10IS110176
5 pages
Bloc Luminaire 3
No ratings yet
Bloc Luminaire 3
2 pages
ARM1028 DesignStart Brochure CMYK ST6
No ratings yet
ARM1028 DesignStart Brochure CMYK ST6
20 pages

Module2 BDA

Uploaded by

Module2 BDA

Uploaded by

ITDX45- Big Data

1.Handle Missing Values:

3. Convert Data Types (if necessary):

For categorical variables, get frequency counts

Use the summarytools package for more detailed

In R, t.test(x, y, var.equal=TRUE) # run the Student’s t-test

• Denotes sample mean, sample variance and sample size

• In R , t.test(x, y, var.equal=FALSE) # run the Welch’s t-test

• Within group variance

• F-test statistic is defined as the ratio between the two.

Can it be said that these samples were from normal

3. Compute the centroid, the center of mass, of each newly

You might also like