DV Unit 1&2 Notes
DV Unit 1&2 Notes
1.1 Statistics
Key Concepts:
● Data: Raw facts and figures collected for analysis. Data can be quantitative
(numerical) or qualitative (categorical).
● Population: The entire set of individuals, items, or data points of interest in a
study. For example, if you are studying the heights of all students in a school, the
population is all the students in that school.
● Sample: A subset of the population that is used to make inferences about the
entire population. For example, if you measure the heights of 50 students out of
500, those 50 students are your sample.
● Variable: A characteristic or attribute that can be measured or observed.
Variables can be independent (predictor) or dependent (outcome).
1.2 Types of Statistics
This image explains standard deviation, showing how data spreads around the
average (mean) in a bell curve. The percentages indicate how much data falls within
each section. The orange lines mark one standard deviation from the mean, covering
about 68% of the data.
1.4 Inferential Statistics
1. Point Estimation
Interval estimation provides a range of values that likely contains the population
parameter, rather than a single estimate. This range is called a Confidence Interval
(CI).
Comparison of Point vs. Interval Estimation
One-Tailed Test
Two-Tailed Test
● Used when the direction of the effect is unknown or when differences in either
direction matter.
● Example: "New drug A has a different effect on recovery rates compared to drug B
(could be higher or lower)."
● Mathematical Representation:
○ H₀ : μ₁ = μ₂
○ H₁ : μ₁ ≠ μ₂
3. Decision Criteria in Hypothesis Testing
● Measures the probability of obtaining the observed results if the null hypothesis
is true.
● Interpretation:
○ p < 0.05 → Strong evidence against H₀; reject H₀.
○ p > 0.05 → Insufficient evidence; fail to reject H₀.
Used to compare the means of two groups to determine if they are significantly different
from each other.
Example:
Example:
Used to compare means across three or more groups to see if at least one group is
significantly different.
● One-Way ANOVA: Compares means of one independent variable with multiple
groups.
○ Example: Comparing students' math scores across three different schools.
● Two-Way ANOVA: Examines the effect of two independent variables on a
dependent variable.
○ Example: Testing how both teaching method and study time affect exam
scores.
4. Z-Test
● Used for large sample sizes (n>30) and when the population variance is known.
● Example: Testing whether the average height of a population differs from a
known value.
5. Regression Analysis
Measures how two variables change together while controlling for other factors.
Example:
Poisson Distribution:
Bernoulli Distribution:
1.6 Sampling and Sampling Distributions
Sampling is the process of selecting a subset (sample) from a larger group (population)
to make statistical inferences about the whole population. The accuracy and reliability
of results depend on the sampling method used.
1. Probability Sampling
In probability sampling, each individual in the population has a known and non-zero
probability of being selected. This ensures that the sample is representative and
reduces bias.
Every individual in the population has an equal and independent chance of being
selected. This is the most basic and widely used sampling method.
How It Works:
Example:
Advantages:
The population is divided into homogeneous subgroups (strata), and individuals are
randomly selected from each stratum proportionally.
How It Works:
Example:
● A school wants to survey students' study habits and stratifies by grade level
(freshman, sophomore, junior, senior), ensuring proportional representation.
Advantages:
Disadvantages:
The population is divided into clusters, and entire clusters are randomly selected
instead of individual members.
How It Works:
Advantages:
Disadvantages:
Every k-th individual from a population list is selected (where k=N/n, N = population
size, n = sample size).
How It Works:
Example:
Advantages:
Disadvantages:
How It Works:
● Select participants who are easily accessible (e.g., students in a nearby class,
people at a mall).
Example:
Advantages:
Disadvantages:
How It Works:
Example:
Disadvantages:
Descriptive data analysis involves summarizing and describing the main features of a
dataset. R provides a variety of functions and packages to perform descriptive analysis.
● Steps in Descriptive Data Analysis:
○ Data Import:
■ Use functions like read.csv(), read.table(), or
read_excel() to import data into R.
○ Data Exploration:
■ Use functions like head(), tail(), and str() to explore the
structure of the dataset.
○ Summary Statistics:
■ Use summary() to get an overview of the dataset (e.g., mean,
median, quartiles).
■ Use mean(), median(), sd(), and var() for specific statistical
measures.
○ Data Cleaning:
■ Handle missing values using na.omit() or na.fill().
■ Remove duplicates using unique().
○ Data Visualization:
■ Use basic plots like histograms (hist()), boxplots (boxplot()),
and scatterplots (plot()) to visualize data distributions and
relationships.
● Relevance to Data Visualization:
○ Descriptive analysis is the foundation of data visualization.
○ Visualizations like histograms and boxplots help in understanding data
distributions and identifying outliers.
○ Summary statistics provide context for interpreting visualizations.
4. Description of Basic Functions Used to Describe Data in R
R provides a wide range of functions to describe and summarize data. Below are some
commonly used functions:
● Data Exploration:
○ head(): Displays the first few rows of the dataset.
○ tail(): Displays the last few rows of the dataset.
○ str(): Provides the structure of the dataset (e.g., data types,
dimensions).
● Summary Statistics:
○ summary(): Generates summary statistics for numerical and categorical
variables.
○ mean(), median(), sd(), var(): Calculate specific statistical measures.
● Data Visualization:
○ hist(): Creates histograms to visualize data distributions.
○ boxplot(): Creates boxplots to visualize data spread and outliers.
○ plot(): Creates scatterplots to visualize relationships between variables.
○ barplot(): Creates bar charts for categorical data.
● Data Manipulation:
○ subset(): Extracts subsets of data based on conditions.
○ aggregate(): Aggregates data based on specific criteria.
○ table(): Creates frequency tables for categorical variables.
● Relevance to Data Visualization:
○ These functions help in preparing data for visualization by summarizing
and cleaning it.
○ Basic plots provide quick insights into data patterns and relationships.
○ Advanced visualization libraries like ggplot2 build on these basic
functions to create more complex and customized visualizations.
UNIT 2: Data Manipulation with R
2.1 Introduction to R
The dplyr package provides a set of efficient functions for handling tabular data,
enabling seamless data manipulation.
# Sample dataset
df <- data.frame(
print("Filtered Data:")
print(filtered_data)
print("Selected Data:")
print(selected_data)
# 3. Sort by salary in descending order
print("Sorted Data:")
print(sorted_data)
print("Mutated Data:")
print(mutated_data)
group_by(department) %>%
summarise(avg_salary = mean(salary))
print("Summary Data:")
print(summary_data)
"Filtered Data:"
1 Bob 35 60000 IT
2 David 40 70000 IT
"Selected Data:"
name age
1 Alice 25
2 Bob 35
3 Charlie 30
4 David 40
"Sorted Data:"
1 David 40 70000 IT
2 Bob 35 60000 IT
3 Charlie 30 55000 HR
4 Alice 25 50000 HR
"Mutated Data:"
# A tibble: 2 × 2
department avg_salary
<chr> <dbl>
1 HR 52500
2 IT 65000
2.2.2 data.table Package for Fast Data Processing
The data.table package enhances data manipulation speed, especially for large
datasets.
● Scatter Plot:
ggplot(data, aes(x = var1, y = var2)) + geom_point()
● Bar Chart:
ggplot(data, aes(x = category, y = value)) + geom_bar(stat = "identity")
● Heatmaps:
ggplot(data, aes(x = var1, y = var2, fill = value)) + geom_tile()
Pair Plots:
library(GGally)
● ggpairs(data)
● data = pd.read_csv("data.csv")
NumPy:
import numpy as np
● plt.show()
Folium (Geospatial Visualizations):
import folium
map = folium.Map(location = [45.5236, -122.6750])
● map.save("map.html")
Below is a table summarizing the packages, libraries, and functions commonly used for
Data Manipulation in R. This table includes the most popular tools and their primary
purposes.
IBM Watson Studio is a powerful cloud-based platform designed for data science,
machine learning, and AI model development. It provides tools for data preparation,
visualization, and model building, supporting both no-code and code-driven workflows
(Python, R, and SQL).
A project in Watson Studio serves as a workspace where you can store, manage, and
analyze datasets, build machine learning models, and collaborate with team members.
The Data Refinery tool in Watson Studio provides a powerful way to clean, transform,
and prepare datasets for analysis. It offers an interactive, drag-and-drop interface with
over 100 built-in operations, reducing the need for manual data preprocessing.
Data cleaning is essential before running any analysis. Some common built-in
operations operations include:
1. Removing Duplicates
4. Normalization/Standardization
● Extract relevant rows based on conditions (e.g., "filter all sales > $10,000").
● Sort data in ascending/descending order.
Feature engineering helps enhance data quality for better insights and model
performance.
2. Aggregating Data
3. Splitting Columns
4. String Manipulation
Data visualization helps uncover patterns, trends, and relationships. Watson Studio
provides built-in visualization tools that require no coding.
How to Create:
How to Create:
● Identify correlations between two variables (e.g., Advertising Spend vs. Revenue).
● Comparing employee experience vs. salary.
● Checking if higher temperature leads to increased ice cream sales.
How to Create:
How to Create:
The Iris dataset is one of the most famous datasets in machine learning and statistics.
It consists of 150 samples from three species of the Iris flower: Setosa, Versicolor, and
Virginica. Each sample has four features:
The goal of this case study is to explore, visualize, and analyze the Iris dataset to
understand relationships between features, perform clustering, and derive insights.
The dataset can be loaded using Python’s seaborn, scikit-learn, or pandas libraries.
import pandas as pd
iris = datasets.load_iris()
# Convert to DataFrame
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
df.shape
Output:
(150, 5) → 150 samples and 5 columns (4 features + 1 target)
df.columns
Output:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)', 'species']
Step 3: Summary Statistics
Statistical summaries help identify mean, median, min, max, and standard deviation
for each feature.
df.describe()
Observations:
● Petal width and length have higher variability than sepal measurements.
● Sepal width has the smallest range compared to other features.
Data visualization helps identify relationships and clusters within the dataset.
Step 1: Pairwise Scatter Plots
Pairwise scatter plots help understand how features relate to each other.
plt.show()
Interpretation:
plt.show()
Takeaways:
Feature Importance:
● Petal Length and Petal Width are the most distinguishing features between
species.
● Sepal Width is less effective in differentiating between species.
Pattern Recognition:
Business/Scientific Application:
Hands-On:
Case Study - Iris Dataset
https://colab.research.google.com/drive/1l7tMJEjonNjzNL5S0pTE2C_CObVPAmJa?usp
=sharing
Uni, Bi, Tri Analysis
https://colab.research.google.com/drive/1VZF-ucI2ooBuFNoOo1H0KZOrJ3yW1kb3?us
p=sharing