0% found this document useful (0 votes)

14 views

DV Unit 1&2 Notes

Unit 1 introduces statistics as a mathematical science focused on data collection, analysis, and interpretation, applicable across various fields. It differentiates between descriptive statistics, which summarize data, and inferential statistics, which make predictions about populations based on samples. Key concepts include data types, sampling methods, hypothesis testing, and common statistical tests like t-tests and ANOVA.

Uploaded by

saniyaarshadali145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

DV Unit 1&2 Notes

Uploaded by

saniyaarshadali145

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

UNIT 1: Introduction to Statistics

1.1 Statistics

Statistics is a mathematical science that involves the collection, organization, analysis,

interpretation, and presentation of data. It helps in making data-driven decisions by
summarizing complex data into meaningful insights. Statistics is widely used in various
fields such as economics, medicine, engineering, social sciences, and business to
analyze trends, make predictions, and test hypotheses.

Key Concepts:
● Data: Raw facts and figures collected for analysis. Data can be quantitative
(numerical) or qualitative (categorical).
● Population: The entire set of individuals, items, or data points of interest in a
study. For example, if you are studying the heights of all students in a school, the
population is all the students in that school.
● Sample: A subset of the population that is used to make inferences about the
entire population. For example, if you measure the heights of 50 students out of
500, those 50 students are your sample.
● Variable: A characteristic or attribute that can be measured or observed.
Variables can be independent (predictor) or dependent (outcome).
1.2 Types of Statistics

Statistics is broadly classified into two main categories:

1.2.1 Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset. It
provides simple summaries about the sample and the measures. It is used to present
data in a meaningful way, often using measures of central tendency and measures of
dispersion.
● Measures of Central Tendency: Mean, Median, Mode.
● Measures of Dispersion: Range, Variance, Standard Deviation.
● Graphical Representations: Histograms, Bar Charts, Pie Charts, Boxplots.

1.2.2 Inferential Statistics

Inferential statistics uses data from a sample to make inferences or predictions about a
population. It involves:
● Hypothesis Testing: Testing assumptions about a population parameter.
● Confidence Intervals: Estimating the range within which a population parameter
lies.
● Regression Analysis: Modeling relationships between variables.
1.3 Descriptive Statistics

1.3.1 Measures of Central Tendency

1.3.2 Measures of Dispersion (Spread)

These measures describe how spread out the data is:

This diagram shows a box plot, which is a simple way to display data distribution. It
shows the minimum and maximum values at the ends, with most of the data contained
in a box in the middle. The box is divided by the median (middle value) and shows the
middle 50% of all data points. The "whiskers" extend from the box to show the
remaining data spread, with each section representing 25% of all values.

This image explains standard deviation, showing how data spreads around the
average (mean) in a bell curve. The percentages indicate how much data falls within
each section. The orange lines mark one standard deviation from the mean, covering
about 68% of the data.
1.4 Inferential Statistics

Inferential statistics allows us to make predictions about a population using a sample.

It includes:

1.4.1 Parameter Estimation

Parameter estimation is a technique in inferential statistics used to estimate unknown

population parameters based on a sample. Since it's often impractical to study an
entire population, we use sample data to draw conclusions about the whole population.

Types of Parameter Estimation

1. Point Estimation

2. Interval Estimation

1. Point Estimation

Point estimation provides a single best estimate of an unknown population parameter.

It uses sample statistics (such as mean, variance, or proportion) as estimates for the
population parameters.
2. Interval Estimation

Interval estimation provides a range of values that likely contains the population
parameter, rather than a single estimate. This range is called a Confidence Interval
(CI).
Comparison of Point vs. Interval Estimation

Feature Point Estimation Interval Estimation

Definition Provides a single best estimate Provides a range of values likely to

for a population parameter. contain the population parameter.

Precision High precision but no measure Lower precision but includes

of uncertainty. uncertainty.

Example Sample mean = 170 cm as an 95% confidence interval = (168.43

estimate of population mean. cm, 171.57 cm).

Use Case Quick estimates when When understanding uncertainty and

confidence levels are not reliability is important.
required.
1.4.3 Types of Inferential Tests

Inferential statistics help us make conclusions about a population based on sample

data.

1.4.3.1 Hypothesis Testing

Hypothesis testing is a fundamental statistical method used to make decisions about a

population based on sample data. It helps determine if a claim about a dataset is true,
whether observed patterns in data are due to random chance or if there is enough
evidence to support a particular claim.
Key Concepts in Hypothesis Testing

Null Hypothesis (H₀)

● Represents the default assumption that there is no effect, relationship, or

difference in the population.
● It serves as the baseline for comparison.
● Example: "There is no difference in test scores between students who study in the
morning and those who study at night."

Alternative Hypothesis (H₁)

● Represents the claim that there is an effect, relationship, or difference.

● It is what the researcher aims to provide evidence for.
● Example: "Students who study in the morning have higher test scores than those
who study at night."

Types of Hypothesis Tests

One-Tailed Test

● Used when the direction of the effect is specified.

● Tests for an increase or decrease in one specific direction.
● Example: "New drug A leads to higher recovery rates than drug B."
● Mathematical Representation:
○ H₀ : μ₁ ≤ μ₂
○ H₁ : μ₁ > μ₂

Two-Tailed Test

● Used when the direction of the effect is unknown or when differences in either
direction matter.
● Example: "New drug A has a different effect on recovery rates compared to drug B
(could be higher or lower)."
● Mathematical Representation:
○ H₀ : μ₁ = μ₂
○ H₁ : μ₁ ≠ μ₂
3. Decision Criteria in Hypothesis Testing

P-Value (Probability Value)

● Measures the probability of obtaining the observed results if the null hypothesis
is true.
● Interpretation:
○ p < 0.05 → Strong evidence against H₀; reject H₀.
○ p > 0.05 → Insufficient evidence; fail to reject H₀.

Level of Significance (α)

● The pre-defined threshold for rejecting the null hypothesis.

● Common values:
○ α=0.05 (5%) → 5% risk of wrongly rejecting H₀.
○ α=0.01 (1%) → More stringent; used in critical applications.

Errors in Hypothesis Testing

Type I Error (False Positive)

● Incorrectly rejecting H₀when it is actually true.

● Example: Concluding that a new drug is effective when it actually is not.
● Probability of Type I Error = α (level of significance).

Type II Error (False Negative)

● Failing to reject H₀when H₁is actually true.

● Example: Concluding that a new drug has no effect when it actually does.
● Probability of Type II Error = β.

Steps in Hypothesis Testing

1. Define the null and alternative hypotheses.

2. Choose the significance level (α).
3. Select the appropriate test (Z-test, t-test, etc.).
4. Compute the test statistic.
5. Compare the p-value with α and make a decision.
6. Interpret the results in the context of the problem.
Hypothesis testing is a powerful tool in statistics, widely used in research, business
analytics, medicine, and many other fields. It provides a structured approach to making
data-driven decisions while controlling for errors and uncertainty.

1. T-Test (Student’s t-test)

Used to compare the means of two groups to determine if they are significantly different
from each other.

Example:

● One-sample t-test: Compares the sample mean to a known population mean.

○ Example: Testing if the average weight of apples in an orchard differs from
200 grams.
● Independent (Unpaired) t-test: Compares means between two independent
groups.
○ Example: Comparing test scores of students from two different schools.
● Paired t-test: Compares means from the same group before and after an
intervention.
○ Example: Testing if a training program improves employee productivity.
2. Chi-Square Test (χ² Test)

Used to test relationships between categorical variables (nominal data).

Example:

● Chi-Square Goodness of Fit Test: Checks if a sample distribution matches an

expected distribution.
○ Example: Testing if customer preferences for three product flavors are
equally distributed.
● Chi-Square Test for Independence: Determines if two categorical variables are
related.
○ Example: Examining whether gender is associated with voting preference.

3. ANOVA (Analysis of Variance)

Used to compare means across three or more groups to see if at least one group is
significantly different.
● One-Way ANOVA: Compares means of one independent variable with multiple
groups.
○ Example: Comparing students' math scores across three different schools.
● Two-Way ANOVA: Examines the effect of two independent variables on a
dependent variable.
○ Example: Testing how both teaching method and study time affect exam
scores.

4. Z-Test

● Used for large sample sizes (n>30) and when the population variance is known.
● Example: Testing whether the average height of a population differs from a
known value.

5. Regression Analysis

Used to determine relationships between independent (predictor) and dependent

(outcome) variables.
Regression Line

● Simple Linear Regression: Examines the relationship between one

independent and one dependent variable.
○ Example: Predicting house price based on square footage.
● Multiple Linear Regression: Involves two or more independent variables.
○ Example: Predicting employee salary based on experience, education
level, and job role.
● Logistic Regression: Used for binary categorical outcomes (e.g., Yes/No, 0/1).
○ Example: Predicting whether a customer will buy a product (Yes/No)
based on ad engagement.

6. Covariance Analysis (ANCOVA – Analysis of Covariance)

Measures how two variables change together while controlling for other factors.

● Covariance vs. Correlation:

○ Covariance measures the direction of the relationship between two
variables but not the strength.
○ Correlation standardizes this measure to range between -1 and 1,
indicating both strength and direction.
● ANCOVA (Analysis of Covariance): Extends ANOVA by controlling for one or
more covariates.
○ Example: Comparing students’ final exam scores across schools while
controlling for previous academic performance.

Inferential Statistics - Definition, Types, Examples, Formulas -

https://www.cuemath.com/data/inferential-statistics/
1.5 Probability Distributions

Probability distributions describe how values of a random variable are distributed.

1.5.1 Types of Random Variables

● Discrete Random Variables: Finite set of values (e.g., number of students in a
class).
● Continuous Random Variables: Infinite possible values within a range (e.g.,
height of students).

1.5.2 Common Distributions

● Normal Distribution (Bell Curve): Used in many real-world scenarios. It is
symmetric and characterized by its mean (μ) and standard deviation (σ).
● Binomial Distribution: For binary outcomes (success/failure).
● Poisson Distribution: Used for rare event occurrences.
● Bernoulli Distribution: A special case of the binomial distribution with a single
trial.
● Uniform Distribution: All outcomes are equally likely.

Example:

Poisson Distribution:
Bernoulli Distribution:
1.6 Sampling and Sampling Distributions

Sampling is the process of selecting a subset of individuals from a population to

estimate characteristics of the whole population.

1.6.1 Types of Sampling

Sampling is the process of selecting a subset (sample) from a larger group (population)
to make statistical inferences about the whole population. The accuracy and reliability
of results depend on the sampling method used.

Sampling is broadly classified into Probability Sampling and Non-Probability Sampling.

1. Probability Sampling

In probability sampling, each individual in the population has a known and non-zero
probability of being selected. This ensures that the sample is representative and
reduces bias.

1.1 Random Sampling (Simple Random Sampling - SRS)

Every individual in the population has an equal and independent chance of being
selected. This is the most basic and widely used sampling method.

How It Works:

● Assign a number to every individual in the population.

● Use a random number generator (e.g., lottery method, computer-generated
random numbers) to select individuals.

Example:

● A researcher selects 100 students at random from a university database to study

their academic performance.

Advantages:

Completely unbiased and ensures fair selection.

Easy to implement for small populations.
Disadvantages:

Not efficient for large populations.

May lead to under-representation of some groups.

1.2 Stratified Sampling

The population is divided into homogeneous subgroups (strata), and individuals are
randomly selected from each stratum proportionally.

How It Works:

● Identify subgroups (strata) based on characteristics (e.g., gender, age, income).

● Randomly select participants from each subgroup in proportion to the
population.

Example:

● A school wants to survey students' study habits and stratifies by grade level
(freshman, sophomore, junior, senior), ensuring proportional representation.

Advantages:

Ensures all groups are represented.

Provides higher precision than simple random sampling.

Disadvantages:

Requires knowledge of population characteristics.

Can be time-consuming to divide the population into strata.

1.3 Cluster Sampling

The population is divided into clusters, and entire clusters are randomly selected
instead of individual members.

How It Works:

● Divide the population into clusters (e.g., neighborhoods, schools, departments).

● Randomly select entire clusters instead of individuals.
Example:

● To survey school performance in a city, entire schools (clusters) are selected

randomly instead of individual students.

Advantages:

Cost-effective and time-saving.

Suitable for large populations spread across a wide area.

Disadvantages:

Can lead to higher sampling error if clusters are not diverse.

May not be representative if clusters vary significantly.

1.4 Systematic Sampling

Every k-th individual from a population list is selected (where k=N/n, N = population
size, n = sample size).

How It Works:

● Choose a starting point randomly.

● Select every k-th person (e.g., every 5th, 10th, etc.).

Example:

● A company wants to survey employees. It selects every 10th employee from a

list of 1,000 employees.

Advantages:

Easier and faster than random sampling.

Ensures even coverage of the population.

Disadvantages:

If there’s an underlying pattern in the population list, it may introduce bias.

Not suitable for populations with irregular distributions.
2. Non-Probability Sampling

In non-probability sampling, individuals are selected based on non-random criteria,

meaning some individuals have a higher chance of being selected than others. This
can introduce bias but is useful for exploratory research.

2.1 Convenience Sampling

Participants are selected based on ease of access and availability.

How It Works:

● Select participants who are easily accessible (e.g., students in a nearby class,
people at a mall).

Example:

● A researcher surveys passersby in a shopping mall about their shopping habits.

Advantages:

Quick and easy to collect data.

Useful for preliminary research.

Disadvantages:

Highly biased and not representative.

Results cannot be generalized to the population.

2.2 Judgment (Purposive) Sampling

The researcher selects participants based on specific characteristics or judgment

about who will provide the best information.

How It Works:

● Choose individuals who best meet the research criteria.

Example:

● A medical researcher selects only experienced doctors to study a new treatment

method.
Advantages:

Useful when only experts can provide relevant data.

Effective for qualitative research.

Disadvantages:

Researcher bias can affect selection.

Not generalizable to the population.

Comparison of Probability vs. Non-Probability Sampling

Feature Probability Sampling Non-Probability Sampling

Definition Every individual has a Selection is based on convenience,

known probability of judgment, or other non-random
being selected. methods.

Bias Low bias (more High bias (less representative).

representative).

Use Case Used for statistical Used for exploratory or qualitative

inferences. research.

Examples Random, Stratified, Convenience, Judgment Sampling.

Cluster, Systematic
Sampling.

1.6.2 Sampling Distribution

1. Overview and About R

R is a powerful open-source programming language and environment specifically

designed for statistical computing, data analysis, and data visualization. It is widely
used in academia, research, and industry due to its flexibility, extensive libraries, and
strong community support.
● Key Features of R:
○ Open-source and free to use.
○ Extensive packages for statistical analysis and visualization (e.g.,
ggplot2, dplyr, tidyr).
○ Strong support for data manipulation and cleaning.
○ Cross-platform compatibility (Windows, macOS, Linux).
○ Integration with other tools like Python, SQL, and Excel.
● Relevance to Data Visualization:
○ R provides a wide range of visualization libraries to create high-quality
graphs, charts, and plots.
○ It allows customization of visualizations to meet specific analytical needs.
○ R is particularly useful for exploratory data analysis (EDA), where
visualizations help uncover patterns, trends, and outliers in data.
2. R and R Studio Installation

R Studio is an Integrated Development Environment (IDE) for R that simplifies coding,

debugging, and visualization. It provides a user-friendly interface for writing scripts,
managing data, and viewing visualizations.
● Installation Steps:
○ Install R:
■ Visit the Comprehensive R Archive Network (CRAN) website:
https://cran.r-project.org/.
■ Download the appropriate version for your operating system.
■ Follow the installation instructions.
○ Install R Studio:
■ Visit the R Studio website: https://www.rstudio.com/.
■ Download the free version of R Studio Desktop.
■ Install it on your system.
● Relevance to Data Visualization:
○ R Studio provides a seamless environment for creating and managing
visualizations.
○ Features like the "Plots" pane allow real-time viewing of graphs and charts.
○ R Studio supports interactive visualizations using packages like shiny.

3. Descriptive Data Analysis Using R

Descriptive data analysis involves summarizing and describing the main features of a
dataset. R provides a variety of functions and packages to perform descriptive analysis.
● Steps in Descriptive Data Analysis:
○ Data Import:
■ Use functions like read.csv(), read.table(), or
read_excel() to import data into R.
○ Data Exploration:
■ Use functions like head(), tail(), and str() to explore the
structure of the dataset.
○ Summary Statistics:
■ Use summary() to get an overview of the dataset (e.g., mean,
median, quartiles).
■ Use mean(), median(), sd(), and var() for specific statistical
measures.
○ Data Cleaning:
■ Handle missing values using na.omit() or na.fill().
■ Remove duplicates using unique().
○ Data Visualization:
■ Use basic plots like histograms (hist()), boxplots (boxplot()),
and scatterplots (plot()) to visualize data distributions and
relationships.
● Relevance to Data Visualization:
○ Descriptive analysis is the foundation of data visualization.
○ Visualizations like histograms and boxplots help in understanding data
distributions and identifying outliers.
○ Summary statistics provide context for interpreting visualizations.
4. Description of Basic Functions Used to Describe Data in R

R provides a wide range of functions to describe and summarize data. Below are some
commonly used functions:
● Data Exploration:
○ head(): Displays the first few rows of the dataset.
○ tail(): Displays the last few rows of the dataset.
○ str(): Provides the structure of the dataset (e.g., data types,
dimensions).
● Summary Statistics:
○ summary(): Generates summary statistics for numerical and categorical
variables.
○ mean(), median(), sd(), var(): Calculate specific statistical measures.
● Data Visualization:
○ hist(): Creates histograms to visualize data distributions.
○ boxplot(): Creates boxplots to visualize data spread and outliers.
○ plot(): Creates scatterplots to visualize relationships between variables.
○ barplot(): Creates bar charts for categorical data.
● Data Manipulation:
○ subset(): Extracts subsets of data based on conditions.
○ aggregate(): Aggregates data based on specific criteria.
○ table(): Creates frequency tables for categorical variables.
● Relevance to Data Visualization:
○ These functions help in preparing data for visualization by summarizing
and cleaning it.
○ Basic plots provide quick insights into data patterns and relationships.
○ Advanced visualization libraries like ggplot2 build on these basic
functions to create more complex and customized visualizations.
UNIT 2: Data Manipulation with R

2.1 Introduction to R

R is a programming language widely used for statistical computing and data

visualization. It offers powerful libraries such as ggplot2, dplyr, and tidyr for data
analysis.

2.1.1 Installing R and RStudio (A Installing R and RStudio | Hands-On Programming

with R - https://rstudio-education.github.io/hopr/starting.html)

1. Download and install R from CRAN.

The Comprehensive R Archive Network - https://cran.r-project.org/
2. Download and install RStudio from RStudio.
RStudio Desktop - Posit - https://posit.co/download/rstudio-desktop/

2.2 Data Manipulation in R

2.2.1 dplyr Package for Data Manipulation

The dplyr package provides a set of efficient functions for handling tabular data,
enabling seamless data manipulation.

● filter(): Selects rows based on specified conditions.

filter(data, condition)
Example: Extracting rows where age is greater than 30.
filter(df, age > 30)
● select(): Chooses specific columns from the dataset.
select(data, column1, column2)
Example: Selecting only name and age columns.
select(df, name, age)
● arrange(): Sorts data in ascending or descending order.
arrange(data, column)
Example: Sorting by salary in descending order.
arrange(df, desc(salary))
● mutate(): Creates new variables based on existing ones.
mutate(data, new_column = expression)
Example: Creating a tax column based on salary.
mutate(df, tax = salary * 0.1)
● group_by() and summarise(): Aggregates data for analysis.
summarise(group_by(data, column), mean_value = mean(column))
Example: Finding average salary by department.
summarise(group_by(df, department), avg_salary = mean(salary))

Complete Example with a Sample Dataset

# Sample dataset

df <- data.frame(

name = c("Alice", "Bob", "Charlie", "David"),

age = c(25, 35, 30, 40),

salary = c(50000, 60000, 55000, 70000),

department = c("HR", "IT", "HR", "IT")

# 1. Filter rows where age is greater than 30

filtered_data <- filter(df, age > 30)

print("Filtered Data:")

print(filtered_data)

# 2. Select only name and age columns

selected_data <- select(df, name, age)

print("Selected Data:")

print(selected_data)
# 3. Sort by salary in descending order

sorted_data <- arrange(df, desc(salary))

print("Sorted Data:")

print(sorted_data)

# 4. Create a new tax column (10% of salary)

mutated_data <- mutate(df, tax = salary * 0.1)

print("Mutated Data:")

print(mutated_data)

# 5. Find average salary by department

summary_data <- df %>%

group_by(department) %>%

summarise(avg_salary = mean(salary))

print("Summary Data:")

print(summary_data)

Output of the Complete Example

"Filtered Data:"

name age salary department

1 Bob 35 60000 IT

2 David 40 70000 IT
"Selected Data:"

name age

1 Alice 25

2 Bob 35

3 Charlie 30

4 David 40

"Sorted Data:"

name age salary department

1 David 40 70000 IT

2 Bob 35 60000 IT

3 Charlie 30 55000 HR

4 Alice 25 50000 HR

"Mutated Data:"

name age salary department tax

1 Alice 25 50000 HR 5000

2 Bob 35 60000 IT 6000

3 Charlie 30 55000 HR 5500

4 David 40 70000 IT 7000

"Summary Data:"

# A tibble: 2 × 2

department avg_salary

1 HR 52500

2 IT 65000
2.2.2 data.table Package for Fast Data Processing

The data.table package enhances data manipulation speed, especially for large
datasets.

● Syntax: DT[i, j, by]

○ i: Filters rows
○ j: Selects/manipulates columns
○ by: Groups data
● Example: Selecting name and salary for employees older than 30.
data_table[age > 30, .(name, salary)]

2.2.3 tidyr Package for Reshaping Data

The tidyr package provides tools for reshaping data formats.

● gather(): Converts wide-format data to long format.

gather(data, key, value, columns_to_gather)
Example: Transforming multiple year columns into a single year-value pair.
● spread(): Converts long-format data to wide format.
spread(data, key, value)

separate() and unite(): Splits and combines columns.

separate(data, column, into = c("col1", "col2"))

● unite(data, new_column, col1, col2)

2.2.4 Additional R Functions & Packages

readr: Reads various file types like CSV, TSV.
library(readr)

● data <- read_csv("data.csv")

lubridate: Simplifies date/time handling.

library(lubridate)

● date <- ymd("2023-10-01")

2.3 Data Visualization in R

2.3.1 ggplot2 Package

● Scatter Plot:
ggplot(data, aes(x = var1, y = var2)) + geom_point()

● Bar Chart:
ggplot(data, aes(x = category, y = value)) + geom_bar(stat = "identity")

2.3.2 Advanced Visualizations

● Heatmaps:
ggplot(data, aes(x = var1, y = var2, fill = value)) + geom_tile()

Pair Plots:
library(GGally)

● ggpairs(data)

2.4 Python for Data Analysis

2.4.1 Key Libraries

Pandas:
import pandas as pd

● data = pd.read_csv("data.csv")

NumPy:
import numpy as np

● array = np.array([1, 2, 3])

Matplotlib & Seaborn:

import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x = data['var1'], y = data['var2'])

● plt.show()
Folium (Geospatial Visualizations):
import folium
map = folium.Map(location = [45.5236, -122.6750])

● map.save("map.html")

2.4.2 Advanced Operations

● Handling Missing Values:

data.fillna(0, inplace=True)
● Grouping & Aggregations:
data.groupby('category').agg({'value': 'mean'})

Below is a table summarizing the packages, libraries, and functions commonly used for
Data Manipulation in R. This table includes the most popular tools and their primary
purposes.

Package/Library Functions Purpose

dplyr filter() Select rows based on conditions.

select() Choose specific columns.

arrange() Sort data in ascending or

descending order.

mutate() Create new variables or modify

existing ones.

group_by() Group data by one or more

columns.
summarise() Aggregate data (e.g., mean, sum)
for grouped data.

rename() Rename columns.

distinct() Remove duplicate rows.

left_join(), Join two datasets based on

right_join(), etc. common columns.

tidyr gather() Convert wide data to long format

(reshaping).

spread() Convert long data to wide format

(reshaping).

separate() Split one column into multiple

columns.

unite() Combine multiple columns into

one.

drop_na() Remove rows with missing

values.
fill() Fill missing values with previous
or next values.

data.table data.table() Create a data.table object

(enhanced data frame).

setkey() Set keys for fast indexing and

joins.

fread() Fast reading of large datasets.

fwrite() Fast writing of large datasets.

[i, j, by] syntax Perform fast filtering,

aggregation, and grouping.

reshape2 melt() Convert wide data to long format.

dcast() Convert long data to wide format.

stringr str_replace() Replace substrings in a string.

str_split() Split strings into substrings.

str_detect() Detect patterns in strings.

str_extract() Extract substrings matching a

pattern.

str_trim() Remove whitespace from strings.

lubridate ymd(), mdy(), dmy() Parse dates from strings.

year(), month(), day() Extract components of a date.

interval() Define time intervals.

difftime() Calculate differences between

dates.

purrr map() Apply a function to each element

of a list or vector.

map_dbl(), map_chr(), etc. Apply a function and return a

specific type (e.g., numeric,
character).

reduce() Reduce a list to a single value by

iteratively applying a function.
flatten() Flatten a nested list.

readr read_csv() Read CSV files into a data frame.

read_tsv() Read tab-separated files into a

data frame.

write_csv() Write data frames to CSV files.

janitor clean_names() Clean column names (e.g.,

remove spaces, special
characters).

remove_empty() Remove empty rows or columns.

tabyl() Create frequency tables.

sqldf sqldf() Run SQL queries on data frames.

tidyverse Collection of packages (dplyr, A meta-package for data

tidyr, readr, etc.) manipulation and analysis.

2.5 IBM Watson Studio - (IBM Watson Studio : https://www.ibm.com/products/watson-studio)

Introduction to IBM Watson Studio - IBM Developer -
https://developer.ibm.com/learningpaths/get-started-watson-studio/introduction-watson-studio/
IBM Watson Studio: A Comprehensive Guide

IBM Watson Studio is a powerful cloud-based platform designed for data science,
machine learning, and AI model development. It provides tools for data preparation,
visualization, and model building, supporting both no-code and code-driven workflows
(Python, R, and SQL).

2.5.1 Creating Projects

A project in Watson Studio serves as a workspace where you can store, manage, and
analyze datasets, build machine learning models, and collaborate with team members.

Step 1: Log in to IBM Watson Studio

1. Go to the IBM Cloud and log in.

2. If you don't have an IBM Cloud account, you'll need to create one (free and paid
tiers are available).
3. Once logged in, go to IBM Watson Studio from the IBM Cloud dashboard.

Step 2: Create a New Project

1. Click on "Create a Project".

2. Choose a project type based on your needs:
○ Standard Project – For general data science, AI, and machine learning.
○ Federated Learning Project – For training AI models across multiple data
sources without centralizing data.
3. Provide a project name and select an IBM Cloud Object Storage instance
(required for storing datasets and notebooks).
4. Click "Create" to complete the setup.

Step 3: Upload Your Dataset

1. Open the project and navigate to Assets → Add Data.

2. Select one of the following options:
○ Upload from local machine (supports CSV, Excel, JSON, Parquet).
○ Connect to a database (IBM Db2, MySQL, PostgreSQL, etc.).
○ Fetch from cloud storage (AWS S3, IBM Cloud Object Storage).
3. Once uploaded, the dataset appears under Data Assets, ready for analysis.
2.5.2 Data Refinery (Data Cleaning & Preparation)

The Data Refinery tool in Watson Studio provides a powerful way to clean, transform,
and prepare datasets for analysis. It offers an interactive, drag-and-drop interface with
over 100 built-in operations, reducing the need for manual data preprocessing.

Step 1: Open Data Refinery

1. In the project dashboard, go to Assets → Data Refinery.

2. Select your uploaded dataset to open the Data Refinery interface.

Step 2: Data Cleaning & Preprocessing

Data cleaning is essential before running any analysis. Some common built-in
operations operations include:

1. Removing Duplicates

● Duplicate records can skew results.

● Use the "Remove Duplicates" function to delete redundant rows.

2. Handling Missing Values

● Missing values (NULLs) can negatively impact machine learning models.

● Possible approaches:
○ Remove missing values if they are minimal.
○ Fill missing values with:
■ Mean/Median (for numerical data).
■ Mode (for categorical data).
■ Custom values (like "Unknown" for text data).

3. Data Type Conversion

● Ensure consistency in data types:

○ Convert categorical values into numerical (e.g., "Yes"/"No" → 1/0).
○ Change date formats to a standard structure.

4. Normalization/Standardization

● Scale data for better model performance.

● Normalize numeric values if needed.
5. Filtering and Sorting

● Extract relevant rows based on conditions (e.g., "filter all sales > $10,000").
● Sort data in ascending/descending order.

Step 3: Data Transformation & Feature Engineering

Feature engineering helps enhance data quality for better insights and model
performance.

1. Creating New Features

● Generate new columns based on existing data.

○ Example: Create a "Total Revenue" column as Price × Quantity
Sold.

2. Aggregating Data

● Summarize information using functions like:

○ Mean, Sum, Count, Min, Max.
○ Example: Calculate average salary per department.

3. Splitting Columns

● Extract information from text fields.

○ Example: Split "Full Name" into "First Name" and "Last Name".

4. String Manipulation

● Replace values (e.g., changing "N/A" to "Unknown").

● Change case (uppercase/lowercase).
● Trim spaces for consistency.

Step 4: Save & Automate Data Cleaning

● Click "Save and Create Job" to automate the cleaning pipeline.

● Export the refined dataset for further use as:
○ CSV, JSON, Parquet
○ Database tables
○ IBM Cloud Object Storage
2.5.3 Visualizing Data

Data visualization helps uncover patterns, trends, and relationships. Watson Studio
provides built-in visualization tools that require no coding.

1. Bar Charts (Categorical Comparisons)

Use Case:

● Comparing sales across different regions.

● Analyzing customer preferences by category (e.g., product type).
● Measuring survey responses (e.g., Yes/No).

How to Create:

1. Select Bar Chart.

2. Set X-axis: Categorical variable (e.g., Product Category).
3. Set Y-axis: Numeric variable (e.g., Total Sales).
4. Customize colors, labels, and tooltips.

2. Line Charts (Trends Over Time)

Use Case:

● Tracking stock prices or revenue over months/years.

● Analyzing temperature changes over seasons.
● Monitoring customer engagement over time.
● Analyze time-based trends (e.g., website traffic).

How to Create:

1. Select Line Chart.

2. Assign a time-based variable to the X-axis (e.g., Date).
3. Assign a numeric value to the Y-axis (e.g., Sales, Temperature).
4. Adjust smoothing and trend lines.

3. Scatter Plots (Relationship Between Variables)

Use Case:

● Identify correlations between two variables (e.g., Advertising Spend vs. Revenue).
● Comparing employee experience vs. salary.
● Checking if higher temperature leads to increased ice cream sales.
How to Create:

1. Select Scatter Plot.

2. Assign X-axis: First numeric variable (e.g., Marketing Budget).
3. Assign Y-axis: Second numeric variable (e.g., Sales).
4. Add a trend line to check correlations.

4. Boxplots (Data Distribution & Outliers) (Summarizing Distributions)

Use Case:

● Identifying outliers in salary distributions.

● Comparing test scores across different groups.
● Understanding data spread (minimum, median, maximum values).

How to Create:

1. Choose Boxplot.

2. Assign a categorical variable to the X-axis (e.g., Department).
3. Assign a numeric variable to the Y-axis (e.g., Employee Salaries).
4. Analyze medians, quartiles, and outliers.
2.6 Case Study: Iris Dataset Analysis

The Iris dataset is one of the most famous datasets in machine learning and statistics.
It consists of 150 samples from three species of the Iris flower: Setosa, Versicolor, and
Virginica. Each sample has four features:

● Sepal Length (cm)

● Sepal Width (cm)
● Petal Length (cm)
● Petal Width (cm)

The goal of this case study is to explore, visualize, and analyze the Iris dataset to
understand relationships between features, perform clustering, and derive insights.

2.6.1 Data Profiling

Step 1: Load the Iris Dataset

The dataset can be loaded using Python’s seaborn, scikit-learn, or pandas libraries.

# Import necessary libraries

import pandas as pd

from sklearn import datasets

# Load the Iris dataset from sklearn

iris = datasets.load_iris()
# Convert to DataFrame

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = iris.target # Add species as a target variable

df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor',

2: 'virginica'}) # Map target to species names

# Display first few rows

df.head()

Step 2: Explore Dataset Structure

Understanding the data structure is essential for further analysis.

# Check the shape of the dataset

df.shape

Output:
(150, 5) → 150 samples and 5 columns (4 features + 1 target)

# Check column names

df.columns

Output:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)', 'species']
Step 3: Summary Statistics
Statistical summaries help identify mean, median, min, max, and standard deviation
for each feature.

# Get summary statistics

df.describe()

Feature Mean Std Dev Min Max

Sepal 5.84 0.83 4.3 7.9

Length

Sepal Width 3.05 0.43 2.0 4.4

Petal 3.76 1.76 1.0 6.9

Length

Petal Width 1.20 0.76 0.1 2.5

Observations:

● Petal width and length have higher variability than sepal measurements.
● Sepal width has the smallest range compared to other features.

2.6.2 Data Visualization

Data visualization helps identify relationships and clusters within the dataset.
Step 1: Pairwise Scatter Plots
Pairwise scatter plots help understand how features relate to each other.

import seaborn as sns

import matplotlib.pyplot as plt

# Create pairplot to visualize relationships

sns.pairplot(df, hue="species", markers=["o", "s", "D"])

plt.show()

Interpretation:

● Petal length vs. petal width clearly separates species.

● Setosa is easily distinguishable, while Versicolor and Virginica overlap slightly.

Step 2: Cluster Analysis Using K-Means

K-Means clustering groups similar data points together.

from sklearn.cluster import KMeans

# Apply K-Means clustering with 3 clusters

kmeans = KMeans(n_clusters=3, random_state=42)

df['cluster'] = kmeans.fit_predict(df.iloc[:, :-2])

# Compare clusters with actual species

sns.scatterplot(x=df["petal length (cm)"], y=df["petal width

(cm)"], hue=df["cluster"], palette="deep")
plt.title("K-Means Clustering on Iris Dataset")

plt.show()

Insights from Clustering:

● The K-Means algorithm correctly separates Setosa.

● There is slight overlap between Versicolor and Virginica.
● Choosing different clustering techniques (e.g., hierarchical clustering) might
improve separation.

Takeaways:
Feature Importance:

● Petal Length and Petal Width are the most distinguishing features between
species.
● Sepal Width is less effective in differentiating between species.

Pattern Recognition:

● Setosa is easily separable, whereas Versicolor and Virginica overlap slightly.

● K-Means clustering works well, but it is not 100% accurate due to overlap.

Business/Scientific Application:

● Can be used for automatic flower classification.

● Useful for species identification in botanical research.

Hands-On:
Case Study - Iris Dataset
https://colab.research.google.com/drive/1l7tMJEjonNjzNL5S0pTE2C_CObVPAmJa?usp
=sharing
Uni, Bi, Tri Analysis
https://colab.research.google.com/drive/1VZF-ucI2ooBuFNoOo1H0KZOrJ3yW1kb3?us
p=sharing

Inferential Statistics
100% (4)
Inferential Statistics
28 pages
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
From Everand
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
Jim Frost
No ratings yet
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
Questions 2
No ratings yet
Questions 2
10 pages
DATA-VISUALIZATION-NOTES-OU
No ratings yet
DATA-VISUALIZATION-NOTES-OU
125 pages
Inferential Statistics
No ratings yet
Inferential Statistics
6 pages
Inferential Statistics
No ratings yet
Inferential Statistics
3 pages
Inferential Statistics
No ratings yet
Inferential Statistics
40 pages
Week - 1 Day - 2 Inferential Statistics
No ratings yet
Week - 1 Day - 2 Inferential Statistics
34 pages
Introduction To Inferential Statistics
No ratings yet
Introduction To Inferential Statistics
11 pages
Inferential Statistics
No ratings yet
Inferential Statistics
35 pages
Inferenatial Assign, Of Iqra Sajid
No ratings yet
Inferenatial Assign, Of Iqra Sajid
8 pages
UNIT I, II, & III
No ratings yet
UNIT I, II, & III
13 pages
Inferential Statistics
No ratings yet
Inferential Statistics
42 pages
lecture-1
No ratings yet
lecture-1
72 pages
1 Statistical-Tests
No ratings yet
1 Statistical-Tests
42 pages
Stat
67% (3)
Stat
70 pages
Lecture 7.descriptive and Inferential Statistics
No ratings yet
Lecture 7.descriptive and Inferential Statistics
44 pages
Statistics SS2020
No ratings yet
Statistics SS2020
12 pages
Statistics Is A Branch of
No ratings yet
Statistics Is A Branch of
6 pages
Lecture 4_Data Science Statistics
No ratings yet
Lecture 4_Data Science Statistics
21 pages
Statistical Biology - Reviewer
100% (1)
Statistical Biology - Reviewer
6 pages
3-4-RESEARCH-8-2
No ratings yet
3-4-RESEARCH-8-2
54 pages
Theory
No ratings yet
Theory
7 pages
Applied_Data_Science-MODULE-2-SEM8
No ratings yet
Applied_Data_Science-MODULE-2-SEM8
53 pages
Statistics - Exam Reviewer (Final)
No ratings yet
Statistics - Exam Reviewer (Final)
10 pages
02 Intrro Continued
No ratings yet
02 Intrro Continued
34 pages
Inferential Statistics For Data Science
100% (1)
Inferential Statistics For Data Science
10 pages
Research
No ratings yet
Research
21 pages
Statistical Techniques - Bda
No ratings yet
Statistical Techniques - Bda
33 pages
Statistics: An Introduction and Overview
No ratings yet
Statistics: An Introduction and Overview
51 pages
90156hypothesis Testing
No ratings yet
90156hypothesis Testing
34 pages
2 Intro to Inferential Stat
No ratings yet
2 Intro to Inferential Stat
37 pages
Statistical Instruments and References Writing in Research
No ratings yet
Statistical Instruments and References Writing in Research
36 pages
STATISTICS Review
No ratings yet
STATISTICS Review
8 pages
Reseach 04
No ratings yet
Reseach 04
13 pages
Biostatistics Notes: Descriptive Statistics
No ratings yet
Biostatistics Notes: Descriptive Statistics
16 pages
Biostatistics Notes
No ratings yet
Biostatistics Notes
8 pages
Research Methadology
No ratings yet
Research Methadology
26 pages
A Brief (Very Brief) Overview of Biostatistics: Jody Kreiman, PHD Bureau of Glottal Affairs
No ratings yet
A Brief (Very Brief) Overview of Biostatistics: Jody Kreiman, PHD Bureau of Glottal Affairs
56 pages
Understanding Inferential Statistics
No ratings yet
Understanding Inferential Statistics
15 pages
5 & 6 - BIOSTATISTICS V & VI Inferential Statistics I & II
No ratings yet
5 & 6 - BIOSTATISTICS V & VI Inferential Statistics I & II
68 pages
Statistical Inference
No ratings yet
Statistical Inference
3 pages
1.7 Statistical Inference Foundations
No ratings yet
1.7 Statistical Inference Foundations
63 pages
Hypothesis
No ratings yet
Hypothesis
14 pages
Correlation: Franciscan College of The Immaculate Conception Baybay City, Leyte
No ratings yet
Correlation: Franciscan College of The Immaculate Conception Baybay City, Leyte
10 pages
Descriptive and Inferential Statistics (2)
No ratings yet
Descriptive and Inferential Statistics (2)
30 pages
Parametric Vs Non Parametric Statistics
No ratings yet
Parametric Vs Non Parametric Statistics
12 pages
Chapter 5 Data Analysis Ab
No ratings yet
Chapter 5 Data Analysis Ab
56 pages
hypothesis formulation and testing
No ratings yet
hypothesis formulation and testing
23 pages
EDU 411 Topic 5 Data Analysis
No ratings yet
EDU 411 Topic 5 Data Analysis
9 pages
Understanding Simple Inferential Statistics
No ratings yet
Understanding Simple Inferential Statistics
58 pages
Business Research Methods: MBA - FALL 2014
No ratings yet
Business Research Methods: MBA - FALL 2014
32 pages
Statistics - The Big Picture
No ratings yet
Statistics - The Big Picture
4 pages
What Is Inferential Statistics
No ratings yet
What Is Inferential Statistics
4 pages
Unit IV- Analytics Tasks (Students)
No ratings yet
Unit IV- Analytics Tasks (Students)
127 pages
APznzaYF7tUA0VX7NvlpipTt1LaZzXVD0sXU30V-Qjepw6tFfUw5mhAAFJZVZ2HswtKpuXOJN5pE9Esv131G6OLiwIXk5R4Ka90mW00hc3...7BgOvkCYLOn2lkmf9QIiOCIyPls-W97NGjMqtS98Pv6ETNledZZDXrX0Ay8hq3mg41XeqttphblF_miCqO_0G0dv4tlMn24Rgkb7KRUTAVXyRJ3LCSNNEH4w==
No ratings yet
APznzaYF7tUA0VX7NvlpipTt1LaZzXVD0sXU30V-Qjepw6tFfUw5mhAAFJZVZ2HswtKpuXOJN5pE9Esv131G6OLiwIXk5R4Ka90mW00hc3...7BgOvkCYLOn2lkmf9QIiOCIyPls-W97NGjMqtS98Pv6ETNledZZDXrX0Ay8hq3mg41XeqttphblF_miCqO_0G0dv4tlMn24Rgkb7KRUTAVXyRJ3LCSNNEH4w==
21 pages
CG8_DATA-ANALYSIS
No ratings yet
CG8_DATA-ANALYSIS
63 pages
sanvi isp practical
No ratings yet
sanvi isp practical
17 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
CH 2
No ratings yet
CH 2
27 pages
Discriminant Analysis
100% (1)
Discriminant Analysis
16 pages
Student Performance (Multiple Linear Regression)
No ratings yet
Student Performance (Multiple Linear Regression)
30 pages
Research Methodology
No ratings yet
Research Methodology
23 pages
Assessing The Factors of Customers Satisfaction On Credit Card Users in Bangladesh
No ratings yet
Assessing The Factors of Customers Satisfaction On Credit Card Users in Bangladesh
13 pages
Locus of Control and The Flow Experience: An Experimental Analysis
No ratings yet
Locus of Control and The Flow Experience: An Experimental Analysis
19 pages
Data science-Unit-2
No ratings yet
Data science-Unit-2
33 pages
Project Report
No ratings yet
Project Report
30 pages
Why We Give, Testing Economic and Social Psychological Accounts of Altruism
No ratings yet
Why We Give, Testing Economic and Social Psychological Accounts of Altruism
34 pages
6606-Article Text-27458-1-10-20200323
No ratings yet
6606-Article Text-27458-1-10-20200323
12 pages
Our Basic Empirical Example: Comparing Multiple Regression Models: Adjusted R and The Partial F-Test
No ratings yet
Our Basic Empirical Example: Comparing Multiple Regression Models: Adjusted R and The Partial F-Test
7 pages
MAchine Learning
No ratings yet
MAchine Learning
120 pages
Thurner Et Al, 2019, Network Interdependencies and The Evolution of The International Arms Trade
No ratings yet
Thurner Et Al, 2019, Network Interdependencies and The Evolution of The International Arms Trade
29 pages
Bayesian Model Averaging For Linear Regression Models
No ratings yet
Bayesian Model Averaging For Linear Regression Models
14 pages
PR2 Worktext Weeks 3-Pascual
No ratings yet
PR2 Worktext Weeks 3-Pascual
1 page
Ch5 - Demand Based Planning2022
No ratings yet
Ch5 - Demand Based Planning2022
77 pages
Quantile Regression
No ratings yet
Quantile Regression
11 pages
BRM Practice Questions PGP20
0% (1)
BRM Practice Questions PGP20
47 pages
Operations Management, 10e: (Heizer/Render) Chapter 4 Forecasting
No ratings yet
Operations Management, 10e: (Heizer/Render) Chapter 4 Forecasting
22 pages
Hendrickson Assignment4
No ratings yet
Hendrickson Assignment4
11 pages
Study On The Analysis of Near-Miss Ship Collisions Using Logistic Regression
No ratings yet
Study On The Analysis of Near-Miss Ship Collisions Using Logistic Regression
7 pages
Erdemir, Cavdar, Bagci, Cihat Corbaci - Factors Predicting E-Learners' Satisfaction On Online Education, 2016
No ratings yet
Erdemir, Cavdar, Bagci, Cihat Corbaci - Factors Predicting E-Learners' Satisfaction On Online Education, 2016
9 pages
9 Research Design and Locale 1
No ratings yet
9 Research Design and Locale 1
33 pages
Hierarchical Regression
No ratings yet
Hierarchical Regression
3 pages
Instant download Multilevel modeling methodological advances issues and applications 1st Edition Steven P. Reise pdf all chapter
100% (5)
Instant download Multilevel modeling methodological advances issues and applications 1st Edition Steven P. Reise pdf all chapter
61 pages
How To Write Methods Section of Research Paper PDF
No ratings yet
How To Write Methods Section of Research Paper PDF
5 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
4 Research Questions
No ratings yet
4 Research Questions
21 pages
Econometrics Ch6 Applications
No ratings yet
Econometrics Ch6 Applications
49 pages

DV Unit 1&2 Notes

Uploaded by

DV Unit 1&2 Notes

Uploaded by

UNIT 1: Introduction to Statistics

Statistics is a mathematical science that involves the collection, organization, analysis,

Statistics is broadly classified into two main categories:

1.2.1 Descriptive Statistics

1.2.2 Inferential Statistics

1.3.1 Measures of Central Tendency

These measures describe how spread out the data is:

Inferential statistics allows us to make predictions about a population using a sample.

1.4.1 Parameter Estimation

Parameter estimation is a technique in inferential statistics used to estimate unknown

Types of Parameter Estimation

1.​ Point Estimation

Point estimation provides a single best estimate of an unknown population parameter.

Feature Point Estimation Interval Estimation

Definition Provides a single best estimate Provides a range of values likely to

Precision High precision but no measure Lower precision but includes

Example Sample mean = 170 cm as an 95% confidence interval = (168.43

Use Case Quick estimates when When understanding uncertainty and

Inferential statistics help us make conclusions about a population based on sample

1.4.3.1 Hypothesis Testing

Hypothesis testing is a fundamental statistical method used to make decisions about a

Null Hypothesis (H₀)

●​ Represents the default assumption that there is no effect, relationship, or

Alternative Hypothesis (H₁)

●​ Represents the claim that there is an effect, relationship, or difference.

Types of Hypothesis Tests

●​ Used when the direction of the effect is specified.

P-Value (Probability Value)

Level of Significance (α)

●​ The pre-defined threshold for rejecting the null hypothesis.

Errors in Hypothesis Testing

Type I Error (False Positive)

●​ Incorrectly rejecting H₀​when it is actually true.

Type II Error (False Negative)

●​ Failing to reject H₀​when H₁​is actually true.

Steps in Hypothesis Testing

1.​ Define the null and alternative hypotheses.

1. T-Test (Student’s t-test)

●​ One-sample t-test: Compares the sample mean to a known population mean.

Used to test relationships between categorical variables (nominal data).

●​ Chi-Square Goodness of Fit Test: Checks if a sample distribution matches an

3. ANOVA (Analysis of Variance)

Used to determine relationships between independent (predictor) and dependent

●​ Simple Linear Regression: Examines the relationship between one

6. Covariance Analysis (ANCOVA – Analysis of Covariance)

●​ Covariance vs. Correlation:

Inferential Statistics - Definition, Types, Examples, Formulas -

Probability distributions describe how values of a random variable are distributed.

1.5.1 Types of Random Variables

1.5.2 Common Distributions

Sampling is the process of selecting a subset of individuals from a population to

1.6.1 Types of Sampling

Sampling is broadly classified into Probability Sampling and Non-Probability Sampling.

1.1 Random Sampling (Simple Random Sampling - SRS)

●​ Assign a number to every individual in the population.

●​ A researcher selects 100 students at random from a university database to study

Completely unbiased and ensures fair selection.​

Not efficient for large populations.​

1.2 Stratified Sampling

●​ Identify subgroups (strata) based on characteristics (e.g., gender, age, income).

Ensures all groups are represented.​

Requires knowledge of population characteristics.​

1.3 Cluster Sampling

●​ Divide the population into clusters (e.g., neighborhoods, schools, departments).

●​ To survey school performance in a city, entire schools (clusters) are selected

Cost-effective and time-saving.​

Can lead to higher sampling error if clusters are not diverse.​

1.4 Systematic Sampling

●​ Choose a starting point randomly.

●​ A company wants to survey employees. It selects every 10th employee from a

Easier and faster than random sampling.​

If there’s an underlying pattern in the population list, it may introduce bias.​

In non-probability sampling, individuals are selected based on non-random criteria,

2.1 Convenience Sampling

Participants are selected based on ease of access and availability.

●​ A researcher surveys passersby in a shopping mall about their shopping habits.

1. Point Estimation

● Represents the default assumption that there is no effect, relationship, or

● Represents the claim that there is an effect, relationship, or difference.

● Used when the direction of the effect is specified.

● The pre-defined threshold for rejecting the null hypothesis.

● Incorrectly rejecting H₀when it is actually true.

● Failing to reject H₀when H₁is actually true.

1. Define the null and alternative hypotheses.

● One-sample t-test: Compares the sample mean to a known population mean.

● Chi-Square Goodness of Fit Test: Checks if a sample distribution matches an

● Simple Linear Regression: Examines the relationship between one

● Covariance vs. Correlation:

● Assign a number to every individual in the population.

● A researcher selects 100 students at random from a university database to study

Completely unbiased and ensures fair selection.

Not efficient for large populations.

● Identify subgroups (strata) based on characteristics (e.g., gender, age, income).

Ensures all groups are represented.

Requires knowledge of population characteristics.

● Divide the population into clusters (e.g., neighborhoods, schools, departments).

● To survey school performance in a city, entire schools (clusters) are selected

Cost-effective and time-saving.

Can lead to higher sampling error if clusters are not diverse.

● Choose a starting point randomly.

● A company wants to survey employees. It selects every 10th employee from a

Easier and faster than random sampling.

If there’s an underlying pattern in the population list, it may introduce bias.

● A researcher surveys passersby in a shopping mall about their shopping habits.

Quick and easy to collect data.

Highly biased and not representative.

● Choose individuals who best meet the research criteria.

● A medical researcher selects only experienced doctors to study a new treatment

Useful when only experts can provide relevant data.

Researcher bias can affect selection.

1. Download and install R from CRAN.

● filter(): Selects rows based on specified conditions.

● Syntax: DT[i, j, by]

● gather(): Converts wide-format data to long format.

separate() and unite(): Splits and combines columns.

● unite(data, new_column, col1, col2)

● data <- read_csv("data.csv")

lubridate: Simplifies date/time handling.

● date <- ymd("2023-10-01")

● array = np.array([1, 2, 3])

Matplotlib & Seaborn:

● Handling Missing Values: