0% found this document useful (0 votes)
14 views

DV Unit 1&2 Notes

Unit 1 introduces statistics as a mathematical science focused on data collection, analysis, and interpretation, applicable across various fields. It differentiates between descriptive statistics, which summarize data, and inferential statistics, which make predictions about populations based on samples. Key concepts include data types, sampling methods, hypothesis testing, and common statistical tests like t-tests and ANOVA.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

DV Unit 1&2 Notes

Unit 1 introduces statistics as a mathematical science focused on data collection, analysis, and interpretation, applicable across various fields. It differentiates between descriptive statistics, which summarize data, and inferential statistics, which make predictions about populations based on samples. Key concepts include data types, sampling methods, hypothesis testing, and common statistical tests like t-tests and ANOVA.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT 1: Introduction to Statistics

1.1 Statistics

Statistics is a mathematical science that involves the collection, organization, analysis,


interpretation, and presentation of data. It helps in making data-driven decisions by
summarizing complex data into meaningful insights. Statistics is widely used in various
fields such as economics, medicine, engineering, social sciences, and business to
analyze trends, make predictions, and test hypotheses.

Key Concepts:
●​ Data: Raw facts and figures collected for analysis. Data can be quantitative
(numerical) or qualitative (categorical).
●​ Population: The entire set of individuals, items, or data points of interest in a
study. For example, if you are studying the heights of all students in a school, the
population is all the students in that school.
●​ Sample: A subset of the population that is used to make inferences about the
entire population. For example, if you measure the heights of 50 students out of
500, those 50 students are your sample.
●​ Variable: A characteristic or attribute that can be measured or observed.
Variables can be independent (predictor) or dependent (outcome).
1.2 Types of Statistics

Statistics is broadly classified into two main categories:

1.2.1 Descriptive Statistics


Descriptive statistics summarize and describe the main features of a dataset. It
provides simple summaries about the sample and the measures. It is used to present
data in a meaningful way, often using measures of central tendency and measures of
dispersion.
●​ Measures of Central Tendency: Mean, Median, Mode.
●​ Measures of Dispersion: Range, Variance, Standard Deviation.
●​ Graphical Representations: Histograms, Bar Charts, Pie Charts, Boxplots.

1.2.2 Inferential Statistics


Inferential statistics uses data from a sample to make inferences or predictions about a
population. It involves:
●​ Hypothesis Testing: Testing assumptions about a population parameter.
●​ Confidence Intervals: Estimating the range within which a population parameter
lies.
●​ Regression Analysis: Modeling relationships between variables.
1.3 Descriptive Statistics

1.3.1 Measures of Central Tendency


1.3.2 Measures of Dispersion (Spread)

These measures describe how spread out the data is:


This diagram shows a box plot, which is a simple way to display data distribution. It
shows the minimum and maximum values at the ends, with most of the data contained
in a box in the middle. The box is divided by the median (middle value) and shows the
middle 50% of all data points. The "whiskers" extend from the box to show the
remaining data spread, with each section representing 25% of all values.

This image explains standard deviation, showing how data spreads around the
average (mean) in a bell curve. The percentages indicate how much data falls within
each section. The orange lines mark one standard deviation from the mean, covering
about 68% of the data.
1.4 Inferential Statistics

Inferential statistics allows us to make predictions about a population using a sample.


It includes:

1.4.1 Parameter Estimation

Parameter estimation is a technique in inferential statistics used to estimate unknown


population parameters based on a sample. Since it's often impractical to study an
entire population, we use sample data to draw conclusions about the whole population.

Types of Parameter Estimation

1.​ Point Estimation


2.​ Interval Estimation

1. Point Estimation

Point estimation provides a single best estimate of an unknown population parameter.


It uses sample statistics (such as mean, variance, or proportion) as estimates for the
population parameters.
2. Interval Estimation

Interval estimation provides a range of values that likely contains the population
parameter, rather than a single estimate. This range is called a Confidence Interval
(CI).
Comparison of Point vs. Interval Estimation

Feature Point Estimation Interval Estimation

Definition Provides a single best estimate Provides a range of values likely to


for a population parameter. contain the population parameter.

Precision High precision but no measure Lower precision but includes


of uncertainty. uncertainty.

Example Sample mean = 170 cm as an 95% confidence interval = (168.43


estimate of population mean. cm, 171.57 cm).

Use Case Quick estimates when When understanding uncertainty and


confidence levels are not reliability is important.
required.
1.4.3 Types of Inferential Tests

Inferential statistics help us make conclusions about a population based on sample


data.

1.4.3.1 Hypothesis Testing

Hypothesis testing is a fundamental statistical method used to make decisions about a


population based on sample data. It helps determine if a claim about a dataset is true,
whether observed patterns in data are due to random chance or if there is enough
evidence to support a particular claim.
Key Concepts in Hypothesis Testing

Null Hypothesis (H₀)

●​ Represents the default assumption that there is no effect, relationship, or


difference in the population.
●​ It serves as the baseline for comparison.
●​ Example: "There is no difference in test scores between students who study in the
morning and those who study at night."

Alternative Hypothesis (H₁)

●​ Represents the claim that there is an effect, relationship, or difference.


●​ It is what the researcher aims to provide evidence for.
●​ Example: "Students who study in the morning have higher test scores than those
who study at night."

Types of Hypothesis Tests

One-Tailed Test

●​ Used when the direction of the effect is specified.


●​ Tests for an increase or decrease in one specific direction.
●​ Example: "New drug A leads to higher recovery rates than drug B."
●​ Mathematical Representation:
○​ H₀ : μ₁ ≤ μ₂
○​ H₁ : μ₁ > μ₂

Two-Tailed Test

●​ Used when the direction of the effect is unknown or when differences in either
direction matter.
●​ Example: "New drug A has a different effect on recovery rates compared to drug B
(could be higher or lower)."
●​ Mathematical Representation:
○​ H₀ : μ₁ = μ₂
○​ H₁ : μ₁ ≠ μ₂
3. Decision Criteria in Hypothesis Testing

P-Value (Probability Value)

●​ Measures the probability of obtaining the observed results if the null hypothesis
is true.
●​ Interpretation:
○​ p < 0.05 → Strong evidence against H₀​; reject H₀​.
○​ p > 0.05 → Insufficient evidence; fail to reject H₀​.

Level of Significance (α)

●​ The pre-defined threshold for rejecting the null hypothesis.


●​ Common values:
○​ α=0.05 (5%) → 5% risk of wrongly rejecting H₀​.
○​ α=0.01 (1%) → More stringent; used in critical applications.

Errors in Hypothesis Testing

Type I Error (False Positive)

●​ Incorrectly rejecting H₀​when it is actually true.


●​ Example: Concluding that a new drug is effective when it actually is not.
●​ Probability of Type I Error = α (level of significance).

Type II Error (False Negative)

●​ Failing to reject H₀​when H₁​is actually true.


●​ Example: Concluding that a new drug has no effect when it actually does.
●​ Probability of Type II Error = β.

Steps in Hypothesis Testing

1.​ Define the null and alternative hypotheses.


2.​ Choose the significance level (α).
3.​ Select the appropriate test (Z-test, t-test, etc.).
4.​ Compute the test statistic.
5.​ Compare the p-value with α and make a decision.
6.​ Interpret the results in the context of the problem.
Hypothesis testing is a powerful tool in statistics, widely used in research, business
analytics, medicine, and many other fields. It provides a structured approach to making
data-driven decisions while controlling for errors and uncertainty.

1. T-Test (Student’s t-test)

Used to compare the means of two groups to determine if they are significantly different
from each other.

Example:

●​ One-sample t-test: Compares the sample mean to a known population mean.


○​ Example: Testing if the average weight of apples in an orchard differs from
200 grams.
●​ Independent (Unpaired) t-test: Compares means between two independent
groups.
○​ Example: Comparing test scores of students from two different schools.
●​ Paired t-test: Compares means from the same group before and after an
intervention.
○​ Example: Testing if a training program improves employee productivity.
2. Chi-Square Test (χ² Test)

Used to test relationships between categorical variables (nominal data).

Example:

●​ Chi-Square Goodness of Fit Test: Checks if a sample distribution matches an


expected distribution.
○​ Example: Testing if customer preferences for three product flavors are
equally distributed.
●​ Chi-Square Test for Independence: Determines if two categorical variables are
related.
○​ Example: Examining whether gender is associated with voting preference.

3. ANOVA (Analysis of Variance)

Used to compare means across three or more groups to see if at least one group is
significantly different.
●​ One-Way ANOVA: Compares means of one independent variable with multiple
groups.
○​ Example: Comparing students' math scores across three different schools.
●​ Two-Way ANOVA: Examines the effect of two independent variables on a
dependent variable.
○​ Example: Testing how both teaching method and study time affect exam
scores.

4. Z-Test

●​ Used for large sample sizes (n>30) and when the population variance is known.
●​ Example: Testing whether the average height of a population differs from a
known value.

5. Regression Analysis

Used to determine relationships between independent (predictor) and dependent


(outcome) variables.
Regression Line

●​ Simple Linear Regression: Examines the relationship between one


independent and one dependent variable.
○​ Example: Predicting house price based on square footage.
●​ Multiple Linear Regression: Involves two or more independent variables.
○​ Example: Predicting employee salary based on experience, education
level, and job role.
●​ Logistic Regression: Used for binary categorical outcomes (e.g., Yes/No, 0/1).
○​ Example: Predicting whether a customer will buy a product (Yes/No)
based on ad engagement.

6. Covariance Analysis (ANCOVA – Analysis of Covariance)

Measures how two variables change together while controlling for other factors.

●​ Covariance vs. Correlation:


○​ Covariance measures the direction of the relationship between two
variables but not the strength.
○​ Correlation standardizes this measure to range between -1 and 1,
indicating both strength and direction.
●​ ANCOVA (Analysis of Covariance): Extends ANOVA by controlling for one or
more covariates.
○​ Example: Comparing students’ final exam scores across schools while
controlling for previous academic performance.

Inferential Statistics - Definition, Types, Examples, Formulas -


https://www.cuemath.com/data/inferential-statistics/
1.5 Probability Distributions

Probability distributions describe how values of a random variable are distributed.

1.5.1 Types of Random Variables


●​ Discrete Random Variables: Finite set of values (e.g., number of students in a
class).
●​ Continuous Random Variables: Infinite possible values within a range (e.g.,
height of students).

1.5.2 Common Distributions


●​ Normal Distribution (Bell Curve): Used in many real-world scenarios. It is
symmetric and characterized by its mean (μ) and standard deviation (σ).
●​ Binomial Distribution: For binary outcomes (success/failure).
●​ Poisson Distribution: Used for rare event occurrences.
●​ Bernoulli Distribution: A special case of the binomial distribution with a single
trial.
●​ Uniform Distribution: All outcomes are equally likely.

Example:


Poisson Distribution:
Bernoulli Distribution:
1.6 Sampling and Sampling Distributions

Sampling is the process of selecting a subset of individuals from a population to


estimate characteristics of the whole population.

1.6.1 Types of Sampling

Sampling is the process of selecting a subset (sample) from a larger group (population)
to make statistical inferences about the whole population. The accuracy and reliability
of results depend on the sampling method used.

Sampling is broadly classified into Probability Sampling and Non-Probability Sampling.

1. Probability Sampling

In probability sampling, each individual in the population has a known and non-zero
probability of being selected. This ensures that the sample is representative and
reduces bias.

1.1 Random Sampling (Simple Random Sampling - SRS)

Every individual in the population has an equal and independent chance of being
selected. This is the most basic and widely used sampling method.

How It Works:

●​ Assign a number to every individual in the population.


●​ Use a random number generator (e.g., lottery method, computer-generated
random numbers) to select individuals.

Example:

●​ A researcher selects 100 students at random from a university database to study


their academic performance.

Advantages:

Completely unbiased and ensures fair selection.​


Easy to implement for small populations.
Disadvantages:

Not efficient for large populations.​


May lead to under-representation of some groups.

1.2 Stratified Sampling

The population is divided into homogeneous subgroups (strata), and individuals are
randomly selected from each stratum proportionally.

How It Works:

●​ Identify subgroups (strata) based on characteristics (e.g., gender, age, income).


●​ Randomly select participants from each subgroup in proportion to the
population.

Example:

●​ A school wants to survey students' study habits and stratifies by grade level
(freshman, sophomore, junior, senior), ensuring proportional representation.

Advantages:

Ensures all groups are represented.​


Provides higher precision than simple random sampling.

Disadvantages:

Requires knowledge of population characteristics.​


Can be time-consuming to divide the population into strata.

1.3 Cluster Sampling

The population is divided into clusters, and entire clusters are randomly selected
instead of individual members.

How It Works:

●​ Divide the population into clusters (e.g., neighborhoods, schools, departments).


●​ Randomly select entire clusters instead of individuals.
Example:

●​ To survey school performance in a city, entire schools (clusters) are selected


randomly instead of individual students.

Advantages:

Cost-effective and time-saving.​


Suitable for large populations spread across a wide area.

Disadvantages:

Can lead to higher sampling error if clusters are not diverse.​


May not be representative if clusters vary significantly.

1.4 Systematic Sampling

Every k-th individual from a population list is selected (where k=N/n, N = population
size, n = sample size).

How It Works:

●​ Choose a starting point randomly.


●​ Select every k-th person (e.g., every 5th, 10th, etc.).

Example:

●​ A company wants to survey employees. It selects every 10th employee from a


list of 1,000 employees.

Advantages:

Easier and faster than random sampling.​


Ensures even coverage of the population.

Disadvantages:

If there’s an underlying pattern in the population list, it may introduce bias.​


Not suitable for populations with irregular distributions.
2. Non-Probability Sampling

In non-probability sampling, individuals are selected based on non-random criteria,


meaning some individuals have a higher chance of being selected than others. This
can introduce bias but is useful for exploratory research.

2.1 Convenience Sampling

Participants are selected based on ease of access and availability.

How It Works:

●​ Select participants who are easily accessible (e.g., students in a nearby class,
people at a mall).

Example:

●​ A researcher surveys passersby in a shopping mall about their shopping habits.

Advantages:

Quick and easy to collect data.​


Useful for preliminary research.

Disadvantages:

Highly biased and not representative.​


Results cannot be generalized to the population.

2.2 Judgment (Purposive) Sampling

The researcher selects participants based on specific characteristics or judgment


about who will provide the best information.

How It Works:

●​ Choose individuals who best meet the research criteria.

Example:

●​ A medical researcher selects only experienced doctors to study a new treatment


method.
Advantages:

Useful when only experts can provide relevant data.​


Effective for qualitative research.

Disadvantages:

Researcher bias can affect selection.​


Not generalizable to the population.

Comparison of Probability vs. Non-Probability Sampling

Feature Probability Sampling Non-Probability Sampling

Definition Every individual has a Selection is based on convenience,


known probability of judgment, or other non-random
being selected. methods.

Bias Low bias (more High bias (less representative).


representative).

Use Case Used for statistical Used for exploratory or qualitative


inferences. research.

Examples Random, Stratified, Convenience, Judgment Sampling.


Cluster, Systematic
Sampling.

1.6.2 Sampling Distribution


1. Overview and About R

R is a powerful open-source programming language and environment specifically


designed for statistical computing, data analysis, and data visualization. It is widely
used in academia, research, and industry due to its flexibility, extensive libraries, and
strong community support.
●​ Key Features of R:
○​ Open-source and free to use.
○​ Extensive packages for statistical analysis and visualization (e.g.,
ggplot2, dplyr, tidyr).
○​ Strong support for data manipulation and cleaning.
○​ Cross-platform compatibility (Windows, macOS, Linux).
○​ Integration with other tools like Python, SQL, and Excel.
●​ Relevance to Data Visualization:
○​ R provides a wide range of visualization libraries to create high-quality
graphs, charts, and plots.
○​ It allows customization of visualizations to meet specific analytical needs.
○​ R is particularly useful for exploratory data analysis (EDA), where
visualizations help uncover patterns, trends, and outliers in data.
2. R and R Studio Installation

R Studio is an Integrated Development Environment (IDE) for R that simplifies coding,


debugging, and visualization. It provides a user-friendly interface for writing scripts,
managing data, and viewing visualizations.
●​ Installation Steps:
○​ Install R:
■​ Visit the Comprehensive R Archive Network (CRAN) website:
https://cran.r-project.org/.
■​ Download the appropriate version for your operating system.
■​ Follow the installation instructions.
○​ Install R Studio:
■​ Visit the R Studio website: https://www.rstudio.com/.
■​ Download the free version of R Studio Desktop.
■​ Install it on your system.
●​ Relevance to Data Visualization:
○​ R Studio provides a seamless environment for creating and managing
visualizations.
○​ Features like the "Plots" pane allow real-time viewing of graphs and charts.
○​ R Studio supports interactive visualizations using packages like shiny.

3. Descriptive Data Analysis Using R

Descriptive data analysis involves summarizing and describing the main features of a
dataset. R provides a variety of functions and packages to perform descriptive analysis.
●​ Steps in Descriptive Data Analysis:
○​ Data Import:
■​ Use functions like read.csv(), read.table(), or
read_excel() to import data into R.
○​ Data Exploration:
■​ Use functions like head(), tail(), and str() to explore the
structure of the dataset.
○​ Summary Statistics:
■​ Use summary() to get an overview of the dataset (e.g., mean,
median, quartiles).
■​ Use mean(), median(), sd(), and var() for specific statistical
measures.
○​ Data Cleaning:
■​ Handle missing values using na.omit() or na.fill().
■​ Remove duplicates using unique().
○​ Data Visualization:
■​ Use basic plots like histograms (hist()), boxplots (boxplot()),
and scatterplots (plot()) to visualize data distributions and
relationships.
●​ Relevance to Data Visualization:
○​ Descriptive analysis is the foundation of data visualization.
○​ Visualizations like histograms and boxplots help in understanding data
distributions and identifying outliers.
○​ Summary statistics provide context for interpreting visualizations.
4. Description of Basic Functions Used to Describe Data in R

R provides a wide range of functions to describe and summarize data. Below are some
commonly used functions:
●​ Data Exploration:
○​ head(): Displays the first few rows of the dataset.
○​ tail(): Displays the last few rows of the dataset.
○​ str(): Provides the structure of the dataset (e.g., data types,
dimensions).
●​ Summary Statistics:
○​ summary(): Generates summary statistics for numerical and categorical
variables.
○​ mean(), median(), sd(), var(): Calculate specific statistical measures.
●​ Data Visualization:
○​ hist(): Creates histograms to visualize data distributions.
○​ boxplot(): Creates boxplots to visualize data spread and outliers.
○​ plot(): Creates scatterplots to visualize relationships between variables.
○​ barplot(): Creates bar charts for categorical data.
●​ Data Manipulation:
○​ subset(): Extracts subsets of data based on conditions.
○​ aggregate(): Aggregates data based on specific criteria.
○​ table(): Creates frequency tables for categorical variables.
●​ Relevance to Data Visualization:
○​ These functions help in preparing data for visualization by summarizing
and cleaning it.
○​ Basic plots provide quick insights into data patterns and relationships.
○​ Advanced visualization libraries like ggplot2 build on these basic
functions to create more complex and customized visualizations.
UNIT 2: Data Manipulation with R

2.1 Introduction to R

R is a programming language widely used for statistical computing and data


visualization. It offers powerful libraries such as ggplot2, dplyr, and tidyr for data
analysis.

2.1.1 Installing R and RStudio (A Installing R and RStudio | Hands-On Programming


with R - https://rstudio-education.github.io/hopr/starting.html)

1.​ Download and install R from CRAN. ​


The Comprehensive R Archive Network - https://cran.r-project.org/
2.​ Download and install RStudio from RStudio.​
RStudio Desktop - Posit - https://posit.co/download/rstudio-desktop/

2.2 Data Manipulation in R

2.2.1 dplyr Package for Data Manipulation

The dplyr package provides a set of efficient functions for handling tabular data,
enabling seamless data manipulation.

●​ filter(): Selects rows based on specified conditions.​


filter(data, condition)​
Example: Extracting rows where age is greater than 30.​
filter(df, age > 30)
●​ select(): Chooses specific columns from the dataset.​
select(data, column1, column2)​
Example: Selecting only name and age columns.​
select(df, name, age)
●​ arrange(): Sorts data in ascending or descending order.​
arrange(data, column)​
Example: Sorting by salary in descending order.​
arrange(df, desc(salary))
●​ mutate(): Creates new variables based on existing ones.​
mutate(data, new_column = expression)​
Example: Creating a tax column based on salary.​
mutate(df, tax = salary * 0.1)
●​ group_by() and summarise(): Aggregates data for analysis.​
summarise(group_by(data, column), mean_value = mean(column))​
Example: Finding average salary by department.​
summarise(group_by(df, department), avg_salary = mean(salary))

Complete Example with a Sample Dataset

# Sample dataset

df <- data.frame(

name = c("Alice", "Bob", "Charlie", "David"),

age = c(25, 35, 30, 40),

salary = c(50000, 60000, 55000, 70000),

department = c("HR", "IT", "HR", "IT")

# 1. Filter rows where age is greater than 30

filtered_data <- filter(df, age > 30)

print("Filtered Data:")

print(filtered_data)

# 2. Select only name and age columns

selected_data <- select(df, name, age)

print("Selected Data:")

print(selected_data)
# 3. Sort by salary in descending order

sorted_data <- arrange(df, desc(salary))

print("Sorted Data:")

print(sorted_data)

# 4. Create a new tax column (10% of salary)

mutated_data <- mutate(df, tax = salary * 0.1)

print("Mutated Data:")

print(mutated_data)

# 5. Find average salary by department

summary_data <- df %>%

group_by(department) %>%

summarise(avg_salary = mean(salary))

print("Summary Data:")

print(summary_data)

Output of the Complete Example

"Filtered Data:"

name age salary department

1 Bob 35 60000 IT

2 David 40 70000 IT
"Selected Data:"

name age

1 Alice 25

2 Bob 35

3 Charlie 30

4 David 40

"Sorted Data:"

name age salary department

1 David 40 70000 IT

2 Bob 35 60000 IT

3 Charlie 30 55000 HR

4 Alice 25 50000 HR

"Mutated Data:"

name age salary department tax

1 Alice 25 50000 HR 5000

2 Bob 35 60000 IT 6000

3 Charlie 30 55000 HR 5500

4 David 40 70000 IT 7000


"Summary Data:"

# A tibble: 2 × 2

department avg_salary

<chr> <dbl>

1 HR 52500

2 IT 65000
2.2.2 data.table Package for Fast Data Processing

The data.table package enhances data manipulation speed, especially for large
datasets.

●​ Syntax: DT[i, j, by]


○​ i: Filters rows
○​ j: Selects/manipulates columns
○​ by: Groups data
●​ Example: Selecting name and salary for employees older than 30.​
data_table[age > 30, .(name, salary)]

2.2.3 tidyr Package for Reshaping Data

The tidyr package provides tools for reshaping data formats.

●​ gather(): Converts wide-format data to long format.​


gather(data, key, value, columns_to_gather)​
Example: Transforming multiple year columns into a single year-value pair.
●​ spread(): Converts long-format data to wide format.​
spread(data, key, value)

separate() and unite(): Splits and combines columns.​


separate(data, column, into = c("col1", "col2"))

●​ unite(data, new_column, col1, col2)

2.2.4 Additional R Functions & Packages


readr: Reads various file types like CSV, TSV.​
library(readr)

●​ data <- read_csv("data.csv")

lubridate: Simplifies date/time handling.​


library(lubridate)

●​ date <- ymd("2023-10-01")


2.3 Data Visualization in R

2.3.1 ggplot2 Package

●​ Scatter Plot:​
ggplot(data, aes(x = var1, y = var2)) + geom_point()

●​ Bar Chart:​
ggplot(data, aes(x = category, y = value)) + geom_bar(stat = "identity")

2.3.2 Advanced Visualizations

●​ Heatmaps:​
ggplot(data, aes(x = var1, y = var2, fill = value)) + geom_tile()

Pair Plots:​
library(GGally)

●​ ggpairs(data)

2.4 Python for Data Analysis

2.4.1 Key Libraries


Pandas:​
import pandas as pd

●​ data = pd.read_csv("data.csv")

NumPy:​
import numpy as np

●​ array = np.array([1, 2, 3])

Matplotlib & Seaborn:​


import matplotlib.pyplot as plt
import seaborn as sns
sns.scatterplot(x = data['var1'], y = data['var2'])

●​ plt.show()
Folium (Geospatial Visualizations):​
import folium
map = folium.Map(location = [45.5236, -122.6750])

●​ map.save("map.html")

2.4.2 Advanced Operations

●​ Handling Missing Values:​


data.fillna(0, inplace=True)
●​ Grouping & Aggregations:​
data.groupby('category').agg({'value': 'mean'})

Below is a table summarizing the packages, libraries, and functions commonly used for
Data Manipulation in R. This table includes the most popular tools and their primary
purposes.

Package/Library Functions Purpose

dplyr filter() Select rows based on conditions.

select() Choose specific columns.

arrange() Sort data in ascending or


descending order.

mutate() Create new variables or modify


existing ones.

group_by() Group data by one or more


columns.
summarise() Aggregate data (e.g., mean, sum)
for grouped data.

rename() Rename columns.

distinct() Remove duplicate rows.

left_join(), Join two datasets based on


right_join(), etc. common columns.

tidyr gather() Convert wide data to long format


(reshaping).

spread() Convert long data to wide format


(reshaping).

separate() Split one column into multiple


columns.

unite() Combine multiple columns into


one.

drop_na() Remove rows with missing


values.
fill() Fill missing values with previous
or next values.

data.table data.table() Create a data.table object


(enhanced data frame).

setkey() Set keys for fast indexing and


joins.

fread() Fast reading of large datasets.

fwrite() Fast writing of large datasets.

[i, j, by] syntax Perform fast filtering,


aggregation, and grouping.

reshape2 melt() Convert wide data to long format.

dcast() Convert long data to wide format.

stringr str_replace() Replace substrings in a string.

str_split() Split strings into substrings.


str_detect() Detect patterns in strings.

str_extract() Extract substrings matching a


pattern.

str_trim() Remove whitespace from strings.

lubridate ymd(), mdy(), dmy() Parse dates from strings.

year(), month(), day() Extract components of a date.

interval() Define time intervals.

difftime() Calculate differences between


dates.

purrr map() Apply a function to each element


of a list or vector.

map_dbl(), map_chr(), etc. Apply a function and return a


specific type (e.g., numeric,
character).

reduce() Reduce a list to a single value by


iteratively applying a function.
flatten() Flatten a nested list.

readr read_csv() Read CSV files into a data frame.

read_tsv() Read tab-separated files into a


data frame.

write_csv() Write data frames to CSV files.

janitor clean_names() Clean column names (e.g.,


remove spaces, special
characters).

remove_empty() Remove empty rows or columns.

tabyl() Create frequency tables.

sqldf sqldf() Run SQL queries on data frames.

tidyverse Collection of packages (dplyr, A meta-package for data


tidyr, readr, etc.) manipulation and analysis.

2.5 IBM Watson Studio - (IBM Watson Studio : https://www.ibm.com/products/watson-studio)​


Introduction to IBM Watson Studio - IBM Developer -
https://developer.ibm.com/learningpaths/get-started-watson-studio/introduction-watson-studio/​
IBM Watson Studio: A Comprehensive Guide

IBM Watson Studio is a powerful cloud-based platform designed for data science,
machine learning, and AI model development. It provides tools for data preparation,
visualization, and model building, supporting both no-code and code-driven workflows
(Python, R, and SQL).

2.5.1 Creating Projects

A project in Watson Studio serves as a workspace where you can store, manage, and
analyze datasets, build machine learning models, and collaborate with team members.

Step 1: Log in to IBM Watson Studio

1.​ Go to the IBM Cloud and log in.


2.​ If you don't have an IBM Cloud account, you'll need to create one (free and paid
tiers are available).
3.​ Once logged in, go to IBM Watson Studio from the IBM Cloud dashboard.

Step 2: Create a New Project

1.​ Click on "Create a Project".


2.​ Choose a project type based on your needs:
○​ Standard Project – For general data science, AI, and machine learning.
○​ Federated Learning Project – For training AI models across multiple data
sources without centralizing data.
3.​ Provide a project name and select an IBM Cloud Object Storage instance
(required for storing datasets and notebooks).
4.​ Click "Create" to complete the setup.

Step 3: Upload Your Dataset

1.​ Open the project and navigate to Assets → Add Data.


2.​ Select one of the following options:
○​ Upload from local machine (supports CSV, Excel, JSON, Parquet).
○​ Connect to a database (IBM Db2, MySQL, PostgreSQL, etc.).
○​ Fetch from cloud storage (AWS S3, IBM Cloud Object Storage).
3.​ Once uploaded, the dataset appears under Data Assets, ready for analysis.
2.5.2 Data Refinery (Data Cleaning & Preparation)

The Data Refinery tool in Watson Studio provides a powerful way to clean, transform,
and prepare datasets for analysis. It offers an interactive, drag-and-drop interface with
over 100 built-in operations, reducing the need for manual data preprocessing.

Step 1: Open Data Refinery

1.​ In the project dashboard, go to Assets → Data Refinery.


2.​ Select your uploaded dataset to open the Data Refinery interface.

Step 2: Data Cleaning & Preprocessing

Data cleaning is essential before running any analysis. Some common built-in
operations operations include:

1. Removing Duplicates

●​ Duplicate records can skew results.


●​ Use the "Remove Duplicates" function to delete redundant rows.

2. Handling Missing Values

●​ Missing values (NULLs) can negatively impact machine learning models.


●​ Possible approaches:
○​ Remove missing values if they are minimal.
○​ Fill missing values with:
■​ Mean/Median (for numerical data).
■​ Mode (for categorical data).
■​ Custom values (like "Unknown" for text data).

3. Data Type Conversion

●​ Ensure consistency in data types:


○​ Convert categorical values into numerical (e.g., "Yes"/"No" → 1/0).
○​ Change date formats to a standard structure.

4. Normalization/Standardization

●​ Scale data for better model performance.


●​ Normalize numeric values if needed.
5. Filtering and Sorting

●​ Extract relevant rows based on conditions (e.g., "filter all sales > $10,000").
●​ Sort data in ascending/descending order.

Step 3: Data Transformation & Feature Engineering

Feature engineering helps enhance data quality for better insights and model
performance.

1. Creating New Features

●​ Generate new columns based on existing data.


○​ Example: Create a "Total Revenue" column as Price × Quantity
Sold.

2. Aggregating Data

●​ Summarize information using functions like:


○​ Mean, Sum, Count, Min, Max.
○​ Example: Calculate average salary per department.

3. Splitting Columns

●​ Extract information from text fields.


○​ Example: Split "Full Name" into "First Name" and "Last Name".

4. String Manipulation

●​ Replace values (e.g., changing "N/A" to "Unknown").


●​ Change case (uppercase/lowercase).
●​ Trim spaces for consistency.

Step 4: Save & Automate Data Cleaning

●​ Click "Save and Create Job" to automate the cleaning pipeline.


●​ Export the refined dataset for further use as:
○​ CSV, JSON, Parquet
○​ Database tables
○​ IBM Cloud Object Storage
2.5.3 Visualizing Data

Data visualization helps uncover patterns, trends, and relationships. Watson Studio
provides built-in visualization tools that require no coding.

1. Bar Charts (Categorical Comparisons)​


Use Case:

●​ Comparing sales across different regions.


●​ Analyzing customer preferences by category (e.g., product type).
●​ Measuring survey responses (e.g., Yes/No).

How to Create:

1.​ Select Bar Chart.


2.​ Set X-axis: Categorical variable (e.g., Product Category).
3.​ Set Y-axis: Numeric variable (e.g., Total Sales).
4.​ Customize colors, labels, and tooltips.

2. Line Charts (Trends Over Time)​


Use Case:

●​ Tracking stock prices or revenue over months/years.


●​ Analyzing temperature changes over seasons.
●​ Monitoring customer engagement over time.
●​ Analyze time-based trends (e.g., website traffic).

How to Create:

1.​ Select Line Chart.


2.​ Assign a time-based variable to the X-axis (e.g., Date).
3.​ Assign a numeric value to the Y-axis (e.g., Sales, Temperature).
4.​ Adjust smoothing and trend lines.

3. Scatter Plots (Relationship Between Variables)​


Use Case:

●​ Identify correlations between two variables (e.g., Advertising Spend vs. Revenue).
●​ Comparing employee experience vs. salary.
●​ Checking if higher temperature leads to increased ice cream sales.
How to Create:

1.​ Select Scatter Plot.


2.​ Assign X-axis: First numeric variable (e.g., Marketing Budget).
3.​ Assign Y-axis: Second numeric variable (e.g., Sales).
4.​ Add a trend line to check correlations.

4. Boxplots (Data Distribution & Outliers) (Summarizing Distributions)​


Use Case:

●​ Identifying outliers in salary distributions.


●​ Comparing test scores across different groups.
●​ Understanding data spread (minimum, median, maximum values).

How to Create:

1.​ Choose Boxplot.


2.​ Assign a categorical variable to the X-axis (e.g., Department).
3.​ Assign a numeric variable to the Y-axis (e.g., Employee Salaries).
4.​ Analyze medians, quartiles, and outliers.
2.6 Case Study: Iris Dataset Analysis

The Iris dataset is one of the most famous datasets in machine learning and statistics.
It consists of 150 samples from three species of the Iris flower: Setosa, Versicolor, and
Virginica. Each sample has four features:

●​ Sepal Length (cm)


●​ Sepal Width (cm)
●​ Petal Length (cm)
●​ Petal Width (cm)

The goal of this case study is to explore, visualize, and analyze the Iris dataset to
understand relationships between features, perform clustering, and derive insights.

2.6.1 Data Profiling

Step 1: Load the Iris Dataset

The dataset can be loaded using Python’s seaborn, scikit-learn, or pandas libraries.

# Import necessary libraries

import pandas as pd

from sklearn import datasets

# Load the Iris dataset from sklearn

iris = datasets.load_iris()
# Convert to DataFrame

df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = iris.target # Add species as a target variable

df['species'] = df['species'].map({0: 'setosa', 1: 'versicolor',


2: 'virginica'}) # Map target to species names

# Display first few rows

df.head()

Step 2: Explore Dataset Structure

Understanding the data structure is essential for further analysis.

# Check the shape of the dataset

df.shape

Output:​
(150, 5) → 150 samples and 5 columns (4 features + 1 target)

# Check column names

df.columns

Output:​
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
'petal width (cm)', 'species']
Step 3: Summary Statistics​
Statistical summaries help identify mean, median, min, max, and standard deviation
for each feature.

# Get summary statistics

df.describe()

Feature Mean Std Dev Min Max

Sepal 5.84 0.83 4.3 7.9


Length

Sepal Width 3.05 0.43 2.0 4.4

Petal 3.76 1.76 1.0 6.9


Length

Petal Width 1.20 0.76 0.1 2.5

Observations:

●​ Petal width and length have higher variability than sepal measurements.
●​ Sepal width has the smallest range compared to other features.

2.6.2 Data Visualization

Data visualization helps identify relationships and clusters within the dataset.
Step 1: Pairwise Scatter Plots​
Pairwise scatter plots help understand how features relate to each other.

import seaborn as sns

import matplotlib.pyplot as plt

# Create pairplot to visualize relationships

sns.pairplot(df, hue="species", markers=["o", "s", "D"])

plt.show()

Interpretation:

●​ Petal length vs. petal width clearly separates species.


●​ Setosa is easily distinguishable, while Versicolor and Virginica overlap slightly.

Step 2: Cluster Analysis Using K-Means​


K-Means clustering groups similar data points together.

from sklearn.cluster import KMeans

# Apply K-Means clustering with 3 clusters

kmeans = KMeans(n_clusters=3, random_state=42)

df['cluster'] = kmeans.fit_predict(df.iloc[:, :-2])

# Compare clusters with actual species

sns.scatterplot(x=df["petal length (cm)"], y=df["petal width


(cm)"], hue=df["cluster"], palette="deep")
plt.title("K-Means Clustering on Iris Dataset")

plt.show()

Insights from Clustering:

●​ The K-Means algorithm correctly separates Setosa.


●​ There is slight overlap between Versicolor and Virginica.
●​ Choosing different clustering techniques (e.g., hierarchical clustering) might
improve separation.

Takeaways:​
Feature Importance:

●​ Petal Length and Petal Width are the most distinguishing features between
species.
●​ Sepal Width is less effective in differentiating between species.

Pattern Recognition:

●​ Setosa is easily separable, whereas Versicolor and Virginica overlap slightly.


●​ K-Means clustering works well, but it is not 100% accurate due to overlap.

Business/Scientific Application:

●​ Can be used for automatic flower classification.


●​ Useful for species identification in botanical research.

Hands-On:
Case Study - Iris Dataset
https://colab.research.google.com/drive/1l7tMJEjonNjzNL5S0pTE2C_CObVPAmJa?usp
=sharing
Uni, Bi, Tri Analysis ​
https://colab.research.google.com/drive/1VZF-ucI2ooBuFNoOo1H0KZOrJ3yW1kb3?us
p=sharing

You might also like