0% found this document useful (0 votes)
28 views41 pages

EDA Unit 1

Exploratory Data Analysis (EDA) is a crucial initial step in understanding datasets, enabling analysts to identify patterns, anomalies, and ensure data quality before applying predictive models. Data Science, an interdisciplinary field, utilizes EDA alongside various phases of data analysis to extract actionable insights from structured and unstructured data. EDA employs statistical measures and visualizations to facilitate decision-making across multiple sectors, including finance, by revealing underlying truths about the data.

Uploaded by

ROHIT SADALAGE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views41 pages

EDA Unit 1

Exploratory Data Analysis (EDA) is a crucial initial step in understanding datasets, enabling analysts to identify patterns, anomalies, and ensure data quality before applying predictive models. Data Science, an interdisciplinary field, utilizes EDA alongside various phases of data analysis to extract actionable insights from structured and unstructured data. EDA employs statistical measures and visualizations to facilitate decision-making across multiple sectors, including finance, by revealing underlying truths about the data.

Uploaded by

ROHIT SADALAGE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

What is exploratory data analysis

Exploratory data analysis (EDA) is the first step in understanding a


dataset before applying any predictive models or making business
decisions.

OR

EDA is a process of examining the available dataset to discover


patterns, spot anomalies, test hypotheses, and check assumptions using
statistical measures.

EDA helps analysts uncover patterns, trends, inconsistencies, and


missing values to ensure data quality and reliability. In the context of
financial services, EDA plays a crucial role in risk assessment, allowing teams
to identify key factors contributing to credit card delinquency and build
stronger prediction models.

Understanding of Data Science:


Data Science is an interdisciplinary field that integrates mathematics,
statistics, computer science, and domain expertise to extract
actionable insights from both structured and unstructured data. It
encompasses a systematic process that includes data collection, cleaning,
analysis, and interpretation, aimed at addressing real-world challenges
across various sectors.

Definition:
Data Science is the scientific discipline of applying computational, statistical,
and algorithmic methods to extract knowledge and insights from data in
diverse formats—ranging from structured datasets (e.g., databases,
spreadsheets) to unstructured sources (e.g., text, audio, and images).

There are several phases of data analysis, including

1) Data requirements
2) Data collection
3) Data processing
4) Data cleaning
5) Exploratory data analysis
6) Modeling and algorithms, and
7) Data product and communication.

These phases are similar to the CRoss-Industry Standard Process for data
mining (CRISP) framework in data mining.
THE SIGNIFICANCE OF EDA

 Different fields of science, economics, engineering, and marketing


accumulate and store data primarily in electronic databases.
Appropriate and well-established decisions should be made using the
data collected.
 It is practically impossible to make sense of datasets containing more
than a handful of data points without the help of computer programs.
To be certain of the insights that the collected data provides and to
make further decisions, data mining is performed where we go through
distinctive analysis processes.
 Exploratory data analysis is key, and usually the first exercise in data
mining. It allows us to visualize data to understand it as well as to
create hypotheses for further analysis. The exploratory analysis
centers around creating a synopsis of data or insights for the next
steps in a data mining project.
 EDA actually reveals ground truth about the content without making
any underlying assumptions. This is the fact that data scientists use
this process to actually understand what type of modeling and
hypotheses can be created.
 Key components of exploratory data analysis include summarizing
data, statistical analysis, and visualization of data. Python provides
expert tools for exploratory analysis, with pandas for summarizing;
scipy, along with others, for statistical analysis; and matplotlib and
plotly for visualizations.

Steps in EDA:

1) Problem Defining:
 Before trying to extract useful insights from the data it is
essential to define the business problem to be solved. The
problem dentition works as driving force for data analysis plan
execution.
 The main tasks involved in problem definition are defining the
main objective of the analysis, defining the main deliverables,
outlining the main roles and responsibilities, obtaining the
current status of the data, defining the timetable, and
performing cost/benefit analysis. Based on such a problem
definition, an execution plan can be created.
2) Data preparation:
 This step involves methods for preparing the dataset before
actual analysis. In this step, we define the sources of data, define
data schemas and tables, understand the main characteristics of
the data, clean the dataset, delete non-relevant datasets,
transform the data, and divide the data into required chunks for
analysis.
3) Data analysis:
 This is one of the most crucial steps that deals with descriptive
statistics and analysis of the data. The main tasks involve
summarizing the data, finding the hidden correlation and
relationships among the data, developing predictive models,
evaluating the models, and calculating the accuracies.
 Some of the techniques used for data summarization are
summary tables, graphs, descriptive statistics, inferential
statistics, correlation statistics, searching, grouping, and
mathematical models.
4) Development and representation of the results:
 This step involves presenting the dataset to the target audience
in the form of graphs, summary tables, maps, and diagrams. This
is also an essential step as the result analyzed from the dataset
should be interpretable by the business stakeholders, which is
one of the major goals of EDA.
 Most of the graphical analysis techniques include scattering
plots, character plots, histograms, box plots, residual plots, mean
plots, and others.
Making Sense of Data:
It is crucial to identify the type of data under analysis. Different disciplines
store different kinds of data for different purposes. For example, medical
researchers store patients' data, universities store students' and teachers'
data, and real estate industries storehouse and building datasets.

Most of the dataset broadly falls into two groups

1) Numerical data
2) Categorical data.

1. Numerical data

This data has a sense of measurement involved in it; for example, a


person's age, height, weight, blood pressure, heart rate, temperature,
number of teeth, number of bones, and the number of family members. This
data is often referred to as quantitative data in statistics. The numerical
dataset can be either discrete or continuous types.

a) Discrete data

This is data that is countable and its values can be listed out. For
example, if we flip a coin, the number of heads in 200 coin flips can take
values from 0 to 200 (finite) cases. A variable that represents a discrete
dataset is referred to as a discrete variable.

The discrete variable takes a fixed number of distinct values. For


example, the Country variable can have values such as Nepal, India, Norway,
and Japan. It is fixed. The Rank variable of a student in a classroom can take
values from 1, 2, 3, 4, 5, and so on.

b) Continuous data

A variable that can have an infinite number of numerical values within


a specific range is classified as continuous data. A variable describing
continuous data is a continuous variable.
2. Categorical Data:

This type of data represents characteristics of objects . e.g. gender,


marital status. This data often referred as Qualitative datasets in statistics.
To understand clearly, here are some of the most common types of
categorical data you can find in data:

 Gender (Male, Female, Other, or Unknown)


 Marital Status (Amulled, Divorced, Interlocutory, Legally Separated,
Married, Polygamous, Never Married, Domestic Partner, Unmarried,
Widowed, or Unknown)
 Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy,
Historical, Horror, Mystery, Philosophical, Political, Romance, Saga,
Satire, Science Fiction, Social, Thriller, Urban, or Western)

A variable describing categorical data is referred to as a categorical variable.


These types of variables can have one of a limited number of values. It is
easier for computer science students to understand categorical values as
enumerated types or enumerations of variables. There are different types of
categorical variables:

Note:

A binary categorical variable can take exactly two values and is also
referred to as a dichotomous variable. For example, when you create an
experiment, the result is either success or failure. Hence, results can be
understood as a binary categorical variable.

Polymorphous variables are categorical variables that can take more


than two possible values. For example, marital status can have several
values, such as annulled, divorced, interlocutory, legally separated, married,
polygamous, never married, domestic partners, unmarried, widowed,
domestic partner, and unknown. Since marital status can take more than two
possible values, it is a polymorphous variable.
Measurement scales:
There are four different types of measurement scales described in statistics:

I. Nominal
II. Ordinal
III. interval, and
IV. ratio.

These scales are used more in academic industries.

Nominal:

These are practiced for labeling variables without any quantitative value. The
scales are generally referred to as labels. And these scales are mutually
exclusive and do not carry any numerical importance. Let's see some
examples:

- What is your gender?

- Male

- Female

- Third gender/Non-binary

Nominal scales are considered qualitative scales and the measurements that
are taken using qualitative scales are considered qualitative data.

Ordinal

The main difference in the ordinal and nominal scale is the order. In ordinal
scales, the order of the values is a significant factor. An easy tip to
remember the ordinal scale is that it sounds like an order.
As depicted in the preceding diagram, the answer to the question of
WordPress is making content managers' lives easier is scaled down to five
different ordinal values, Strongly Agree, Agree, Neutral, Disagree, and
Strongly Disagree. Scales like these are referred to as the Likert scale.
Similarly, the following diagram shows more examples of the Likert scale.

To make it easier, consider ordinal scales as an order of ranking (1st, 2nd,


3rd, 4th, and so on). The median item is allowed as the measure of central
tendency; however, the average is not permitted.

Interval

In interval scales, both the order and exact differences between the values
are significant. Interval scales are widely used in statistics, for example, in
the measure of central tendencies ; mean, mode, median and standard
deviation. Examples include location in Cartesian coordinates and direction
measured in degrees from magnetic north. The mean, median, and mode are
allowed on interval data.
Ratio

Ratio scales contain order, exact values, and absolute zero, which makes it
possible to be used in descriptive and inferential statistics. These scales
provide numerous possibilities for statistical analysis. Mathematical
operations, the measure of central tendencies, and the measure of
dispersion and coefficient of variation can also be computed from such
scales.

Categorical variable  Nominal scale and Ordinal scale

Quantitative variable  Interval Scale and Ratio Scale

COMPARING EDA WITH CLASSICAL AND BAYESIAN ANALYSIS

There are several approaches to data analysis. The most popular ones that
are relevant to this book are the following:

Classical data analysis:


For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by
analysis and result communication.
Example: Predicting exam scores using a regression model after collecting
student data.

Exploratory data analysis approach:


For the EDA approach, it follows the same approach as classical data
analysis except the model imposition and the data analysis steps are
swapped. The main focus is on the data, its structure, outliers, models, and
visualizations. Generally, in EDA, we do not impose any deterministic or
probabilistic models on the data.
Example: Using graphs and summaries to understand students' marks
distribution before deciding which model to apply.

Bayesian data analysis approach:


The Bayesian approach incorporates prior probability distribution knowledge
into the analysis steps as shown in the following diagram. Well, simply put,
prior probability distribution of any quantity expresses the belief about that
particular quantity before considering some evidence.
Example: If past data suggests 80% students pass, and we get new data,
Bayesian methods update the probability of passing based on both.

SOFTWARE TOOLS AVAILABLE FOR EDA


There are several software tools that are available to facilitate EDA. Here, we
are going to outline some of the open source tools:

Python: This is an open source programming language widely used in data


analysis, data mining, and data science (https://www.python.org/).
The following Libraries are used for EDA in python
1) Pandas: Data manipulation and summary statistics.
2) Matplotlib/Seaborn: Visualization tools for plots like
histograms,
scatter plots, and box plots.
3) NumPy: Numerical computations for descriptive stats.

Example: Use Pandas to compute mean and median, Seaborn for a box plot
of sales data.

R programming language: R is an open source programming language


that is widely utilized in statistical computation and graphical data analysis
(https://www.r-project.org).
The following Libraries are used for EDA in R Programming language

1) ggplot2: Advanced visualizations (e.g., scatter plots, density


plots).
2) dplyr: Data manipulation and summary statistics.
3) tidyr: Data cleaning and reshaping.

Example: Create a scatter plot with ggplot2 to explore relationships between


variables.
Weka: This is an open source data mining package that involves several
EDA tools and algorithms (https://www.cs.waikato.ac.nz/ml/weka/).

KNIME: This is an open source tool for data analysis and is based on Eclipse
(https://www.knime.com/).

VISUAL AIDS FOR EDA:

Visual aids are essential for Exploratory Data Analysis (EDA) as they help
uncover patterns, relationships, and anomalies in data.
Visual aids are crucial for:

i. Understanding variable distributions


ii. Identifying relationships between variables
iii. Detecting outliers or anomalies
iv. Visualizing missing data
v. Supporting feature engineering and model selection

The followings are the commonly used visual Aids

1) Line chart
2) Bar chart
3) Scatter plot
4) Box Plot
5) Area plot and stacked plot
6) Pie chart
7) Table chart
8) Polar chart
9) Histogram
10) Lollipop chart
Line Chart:
A line chart is a type of data visualization that displays information as a
series of data points connected by straight lines. It is commonly used to
show trends, changes, or patterns over time or across categories, making it
especially useful for visualizing continuous data

Example of Line chart using python with matplotlib library

The following example demonstrates how to analyze and visualize the


change in average temperature from January to June using a line chart.

# Step 1: Import the library


import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., months vs temperature)


months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
temperature = [22, 25, 28, 30, 32, 33]

# Step 3: Create the line chart


plt.plot(months, temperature, color='green', marker='o', linestyle='-')

# Step 4: Add titles and labels


plt.title("Average Monthly Temperature")
plt.xlabel("Month")
plt.ylabel("Temperature (°C)")

# Step 5: Show the plot


plt.grid(True)
plt.show()

Python Output:
Conclusion:

The line chart illustrating the average monthly temperature from January to
June 2025 demonstrates a consistent upward trend, with temperatures rising
from approximately 22°C in January to 32°C in June.

Bar Plot:
A bar chart is a graphical representation of categorical data using
rectangular bars. Each bar’s length or height corresponds to the value of the
category it represents, making it easy to compare quantities across different
groups. Bar charts are widely used to display discrete data, such as counts,
frequencies, or totals for various categories.

Example of Bar chart using python with matplotlib library

To visualize the marks obtained by a student in different school subjects , we


can use bar chart.

# Step 1: Import the library


import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., subjects vs marks)


subjects = ["Math", "Science", "English", "History", "Geography"]
marks = [85, 90, 78, 70, 88]

# Step 3: Create the bar chart


plt.bar(subjects, marks, color='orange', edgecolor='black')

# Step 4: Add titles and labels


plt.title("Marks Obtained in Subjects")
plt.xlabel("Subjects")
plt.ylabel("Marks")

# Step 5: Show the plot


plt.grid(axis='y')
plt.show()

Python Output:
Conclusion:

Based on the bar chart, the data reflects the performance across five
subjects: Math, Science, English, History, and Geography. The highest marks
are observed in Science and Geography, both reaching approximately 85,
followed closely by Math at around 80. English scores around 75, while
History shows the lowest performance at approximately 65. This visual
representation highlights a strong performance in Science and Geography,
with a noticeable dip in History, providing a clear comparison of subject-wise
achievement

Scatter Plot:
A scatter plot is a type of data visualization that displays individual data
points as dots on a two-dimensional plane, with one variable on the x-axis
and another on the y-axis. Each point represents an observation in the
dataset, allowing viewers to see the relationship—or lack thereof—between
the two variables.

Example of Bar chart using python with matplotlib library

The following is the examples to construct scatter plot using python

Using matplotlib library:

###### Scatter plot using Matplotlib


# Import matplotlib
import matplotlib.pyplot as plt

# Sample data (X and Y values)


x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create the scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add title and labels


plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Show grid and plot


plt.grid(True)
plt.show()

Python Output:

Scatter plot using pandas library:

#########Scatter Plot Using Pandas


# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Step 2: Create sample data as a Pandas DataFrame


data = {
'Math_Score': [45, 78, 88, 56, 90, 67, 76],
'Science_Score': [55, 80, 85, 60, 95, 70, 75]
}
df = pd.DataFrame(data)

# Step 3: Use DataFrame.plot.scatter to create scatter plot


df.plot.scatter(x='Math_Score', y='Science_Score', color='red', marker='o', title='Math vs
Science Scores')
# Step 4: Show the plot
plt.show()

Python Output:

Box Plot:
A box plot—also known as a box-and-whisker plot—is a graphical
representation of a dataset’s distribution, summarizing key statistical
measures in a compact, standardized format11. It visually displays
the minimum, lower quartile (Q1), median, upper quartile (Q3), and
maximum of the data, often along with potential outliers. Box plots are
widely used in research, business, education, and any field where
understanding and communicating data distribution is important. They
complement other visualization tools like histograms and density plots.

The following is the example to construct the box plot using python

import matplotlib.pyplot as plt


# Sample data
data = [7, 2, 5, 8, 6, 3, 9, 4, 7, 5]

# Create box plot


plt.boxplot(data)

# Add labels and title


plt.title('Simple Box Plot')
plt.ylabel('Values')

# Display the plot


plt.show()

Python Output:

Example 2:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'Score': [10, 20, 30, 40, 50, 60, 70,120]})


plt.boxplot(df['Score'])
plt.title("Boxplot of Scores")
plt.show()

Output:
Pie Chart:
A pie chart is a circular graph that visually represents data by dividing a
circle into slices, where each slice corresponds to a category and its size
shows the proportion of that category relative to the whole. The entire pie
represents 100% (or 360 degrees), and each slice’s angle or area is
proportional to its share of the total.

A pie chart is most effective when you want to show how individual
categories contribute to a whole—that is, to display the part-to-whole
relationship in your data

The following is the example to construct the Pie chart

import matplotlib.pyplot as plt

# Sample data
data = [7, 2, 5, 8, 6, 3, 9, 4, 7, 5]

# Create box plot


plt.boxplot(data)

# Add labels and title


plt.title('Simple Box Plot')
plt.ylabel('Values')

# Display the plot


plt.show()

Python Output:
3D Pie Chart:

import matplotlib.pyplot as plt

labels = ['Apple', 'Samsung', 'Huawei', 'Xiaomi', 'Others']


sizes = [30, 25, 15, 10, 20]
explode = (0.1, 0, 0, 0, 0) # explode 1st slice

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140,


shadow=True, explode=explode, colors=['red', 'blue', 'green', 'orange',
'grey'])

plt.title('Simulated 3D Pie Chart with Shadow')


plt.axis('equal')
plt.show()
Histogram:

A histogram is a graphical representation used in statistics to show


the frequency distribution of numerical data across a continuous
range126. It is made up of adjacent vertical bars (rectangles), where each
bar represents a range of values (called a bin or bucket), and the height of
the bar indicates the number of data points (frequency) that fall within that
range126. The horizontal (x) axis shows the data ranges, and the vertical (y)
axis shows the frequency or count of data points in each range

The following is the example to construct the histogram with the help of
Python.

# Step 1: Import the library


import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., student marks out of 100)


marks = [45, 56, 67, 45, 89, 90, 67, 76, 56, 80, 70, 68, 59, 48, 77, 85, 94, 67]

# Step 3: Create a histogram


plt.hist(marks, bins=7, color='skyblue', edgecolor='black')

# Step 4: Add titles and labels


plt.title("Histogram of Student Marks")
plt.xlabel("Marks")
plt.ylabel("Number of Students")
# Step 5: Show the plot
plt.grid(True)
plt.show()

Area Plot:

An area plot (also called an area chart or area graph) is a data visualization
that displays quantitative data by plotting points and connecting them with
line segments—similar to a line chart—but then filling the area between the
line and the horizontal (x) axis with color or shading124. This shaded area
visually emphasizes the magnitude of change and the cumulative total over
time or across categories7.

Area plots can show one or more data series, and when multiple series
are included, each is represented by a differently colored or patterned area,
either overlapping or stacked138. The stacked area chart is a common
variation, showing how different categories contribute to the total over time
# Step 1: Import the library
import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., days vs active users)


days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
users = [100, 150, 130, 180, 170, 200, 190]

# Step 3: Create the area plot


plt.fill_between(days, users, color='skyblue', alpha=0.5)

# Step 4: Add line on top for clarity


plt.plot(days, users, color='blue', marker='o')

# Step 5: Add titles and labels


plt.title("Active Users During the Week")
plt.xlabel("Days")
plt.ylabel("Number of Users")

# Step 6: Show the plot


plt.grid(True)
plt.show()

Conclusion:
The chart shows the number of active users during the week from Monday to
Sunday. The number starts at 100 on Monday, increases to 150 on Tuesday,
dips to 125 on Wednesday, rises to 175 on Thursday, drops to 150 on Friday,
peaks at 200 on Saturday, and slightly decreases to 175 on Sunday. The
highest activity is on Saturday, while the lowest is on Monday.

DATA TRANSFORMATION TECHNIQUES

Data Transformation:

One of the fundamental steps of Exploratory Data Analysis (EDA) is data


wrangling. We will come to understand the work that must be completed
before transferring our information for further examination, including,
removing duplicates, replacing values, renaming axis indexes, discretization
and binning, and detecting and filtering outliers.

Data transformation is a set of techniques used to convert data from one


format or structure to another format or structure. The following are some
examples of transformation activities:

Data deduplication involves the identification of duplicates and their


removal.

Key restructuring involves transforming any keys with built-in meanings to


the generic keys.

Data cleansing involves extracting words and deleting out-of-date,


inaccurate, and incomplete information from the source language without
extracting the meaning or information to enhance the accuracy of the source
data.

Data validation is a process of formulating rules or algorithms that help in


validating different types of data against some known issues.

Format revisioning involves converting from one format to another.

Data derivation consists of creating a set of rules to generate more


information from the data source.
Data aggregation involves searching, extracting, summarizing, and
preserving important information in different types of reporting systems.

Data integration involves converting different data types and merging


them into a common structure or schema.

Data filtering involves identifying information relevant to any particular


user.

Data joining involves establishing a relationship between two or more


tables.

The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data. In addition to
this, interoperability in a system can be achieved by following a common
data structure and format.

Merging Database:
Merging is the process of combining two or more datasets based on a
common key or index. It is especially useful when information is spread
across multiple tables or files.

For example, if you have customer names and addresses in one file and their
purchase history in another, merging means putting both types of
information into a single file, making it easier to analyze or report on
combined data.

In pandas, we can merge datasets using four primary join types:

1. Inner Join
2. Left Join
3. Right Join
4. Outer (Full) Join

Inner Join:
An INNER JOIN is a way to combine rows from two or more tables based on
a common column between them. It only returns rows where there are
matching values in both tables—rows without a match are excluded from the
result.

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='inner')


print(result)

Left Join:
A LEFT JOIN is a way to combine rows from two tables. It returns all rows
from the left table (the first table you mention) and only the matching
rows from the right table (the second table)

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='left')


print(result)
Right Join:
A RIGHT JOIN is a type of join that returns all rows from the right
table (the second table listed in your query) and only the matching rows
from the left table (the first table). If there is no match in the left table, the
result will have NULL values for the left table’s columns

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='right')


print(result)

Outer (Full) Join:


An outer join is a method to combine rows from two or more tables based
on a related column, including not only the rows with matches in both
tables, but also the rows that do not have a match in one or both
tables

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='outer')


print(result)

Merging Data set using the pandas concat()


method:

Syntax:
import pandas as pd

pd.concat (objs, axis=0, ignore_index=False)

were,

objs: List of DataFrames or Series to concatenate.

axis: 0 for rows (default), 1 for columns.

ignore_index: If True, resets index in the result.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})


print(df1)
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
print(df2)

result = pd.concat([df1, df2], axis=0, ignore_index=False)


print(result)
Combining Datasets Using Append in Pandas
Appending is a straightforward way to combine two or more datasets by
adding the rows of one DataFrame to the end of another—creating a
single, taller table (not wider). This is different from merging or joining, which
combine data by matching keys or columns.

Note: As of pandas 2.0, the DataFrame.append() method has been


removed. Instead of pd.append() method , we can use pd.concat()
method.

Reshaping Data
Reshaping means changing the structure (the rows and columns layout) of
your dataset to make it easier to analyze, visualize, or fit into a particular
model. Data can be reshaped in many ways, such as going from wide
format (many columns, few rows) to long format (many rows, fewer
columns) and vice versa

Example:

import pandas as pd

df = pd.DataFrame({
'ID': [1, 2],
'Math': [90, 85],
'Science': [80, 95]
})

melted = pd.melt(df, id_vars='ID', var_name='Subject', value_name='Score')


print(melted)

Output:
Pivoting Data
Pivoting is a specific type of reshaping where you convert data from long
format (many rows, few columns) to wide format (few rows, many columns).
This is useful when you want to compare values across different categories
or time periods.

import pandas as pd

df_long = pd.DataFrame({
'ID': [1, 1, 2, 2],
'Subject': ['Math', 'Science', 'Math', 'Science'],
'Score': [90, 80, 85, 95]
})

pivoted = df_long.pivot(index='ID', columns='Subject', values='Score')


print(pivoted)

Output:
Transformation techniques:

1) Performing data deduplication:


Data deduplication is a process that removes duplicate records or
repeating data segments within a dataset. This is an essential step in data
transformation, especially when preparing data for analysis. Deduplication
improves data quality and storage efficiency.

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice']})


df = df.drop_duplicates()

print(df)

2) Replacing values :
Replacing values is a common data transformation technique used to
modify specific data values within a dataset. This is often done to correct
errors, standardize formats, or prepare data for analysis and modeling.

import pandas as pd

df = pd.DataFrame({'Gender': ['M', 'F', 'M']})


df = df.replace({'M': 'Male', 'F': 'Female'})

print(df)

3) Handling Missing Data :


Handling missing data is a crucial step in preparing datasets for analysis
or machine learning. Missing values can distort results, introduce bias, and
reduce the effectiveness of models if not managed properly.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Age': [25, np.nan, 30]})

print(df.isnull())

4) Droping Missing Values


Dropping missing values means removing rows or columns from a dataset
where data is incomplete or missing. This is one of the most common and
simplest ways to handle missing data

import pandas as pd

# Create a sample DataFrame


data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, 30, 22, None]
}
df = pd.DataFrame(data)

# Drop rows with any missing values


df_drop_rows = df.dropna()
print(df_drop_rows)

# Drop columns with any missing values


df_drop_columns = df.drop(columns=['Age'])
print(df_drop_columns)

5) Filling Missing Values :


Filling missing values is the process of replacing empty or undefined data
entries (usually shown as NaN, null, or blank spaces) with valid values so
the dataset becomes complete, consistent, and usable for analysis or
machine learning.

This is also called “imputation”.

import pandas as pd
data = {'Age': [25, None, 30]} # Note: You used NaN, which should be None in Python or np.nan
if numpy is imported
df = pd.DataFrame(data)
df = df.fillna(0)
print(df)

6) Backword and Forward filling :


Backward filling (backfill or bfill) and forward filling (ffill) are simple data
imputation methods used to handle missing values, especially in time
series or sequential data.
import pandas as pd
data = {'Age': [25,None, 30]} # Note: You used NaN, which should be None in Python or np.nan
if numpy is imported
df = pd.DataFrame(data)
df_forward = df.fillna(method='ffill')
print(df_forward)
df_backward = df.fillna(method='bfill')
print(df_backward)

7) Interpolating missing values :


Interpolation is a method of estimating missing values in a dataset by
using the existing surrounding data points. Instead of simply copying nearby
values (like forward fill or backward fill), interpolation calculates new
values based on mathematical formulas—usually assuming values
change smoothly over time or across a sequence.

import pandas as pd
df = pd.DataFrame({'Age': [20, None, 30]})
df = df.interpolate()
print(df)

8) Discritization and binning :


Discretization is the process of converting continuous numerical data into
discrete categories, intervals, or bins. This transformation helps simplify
data, making it easier to analyze, visualize, and feed into machine learning
models that require categorical input

Binning (also called bucketing) is a common technique for discretization. In


binning, the range of continuous values is split into a set of discrete intervals
called bins.

Example 1: pd.cut() – Equal-width Binning:

import pandas as pd
df = pd.DataFrame({'Age': [15, 30, 20, 15, 66, 87, 13]})
df['Group'] = pd.cut(df['Age'], bins=[0, 18, 60, 100], labels=['Teen', 'Adult', 'Senior'])
print(df)
Example 2: pd.qcut() – Quantile-based Binning:

#Imports the pandas library


import pandas as pd

#Insert Data
scores = [10, 20, 30, 40, 50, 60, 70, 80]

#Create the DataFrame


df = pd.DataFrame({'Score': [10, 20, 30, 40, 50, 60, 70, 80]})

#Apply qcut() for Binning


# Divide scores into 4 quantile bins
df['ScoreGroup'] = pd.qcut(df['Score'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

#Print the Final DataFrame


print(df)
Data Manipulation using Pandas
Data manipulation using pandas involves organizing, transforming, and
analyzing tabular data (data structures like Data Frames and Series) with the
panda’s library in Python. A panda provides intuitive tools to efficiently
handle common data processing tasks for analysis, visualization, or
modeling.

Pandas Objects:
A panda is a Python library widely used for data manipulation and analysis. It
provides two fundamental data structures for handling structured data:
Series and DataFrame.

Series: A Series is a one-dimensional, labeled array capable of holding any


data type, such as integers, floats, or strings.

Data Frame: A DataFrame in Python (using the pandas library) is a two-


dimensional, labeled data structure that organizes data in rows and columns,
much like a table in a database or a spreadsheet in Excel

Data Indexing and Selection:


Introduction:
Data indexing and selection are fundamental operations in data manipulation,
allowing us to access and work with specific portions of a dataset. In Python, the
Pandas library provides powerful tools for indexing and selecting data using labels,
positions, or boolean conditions.

Label-based Indexing:
Label-based Indexing refers to selecting and accessing data in a Python
Pandas DataFrame (or Series) by using explicit labels (names of rows and
columns) rather than integer positions.
Example: Suppose we have a DataFrame df with columns 'Name', 'Age',
and 'City' as follows
Using label-based indexing in Pandas, we can access the data of an entire
column by specifying the column label and using either all rows or a subset
of rows.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df['Name']) # Selects the 'Name' column

Position-based Indexing:
Position-based indexing in Python's Pandas library refers to accessing Data
Frame or Series elements using integer positions (index numbers) rather
than labels. This is done using the df.iloc[ ] accessor, which stands for
integer location based indexing.
Example:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df.iloc[0]) # Selects the first row

Boolean Indexing:
Boolean indexing in Python's Pandas library is a technique used to filter and
select data from a DataFrame based on conditions that evaluate
to True or False.
Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

print(df[df['Age'] > 25])


Renaming Axis index :
Renaming the axis index in pandas means changing the “name” of the
row or column axis (the label above the index or columns) rather than
renaming the actual row or column labels themselves.

import pandas as pd
df = pd.DataFrame({
'Math': [85, 90],
'Science': [80, 95]
}, index=['Alice', 'Bob'])

print(df)

#Rename the row index axis


df1 = df.rename_axis("Student")
print(df1)

#Rename the column axis


df2 = df1.rename_axis("Subject", axis='columns')
print(df2)

#Rename specific row index values


df3 = df.rename(index={'Alice': 'Alicia'})
print(df3)

#Rename specific column names


df4 = df.rename(columns={'Math': 'Mathematics'})

print(df4)

Hierarchical Indexing

Hierarchical indexing, also known as multi-indexing, is a method used in data


analysis (such as with the Pandas library in Python) that allows you to use
more than one index (row label) to organize and access your data efficiently.
This is especially helpful for representing and working with complex, multi-
dimensional data within flat, tabular structures.

With hierarchical indexing, you can group data at more than one
category or level, allowing for:

I. Easier filtering and slicing at different levels.


II. Advanced grouping and aggregation.
III. Representation of higher-dimensional data in two-dimensional
DataFrames.

Hierarchical indexing is useful for:

a) Grouping statistics by multiple categories (e.g., sales by country and


product).
b) Handling time series data across multiple entities (e.g., stock prices for
several companies over years).

Example 1:

import pandas as pd

# Sample data
data = {
'region': ['East', 'East', 'West', 'West'],
'state': ['NY', 'NJ', 'CA', 'WA'],
'sales': [100, 150, 200, 250]
}

df = pd.DataFrame(data)
print(df)

# Set hierarchical (multi-level) index using 'region' and 'state'


df_multi = df.set_index(['region', 'state'])

print(df_multi)

Example 2:

import pandas as pd

arrays = [
['2023', '2023', '2024', '2024'],
['Q1', 'Q2', 'Q1', 'Q2']
]
print(arrays)

# Create MultiIndex
index = pd.MultiIndex.from_arrays(arrays, names=('Year', 'Quarter'))
print(index)

# Create Series with MultiIndex


data = pd.Series([100, 150, 200, 250], index=index)
print(data)
# Print the result
print(data)

Data Aggregation:
Aggregation is the process of applying mathematical operations on a dataset
or a subset of it to produce summary statistics or consolidated values. It is a
fundamental technique in data analysis used to manipulate and summarize
data within a DataFrame.

Aggregation involves computing a summarizing function over data to reduce


detailed information into a form that is easier to analyze. Common
aggregation operations include sum, minimum, maximum, mean, count, etc.
Pandas provides a flexible aggregate method that allows applying one or
multiple aggregation functions to one or many columns.

Example 1:

import pandas as pd

df = pd.DataFrame({
'Sales': [100, 200, 300],
'Expenses': [80, 150, 250]
})
# Aggregate sum and min over all columns
result = df.aggregate(['sum', 'min'])
print(result)

#Applying Different Aggregations to Specific Columns


result = df.aggregate({'Sales': 'sum', 'Expenses': 'min'})
print(result)

Example 2:

import pandas as pd

df = pd.DataFrame({
'Math': [90, 80, 70],
'Science': [85, 85, 65],
'English': [88, 76, 95]
}, index=['Alice', 'Bob', 'Charlie'])

print(df)

#Aggregate sum of scores on each row (total score per student)


df['Total'] = df.aggregate('sum', axis=1)
print(df)

#Aggregate min and max on each row (min and max score per student)
row_min_max = df.aggregate(['min', 'max'], axis=1)
print(row_min_max)

#Apply multiple aggregation functions (mean, sum, std) on rows


row_stats = df.aggregate(['mean', 'sum', 'std'], axis=1)
print(row_stats)

Data Grouping :
Data grouping is the process of organizing data into categories or groups
based on one or more criteria. This technique is widely used in data analysis
and reporting to summarize large datasets and identify patterns or trends.

Data is divided into subsets called groups based on the values


of one or more columns (known as grouping keys). Common operations like
aggregation (sum, mean, count), filtering, and transformation are then
applied to each group separately.

Data grouping is useful because it allows you to:

 Summarize: Present condensed views of the data, making it easier to


interpret and analyze.
 Compare: Quickly contrast different categories or groups, such as
evaluating sales across regions or departments.
 Aggregate: Calculate summary statistics—like totals, averages, or
counts—for each group, providing valuable insights at a glance.

Example:

A company’s sales data collected over multiple years, involving different


regions and products.

The goal is to summarize the total sales by region and product to understand
how each category is performing overall.
Python Code:

import pandas as pd

# Create a DataFrame with the sales data


data = {
'Region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Year': [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023],
'Sales': [100, 150, 200, 250, 300, 350, 400, 450]
}

df = pd.DataFrame(data)

# Group by 'Region' and 'Product' columns and aggregate sales by summing


grouped_sales = df.groupby(['Region', 'Product'])['Sales'].sum()

print(grouped_sales)

Pivot table
A pivot table is a powerful data analysis tool that summarizes, reorganizes,
and groups selected columns and rows of data in a spreadsheet or database
to generate insightful reports. It allows you to view data from different
perspectives by "pivoting" the layout without changing the original dataset.
Pivot tables help in quickly aggregating data such as sums, averages,
counts, and percentages.

Pivot tables organize data into a grid where: Rows and Columns represent
categorical data. Values display aggregated numerical data. Filters allow
selective viewing based on criteria.

When to Use Pivot Tables

 To efficiently summarize and condense large datasets into clear,


understandable summaries.
 To analyze data by breaking it down into various categories and
dimensions for deeper insights.
 To easily compare different groups or segments, such as evaluating
sales performance by region or product.
 To detect important trends and identify any unusual data points or
outliers.
 To create dynamic, interactive reports that facilitate informed business
decisions.
 When you need fast and flexible data aggregation without writing
complex code or formulas.

Example:

import pandas as pd

# Sample data: Students, Subject, and Percentage


data = {
'Student': ['Alice', 'Alice', 'Bob', 'Bob'],
'Subject': ['Math', 'Science', 'Math', 'Science'],
'Percentage': [85, 90, 75, 80]
}

# Create DataFrame
df = pd.DataFrame(data)

# Create pivot table: average percentage by Student and Subject


pivot_table = pd.pivot_table(df, index='Student', columns='Subject', values='Percentage')

print(pivot_table)

You might also like