0% found this document useful (0 votes)

28 views41 pages

EDA Unit 1

Exploratory Data Analysis (EDA) is a crucial initial step in understanding datasets, enabling analysts to identify patterns, anomalies, and ensure data quality before applying predictive models. Data Science, an interdisciplinary field, utilizes EDA alongside various phases of data analysis to extract actionable insights from structured and unstructured data. EDA employs statistical measures and visualizations to facilitate decision-making across multiple sectors, including finance, by revealing underlying truths about the data.

Uploaded by

ROHIT SADALAGE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views41 pages

EDA Unit 1

Uploaded by

ROHIT SADALAGE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 41

What is exploratory data analysis

Exploratory data analysis (EDA) is the first step in understanding a

dataset before applying any predictive models or making business
decisions.

EDA is a process of examining the available dataset to discover

patterns, spot anomalies, test hypotheses, and check assumptions using
statistical measures.

EDA helps analysts uncover patterns, trends, inconsistencies, and

missing values to ensure data quality and reliability. In the context of
financial services, EDA plays a crucial role in risk assessment, allowing teams
to identify key factors contributing to credit card delinquency and build
stronger prediction models.

Understanding of Data Science:

Data Science is an interdisciplinary field that integrates mathematics,
statistics, computer science, and domain expertise to extract
actionable insights from both structured and unstructured data. It
encompasses a systematic process that includes data collection, cleaning,
analysis, and interpretation, aimed at addressing real-world challenges
across various sectors.

Definition:
Data Science is the scientific discipline of applying computational, statistical,
and algorithmic methods to extract knowledge and insights from data in
diverse formats—ranging from structured datasets (e.g., databases,
spreadsheets) to unstructured sources (e.g., text, audio, and images).

There are several phases of data analysis, including

1) Data requirements
2) Data collection
3) Data processing
4) Data cleaning
5) Exploratory data analysis
6) Modeling and algorithms, and
7) Data product and communication.

These phases are similar to the CRoss-Industry Standard Process for data
mining (CRISP) framework in data mining.
THE SIGNIFICANCE OF EDA

 Different fields of science, economics, engineering, and marketing

accumulate and store data primarily in electronic databases.
Appropriate and well-established decisions should be made using the
data collected.
 It is practically impossible to make sense of datasets containing more
than a handful of data points without the help of computer programs.
To be certain of the insights that the collected data provides and to
make further decisions, data mining is performed where we go through
distinctive analysis processes.
 Exploratory data analysis is key, and usually the first exercise in data
mining. It allows us to visualize data to understand it as well as to
create hypotheses for further analysis. The exploratory analysis
centers around creating a synopsis of data or insights for the next
steps in a data mining project.
 EDA actually reveals ground truth about the content without making
any underlying assumptions. This is the fact that data scientists use
this process to actually understand what type of modeling and
hypotheses can be created.
 Key components of exploratory data analysis include summarizing
data, statistical analysis, and visualization of data. Python provides
expert tools for exploratory analysis, with pandas for summarizing;
scipy, along with others, for statistical analysis; and matplotlib and
plotly for visualizations.

Steps in EDA:

1) Problem Defining:
 Before trying to extract useful insights from the data it is
essential to define the business problem to be solved. The
problem dentition works as driving force for data analysis plan
execution.
 The main tasks involved in problem definition are defining the
main objective of the analysis, defining the main deliverables,
outlining the main roles and responsibilities, obtaining the
current status of the data, defining the timetable, and
performing cost/benefit analysis. Based on such a problem
definition, an execution plan can be created.
2) Data preparation:
 This step involves methods for preparing the dataset before
actual analysis. In this step, we define the sources of data, define
data schemas and tables, understand the main characteristics of
the data, clean the dataset, delete non-relevant datasets,
transform the data, and divide the data into required chunks for
analysis.
3) Data analysis:
 This is one of the most crucial steps that deals with descriptive
statistics and analysis of the data. The main tasks involve
summarizing the data, finding the hidden correlation and
relationships among the data, developing predictive models,
evaluating the models, and calculating the accuracies.
 Some of the techniques used for data summarization are
summary tables, graphs, descriptive statistics, inferential
statistics, correlation statistics, searching, grouping, and
mathematical models.
4) Development and representation of the results:
 This step involves presenting the dataset to the target audience
in the form of graphs, summary tables, maps, and diagrams. This
is also an essential step as the result analyzed from the dataset
should be interpretable by the business stakeholders, which is
one of the major goals of EDA.
 Most of the graphical analysis techniques include scattering
plots, character plots, histograms, box plots, residual plots, mean
plots, and others.
Making Sense of Data:
It is crucial to identify the type of data under analysis. Different disciplines
store different kinds of data for different purposes. For example, medical
researchers store patients' data, universities store students' and teachers'
data, and real estate industries storehouse and building datasets.

Most of the dataset broadly falls into two groups

1) Numerical data
2) Categorical data.

1. Numerical data

This data has a sense of measurement involved in it; for example, a

person's age, height, weight, blood pressure, heart rate, temperature,
number of teeth, number of bones, and the number of family members. This
data is often referred to as quantitative data in statistics. The numerical
dataset can be either discrete or continuous types.

a) Discrete data

This is data that is countable and its values can be listed out. For
example, if we flip a coin, the number of heads in 200 coin flips can take
values from 0 to 200 (finite) cases. A variable that represents a discrete
dataset is referred to as a discrete variable.

The discrete variable takes a fixed number of distinct values. For

example, the Country variable can have values such as Nepal, India, Norway,
and Japan. It is fixed. The Rank variable of a student in a classroom can take
values from 1, 2, 3, 4, 5, and so on.

b) Continuous data

A variable that can have an infinite number of numerical values within

a specific range is classified as continuous data. A variable describing
continuous data is a continuous variable.
2. Categorical Data:

This type of data represents characteristics of objects . e.g. gender,

marital status. This data often referred as Qualitative datasets in statistics.
To understand clearly, here are some of the most common types of
categorical data you can find in data:

 Gender (Male, Female, Other, or Unknown)

 Marital Status (Amulled, Divorced, Interlocutory, Legally Separated,
Married, Polygamous, Never Married, Domestic Partner, Unmarried,
Widowed, or Unknown)
 Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy,
Historical, Horror, Mystery, Philosophical, Political, Romance, Saga,
Satire, Science Fiction, Social, Thriller, Urban, or Western)

A variable describing categorical data is referred to as a categorical variable.

These types of variables can have one of a limited number of values. It is
easier for computer science students to understand categorical values as
enumerated types or enumerations of variables. There are different types of
categorical variables:

Note:

A binary categorical variable can take exactly two values and is also
referred to as a dichotomous variable. For example, when you create an
experiment, the result is either success or failure. Hence, results can be
understood as a binary categorical variable.

Polymorphous variables are categorical variables that can take more

than two possible values. For example, marital status can have several
values, such as annulled, divorced, interlocutory, legally separated, married,
polygamous, never married, domestic partners, unmarried, widowed,
domestic partner, and unknown. Since marital status can take more than two
possible values, it is a polymorphous variable.
Measurement scales:
There are four different types of measurement scales described in statistics:

I. Nominal
II. Ordinal
III. interval, and
IV. ratio.

These scales are used more in academic industries.

Nominal:

These are practiced for labeling variables without any quantitative value. The
scales are generally referred to as labels. And these scales are mutually
exclusive and do not carry any numerical importance. Let's see some
examples:

- What is your gender?

- Male

- Female

- Third gender/Non-binary

Nominal scales are considered qualitative scales and the measurements that
are taken using qualitative scales are considered qualitative data.

Ordinal

The main difference in the ordinal and nominal scale is the order. In ordinal
scales, the order of the values is a significant factor. An easy tip to
remember the ordinal scale is that it sounds like an order.
As depicted in the preceding diagram, the answer to the question of
WordPress is making content managers' lives easier is scaled down to five
different ordinal values, Strongly Agree, Agree, Neutral, Disagree, and
Strongly Disagree. Scales like these are referred to as the Likert scale.
Similarly, the following diagram shows more examples of the Likert scale.

To make it easier, consider ordinal scales as an order of ranking (1st, 2nd,

3rd, 4th, and so on). The median item is allowed as the measure of central
tendency; however, the average is not permitted.

Interval

In interval scales, both the order and exact differences between the values
are significant. Interval scales are widely used in statistics, for example, in
the measure of central tendencies ; mean, mode, median and standard
deviation. Examples include location in Cartesian coordinates and direction
measured in degrees from magnetic north. The mean, median, and mode are
allowed on interval data.
Ratio

Ratio scales contain order, exact values, and absolute zero, which makes it
possible to be used in descriptive and inferential statistics. These scales
provide numerous possibilities for statistical analysis. Mathematical
operations, the measure of central tendencies, and the measure of
dispersion and coefficient of variation can also be computed from such
scales.

Categorical variable  Nominal scale and Ordinal scale

Quantitative variable  Interval Scale and Ratio Scale

COMPARING EDA WITH CLASSICAL AND BAYESIAN ANALYSIS

There are several approaches to data analysis. The most popular ones that
are relevant to this book are the following:

Classical data analysis:

For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by
analysis and result communication.
Example: Predicting exam scores using a regression model after collecting
student data.

Exploratory data analysis approach:

For the EDA approach, it follows the same approach as classical data
analysis except the model imposition and the data analysis steps are
swapped. The main focus is on the data, its structure, outliers, models, and
visualizations. Generally, in EDA, we do not impose any deterministic or
probabilistic models on the data.
Example: Using graphs and summaries to understand students' marks
distribution before deciding which model to apply.

Bayesian data analysis approach:

The Bayesian approach incorporates prior probability distribution knowledge
into the analysis steps as shown in the following diagram. Well, simply put,
prior probability distribution of any quantity expresses the belief about that
particular quantity before considering some evidence.
Example: If past data suggests 80% students pass, and we get new data,
Bayesian methods update the probability of passing based on both.

SOFTWARE TOOLS AVAILABLE FOR EDA

There are several software tools that are available to facilitate EDA. Here, we
are going to outline some of the open source tools:

Python: This is an open source programming language widely used in data

analysis, data mining, and data science (https://www.python.org/).
The following Libraries are used for EDA in python
1) Pandas: Data manipulation and summary statistics.
2) Matplotlib/Seaborn: Visualization tools for plots like
histograms,
scatter plots, and box plots.
3) NumPy: Numerical computations for descriptive stats.

Example: Use Pandas to compute mean and median, Seaborn for a box plot
of sales data.

R programming language: R is an open source programming language

that is widely utilized in statistical computation and graphical data analysis
(https://www.r-project.org).
The following Libraries are used for EDA in R Programming language

1) ggplot2: Advanced visualizations (e.g., scatter plots, density

plots).
2) dplyr: Data manipulation and summary statistics.
3) tidyr: Data cleaning and reshaping.

Example: Create a scatter plot with ggplot2 to explore relationships between

variables.
Weka: This is an open source data mining package that involves several
EDA tools and algorithms (https://www.cs.waikato.ac.nz/ml/weka/).

KNIME: This is an open source tool for data analysis and is based on Eclipse
(https://www.knime.com/).

VISUAL AIDS FOR EDA:

Visual aids are essential for Exploratory Data Analysis (EDA) as they help
uncover patterns, relationships, and anomalies in data.
Visual aids are crucial for:

i. Understanding variable distributions

ii. Identifying relationships between variables
iii. Detecting outliers or anomalies
iv. Visualizing missing data
v. Supporting feature engineering and model selection

The followings are the commonly used visual Aids

1) Line chart
2) Bar chart
3) Scatter plot
4) Box Plot
5) Area plot and stacked plot
6) Pie chart
7) Table chart
8) Polar chart
9) Histogram
10) Lollipop chart
Line Chart:
A line chart is a type of data visualization that displays information as a
series of data points connected by straight lines. It is commonly used to
show trends, changes, or patterns over time or across categories, making it
especially useful for visualizing continuous data

Example of Line chart using python with matplotlib library

The following example demonstrates how to analyze and visualize the

change in average temperature from January to June using a line chart.

# Step 1: Import the library

import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., months vs temperature)

months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
temperature = [22, 25, 28, 30, 32, 33]

# Step 3: Create the line chart

plt.plot(months, temperature, color='green', marker='o', linestyle='-')

# Step 4: Add titles and labels

plt.title("Average Monthly Temperature")
plt.xlabel("Month")
plt.ylabel("Temperature (°C)")

# Step 5: Show the plot

plt.grid(True)
plt.show()

Python Output:
Conclusion:

The line chart illustrating the average monthly temperature from January to
June 2025 demonstrates a consistent upward trend, with temperatures rising
from approximately 22°C in January to 32°C in June.

Bar Plot:
A bar chart is a graphical representation of categorical data using
rectangular bars. Each bar’s length or height corresponds to the value of the
category it represents, making it easy to compare quantities across different
groups. Bar charts are widely used to display discrete data, such as counts,
frequencies, or totals for various categories.

Example of Bar chart using python with matplotlib library

To visualize the marks obtained by a student in different school subjects , we

can use bar chart.

# Step 1: Import the library

import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., subjects vs marks)

subjects = ["Math", "Science", "English", "History", "Geography"]
marks = [85, 90, 78, 70, 88]

# Step 3: Create the bar chart

plt.bar(subjects, marks, color='orange', edgecolor='black')

# Step 4: Add titles and labels

plt.title("Marks Obtained in Subjects")
plt.xlabel("Subjects")
plt.ylabel("Marks")

# Step 5: Show the plot

plt.grid(axis='y')
plt.show()

Python Output:
Conclusion:

Based on the bar chart, the data reflects the performance across five
subjects: Math, Science, English, History, and Geography. The highest marks
are observed in Science and Geography, both reaching approximately 85,
followed closely by Math at around 80. English scores around 75, while
History shows the lowest performance at approximately 65. This visual
representation highlights a strong performance in Science and Geography,
with a noticeable dip in History, providing a clear comparison of subject-wise
achievement

Scatter Plot:
A scatter plot is a type of data visualization that displays individual data
points as dots on a two-dimensional plane, with one variable on the x-axis
and another on the y-axis. Each point represents an observation in the
dataset, allowing viewers to see the relationship—or lack thereof—between
the two variables.

Example of Bar chart using python with matplotlib library

The following is the examples to construct scatter plot using python

Using matplotlib library:

###### Scatter plot using Matplotlib

# Import matplotlib
import matplotlib.pyplot as plt

# Sample data (X and Y values)

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
# Create the scatter plot
plt.scatter(x, y, color='blue', marker='o')

# Add title and labels

plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# Show grid and plot

plt.grid(True)
plt.show()

Python Output:

Scatter plot using pandas library:

#########Scatter Plot Using Pandas

# Step 1: Import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Step 2: Create sample data as a Pandas DataFrame

data = {
'Math_Score': [45, 78, 88, 56, 90, 67, 76],
'Science_Score': [55, 80, 85, 60, 95, 70, 75]
}
df = pd.DataFrame(data)

# Step 3: Use DataFrame.plot.scatter to create scatter plot

df.plot.scatter(x='Math_Score', y='Science_Score', color='red', marker='o', title='Math vs
Science Scores')
# Step 4: Show the plot
plt.show()

Python Output:

Box Plot:
A box plot—also known as a box-and-whisker plot—is a graphical
representation of a dataset’s distribution, summarizing key statistical
measures in a compact, standardized format11. It visually displays
the minimum, lower quartile (Q1), median, upper quartile (Q3), and
maximum of the data, often along with potential outliers. Box plots are
widely used in research, business, education, and any field where
understanding and communicating data distribution is important. They
complement other visualization tools like histograms and density plots.

The following is the example to construct the box plot using python

import matplotlib.pyplot as plt

# Sample data
data = [7, 2, 5, 8, 6, 3, 9, 4, 7, 5]

# Create box plot

plt.boxplot(data)

# Add labels and title

plt.title('Simple Box Plot')
plt.ylabel('Values')

# Display the plot

plt.show()

Python Output:

Example 2:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'Score': [10, 20, 30, 40, 50, 60, 70,120]})

plt.boxplot(df['Score'])
plt.title("Boxplot of Scores")
plt.show()

Output:
Pie Chart:
A pie chart is a circular graph that visually represents data by dividing a
circle into slices, where each slice corresponds to a category and its size
shows the proportion of that category relative to the whole. The entire pie
represents 100% (or 360 degrees), and each slice’s angle or area is
proportional to its share of the total.

A pie chart is most effective when you want to show how individual
categories contribute to a whole—that is, to display the part-to-whole
relationship in your data

The following is the example to construct the Pie chart

import matplotlib.pyplot as plt

# Sample data
data = [7, 2, 5, 8, 6, 3, 9, 4, 7, 5]

# Create box plot

plt.boxplot(data)

# Add labels and title

plt.title('Simple Box Plot')
plt.ylabel('Values')

# Display the plot

plt.show()

Python Output:
3D Pie Chart:

import matplotlib.pyplot as plt

labels = ['Apple', 'Samsung', 'Huawei', 'Xiaomi', 'Others']

sizes = [30, 25, 15, 10, 20]
explode = (0.1, 0, 0, 0, 0) # explode 1st slice

plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140,

shadow=True, explode=explode, colors=['red', 'blue', 'green', 'orange',
'grey'])

plt.title('Simulated 3D Pie Chart with Shadow')

plt.axis('equal')
plt.show()
Histogram:

A histogram is a graphical representation used in statistics to show

the frequency distribution of numerical data across a continuous
range126. It is made up of adjacent vertical bars (rectangles), where each
bar represents a range of values (called a bin or bucket), and the height of
the bar indicates the number of data points (frequency) that fall within that
range126. The horizontal (x) axis shows the data ranges, and the vertical (y)
axis shows the frequency or count of data points in each range

The following is the example to construct the histogram with the help of
Python.

# Step 1: Import the library

import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., student marks out of 100)

marks = [45, 56, 67, 45, 89, 90, 67, 76, 56, 80, 70, 68, 59, 48, 77, 85, 94, 67]

# Step 3: Create a histogram

plt.hist(marks, bins=7, color='skyblue', edgecolor='black')

# Step 4: Add titles and labels

plt.title("Histogram of Student Marks")
plt.xlabel("Marks")
plt.ylabel("Number of Students")
# Step 5: Show the plot
plt.grid(True)
plt.show()

Area Plot:

An area plot (also called an area chart or area graph) is a data visualization
that displays quantitative data by plotting points and connecting them with
line segments—similar to a line chart—but then filling the area between the
line and the horizontal (x) axis with color or shading124. This shaded area
visually emphasizes the magnitude of change and the cumulative total over
time or across categories7.

Area plots can show one or more data series, and when multiple series
are included, each is represented by a differently colored or patterned area,
either overlapping or stacked138. The stacked area chart is a common
variation, showing how different categories contribute to the total over time
# Step 1: Import the library
import matplotlib.pyplot as plt

# Step 2: Sample data (e.g., days vs active users)

days = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
users = [100, 150, 130, 180, 170, 200, 190]

# Step 3: Create the area plot

plt.fill_between(days, users, color='skyblue', alpha=0.5)

# Step 4: Add line on top for clarity

plt.plot(days, users, color='blue', marker='o')

# Step 5: Add titles and labels

plt.title("Active Users During the Week")
plt.xlabel("Days")
plt.ylabel("Number of Users")

# Step 6: Show the plot

plt.grid(True)
plt.show()

Conclusion:
The chart shows the number of active users during the week from Monday to
Sunday. The number starts at 100 on Monday, increases to 150 on Tuesday,
dips to 125 on Wednesday, rises to 175 on Thursday, drops to 150 on Friday,
peaks at 200 on Saturday, and slightly decreases to 175 on Sunday. The
highest activity is on Saturday, while the lowest is on Monday.

DATA TRANSFORMATION TECHNIQUES

Data Transformation:

One of the fundamental steps of Exploratory Data Analysis (EDA) is data

wrangling. We will come to understand the work that must be completed
before transferring our information for further examination, including,
removing duplicates, replacing values, renaming axis indexes, discretization
and binning, and detecting and filtering outliers.

Data transformation is a set of techniques used to convert data from one

format or structure to another format or structure. The following are some
examples of transformation activities:

Data deduplication involves the identification of duplicates and their

removal.

Key restructuring involves transforming any keys with built-in meanings to

the generic keys.

Data cleansing involves extracting words and deleting out-of-date,

inaccurate, and incomplete information from the source language without
extracting the meaning or information to enhance the accuracy of the source
data.

Data validation is a process of formulating rules or algorithms that help in

validating different types of data against some known issues.

Format revisioning involves converting from one format to another.

Data derivation consists of creating a set of rules to generate more

information from the data source.
Data aggregation involves searching, extracting, summarizing, and
preserving important information in different types of reporting systems.

Data integration involves converting different data types and merging

them into a common structure or schema.

Data filtering involves identifying information relevant to any particular

user.

Data joining involves establishing a relationship between two or more

tables.

The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data. In addition to
this, interoperability in a system can be achieved by following a common
data structure and format.

Merging Database:
Merging is the process of combining two or more datasets based on a
common key or index. It is especially useful when information is spread
across multiple tables or files.

For example, if you have customer names and addresses in one file and their
purchase history in another, merging means putting both types of
information into a single file, making it easier to analyze or report on
combined data.

In pandas, we can merge datasets using four primary join types:

1. Inner Join
2. Left Join
3. Right Join
4. Outer (Full) Join

Inner Join:
An INNER JOIN is a way to combine rows from two or more tables based on
a common column between them. It only returns rows where there are
matching values in both tables—rows without a match are excluded from the
result.

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='inner')

print(result)

Left Join:
A LEFT JOIN is a way to combine rows from two tables. It returns all rows
from the left table (the first table you mention) and only the matching
rows from the right table (the second table)

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='left')

print(result)
Right Join:
A RIGHT JOIN is a type of join that returns all rows from the right
table (the second table listed in your query) and only the matching rows
from the left table (the first table). If there is no match in the left table, the
result will have NULL values for the left table’s columns

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='right')

print(result)

Outer (Full) Join:

An outer join is a method to combine rows from two or more tables based
on a related column, including not only the rows with matches in both
tables, but also the rows that do not have a match in one or both
tables

import pandas as pd

#Data set 1
students = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

#Data Set 2
scores = pd.DataFrame({
'ID': [2, 3, 4],
'Score': [88, 91, 85]
})

result = pd.merge(students, scores, on='ID', how='outer')

print(result)

Merging Data set using the pandas concat()

method:

Syntax:
import pandas as pd

pd.concat (objs, axis=0, ignore_index=False)

were,

objs: List of DataFrames or Series to concatenate.

axis: 0 for rows (default), 1 for columns.

ignore_index: If True, resets index in the result.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})

print(df1)
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
print(df2)

result = pd.concat([df1, df2], axis=0, ignore_index=False)

print(result)
Combining Datasets Using Append in Pandas
Appending is a straightforward way to combine two or more datasets by
adding the rows of one DataFrame to the end of another—creating a
single, taller table (not wider). This is different from merging or joining, which
combine data by matching keys or columns.

Note: As of pandas 2.0, the DataFrame.append() method has been

removed. Instead of pd.append() method , we can use pd.concat()
method.

Reshaping Data
Reshaping means changing the structure (the rows and columns layout) of
your dataset to make it easier to analyze, visualize, or fit into a particular
model. Data can be reshaped in many ways, such as going from wide
format (many columns, few rows) to long format (many rows, fewer
columns) and vice versa

Example:

import pandas as pd

df = pd.DataFrame({
'ID': [1, 2],
'Math': [90, 85],
'Science': [80, 95]
})

melted = pd.melt(df, id_vars='ID', var_name='Subject', value_name='Score')

print(melted)

Output:
Pivoting Data
Pivoting is a specific type of reshaping where you convert data from long
format (many rows, few columns) to wide format (few rows, many columns).
This is useful when you want to compare values across different categories
or time periods.

import pandas as pd

df_long = pd.DataFrame({
'ID': [1, 1, 2, 2],
'Subject': ['Math', 'Science', 'Math', 'Science'],
'Score': [90, 80, 85, 95]
})

pivoted = df_long.pivot(index='ID', columns='Subject', values='Score')

print(pivoted)

Output:
Transformation techniques:

1) Performing data deduplication:

Data deduplication is a process that removes duplicate records or
repeating data segments within a dataset. This is an essential step in data
transformation, especially when preparing data for analysis. Deduplication
improves data quality and storage efficiency.

import pandas as pd

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Alice']})

df = df.drop_duplicates()

print(df)

2) Replacing values :
Replacing values is a common data transformation technique used to
modify specific data values within a dataset. This is often done to correct
errors, standardize formats, or prepare data for analysis and modeling.

import pandas as pd

df = pd.DataFrame({'Gender': ['M', 'F', 'M']})

df = df.replace({'M': 'Male', 'F': 'Female'})

print(df)

3) Handling Missing Data :

Handling missing data is a crucial step in preparing datasets for analysis
or machine learning. Missing values can distort results, introduce bias, and
reduce the effectiveness of models if not managed properly.

import pandas as pd
import numpy as np

df = pd.DataFrame({'Age': [25, np.nan, 30]})

print(df.isnull())

4) Droping Missing Values

Dropping missing values means removing rows or columns from a dataset
where data is incomplete or missing. This is one of the most common and
simplest ways to handle missing data

import pandas as pd

# Create a sample DataFrame

data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, 30, 22, None]
}
df = pd.DataFrame(data)

# Drop rows with any missing values

df_drop_rows = df.dropna()
print(df_drop_rows)

# Drop columns with any missing values

df_drop_columns = df.drop(columns=['Age'])
print(df_drop_columns)

5) Filling Missing Values :

Filling missing values is the process of replacing empty or undefined data
entries (usually shown as NaN, null, or blank spaces) with valid values so
the dataset becomes complete, consistent, and usable for analysis or
machine learning.

This is also called “imputation”.

import pandas as pd
data = {'Age': [25, None, 30]} # Note: You used NaN, which should be None in Python or np.nan
if numpy is imported
df = pd.DataFrame(data)
df = df.fillna(0)
print(df)

6) Backword and Forward filling :

Backward filling (backfill or bfill) and forward filling (ffill) are simple data
imputation methods used to handle missing values, especially in time
series or sequential data.
import pandas as pd
data = {'Age': [25,None, 30]} # Note: You used NaN, which should be None in Python or np.nan
if numpy is imported
df = pd.DataFrame(data)
df_forward = df.fillna(method='ffill')
print(df_forward)
df_backward = df.fillna(method='bfill')
print(df_backward)

7) Interpolating missing values :

Interpolation is a method of estimating missing values in a dataset by
using the existing surrounding data points. Instead of simply copying nearby
values (like forward fill or backward fill), interpolation calculates new
values based on mathematical formulas—usually assuming values
change smoothly over time or across a sequence.

import pandas as pd
df = pd.DataFrame({'Age': [20, None, 30]})
df = df.interpolate()
print(df)

8) Discritization and binning :

Discretization is the process of converting continuous numerical data into
discrete categories, intervals, or bins. This transformation helps simplify
data, making it easier to analyze, visualize, and feed into machine learning
models that require categorical input

Binning (also called bucketing) is a common technique for discretization. In

binning, the range of continuous values is split into a set of discrete intervals
called bins.

Example 1: pd.cut() – Equal-width Binning:

import pandas as pd
df = pd.DataFrame({'Age': [15, 30, 20, 15, 66, 87, 13]})
df['Group'] = pd.cut(df['Age'], bins=[0, 18, 60, 100], labels=['Teen', 'Adult', 'Senior'])
print(df)
Example 2: pd.qcut() – Quantile-based Binning:

#Imports the pandas library

import pandas as pd

#Insert Data
scores = [10, 20, 30, 40, 50, 60, 70, 80]

#Create the DataFrame

df = pd.DataFrame({'Score': [10, 20, 30, 40, 50, 60, 70, 80]})

#Apply qcut() for Binning

# Divide scores into 4 quantile bins
df['ScoreGroup'] = pd.qcut(df['Score'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])

#Print the Final DataFrame

print(df)
Data Manipulation using Pandas
Data manipulation using pandas involves organizing, transforming, and
analyzing tabular data (data structures like Data Frames and Series) with the
panda’s library in Python. A panda provides intuitive tools to efficiently
handle common data processing tasks for analysis, visualization, or
modeling.

Pandas Objects:
A panda is a Python library widely used for data manipulation and analysis. It
provides two fundamental data structures for handling structured data:
Series and DataFrame.

Series: A Series is a one-dimensional, labeled array capable of holding any

data type, such as integers, floats, or strings.

Data Frame: A DataFrame in Python (using the pandas library) is a two-

dimensional, labeled data structure that organizes data in rows and columns,
much like a table in a database or a spreadsheet in Excel

Data Indexing and Selection:

Introduction:
Data indexing and selection are fundamental operations in data manipulation,
allowing us to access and work with specific portions of a dataset. In Python, the
Pandas library provides powerful tools for indexing and selecting data using labels,
positions, or boolean conditions.

Label-based Indexing:
Label-based Indexing refers to selecting and accessing data in a Python
Pandas DataFrame (or Series) by using explicit labels (names of rows and
columns) rather than integer positions.
Example: Suppose we have a DataFrame df with columns 'Name', 'Age',
and 'City' as follows
Using label-based indexing in Pandas, we can access the data of an entire
column by specifying the column label and using either all rows or a subset
of rows.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df['Name']) # Selects the 'Name' column

Position-based Indexing:
Position-based indexing in Python's Pandas library refers to accessing Data
Frame or Series elements using integer positions (index numbers) rather
than labels. This is done using the df.iloc[ ] accessor, which stands for
integer location based indexing.
Example:

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)
print(df.iloc[0]) # Selects the first row

Boolean Indexing:
Boolean indexing in Python's Pandas library is a technique used to filter and
select data from a DataFrame based on conditions that evaluate
to True or False.
Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

print(df[df['Age'] > 25])

Renaming Axis index :
Renaming the axis index in pandas means changing the “name” of the
row or column axis (the label above the index or columns) rather than
renaming the actual row or column labels themselves.

import pandas as pd
df = pd.DataFrame({
'Math': [85, 90],
'Science': [80, 95]
}, index=['Alice', 'Bob'])

print(df)

#Rename the row index axis

df1 = df.rename_axis("Student")
print(df1)

#Rename the column axis

df2 = df1.rename_axis("Subject", axis='columns')
print(df2)

#Rename specific row index values

df3 = df.rename(index={'Alice': 'Alicia'})
print(df3)

#Rename specific column names

df4 = df.rename(columns={'Math': 'Mathematics'})

print(df4)

Hierarchical Indexing

Hierarchical indexing, also known as multi-indexing, is a method used in data

analysis (such as with the Pandas library in Python) that allows you to use
more than one index (row label) to organize and access your data efficiently.
This is especially helpful for representing and working with complex, multi-
dimensional data within flat, tabular structures.

With hierarchical indexing, you can group data at more than one
category or level, allowing for:

I. Easier filtering and slicing at different levels.

II. Advanced grouping and aggregation.
III. Representation of higher-dimensional data in two-dimensional
DataFrames.

Hierarchical indexing is useful for:

a) Grouping statistics by multiple categories (e.g., sales by country and

product).
b) Handling time series data across multiple entities (e.g., stock prices for
several companies over years).

Example 1:

import pandas as pd

# Sample data
data = {
'region': ['East', 'East', 'West', 'West'],
'state': ['NY', 'NJ', 'CA', 'WA'],
'sales': [100, 150, 200, 250]
}

df = pd.DataFrame(data)
print(df)

# Set hierarchical (multi-level) index using 'region' and 'state'

df_multi = df.set_index(['region', 'state'])

print(df_multi)

Example 2:

import pandas as pd

arrays = [
['2023', '2023', '2024', '2024'],
['Q1', 'Q2', 'Q1', 'Q2']
]
print(arrays)

# Create MultiIndex
index = pd.MultiIndex.from_arrays(arrays, names=('Year', 'Quarter'))
print(index)

# Create Series with MultiIndex

data = pd.Series([100, 150, 200, 250], index=index)
print(data)
# Print the result
print(data)

Data Aggregation:
Aggregation is the process of applying mathematical operations on a dataset
or a subset of it to produce summary statistics or consolidated values. It is a
fundamental technique in data analysis used to manipulate and summarize
data within a DataFrame.

Aggregation involves computing a summarizing function over data to reduce

detailed information into a form that is easier to analyze. Common
aggregation operations include sum, minimum, maximum, mean, count, etc.
Pandas provides a flexible aggregate method that allows applying one or
multiple aggregation functions to one or many columns.

Example 1:

import pandas as pd

df = pd.DataFrame({
'Sales': [100, 200, 300],
'Expenses': [80, 150, 250]
})
# Aggregate sum and min over all columns
result = df.aggregate(['sum', 'min'])
print(result)

#Applying Different Aggregations to Specific Columns

result = df.aggregate({'Sales': 'sum', 'Expenses': 'min'})
print(result)

Example 2:

import pandas as pd

df = pd.DataFrame({
'Math': [90, 80, 70],
'Science': [85, 85, 65],
'English': [88, 76, 95]
}, index=['Alice', 'Bob', 'Charlie'])

print(df)

#Aggregate sum of scores on each row (total score per student)

df['Total'] = df.aggregate('sum', axis=1)
print(df)

#Aggregate min and max on each row (min and max score per student)
row_min_max = df.aggregate(['min', 'max'], axis=1)
print(row_min_max)

#Apply multiple aggregation functions (mean, sum, std) on rows

row_stats = df.aggregate(['mean', 'sum', 'std'], axis=1)
print(row_stats)

Data Grouping :
Data grouping is the process of organizing data into categories or groups
based on one or more criteria. This technique is widely used in data analysis
and reporting to summarize large datasets and identify patterns or trends.

Data is divided into subsets called groups based on the values

of one or more columns (known as grouping keys). Common operations like
aggregation (sum, mean, count), filtering, and transformation are then
applied to each group separately.

Data grouping is useful because it allows you to:

 Summarize: Present condensed views of the data, making it easier to

interpret and analyze.
 Compare: Quickly contrast different categories or groups, such as
evaluating sales across regions or departments.
 Aggregate: Calculate summary statistics—like totals, averages, or
counts—for each group, providing valuable insights at a glance.

Example:

A company’s sales data collected over multiple years, involving different

regions and products.

The goal is to summarize the total sales by region and product to understand
how each category is performing overall.
Python Code:

import pandas as pd

# Create a DataFrame with the sales data

data = {
'Region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Year': [2022, 2022, 2022, 2022, 2023, 2023, 2023, 2023],
'Sales': [100, 150, 200, 250, 300, 350, 400, 450]
}

df = pd.DataFrame(data)

# Group by 'Region' and 'Product' columns and aggregate sales by summing

grouped_sales = df.groupby(['Region', 'Product'])['Sales'].sum()

print(grouped_sales)

Pivot table
A pivot table is a powerful data analysis tool that summarizes, reorganizes,
and groups selected columns and rows of data in a spreadsheet or database
to generate insightful reports. It allows you to view data from different
perspectives by "pivoting" the layout without changing the original dataset.
Pivot tables help in quickly aggregating data such as sums, averages,
counts, and percentages.

Pivot tables organize data into a grid where: Rows and Columns represent
categorical data. Values display aggregated numerical data. Filters allow
selective viewing based on criteria.

When to Use Pivot Tables

 To efficiently summarize and condense large datasets into clear,

understandable summaries.
 To analyze data by breaking it down into various categories and
dimensions for deeper insights.
 To easily compare different groups or segments, such as evaluating
sales performance by region or product.
 To detect important trends and identify any unusual data points or
outliers.
 To create dynamic, interactive reports that facilitate informed business
decisions.
 When you need fast and flexible data aggregation without writing
complex code or formulas.

Example:

import pandas as pd

# Sample data: Students, Subject, and Percentage

data = {
'Student': ['Alice', 'Alice', 'Bob', 'Bob'],
'Subject': ['Math', 'Science', 'Math', 'Science'],
'Percentage': [85, 90, 75, 80]
}

# Create DataFrame
df = pd.DataFrame(data)

# Create pivot table: average percentage by Student and Subject

pivot_table = pd.pivot_table(df, index='Student', columns='Subject', values='Percentage')

print(pivot_table)

AD3301 Data Exploration and Visualization
No ratings yet
AD3301 Data Exploration and Visualization
278 pages
Aiht Notes Dev 1-5
No ratings yet
Aiht Notes Dev 1-5
236 pages
Unit 1
No ratings yet
Unit 1
50 pages
Unit 1
No ratings yet
Unit 1
52 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
Unit 1 Dev
No ratings yet
Unit 1 Dev
26 pages
Exploratory Data Analysis Overview
No ratings yet
Exploratory Data Analysis Overview
34 pages
Probability and Stat Unit 1
No ratings yet
Probability and Stat Unit 1
12 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
Lecture 2 EDA 1
No ratings yet
Lecture 2 EDA 1
26 pages
Data Acquisition and EDA Techniques
No ratings yet
Data Acquisition and EDA Techniques
58 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
P23MBA547 Predictive Analytics
No ratings yet
P23MBA547 Predictive Analytics
133 pages
Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
31 pages
DS Lecture 15
No ratings yet
DS Lecture 15
44 pages
EDA Feature Eng - Estimation Inference and Hypothesis
No ratings yet
EDA Feature Eng - Estimation Inference and Hypothesis
53 pages
Data Analysis for Business Insights
No ratings yet
Data Analysis for Business Insights
99 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Unit 3
No ratings yet
Unit 3
47 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
47 pages
EDA - Module 4
No ratings yet
EDA - Module 4
35 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
Essential Guide to Exploratory Data Analysis
No ratings yet
Essential Guide to Exploratory Data Analysis
3 pages
FDS Unit I
No ratings yet
FDS Unit I
404 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
Data Analytics for Aspiring Analysts
No ratings yet
Data Analytics for Aspiring Analysts
54 pages
Analytical Decision Making
No ratings yet
Analytical Decision Making
27 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
19 pages
Ds Unit 2 QB
No ratings yet
Ds Unit 2 QB
25 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
2 Mark Dev
No ratings yet
2 Mark Dev
6 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
Unit 1
No ratings yet
Unit 1
29 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
QM 1
No ratings yet
QM 1
58 pages
Quantitative Methods 3
No ratings yet
Quantitative Methods 3
174 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
SM Session 1 IPL 2024 Post Session Slides
No ratings yet
SM Session 1 IPL 2024 Post Session Slides
44 pages
Data Mining Vs Data Exploration UNIT-II
No ratings yet
Data Mining Vs Data Exploration UNIT-II
11 pages
Fda End Sem
No ratings yet
Fda End Sem
14 pages
DA Unit 2 Trio 1
No ratings yet
DA Unit 2 Trio 1
26 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
No ratings yet
Estadístic A Descriptiv A: Dr. Lázaro Bustio Martínez Otoño 2023
42 pages
Data Analytics for IoT Solutions Guide
No ratings yet
Data Analytics for IoT Solutions Guide
79 pages
Data Analytics Syllabus Overview
No ratings yet
Data Analytics Syllabus Overview
80 pages
Unit 2
No ratings yet
Unit 2
20 pages
TB Proposal Sis Asgedom
No ratings yet
TB Proposal Sis Asgedom
36 pages
Redis Report
No ratings yet
Redis Report
36 pages
Spark Runtime Architecture Overview
No ratings yet
Spark Runtime Architecture Overview
5 pages
BCom Online Banking Project
No ratings yet
BCom Online Banking Project
14 pages
Cyber Security and Privacy - Unit 12 - Week 10
No ratings yet
Cyber Security and Privacy - Unit 12 - Week 10
2 pages
Handwriting Enhancement Recognition-Based and Recognition-Independent Approaches For On-Device Online Handwritten Text Alignment
No ratings yet
Handwriting Enhancement Recognition-Based and Recognition-Independent Approaches For On-Device Online Handwritten Text Alignment
15 pages
CS186 Database Systems Exam
No ratings yet
CS186 Database Systems Exam
14 pages
E-Court Effectiveness of Religious Courts in Indonesia: Syariful Alam, Nu'man Aunuh, Muhammad Luthfi
No ratings yet
E-Court Effectiveness of Religious Courts in Indonesia: Syariful Alam, Nu'man Aunuh, Muhammad Luthfi
6 pages
Fundamentals of Database Systems: (Query Optimization - I)
No ratings yet
Fundamentals of Database Systems: (Query Optimization - I)
27 pages
8 Sockets Part2
No ratings yet
8 Sockets Part2
28 pages
People Analytics & Digital HR Course
No ratings yet
People Analytics & Digital HR Course
10 pages
5.file Management
No ratings yet
5.file Management
21 pages
Top 14 Pega Interview Questions and Answers
No ratings yet
Top 14 Pega Interview Questions and Answers
10 pages
Business & Financial Consultant Resume
No ratings yet
Business & Financial Consultant Resume
2 pages
RMAN Interview Questions and Answers Guide.: Global Guideline
No ratings yet
RMAN Interview Questions and Answers Guide.: Global Guideline
12 pages
Notes Class 11 Psychology Introduction To Psychology Chapter 2 Methods of Enquiry in Psychology
No ratings yet
Notes Class 11 Psychology Introduction To Psychology Chapter 2 Methods of Enquiry in Psychology
4 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
3 pages
Question Bank For Class 9 Artificial Intelligence
No ratings yet
Question Bank For Class 9 Artificial Intelligence
7 pages
MIS311 Session 0
No ratings yet
MIS311 Session 0
31 pages
NCR 2020 Census: Population & Households
No ratings yet
NCR 2020 Census: Population & Households
88 pages
Understanding Teachers' Experiences in Using Artificial Intelligence
No ratings yet
Understanding Teachers' Experiences in Using Artificial Intelligence
11 pages
Oracle Database Schema Overview
No ratings yet
Oracle Database Schema Overview
7 pages
QP List Line Detail
No ratings yet
QP List Line Detail
7 pages
Elements of Cache Design
No ratings yet
Elements of Cache Design
6 pages
The Corporate Social Performance-Financial Performance Link
No ratings yet
The Corporate Social Performance-Financial Performance Link
17 pages
Database Adnalesque Cano: I Sing of A Database and Its Records
No ratings yet
Database Adnalesque Cano: I Sing of A Database and Its Records
42 pages
Quanti Management
No ratings yet
Quanti Management
507 pages
Data Engineering Roadmap
No ratings yet
Data Engineering Roadmap
3 pages
Business Intelligence: Body of Knowledge
No ratings yet
Business Intelligence: Body of Knowledge
16 pages
Discretization Methods in Data Mining
No ratings yet
Discretization Methods in Data Mining
3 pages