0% found this document useful (0 votes)
5 views

EDA (2)

The document provides examples of various data manipulation techniques using pandas in Python, including merging DataFrames, reshaping data with hierarchical indexing, detecting and removing duplicates, and handling missing values. It also covers data transformation techniques such as renaming indexes, discretization and binning, and random sampling. Each section includes code snippets demonstrating the respective techniques.

Uploaded by

hemanthboni18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

EDA (2)

The document provides examples of various data manipulation techniques using pandas in Python, including merging DataFrames, reshaping data with hierarchical indexing, detecting and removing duplicates, and handling missing values. It also covers data transformation techniques such as renaming indexes, discretization and binning, and random sampling. Each section includes code snippets demonstrating the respective techniques.

Uploaded by

hemanthboni18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

10)

a) Merging DataFrames

import pandas as pd

# Creating two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 4], 'Score': [85, 90, 78]})

# Merging on 'ID' column

merged_df = pd.merge(df1, df2, on='ID', how='inner') # Inner Join

print(merged_df)

b) Reshaping with Hierarchical Indexing


# Creating a MultiIndex DataFrame

import pandas as pd

data = {

'Category': ['A', 'A', 'B', 'B'],

'Year': [2022, 2023, 2022, 2023],

'Value': [10, 15, 20, 25] }

df = pd.DataFrame(data)

reshaped_df = df.pivot_table(index='Category', columns='Year', values='Value')

print(reshaped_df)

Data Duplication

# Detecting duplicates

print(df.duplicated())

# Removing duplicates

import pandas as pd

data = {'ID': [1, 2, 3, 1, 4], 'Value': ['A', 'B', 'C', 'A', 'D']}
df = pd.DataFrame(data)

duplicates = df[df.duplicated(subset=['ID'], keep=False)]

print(duplicates)

d) Replacing Values

import pandas as pd

data = {

'Category': ['A', 'A', 'B', 'B'],

'Year': [2022, 2023, 2022, 2023],

'Value': [10, 15, 20, 25]

df = pd.DataFrame(data)

reshaped_df = df.pivot_table(index='Category', columns='Year', values='Value')

print(reshaped_df)

11) Apply different missing data handling techniques

a) NAN values in mathematical operations

import pandas as pd

import numpy as np

data = {'Values': [10, np.nan, 20, 30, np.nan]}

s = pd.Series(data['Values'])

print("Original Series:\n", s)

# Sum (NaN ignored)

total = s.sum()

print("\nSum:", total)

# Mean (NaN ignored)

average = s.mean()

print("Mean:", average)
# Count (non-NaN values)

count = s.count()

print("Count:", count)

# Max (NaN ignored)

maximum = s.max()

print("Max:", maximum)

b) Filling in missing values

import pandas as pd

import numpy as np

data = {'A': [1, np.nan, 3], 'B': [4, 5, np.nan]}

df = pd.DataFrame(data)

print("Original:\n", df)

# Fill NaN in column 'A' with 0

df['A'] = df['A'].fillna(0)

# Fill NaN in column 'B' with the mean of 'B'

df['B'] = df['B'].fillna(df['B'].mean())

print("\nFilled:\n", df)

c) Forward and backward filling of missing values

import pandas as pd

import numpy as np

data = {'A': [1, np.nan, 3, np.nan, 5]}

df = pd.DataFrame(data)

print("Original:\n", df)

# Forward fill

ffilled = df.fillna(method='ffill')

print("\nForward filled:\n", ffilled)

# Backward fill
bfilled = df.fillna(method='bfill')

print("\nBackward filled:\n", bfilled)

d) Filling with index values

import pandas as pd

import numpy as np

data = {'A': [1, np.nan, 3, np.nan]}

df = pd.DataFrame(data)

print("Original:\n", df)

# Fill NaN with index

for i in range(len(df)):

if pd.isna(df.loc[i, 'A']):

df.loc[i, 'A'] = i

print("\nFilled with index:\n", df)

e) Interpolation of missing values

import pandas as pd

import numpy as np

data = {'A': [1, np.nan, 3, np.nan, 5]}

df = pd.DataFrame(data)

print("Original:\n", df)

# Linear interpolation

interpolated = df.interpolate()

print("\nInterpolated:\n", interpolated)

12) Apply different data transformation techniques

a) Renaming axis indexes

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6]}

df = pd.DataFrame(data, index=['x', 'y', 'z'])


print("Original:\n", df)

# Rename columns

df_renamed_cols = df.rename(columns={'A': 'New_A', 'B': 'New_B'})

print("\nRenamed columns:\n", df_renamed_cols)

# Rename index

df_renamed_index = df.rename(index={'x': 'one', 'z': 'three'})

print("\nRenamed index:\n", df_renamed_index)

# Rename both

df_renamed_both = df.rename(columns={'A': 'a'}, index={'y': '2'})

print("\nRenamed both:\n", df_renamed_both)

b) Discretization and Binning

import pandas as pd

ages = pd.Series([22, 35, 48, 61, 28])

# Equal-width bins

bins = [20, 40, 60, 80]

age_bins = pd.cut(ages, bins)

print("Equal-width bins:\n", age_bins)

# Equal-frequency bins

age_qbins = pd.qcut(ages, 2)

print("\nEqual-frequency bins:\n", age_qbins)

NOTE:

Key Differences:

1) pd.cut() (Equal-Width):

Creates bins with equal-width ranges (e.g., 20-40, 40-60).

The number of elements in each bin may vary.

2) pd.qcut() (Equal-Frequency):

Creates bins with approximately equal numbers of elements.

The width of the bins may vary.

c) Permutation and random sampling


import pandas as pd

data = {'A': [1, 2, 3, 4, 5], 'B': ['a', 'b', 'c', 'd', 'e']}

df = pd.DataFrame(data)

print("Original:\n", df)

# Permute rows

permuted_df = df.sample(frac=1).reset_index(drop=True)

print("\nPermuted:\n", permuted_df)

# Random sample (3 rows)

sampled_df = df.sample(n=3)

print("\nSampled:\n", sampled_df)

NOTE: Output (will vary due to randomness)


data.reset_index(inplace=True)

print(data)

You might also like