0% found this document useful (0 votes)
10 views4 pages

Datacleaning Py

Uploaded by

rakshithasai22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Datacleaning Py

Uploaded by

rakshithasai22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Cleaning and Preprocessing with Pandas

Step 1

import pandas as pd

# Load the dataset

url = '[Link]

df = pd.read_csv(url)

# Display the first few rows of the dataset

[Link]()

Step 2: Inspect the Data

# Display the summary statistics of the dataset

[Link]()

# Check for missing values

[Link]().sum()

Step 3: Handle Missing Values

# Fill missing values in the 'Age' column with the median age

df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing values in the 'Embarked' column with the most frequent value (mode)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column as it has too many missing values

[Link]('Cabin', axis=1, inplace=True)

Step 4: Convert Categorical Variables


# Convert the 'Sex' column to numerical values: 0 for 'male' and 1 for 'female'

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Create dummy variables for the 'Embarked' column

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

Step 5: Feature Engineering

# Create a new column 'FamilySize' by adding 'SibSp' and 'Parch' columns

df['FamilySize'] = df['SibSp'] + df['Parch']

Step 6: Drop Unnecessary Columns

# Drop the 'Name', 'Ticket', and 'SibSp' columns as they are not needed for analysis

[Link](['Name', 'Ticket', 'SibSp'], axis=1, inplace=True)

Step 7: Save the Cleaned Data

# Save the cleaned dataset to a new CSV file

df.to_csv('cleaned_titanic.csv', index=False)

Complete Script in a Jupyter Notebook

import pandas as pd

# Load the dataset

url = '[Link]

df = pd.read_csv(url)

# Display the first few rows of the dataset


print("First few rows of the dataset:")

print([Link]())

# Display the summary statistics of the dataset

print("\nSummary statistics of the dataset:")

print([Link]())

# Check for missing values

print("\nMissing values in the dataset:")

print([Link]().sum())

# Fill missing values in the 'Age' column with the median age

df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing values in the 'Embarked' column with the most frequent value (mode)

df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column as it has too many missing values

[Link]('Cabin', axis=1, inplace=True)

# Convert the 'Sex' column to numerical values: 0 for 'male' and 1 for 'female'

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# Create dummy variables for the 'Embarked' column

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Create a new column 'FamilySize' by adding 'SibSp' and 'Parch' columns

df['FamilySize'] = df['SibSp'] + df['Parch']

# Drop the 'Name', 'Ticket', and 'SibSp' columns as they are not needed for analysis

[Link](['Name', 'Ticket', 'SibSp'], axis=1, inplace=True)


# Save the cleaned dataset to a new CSV file

df.to_csv('cleaned_titanic.csv', index=False)

print("\nData cleaning and preprocessing completed. Cleaned data saved to 'cleaned_titanic.csv'.")

You might also like