Data Cleaning and Preprocessing with Pandas
Step 1
import pandas as pd
# Load the dataset
url = '[Link]
df = pd.read_csv(url)
# Display the first few rows of the dataset
[Link]()
Step 2: Inspect the Data
# Display the summary statistics of the dataset
[Link]()
# Check for missing values
[Link]().sum()
Step 3: Handle Missing Values
# Fill missing values in the 'Age' column with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing values in the 'Embarked' column with the most frequent value (mode)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Drop the 'Cabin' column as it has too many missing values
[Link]('Cabin', axis=1, inplace=True)
Step 4: Convert Categorical Variables
# Convert the 'Sex' column to numerical values: 0 for 'male' and 1 for 'female'
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# Create dummy variables for the 'Embarked' column
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
Step 5: Feature Engineering
# Create a new column 'FamilySize' by adding 'SibSp' and 'Parch' columns
df['FamilySize'] = df['SibSp'] + df['Parch']
Step 6: Drop Unnecessary Columns
# Drop the 'Name', 'Ticket', and 'SibSp' columns as they are not needed for analysis
[Link](['Name', 'Ticket', 'SibSp'], axis=1, inplace=True)
Step 7: Save the Cleaned Data
# Save the cleaned dataset to a new CSV file
df.to_csv('cleaned_titanic.csv', index=False)
Complete Script in a Jupyter Notebook
import pandas as pd
# Load the dataset
url = '[Link]
df = pd.read_csv(url)
# Display the first few rows of the dataset
print("First few rows of the dataset:")
print([Link]())
# Display the summary statistics of the dataset
print("\nSummary statistics of the dataset:")
print([Link]())
# Check for missing values
print("\nMissing values in the dataset:")
print([Link]().sum())
# Fill missing values in the 'Age' column with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Fill missing values in the 'Embarked' column with the most frequent value (mode)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
# Drop the 'Cabin' column as it has too many missing values
[Link]('Cabin', axis=1, inplace=True)
# Convert the 'Sex' column to numerical values: 0 for 'male' and 1 for 'female'
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# Create dummy variables for the 'Embarked' column
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
# Create a new column 'FamilySize' by adding 'SibSp' and 'Parch' columns
df['FamilySize'] = df['SibSp'] + df['Parch']
# Drop the 'Name', 'Ticket', and 'SibSp' columns as they are not needed for analysis
[Link](['Name', 'Ticket', 'SibSp'], axis=1, inplace=True)
# Save the cleaned dataset to a new CSV file
df.to_csv('cleaned_titanic.csv', index=False)
print("\nData cleaning and preprocessing completed. Cleaned data saved to 'cleaned_titanic.csv'.")