AML_LAB12.Ipynb - Colab
AML_LAB12.Ipynb - Colab
ipynb - Colab
Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to
enable.
Saving titanic data.csv to titanic data.csv
import pandas as pd
titanic_df = pd.read_csv("titanic_data.csv")
#Data exploration: To view the first 5 rows of the DataFrame, you can use the head() function:
titanic_df.head()
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
#To get an overview of the DataFrame, you can use the info() function:
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 889 non-null int64
1 pclass 889 non-null int64
2 sex 889 non-null object
3 age 713 non-null float64
4 sibsp 889 non-null int64
5 parch 889 non-null int64
6 fare 889 non-null float64
7 embarked 887 non-null object
8 class 889 non-null object
9 who 889 non-null object
10 adult_male 889 non-null bool
11 deck 203 non-null object
12 embark_town 887 non-null object
13 alive 889 non-null object
14 alone 889 non-null bool
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.2+ KB
#To get summary statistics of the numerical columns in the DataFrame, you can use the describe() function:
titanic_df.describe()
#Data selection: To select the "sex" and "alive" columns of the DataFrame, you can use the loc[] function:
titanic_df.loc[3:14, ["sex", "alive"]]
https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 1/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab
sex alive
3 female yes
4 male no
5 male no
6 male no
7 male no
8 female yes
9 female yes
10 female yes
11 female yes
12 male no
13 male no
14 female no
#To select rows where the "sex" is "female", you can use Boolean indexing:
titanic_females = titanic_df[titanic_df["sex"] == "female"]
print(titanic_females)
#Data filtering: To filter the DataFrame to only include people with a "age" value greater than 21, you can use Boolean indexing:
titanic_adults = titanic_df[titanic_df["age"] > 21]
print(titanic_adults)
https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 2/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab
#Data aggregation: To get the average "age" value for each "sex" in the DataFrame, you can use the groupby() function:
age_by_sex = titanic_df.groupby("sex")["age"].mean()
print(age_by_sex)
sex
female 27.915709
male 30.728252
Name: age, dtype: float64
#Data cleaning: To drop the "embark_town" column from the DataFrame, you can use the drop() function:
titanic_df = titanic_df.drop("embark_town", axis=1)
#axis=1 (or axis='columns') is vertical axis. To take it further, if you use pandas method drop, to remove columns or rows,
#if you specify axis=1 you will be removing columns.
#If you specify axis=0 you will be removing rows from dataset.
titanic_df.head()
survived pclass sex age sibsp parch fare embarked class who adult
#To fill in missing values in the "age" column with the mean value, you can use the fillna() function:
mean_age = titanic_df["age"].mean()
titanic_df["age"] = titanic_df["age"].fillna(mean_age)
#Data transformation: To create a new column called "age_category" based on the value of the "age" column, you can use the apply() funct
def category(age):
if age < 18:
return "Child"
elif age < 25:
return "Teenager"
else:
return "Adult"
titanic_df["age_category"] = titanic_df["age"].apply(category)
print(titanic_df["age_category"])
titanic_df.info()
0 Teenager
1 Adult
2 Adult
3 Adult
4 Adult
...
884 Adult
885 Teenager
886 Adult
887 Adult
888 Adult
Name: age_category, Length: 889, dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 889 non-null int64
1 pclass 889 non-null int64
2 sex 889 non-null object
3 age 889 non-null float64
4 sibsp 889 non-null int64
5 parch 889 non-null int64
6 fare 889 non-null float64
7 embarked 887 non-null object
8 class 889 non-null object
9 who 889 non-null object
10 adult_male 889 non-null bool
11 deck 203 non-null object
12 alive 889 non-null object
13 alone 889 non-null bool
14 age_category 889 non-null object
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.2+ KB
https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 3/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab
#Remove unwanted columns: We can remove the "sibsp" column as it doesn't provide any useful information.
titanic_df = titanic_df.drop("sibsp", axis=1)
titanic_df.head()
survived pclass sex age parch fare embarked class who adult_male
#Handle missing data: We can replace missing values with either the mean or median value of the column.
# Replace missing price values with the median age
titanic_df["age"] = titanic_df["age"].fillna(titanic_df["age"].median())
#Sorting: You can sort the DataFrame by one or more columns using the sort_values() function. For example, to sort the DataFrame by "far
#descending order and then by "age" in ascending order, you would use:
sorted_titanic_df = titanic_df.sort_values(by=["fare", "age"], ascending=[False, True])
#Applying functions element-wise: You can apply functions to each element in a column using the apply() function.
#For example, to convert the "fare" column from US dollars to euros using an exchange rate of 0.84, you would use:
def usd_to_eur(fare):
return fare * 0.84
titanic_df["fare_eur"] = titanic_df["fare"].apply(usd_to_eur)
https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 4/4