0% found this document useful (0 votes)
11 views4 pages

AML_LAB12.Ipynb - Colab

The document is a Jupyter notebook that demonstrates data manipulation techniques using the Titanic dataset in Python with Pandas. It includes data exploration, selection, filtering, aggregation, cleaning, and transformation methods. Key operations include reading the dataset, displaying statistics, filtering by age and sex, and creating new columns based on existing data.

Uploaded by

Aastha Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views4 pages

AML_LAB12.Ipynb - Colab

The document is a Jupyter notebook that demonstrates data manipulation techniques using the Titanic dataset in Python with Pandas. It includes data exploration, selection, filtering, aggregation, cleaning, and transformation methods. Key operations include reading the dataset, displaying statistics, filtering by age and sex, and creating new columns based on existing data.

Uploaded by

Aastha Mehta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

7/3/24, 9:36 AM AML_LAB12.

ipynb - Colab

from google.colab import files


uploaded = files.upload()

Choose Files No file chosen Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to
enable.
Saving titanic data.csv to titanic data.csv

import pandas as pd
titanic_df = pd.read_csv("titanic_data.csv")

#Data exploration: To view the first 5 rows of the DataFrame, you can use the head() function:
titanic_df.head()

survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone

0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False

1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False

2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True

3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False

4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True

#To get an overview of the DataFrame, you can use the info() function:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 889 non-null int64
1 pclass 889 non-null int64
2 sex 889 non-null object
3 age 713 non-null float64
4 sibsp 889 non-null int64
5 parch 889 non-null int64
6 fare 889 non-null float64
7 embarked 887 non-null object
8 class 889 non-null object
9 who 889 non-null object
10 adult_male 889 non-null bool
11 deck 203 non-null object
12 embark_town 887 non-null object
13 alive 889 non-null object
14 alone 889 non-null bool
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.2+ KB

#To get summary statistics of the numerical columns in the DataFrame, you can use the describe() function:
titanic_df.describe()

survived pclass age sibsp parch fare

count 889.000000 889.000000 713.000000 889.000000 889.000000 889.000000

mean 0.384702 2.307087 29.698696 0.523060 0.382452 32.259059

std 0.486799 0.836367 14.536691 1.103729 0.806761 49.735870

min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 0.000000 2.000000 20.000000 0.000000 0.000000 7.925000

50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

#Data selection: To select the "sex" and "alive" columns of the DataFrame, you can use the loc[] function:
titanic_df.loc[3:14, ["sex", "alive"]]

https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 1/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab

sex alive

3 female yes

4 male no

5 male no

6 male no

7 male no

8 female yes

9 female yes

10 female yes

11 female yes

12 male no

13 male no

14 female no

#To select rows where the "sex" is "female", you can use Boolean indexing:
titanic_females = titanic_df[titanic_df["sex"] == "female"]
print(titanic_females)

survived pclass sex age sibsp parch fare embarked class \


1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
8 1 3 female 27.0 0 2 11.1333 S Third
9 1 2 female 14.0 1 0 30.0708 C Second
.. ... ... ... ... ... ... ... ... ...
878 1 2 female 25.0 0 1 26.0000 S Second
880 0 3 female 22.0 0 0 10.5167 S Third
883 0 3 female 39.0 0 5 29.1250 Q Third
885 1 1 female 19.0 0 0 30.0000 S First
886 0 3 female NaN 1 2 23.4500 S Third

who adult_male deck embark_town alive alone


1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
8 woman False NaN Southampton yes False
9 child False NaN Cherbourg yes False
.. ... ... ... ... ... ...
878 woman False NaN Southampton yes False
880 woman False NaN Southampton no True
883 woman False NaN Queenstown no False
885 woman False B Southampton yes True
886 woman False NaN Southampton no False

[314 rows x 15 columns]

#Data filtering: To filter the DataFrame to only include people with a "age" value greater than 21, you can use Boolean indexing:
titanic_adults = titanic_df[titanic_df["age"] > 21]
print(titanic_adults)

survived pclass sex age sibsp parch fare embarked class \


0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
.. ... ... ... ... ... ... ... ... ...
882 0 3 male 25.0 0 0 7.0500 S Third
883 0 3 female 39.0 0 5 29.1250 Q Third
884 0 2 male 27.0 0 0 13.0000 S Second
887 1 1 male 26.0 0 0 30.0000 C First
888 0 3 male 32.0 0 0 7.7500 Q Third

who adult_male deck embark_town alive alone


0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
.. ... ... ... ... ... ...
882 man True NaN Southampton no True
883 woman False NaN Queenstown no False
884 man True NaN Southampton no True
887 man True C Cherbourg yes True
888 man True NaN Queenstown no True

[509 rows x 15 columns]

https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 2/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab

#Data aggregation: To get the average "age" value for each "sex" in the DataFrame, you can use the groupby() function:
age_by_sex = titanic_df.groupby("sex")["age"].mean()
print(age_by_sex)

sex
female 27.915709
male 30.728252
Name: age, dtype: float64

#Data cleaning: To drop the "embark_town" column from the DataFrame, you can use the drop() function:
titanic_df = titanic_df.drop("embark_town", axis=1)

#axis=1 (or axis='columns') is vertical axis. To take it further, if you use pandas method drop, to remove columns or rows,
#if you specify axis=1 you will be removing columns.
#If you specify axis=0 you will be removing rows from dataset.

titanic_df.head()

survived pclass sex age sibsp parch fare embarked class who adult

0 0 3 male 22.0 1 0 7.2500 S Third man

1 1 1 female 38.0 1 0 71.2833 C First woman

2 1 3 female 26.0 0 0 7.9250 S Third woman

3 1 1 female 35.0 1 0 53.1000 S First woman

4 0 3 male 35 0 0 0 8 0500 S Third man

#To fill in missing values in the "age" column with the mean value, you can use the fillna() function:
mean_age = titanic_df["age"].mean()
titanic_df["age"] = titanic_df["age"].fillna(mean_age)

#Data transformation: To create a new column called "age_category" based on the value of the "age" column, you can use the apply() funct
def category(age):
if age < 18:
return "Child"
elif age < 25:
return "Teenager"
else:
return "Adult"

titanic_df["age_category"] = titanic_df["age"].apply(category)
print(titanic_df["age_category"])
titanic_df.info()

0 Teenager
1 Adult
2 Adult
3 Adult
4 Adult
...
884 Adult
885 Teenager
886 Adult
887 Adult
888 Adult
Name: age_category, Length: 889, dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 889 non-null int64
1 pclass 889 non-null int64
2 sex 889 non-null object
3 age 889 non-null float64
4 sibsp 889 non-null int64
5 parch 889 non-null int64
6 fare 889 non-null float64
7 embarked 887 non-null object
8 class 889 non-null object
9 who 889 non-null object
10 adult_male 889 non-null bool
11 deck 203 non-null object
12 alive 889 non-null object
13 alone 889 non-null bool
14 age_category 889 non-null object
dtypes: bool(2), float64(2), int64(4), object(7)
memory usage: 92.2+ KB

https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 3/4
7/3/24, 9:36 AM AML_LAB12.ipynb - Colab

#Remove unwanted columns: We can remove the "sibsp" column as it doesn't provide any useful information.
titanic_df = titanic_df.drop("sibsp", axis=1)

titanic_df.head()

survived pclass sex age parch fare embarked class who adult_male

0 0 3 male 22.0 0 7.2500 S Third man True

1 1 1 female 38.0 0 71.2833 C First woman False

2 1 3 female 26.0 0 7.9250 S Third woman False

3 1 1 female 35.0 0 53.1000 S First woman False

4 0 3 male 35 0 0 8 0500 S Third man True

#Handle missing data: We can replace missing values with either the mean or median value of the column.
# Replace missing price values with the median age
titanic_df["age"] = titanic_df["age"].fillna(titanic_df["age"].median())

# Replace missing country values with "Unknown"


titanic_df["sex"] = titanic_df["sex"].fillna("Unknown")

#Handle duplicates: We can remove any duplicate rows in the dataset.


titanic_df = titanic_df.drop_duplicates()

#Sorting: You can sort the DataFrame by one or more columns using the sort_values() function. For example, to sort the DataFrame by "far
#descending order and then by "age" in ascending order, you would use:
sorted_titanic_df = titanic_df.sort_values(by=["fare", "age"], ascending=[False, True])

#Aggregating with multiple functions: You can apply multiple aggregation


#functions to the groups created by the groupby() function using the agg() function.
#For example, to get the count, mean, and max "fare" value for each "sex" in the DataFrame, you would use:
fare_agg = titanic_df.groupby("sex")["fare"].agg(["count", "mean", "max"])
print(fare_agg)

#Applying functions element-wise: You can apply functions to each element in a column using the apply() function.
#For example, to convert the "fare" column from US dollars to euros using an exchange rate of 0.84, you would use:
def usd_to_eur(fare):
return fare * 0.84

titanic_df["fare_eur"] = titanic_df["fare"].apply(usd_to_eur)

https://colab.research.google.com/drive/1p_Pc43tysFabphWLUD-sZNUJ8UhRSgOJ?authuser=1#printMode=true 4/4

You might also like