0% found this document useful (0 votes)
4 views143 pages

Pandas DataFrame Notebook

The document provides an introduction to Pandas DataFrames, highlighting their structure and key characteristics such as labeled rows and columns. It includes examples of creating DataFrames using lists, dictionaries, and CSV files, along with methods to access attributes like shape, data types, index, and columns. Additionally, it demonstrates how to manipulate and analyze data within DataFrames.

Uploaded by

Akshay Yeole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views143 pages

Pandas DataFrame Notebook

The document provides an introduction to Pandas DataFrames, highlighting their structure and key characteristics such as labeled rows and columns. It includes examples of creating DataFrames using lists, dictionaries, and CSV files, along with methods to access attributes like shape, data types, index, and columns. Additionally, it demonstrates how to manipulate and analyze data within DataFrames.

Uploaded by

Akshay Yeole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 143

2/8/25, 2:23 PM 14-DataFrame

DataFrame()

Pandas DataFrame
Introduction
A DataFrame in Pandas is a 2-dimensional labeled data structure, similar to a table in a
database, an Excel spreadsheet, or a data frame in R. It is one of the primary data
structures in Pandas, and it allows for easy data analysis and manipulation.

Key Characteristics
Labeled Rows and Columns: Each row and column has labels for easier access.

In [1]: import numpy as np


import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [2]: # Using lists

students_data = [
[100,80,10],
[90,70,7],
[120,100,14],
[80,50,2]
]

pd.DataFrame(students_data,columns = ['iq','marks','package'])

Out[2]: iq marks package

0 100 80 10

1 90 70 7

2 120 100 14

3 80 50 2

USing dicts

In [3]: # USing dicts -- Create a DataFrame using Dictionary

students_dict = {
'name':['Gourab','saurabh','suman','pranav','sanoj','hero'],
'iq':[100,90,120,80,0,0],
'marks':[80,70,100,50,0,0],
'package':[10,7,14,2,0,0]
}

file:///C:/Users/goura/Downloads/14-DataFrame.html 1/143
2/8/25, 2:23 PM 14-DataFrame

students = pd.DataFrame(students_dict)
students.set_index('name',inplace = True)
students

Out[3]: iq marks package

name

Gourab 100 80 10

saurabh 90 70 7

suman 120 100 14

pranav 80 50 2

sanoj 0 0 0

hero 0 0 0

using read_csv

In [4]: # using read_csv -- Create DataFrame using CSV file

movies = pd.read_csv("movies.csv")
movies

file:///C:/Users/goura/Downloads/14-DataFrame.html 2/143
2/8/25, 2:23 PM 14-DataFrame

Out[4]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
Strike

Battalion
1 tt9472208 NaN https://e
609

The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Minister
(film)

Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
India

Evening
4 tt6028796 NaN https://en.w
Shadows

... ... ... ...

Tera Mera
1624 Saath tt0301250 https://upload.wikimedia.org/wikipedia/en/2/2b... https://en.wik
Rahen

Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Ka Safar

Sabse
1626 Bada tt0069204 NaN https://en.w
Sukh

1627 Daaka tt10833860 https://upload.wikimedia.org/wikipedia/en/thum... h

1628 Humsafar tt2403201 https://upload.wikimedia.org/wikipedia/en/thum... http

file:///C:/Users/goura/Downloads/14-DataFrame.html 3/143
2/8/25, 2:23 PM 14-DataFrame

title_x imdb_id poster_path

1629 rows × 18 columns

In [5]: ipl = pd.read_csv("ipl-matches.csv")


ipl

file:///C:/Users/goura/Downloads/14-DataFrame.html 4/143
2/8/25, 2:23 PM 14-DataFrame

Out[5]: ID City Date Season MatchNumber Team1 Team2

2022- Rajasthan Gujarat


0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans
Ah

Royal
2022- Rajasthan
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals
Bangalore
Ah

Royal Lucknow
2022-
2 1312198 Kolkata 2022 Eliminator Challengers Super
05-25
Bangalore Giants

2022- Rajasthan Gujarat


3 1312197 Kolkata 2022 Qualifier 1
05-24 Royals Titans

W
2022- Sunrisers Punjab
4 1304116 Mumbai 2022 70
05-22 Hyderabad Kings

... ... ... ... ... ... ... ...

Kolkata
2008- Deccan
945 335986 Kolkata 2007/08 4 Knight
04-20 Chargers
Riders

Royal
2008- Mumbai W
946 335985 Mumbai 2007/08 5 Challengers
04-20 Indians
Bangalore

2008- Delhi Rajasthan F


947 335984 Delhi 2007/08 3
04-19 Daredevils Royals

Chennai
2008- Kings XI
948 335983 Chandigarh 2007/08 2 Super A
04-19 Punjab
Kings

Royal Kolkata
2008-
949 335982 Bangalore 2007/08 1 Challengers Knight Chin
04-18
Bangalore Riders

950 rows × 20 columns

file:///C:/Users/goura/Downloads/14-DataFrame.html 5/143
2/8/25, 2:23 PM 14-DataFrame

DataFrame Attributes and Methods

shape

In [6]: # shape -- Shape of the dataset (No. of Rows, No. of Columns)

movies.shape

Out[6]: (1629, 18)

In [7]: ipl.shape

Out[7]: (950, 20)

dtypes

In [8]: # dtypes -- Data Types of every columns

movies.dtypes

Out[8]: title_x object


imdb_id object
poster_path object
wiki_link object
title_y object
original_title object
is_adult int64
year_of_release int64
runtime object
genres object
imdb_rating float64
imdb_votes int64
story object
summary object
tagline object
actors object
wins_nominations object
release_date object
dtype: object

In [9]: ipl.dtypes

file:///C:/Users/goura/Downloads/14-DataFrame.html 6/143
2/8/25, 2:23 PM 14-DataFrame

Out[9]: ID int64
City object
Date object
Season object
MatchNumber object
Team1 object
Team2 object
Venue object
TossWinner object
TossDecision object
SuperOver object
WinningTeam object
WonBy object
Margin float64
method object
Player_of_Match object
Team1Players object
Team2Players object
Umpire1 object
Umpire2 object
dtype: object

Index

In [10]: # Index --

movies.index

Out[10]: RangeIndex(start=0, stop=1629, step=1)

In [11]: ipl.index

Out[11]: RangeIndex(start=0, stop=950, step=1)

columns

In [12]: # columns

movies.columns

Out[12]: Index(['title_x', 'imdb_id', 'poster_path', 'wiki_link', 'title_y',


'original_title', 'is_adult', 'year_of_release', 'runtime', 'genres',
'imdb_rating', 'imdb_votes', 'story', 'summary', 'tagline', 'actors',
'wins_nominations', 'release_date'],
dtype='object')

In [13]: ipl.columns

Out[13]: Index(['ID', 'City', 'Date', 'Season', 'MatchNumber', 'Team1', 'Team2',


'Venue', 'TossWinner', 'TossDecision', 'SuperOver', 'WinningTeam',
'WonBy', 'Margin', 'method', 'Player_of_Match', 'Team1Players',
'Team2Players', 'Umpire1', 'Umpire2'],
dtype='object')

values

file:///C:/Users/goura/Downloads/14-DataFrame.html 7/143
2/8/25, 2:23 PM 14-DataFrame

In [14]: # values

movies.values

Out[14]: array([['Uri: The Surgical Strike', 'tt8291224',


'https://upload.wikimedia.org/wikipedia/en/thumb/3/3b/URI_-_New_poster.
jpg/220px-URI_-_New_poster.jpg',
...,
'Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Gautam|Kirti Kulhari|Rajit
Kapoor|Ivan Rodrigues|Manasi Parekh|Swaroop Sampat|Riva Arora|Yogesh Soman|Fare
ed Ahmed|Akashdeep Arora|Kallol Banerjee|',
'4 wins', '11 January 2019 (USA)'],
['Battalion 609', 'tt9472208', nan, ...,
'Vicky Ahuja|Shoaib Ibrahim|Shrikant Kamat|Elena Kazan|Vishwas Kini|Maj
or Kishore|Jashn Kohli|Rammy C. Pandey|Manish Sharma|Sparsh Sharma|Farnaz Shett
y|Vikas Shrivastav|Chandraprakash Thakur|Brajesh Tiwari|',
nan, '11 January 2019 (India)'],
['The Accidental Prime Minister (film)', 'tt6986710',
'https://upload.wikimedia.org/wikipedia/en/thumb/a/a1/The_Accidental_Pr
ime_Minister_film.jpg/220px-The_Accidental_Prime_Minister_film.jpg',
...,
'Anupam Kher|Akshaye Khanna|Aahana Kumra|Atul Sharma|Manoj Anand|Arjun
Mathur|Suzanne Bernert|Abdul Quadir Amin|Bharat Mistri|Divya Seth|Anil Rastogi|
Ramesh Bhatkar|Parrgash Kaur|Jess Kaur|',
nan, '11 January 2019 (USA)'],
...,
['Sabse Bada Sukh', 'tt0069204', nan, ...,
'Vijay Arora|Asrani|Rajni Bala|Kumud Damle|Utpal Dutt|Meeta Faiyyaz|Rab
i Ghosh|Tarun Ghosh|Sanjeev Kumar|Keshto Mukherjee|Meena Rai|',
nan, nan],
['Daaka', 'tt10833860',
'https://upload.wikimedia.org/wikipedia/en/thumb/4/45/Daaka.jpg/220px-D
aaka.jpg',
..., 'Gippy Grewal|Zareen Khan|', nan, '1 November 2019 (USA)'],
['Humsafar', 'tt2403201',
'https://upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.sv
g/23px-Flag_of_India.svg.png',
..., 'Fawad Khan|', nan, 'TV Series (2011–2012)']], dtype=object)

In [15]: ipl.values

file:///C:/Users/goura/Downloads/14-DataFrame.html 8/143
2/8/25, 2:23 PM 14-DataFrame

Out[15]: array([[1312200, 'Ahmedabad', '2022-05-29', ...,


"['WP Saha', 'Shubman Gill', 'MS Wade', 'HH Pandya', 'DA Miller', 'R Te
watia', 'Rashid Khan', 'R Sai Kishore', 'LH Ferguson', 'Yash Dayal', 'Mohammed
Shami']",
'CB Gaffaney', 'Nitin Menon'],
[1312199, 'Ahmedabad', '2022-05-27', ...,
"['YBK Jaiswal', 'JC Buttler', 'SV Samson', 'D Padikkal', 'SO Hetmyer',
'R Parag', 'R Ashwin', 'TA Boult', 'YS Chahal', 'M Prasidh Krishna', 'OC McCo
y']",
'CB Gaffaney', 'Nitin Menon'],
[1312198, 'Kolkata', '2022-05-25', ...,
"['Q de Kock', 'KL Rahul', 'M Vohra', 'DJ Hooda', 'MP Stoinis', 'E Lewi
s', 'KH Pandya', 'PVD Chameera', 'Mohsin Khan', 'Avesh Khan', 'Ravi Bishnoi']",
'J Madanagopal', 'MA Gough'],
...,
[335984, 'Delhi', '2008-04-19', ...,
"['T Kohli', 'YK Pathan', 'SR Watson', 'M Kaif', 'DS Lehmann', 'RA Jade
ja', 'M Rawat', 'D Salunkhe', 'SK Warne', 'SK Trivedi', 'MM Patel']",
'Aleem Dar', 'GA Pratapkumar'],
[335983, 'Chandigarh', '2008-04-19', ...,
"['PA Patel', 'ML Hayden', 'MEK Hussey', 'MS Dhoni', 'SK Raina', 'JDP O
ram', 'S Badrinath', 'Joginder Sharma', 'P Amarnath', 'MS Gony', 'M Muralithara
n']",
'MR Benson', 'SL Shastri'],
[335982, 'Bangalore', '2008-04-18', ...,
"['SC Ganguly', 'BB McCullum', 'RT Ponting', 'DJ Hussey', 'Mohammad Haf
eez', 'LR Shukla', 'WP Saha', 'AB Agarkar', 'AB Dinda', 'M Kartik', 'I Sharm
a']",
'Asad Rauf', 'RE Koertzen']], dtype=object)

In [16]: # head & tail

movies.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 9/143
2/8/25, 2:23 PM 14-DataFrame

Out[16]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikiped
Strike

Battalion
1 tt9472208 NaN https://en.wik
609

The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedi
Minister
(film)

Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipe
India

Evening
4 tt6028796 NaN https://en.wikiped
Shadows

In [17]: movies.tail()

file:///C:/Users/goura/Downloads/14-DataFrame.html 10/143
2/8/25, 2:23 PM 14-DataFrame

Out[17]: title_x imdb_id poster_path

Tera
Mera
1624 tt0301250 https://upload.wikimedia.org/wikipedia/en/2/2b... https://en.wiki
Saath
Rahen

Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wiki
Ka Safar

Sabse
1626 Bada tt0069204 NaN https://en.wi
Sukh

1627 Daaka tt10833860 https://upload.wikimedia.org/wikipedia/en/thum... h

1628 Humsafar tt2403201 https://upload.wikimedia.org/wikipedia/en/thum... https

In [18]: ipl.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 11/143
2/8/25, 2:23 PM 14-DataFrame

Out[18]: ID City Date Season MatchNumber Team1 Team2 Ven

Narend
2022- Rajasthan Gujarat M
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadiu
Ahmedab

Narend
Royal
2022- Rajasthan M
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadiu
Bangalore
Ahmedab

Royal Lucknow Ed
2022-
2 1312198 Kolkata 2022 Eliminator Challengers Super Garde
05-25
Bangalore Giants Kolk

Ed
2022- Rajasthan Gujarat
3 1312197 Kolkata 2022 Qualifier 1 Garde
05-24 Royals Titans
Kolk

Wankhe
2022- Sunrisers Punjab
4 1304116 Mumbai 2022 70 Stadiu
05-22 Hyderabad Kings
Mum

In [19]: ipl.tail()

file:///C:/Users/goura/Downloads/14-DataFrame.html 12/143
2/8/25, 2:23 PM 14-DataFrame

Out[19]: ID City Date Season MatchNumber Team1 Team2

Kolkata
2008- Deccan
945 335986 Kolkata 2007/08 4 Knight
04-20 Chargers
Riders

Royal
2008- Mumbai Wa
946 335985 Mumbai 2007/08 5 Challengers
04-20 Indians
Bangalore

2008- Delhi Rajasthan Fer


947 335984 Delhi 2007/08 3
04-19 Daredevils Royals

Chennai
2008- Kings XI
948 335983 Chandigarh 2007/08 2 Super Ass
04-19 Punjab
Kings S

Royal Kolkata
2008-
949 335982 Bangalore 2007/08 1 Challengers Knight Chinn
04-18
Bangalore Riders

In [20]: # sample

movies.sample(5)

file:///C:/Users/goura/Downloads/14-DataFrame.html 13/143
2/8/25, 2:23 PM 14-DataFrame

Out[20]: title_x imdb_id poster_path

Chargesheet
803 tt1368453 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
(film)

Hero (2015
434 tt4467202 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Hindi film)

Helicopter
144 tt8427036 https://upload.wikimedia.org/wikipedia/en/thum... https://en
Eela

Batti Gul
133 tt7720922 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wi
Meter Chalu

557 Ungli tt2392447 https://upload.wikimedia.org/wikipedia/en/thum...

In [21]: ipl.sample(5)

file:///C:/Users/goura/Downloads/14-DataFrame.html 14/143
2/8/25, 2:23 PM 14-DataFrame

Out[21]: ID City Date Season MatchNumber Team1 Team2 V

Kochi
2011- Deccan N
745 501229 Kochi 2011 32 Tuskers
04-27 Chargers Sta
Kerala

Royal
2015- Sunrisers
484 829719 Bangalore 2015 8 Challengers Chinnas
04-13 Hyderabad
Bangalore Sta

Royal Subrat
2012- Pune
647 548362 Pune 2012 57 Challengers S
05-11 Warriors
Bangalore Sta

2012- Mumbai Deccan Wan


664 548346 Mumbai 2012 40
04-29 Indians Chargers Sta

Chennai
2014- Mumbai Wan
519 733995 Mumbai 2014 33 Super
05-10 Indians Sta
Kings

In [22]: # info

movies.info()

file:///C:/Users/goura/Downloads/14-DataFrame.html 15/143
2/8/25, 2:23 PM 14-DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1629 entries, 0 to 1628
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title_x 1629 non-null object
1 imdb_id 1629 non-null object
2 poster_path 1526 non-null object
3 wiki_link 1629 non-null object
4 title_y 1629 non-null object
5 original_title 1629 non-null object
6 is_adult 1629 non-null int64
7 year_of_release 1629 non-null int64
8 runtime 1629 non-null object
9 genres 1629 non-null object
10 imdb_rating 1629 non-null float64
11 imdb_votes 1629 non-null int64
12 story 1609 non-null object
13 summary 1629 non-null object
14 tagline 557 non-null object
15 actors 1624 non-null object
16 wins_nominations 707 non-null object
17 release_date 1522 non-null object
dtypes: float64(1), int64(3), object(14)
memory usage: 229.2+ KB

In [23]: # describe

movies.describe()

Out[23]: is_adult year_of_release imdb_rating imdb_votes

count 1629.0 1629.000000 1629.000000 1629.000000

mean 0.0 2010.263966 5.557459 5384.263352

std 0.0 5.381542 1.567609 14552.103231

min 0.0 2001.000000 0.000000 0.000000

25% 0.0 2005.000000 4.400000 233.000000

50% 0.0 2011.000000 5.600000 1000.000000

75% 0.0 2015.000000 6.800000 4287.000000

max 0.0 2019.000000 9.400000 310481.000000

In [24]: # isnull

movies.isnull().sum()

file:///C:/Users/goura/Downloads/14-DataFrame.html 16/143
2/8/25, 2:23 PM 14-DataFrame

Out[24]: title_x 0
imdb_id 0
poster_path 103
wiki_link 0
title_y 0
original_title 0
is_adult 0
year_of_release 0
runtime 0
genres 0
imdb_rating 0
imdb_votes 0
story 20
summary 0
tagline 1072
actors 5
wins_nominations 922
release_date 107
dtype: int64

In [25]: # duplicated

movies.duplicated().sum()

Out[25]: 0

In [26]: # rename

students

Out[26]: iq marks package

name

Gourab 100 80 10

saurabh 90 70 7

suman 120 100 14

pranav 80 50 2

sanoj 0 0 0

hero 0 0 0

In [27]: students.rename(columns={'marks':'percent','package':'lpa'},inplace=True) # inpl

In [28]: students

file:///C:/Users/goura/Downloads/14-DataFrame.html 17/143
2/8/25, 2:23 PM 14-DataFrame

Out[28]: iq percent lpa

name

Gourab 100 80 10

saurabh 90 70 7

suman 120 100 14

pranav 80 50 2

sanoj 0 0 0

hero 0 0 0

Maths Methods
In [29]: # sum -> axis argument

students.sum()

Out[29]: iq 390
percent 300
lpa 33
dtype: int64

In [30]: # sum -> axis argument

students.sum(axis = 1)

Out[30]: name
Gourab 190
saurabh 167
suman 234
pranav 132
sanoj 0
hero 0
dtype: int64

In [31]: # mean
students.mean()

Out[31]: iq 65.0
percent 50.0
lpa 5.5
dtype: float64

In [32]: students.mean(axis = 1)

Out[32]: name
Gourab 63.333333
saurabh 55.666667
suman 78.000000
pranav 44.000000
sanoj 0.000000
hero 0.000000
dtype: float64

file:///C:/Users/goura/Downloads/14-DataFrame.html 18/143
2/8/25, 2:23 PM 14-DataFrame

In [33]: # median

students.median()

Out[33]: iq 85.0
percent 60.0
lpa 4.5
dtype: float64

In [34]: students.median(axis = 1)

Out[34]: name
Gourab 80.0
saurabh 70.0
suman 100.0
pranav 50.0
sanoj 0.0
hero 0.0
dtype: float64

In [35]: # min

students.min()

Out[35]: iq 0
percent 0
lpa 0
dtype: int64

In [36]: students.min(axis = 1)

Out[36]: name
Gourab 10
saurabh 7
suman 14
pranav 2
sanoj 0
hero 0
dtype: int64

In [37]: # max

students.max()

Out[37]: iq 120
percent 100
lpa 14
dtype: int64

In [38]: students.max(axis = 1)

Out[38]: name
Gourab 100
saurabh 90
suman 120
pranav 80
sanoj 0
hero 0
dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 19/143
2/8/25, 2:23 PM 14-DataFrame

In [39]: # standard deviation -->> std

students.std()

Out[39]: iq 52.057660
percent 41.952354
lpa 5.787918
dtype: float64

In [40]: students.std(axis = 1)

Out[40]: name
Gourab 47.258156
saurabh 43.316663
suman 56.320511
pranav 39.344631
sanoj 0.000000
hero 0.000000
dtype: float64

In [41]: # variance -->

students.var()

Out[41]: iq 2710.0
percent 1760.0
lpa 33.5
dtype: float64

In [42]: students.var(axis = 1)

Out[42]: name
Gourab 2233.333333
saurabh 1876.333333
suman 3172.000000
pranav 1548.000000
sanoj 0.000000
hero 0.000000
dtype: float64

selecting columns from a DataFrame


In [43]: # single cols -->> it will return series

movies['title_x']

Out[43]: 0 Uri: The Surgical Strike


1 Battalion 609
2 The Accidental Prime Minister (film)
3 Why Cheat India
4 Evening Shadows
...
1624 Tera Mera Saath Rahen
1625 Yeh Zindagi Ka Safar
1626 Sabse Bada Sukh
1627 Daaka
1628 Humsafar
Name: title_x, Length: 1629, dtype: object

file:///C:/Users/goura/Downloads/14-DataFrame.html 20/143
2/8/25, 2:23 PM 14-DataFrame

In [44]: # multiple columns -->> it will return DataFrame

movies[['title_x','year_of_release','actors']]

Out[44]: title_x year_of_release actors

Vicky Kaushal|Paresh Rawal|Mohit


0 Uri: The Surgical Strike 2019
Raina|Yami Ga...

Vicky Ahuja|Shoaib Ibrahim|Shrikant


1 Battalion 609 2019
Kamat|Elen...

The Accidental Prime Anupam Kher|Akshaye Khanna|Aahana


2 2019
Minister (film) Kumra|Atul S...

Emraan Hashmi|Shreya
3 Why Cheat India 2019
Dhanwanthary|Snighdadeep ...

Mona Ambegaonkar|Ananth Narayan


4 Evening Shadows 2018
Mahadevan|Deva...

... ... ... ...

Ajay Devgn|Sonali Bendre|Namrata


1624 Tera Mera Saath Rahen 2001
Shirodkar|Pre...

Ameesha Patel|Jimmy Sheirgill|Nafisa


1625 Yeh Zindagi Ka Safar 2001
Ali|Gulsh...

Vijay Arora|Asrani|Rajni Bala|Kumud


1626 Sabse Bada Sukh 2018
Damle|Utpa...

1627 Daaka 2019 Gippy Grewal|Zareen Khan|

1628 Humsafar 2011 Fawad Khan|

1629 rows × 3 columns

Selecting rows from a DataFrame


.iloc --> searches using index positions -->>iloc kaam karta hai index position pe
which is always available[0,1,2]
loc --> searches using index labels -->> loc kaam karta hai index label par which is
not always available[name of column]

In [45]: # single row

movies.iloc[0]

file:///C:/Users/goura/Downloads/14-DataFrame.html 21/143
2/8/25, 2:23 PM 14-DataFrame

Out[45]: title_x Uri: The Surgical Strike


imdb_id tt8291224
poster_path https://upload.wikimedia.org/wikipedia/en/thum...
wiki_link https://en.wikipedia.org/wiki/Uri:_The_Surgica...
title_y Uri: The Surgical Strike
original_title Uri: The Surgical Strike
is_adult 0
year_of_release 2019
runtime 138
genres Action|Drama|War
imdb_rating 8.4
imdb_votes 35112
story Divided over five chapters the film chronicle...
summary Indian army special forces execute a covert op...
tagline NaN
actors Vicky Kaushal|Paresh Rawal|Mohit Raina|Yami Ga...
wins_nominations 4 wins
release_date 11 January 2019 (USA)
Name: 0, dtype: object

In [46]: # Multiple rows

movies.iloc[0:16:3]

file:///C:/Users/goura/Downloads/14-DataFrame.html 22/143
2/8/25, 2:23 PM 14-DataFrame

Out[46]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikiped
Strike

Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipe
India

Fraud
6 tt5013008 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wiki
Saiyaan

Thackeray
9 tt7777196 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipe
(film)

Hum
12 tt9319812 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
Chaar

Badla
15 (2019 tt8130968 https://upload.wikimedia.org/wikipedia/en/0/0c... https://en.wikiped
film)

In [47]: # fancy indexing

movies.iloc[[0,4,5,9]]

file:///C:/Users/goura/Downloads/14-DataFrame.html 23/143
2/8/25, 2:23 PM 14-DataFrame

Out[47]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedi
Strike

Evening
4 tt6028796 NaN https://en.wikipedi
Shadows

Soni
5 tt6078866 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
(film)

Thackeray
9 tt7777196 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikiped
(film)

In [48]: # loc

students.loc['Gourab']

Out[48]: iq 100
percent 80
lpa 10
Name: Gourab, dtype: int64

In [49]: students.loc['Gourab':'pranav':2]

Out[49]: iq percent lpa

name

Gourab 100 80 10

suman 120 100 14

In [50]: students.loc[['suman','pranav','sanoj']]

Out[50]: iq percent lpa

name

suman 120 100 14

pranav 80 50 2

sanoj 0 0 0

file:///C:/Users/goura/Downloads/14-DataFrame.html 24/143
2/8/25, 2:23 PM 14-DataFrame

In [51]: students.iloc[0:5:2]

Out[51]: iq percent lpa

name

Gourab 100 80 10

suman 120 100 14

sanoj 0 0 0

In [52]: students.iloc[[0,2,3,5]]

Out[52]: iq percent lpa

name

Gourab 100 80 10

suman 120 100 14

pranav 80 50 2

hero 0 0 0

Selecting both rows and columns


In [53]: movies.iloc[0:3,0:3]

Out[53]: title_x imdb_id poster_path

0 Uri: The Surgical Strike tt8291224 https://upload.wikimedia.org/wikipedia/en/thum...

1 Battalion 609 tt9472208 NaN

The Accidental Prime


2 tt6986710 https://upload.wikimedia.org/wikipedia/en/thum...
Minister (film)

In [54]: movies.loc[0:2,'title_x':'poster_path']

Out[54]: title_x imdb_id poster_path

0 Uri: The Surgical Strike tt8291224 https://upload.wikimedia.org/wikipedia/en/thum...

1 Battalion 609 tt9472208 NaN

The Accidental Prime


2 tt6986710 https://upload.wikimedia.org/wikipedia/en/thum...
Minister (film)

Filtering a DataFrame
In [55]: # Find all the final winners

mask = ipl['MatchNumber'] == 'Final'

file:///C:/Users/goura/Downloads/14-DataFrame.html 25/143
2/8/25, 2:23 PM 14-DataFrame

new_df = ipl[mask]
new_df[['Season','WinningTeam']]

Out[55]: Season WinningTeam

0 2022 Gujarat Titans

74 2021 Chennai Super Kings

134 2020/21 Mumbai Indians

194 2019 Mumbai Indians

254 2018 Chennai Super Kings

314 2017 Mumbai Indians

373 2016 Sunrisers Hyderabad

433 2015 Mumbai Indians

492 2014 Kolkata Knight Riders

552 2013 Mumbai Indians

628 2012 Kolkata Knight Riders

702 2011 Chennai Super Kings

775 2009/10 Chennai Super Kings

835 2009 Deccan Chargers

892 2007/08 Rajasthan Royals

In [56]: # single line code for above

ipl[ipl['MatchNumber'] == 'Final'][['Season','WinningTeam']]

file:///C:/Users/goura/Downloads/14-DataFrame.html 26/143
2/8/25, 2:23 PM 14-DataFrame

Out[56]: Season WinningTeam

0 2022 Gujarat Titans

74 2021 Chennai Super Kings

134 2020/21 Mumbai Indians

194 2019 Mumbai Indians

254 2018 Chennai Super Kings

314 2017 Mumbai Indians

373 2016 Sunrisers Hyderabad

433 2015 Mumbai Indians

492 2014 Kolkata Knight Riders

552 2013 Mumbai Indians

628 2012 Kolkata Knight Riders

702 2011 Chennai Super Kings

775 2009/10 Chennai Super Kings

835 2009 Deccan Chargers

892 2007/08 Rajasthan Royals

In [57]: # how many super over finishes have occured

ipl[ipl['SuperOver'] == 'Y'].shape[0]

Out[57]: 14

In [58]: # how many matches has csk won in kolkata

ipl[(ipl['City'] == 'Kolkata') & (ipl['WinningTeam'] == 'Chennai Super Kings')].

Out[58]: 5

In [59]: # toss winner is match winner in percentage

(ipl[ipl['TossWinner'] == ipl['WinningTeam']].shape[0]/ipl.shape[0])*100

Out[59]: 51.473684210526315

In [60]: # movies with rating higher than 8 and votes > 10000

movies[(movies['imdb_rating'] > 8) & (movies['imdb_votes'] > 10000)].shape[0]

Out[60]: 43

In [61]: # Action movie with rating higher than 7.5


# .contains('action')
mask1 = movies['genres'].str.split('|').apply(lambda x: 'Action' in x)
mask2 = movies['imdb_rating'] > 7.5

file:///C:/Users/goura/Downloads/14-DataFrame.html 27/143
2/8/25, 2:23 PM 14-DataFrame

movies[mask1 & mask2].shape[0]

Out[61]: 33

Adding New Columns


In [62]: # Completely new

movies['Country'] = 'India'
movies.head(2)

Out[62]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia
Strike

Battalion
1 tt9472208 NaN https://en.wikip
609

In [63]: # from existing ones

# -->will work when no missing value will present


# movies['actors'].str.split('|').apply(lambda x: x[0])

Important DataFrame Functions


In [64]: # astype

ipl.info()

file:///C:/Users/goura/Downloads/14-DataFrame.html 28/143
2/8/25, 2:23 PM 14-DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int64
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null object
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: float64(1), int64(1), object(18)
memory usage: 148.6+ KB

In [65]: ipl['ID'] = ipl['ID'].astype('int32')

In [66]: ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int32
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null object
4 MatchNumber 950 non-null object
5 Team1 950 non-null object
6 Team2 950 non-null object
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: float64(1), int32(1), object(18)
memory usage: 144.9+ KB

file:///C:/Users/goura/Downloads/14-DataFrame.html 29/143
2/8/25, 2:23 PM 14-DataFrame

In [67]: ipl['Season'] = ipl['Season'].astype('category')


ipl['Team1'] = ipl['Team2'].astype('category')
ipl['Team2'] = ipl['Team2'].astype('category')

In [68]: ipl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 950 entries, 0 to 949
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 950 non-null int32
1 City 899 non-null object
2 Date 950 non-null object
3 Season 950 non-null category
4 MatchNumber 950 non-null object
5 Team1 950 non-null category
6 Team2 950 non-null category
7 Venue 950 non-null object
8 TossWinner 950 non-null object
9 TossDecision 950 non-null object
10 SuperOver 946 non-null object
11 WinningTeam 946 non-null object
12 WonBy 950 non-null object
13 Margin 932 non-null float64
14 method 19 non-null object
15 Player_of_Match 946 non-null object
16 Team1Players 950 non-null object
17 Team2Players 950 non-null object
18 Umpire1 950 non-null object
19 Umpire2 950 non-null object
dtypes: category(3), float64(1), int32(1), object(15)
memory usage: 127.4+ KB

Task
"https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

In [69]: import pandas as pd


import numpy as np

Basic DataFrame
Consider the following Python dictionary data and Python list labels:

data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills',


'spoonbills', 'Cranes', 'plovers', 'Cranes', 'spoonbills',
'spoonbills', 'Cranes'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4,
3.5], 'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2, 2],
'priority': ['yes', 'yes', 'no', np.nan, 'no', 'no',
'no', 'yes', 'no', 'no','yes']}

file:///C:/Users/goura/Downloads/14-DataFrame.html 30/143
2/8/25, 2:23 PM 14-DataFrame

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']

Q-1:
i. Create a DataFrame birds from the above dictionary data which has the index labels.

ii. Display basic information about the dataFrame.

iii. Show Alternate rows of the dataframe.

In [70]: # code here


data = {'birds': ['Cranes', 'Cranes', 'plovers', 'spoonbills', 'spoonbills',
'Cranes', 'plovers', 'Cranes', 'spoonbills', 'spoonbills',
'Cranes'],
'age': [3.5, 4, 1.5, np.nan, 6, 3, 5.5, np.nan, 8, 4, 3.5],
'visits': [2, 4, 3, 4, 3, 4, 2, 2, 3, 2, 2],
'priority': ['yes', 'yes', 'no', np.nan, 'no', 'no', 'no', 'yes',
'no', 'no','yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
# i. Create a DataFrame birds from the above dictionary data which has the
# index labels.
df1 = pd.DataFrame(data = data, index = labels)

In [71]: #2 Display basic information about the dataFrame.

df1.info()
df1.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, a to k
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 birds 11 non-null object
1 age 9 non-null float64
2 visits 11 non-null int64
3 priority 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 440.0+ bytes
Out[71]: age visits

count 9.000000 11.000000

mean 4.333333 2.818182

std 1.903943 0.873863

min 1.500000 2.000000

25% 3.500000 2.000000

50% 4.000000 3.000000

75% 5.500000 3.500000

max 8.000000 4.000000

file:///C:/Users/goura/Downloads/14-DataFrame.html 31/143
2/8/25, 2:23 PM 14-DataFrame

In [72]: #3 Show Alternate rows of the dataframe.


df1.iloc[::2]

Out[72]: birds age visits priority

a Cranes 3.5 2 yes

c plovers 1.5 3 no

e spoonbills 6.0 3 no

g plovers 5.5 2 no

i spoonbills 8.0 3 no

k Cranes 3.5 2 yes

Q-2:
i. Show only rows [1st, 3rd, 7th] from columns ['bird', 'age']

ii. Select rows where the number of visits is less than 4.

In [73]: # code here


#1 Show only rows [1st, 3rd, 7th] from columns ['bird', 'age']

df1[['birds', 'age']].iloc[[0,2,6]]

Out[73]: birds age

a Cranes 3.5

c plovers 1.5

g plovers 5.5

In [74]: #2 Select rows where the number of visits is less than 4

df1[df1.visits<4]

Out[74]: birds age visits priority

a Cranes 3.5 2 yes

c plovers 1.5 3 no

e spoonbills 6.0 3 no

g plovers 5.5 2 no

h Cranes NaN 2 yes

i spoonbills 8.0 3 no

j spoonbills 4.0 2 no

k Cranes 3.5 2 yes

file:///C:/Users/goura/Downloads/14-DataFrame.html 32/143
2/8/25, 2:23 PM 14-DataFrame

Q-3:
i. Select all rows with nan values in age and visits column.

ii. Fill nan with respective series mode value.

In [75]: # code here


#1 Select all rows with nan values in age and visits column.

df1[df1.age.isna() | df1.visits.isna()]

Out[75]: birds age visits priority

d spoonbills NaN 4 NaN

h Cranes NaN 2 yes

In [76]: #2 Fill nan with respective series mode value.

df1.age.fillna(df1.age.mode()[0], inplace=True)
df1.visits.fillna(df1.visits.mode()[0], inplace=True)

Q-4
i. Find the total number of visits of the bird Cranes

ii. Find the number of each type of birds in dataframe.

iii. Print no of duplicate rows

iv. Drop Duplicates rows and make this changes permanent. Show dataframe after
changes.

In [77]: # code here


#1 Find the total number of visits of the bird Cranes

df1[df1.birds == "Cranes"].visits.sum()

Out[77]: 14

In [78]: #2 Find the number of each type of birds in dataframe.

df1.birds.value_counts()

Out[78]: birds
Cranes 5
spoonbills 4
plovers 2
Name: count, dtype: int64

In [79]: #3 Print no of duplicate rows

df1.duplicated().sum()

file:///C:/Users/goura/Downloads/14-DataFrame.html 33/143
2/8/25, 2:23 PM 14-DataFrame

Out[79]: 2

In [80]: #4 Drop Duplicates rows and make this changes permanent. Show dataframe after ch

df1.drop_duplicates(inplace=True)

Q-5: In IPL matches dataset some teams name has


changed.
You will have to consider them as same.

'Delhi Capitals' formerly as 'Delhi Daredevils'


'Punjab Kings' formerly as 'Kings XI Punjab'
'Rising Pune Supergiant' formerly as 'Rising Pune Supergiants'

You need to make changes accordingly. Consider current name for each teams.

Be careful Gujrat Titans and Gujrat Lions are different teams.

In [81]: # code here


data = pd.read_csv("IPL_Matches_2008_2022.csv")
changed_name = {'Delhi Daredevils':'Delhi Capitals',
'Kings XI Punjab':'Punjab Kings',
'Rising Pune Supergiants':'Rising Pune Supergiant'}
data.replace(changed_name.keys(), changed_name.values(),inplace=True )

In [82]: data.columns

Out[82]: Index(['ID', 'City', 'Date', 'Season', 'MatchNumber', 'Team1', 'Team2',


'Venue', 'TossWinner', 'TossDecision', 'SuperOver', 'WinningTeam',
'WonBy', 'Margin', 'method', 'Player_of_Match', 'Team1Players',
'Team2Players', 'Umpire1', 'Umpire2'],
dtype='object')

Q-6 Write a code which can display the bar chart of top 5
teams who have played maximum number of matches in
the IPL.
Hint: Be careful the data is divided in 2 different cols(Team 1 and Team 2)

In [83]: # code here


# Considering both team slots
(data['Team1'].value_counts()+data["Team2"].value_counts()).sort_values(ascendin

Out[83]: <Axes: >

file:///C:/Users/goura/Downloads/14-DataFrame.html 34/143
2/8/25, 2:23 PM 14-DataFrame

Q-7: Player who got Most no. of player of the match


award playing against Mumbai Indians.
Just for this question assume player of the match award is given to players
from winning team. Although this is true in most of the cases.

In [84]: # code here


m1 = (data.Team1 == "Mumbai Indians") | (data.Team2 == "Mumbai Indians")
m2 = data.WinningTeam != "Mumbai Indians"
data[m1 & m2].Player_of_Match.value_counts().head(1)

Out[84]: Player_of_Match
SPD Smith 4
Name: count, dtype: int64

Q-8: Team1 vs Team2 Dashbord


Create a function which will take two string(name of two teams) as input. Show win Loss
record between them and player getting most player of the match award in matches

file:///C:/Users/goura/Downloads/14-DataFrame.html 35/143
2/8/25, 2:23 PM 14-DataFrame

between these two teams.

team1_vs_team2('Kolkata Knight Riders','Chennai Super Kings')

In [85]: # code here

def team1_vs_team2(t1, t2):


m1 = (data.Team1 == t1) | (data.Team2 == t1) # Filter for matches where onl
m2 = (data.Team1 == t2) | (data.Team2 == t2) # Filter for matches where onl
df1 = data[m1 & m2]
print(df1.WinningTeam.value_counts())
print(df1.Player_of_Match.value_counts().head(1))

team1_vs_team2('Kolkata Knight Riders','Chennai Super Kings')

WinningTeam
Chennai Super Kings 17
Kolkata Knight Riders 9
Name: count, dtype: int64
Player_of_Match
RA Jadeja 3
Name: count, dtype: int64

Q-9: Find out the top 7 cities where the matches of


Kolkata Knight Riders are played frequently and plot the
result as bar chart.
.plot(kind = "bar") can help you to plot the bar chart. Also you can learn more
about this method from here

In [86]: # code here


data[(data.Team1 == "Kolkata Knight Riders") | (data.Team2 == "Kolkata Knight Ri

Out[86]: <Axes: xlabel='City'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 36/143
2/8/25, 2:23 PM 14-DataFrame

Q-10: Find out the average margin for the team Mumbai
Indians for only the session 2011.
In [87]: # code here
data[((data.Team1 == "Mumbai Indians") | (data.Team2 == "Mumbai Indians")) & (da

Out[87]: 19.25

In [88]: # value_counts
# sort_values
# rank
# sort index
# set index
# rename index -> rename
# reset index
# unique & nunique
# isnull/notnull/hasnans
# dropna
# fillna
# drop_duplicates
# drop
# apply
# isin
# corr
# nlargest -> nsmallest
# insert
# copy

file:///C:/Users/goura/Downloads/14-DataFrame.html 37/143
2/8/25, 2:23 PM 14-DataFrame

In [89]: import numpy as np


import pandas as pd

In [90]: # Value_counts(series & DataFrame)

marks = pd.DataFrame([
[100,80,10],
[90,70,7],
[120,100,14],
[80,70,14],
[80,70,14]
],columns = ['iq','marks','package'])

marks

Out[90]: iq marks package

0 100 80 10

1 90 70 7

2 120 100 14

3 80 70 14

4 80 70 14

In [91]: marks.value_counts()

Out[91]: iq marks package


80 70 14 2
90 70 7 1
100 80 10 1
120 100 14 1
Name: count, dtype: int64

In [92]: a = pd.Series([1,1,1,2,2,2,3,3,4,4,5,6,7,8,8,9])
a.value_counts()

Out[92]: 1 3
2 3
3 2
4 2
8 2
5 1
6 1
7 1
9 1
Name: count, dtype: int64

In [93]: ipl = pd.read_csv("ipl-matches.csv")


ipl.head(2)

file:///C:/Users/goura/Downloads/14-DataFrame.html 38/143
2/8/25, 2:23 PM 14-DataFrame

Out[93]: ID City Date Season MatchNumber Team1 Team2 Ven

Narend
2022- Rajasthan Gujarat M
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadiu
Ahmedab

Narend
Royal
2022- Rajasthan M
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadiu
Bangalore
Ahmedab

In [94]: # find which player has won most potm -> in finals and qualifiers

ipl[~ipl['MatchNumber'].str.isdigit()]['Player_of_Match'].value_counts()

file:///C:/Users/goura/Downloads/14-DataFrame.html 39/143
2/8/25, 2:23 PM 14-DataFrame

Out[94]: Player_of_Match
KA Pollard 3
F du Plessis 3
SK Raina 3
A Kumble 2
MK Pandey 2
YK Pathan 2
M Vijay 2
JJ Bumrah 2
AB de Villiers 2
SR Watson 2
HH Pandya 1
Harbhajan Singh 1
A Nehra 1
V Sehwag 1
UT Yadav 1
MS Bisla 1
BJ Hodge 1
MEK Hussey 1
MS Dhoni 1
CH Gayle 1
MM Patel 1
DE Bollinger 1
AC Gilchrist 1
RG Sharma 1
DA Warner 1
MC Henriques 1
JC Buttler 1
RM Patidar 1
DA Miller 1
VR Iyer 1
SP Narine 1
RD Gaikwad 1
TA Boult 1
MP Stoinis 1
KS Williamson 1
RR Pant 1
SA Yadav 1
Rashid Khan 1
AD Russell 1
KH Pandya 1
KV Sharma 1
NM Coulter-Nile 1
Washington Sundar 1
BCJ Cutting 1
M Ntini 1
Name: count, dtype: int64

In [95]: # Toss Decision plot

ipl['TossDecision'].value_counts().plot(kind = 'pie')

Out[95]: <Axes: ylabel='count'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 40/143
2/8/25, 2:23 PM 14-DataFrame

In [96]: # How many matches each team has played

(ipl['Team1'].value_counts() + ipl['Team2'].value_counts()).sort_values(
ascending=False)

Out[96]: Mumbai Indians 231


Royal Challengers Bangalore 226
Kolkata Knight Riders 223
Chennai Super Kings 208
Rajasthan Royals 192
Kings XI Punjab 190
Delhi Daredevils 161
Sunrisers Hyderabad 152
Deccan Chargers 75
Delhi Capitals 63
Pune Warriors 46
Gujarat Lions 30
Punjab Kings 28
Gujarat Titans 16
Rising Pune Supergiant 16
Lucknow Super Giants 15
Kochi Tuskers Kerala 14
Rising Pune Supergiants 14
Name: count, dtype: int64

In [97]: # sort_values(series and dataframe) -> ascending -> na_position -> inplace -> mu

x = pd.Series([12,14,1,56,89])
x

Out[97]: 0 12
1 14
2 1
3 56
4 89
dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 41/143
2/8/25, 2:23 PM 14-DataFrame

In [98]: x.sort_values(ascending = False)

Out[98]: 4 89
3 56
1 14
0 12
2 1
dtype: int64

In [99]: movies = pd.read_csv("movies.csv")


movies.head(2)

Out[99]: title_x imdb_id poster_path

Uri: The
0 Surgical tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia
Strike

Battalion
1 tt9472208 NaN https://en.wikip
609

In [100… movies.sort_values('title_x',ascending = False)

file:///C:/Users/goura/Downloads/14-DataFrame.html 42/143
2/8/25, 2:23 PM 14-DataFrame

Out[100… title_x imdb_id poster_path

1623 Zubeidaa tt0255713 https://upload.wikimedia.org/wikipedia/en/thum... http

Zor Lagaa
939 tt1479857 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wi
Ke...Haiya!

756 Zokkomon tt1605790 https://upload.wikimedia.org/wikipedia/en/thum... https

Zindagi
670 tt2164702 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Tere Naam

Zindagi
778 Na Milegi tt1562872 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Dobara

... ... ... ...

1971
1039 tt0983990 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
(2007 film)

1920: The
723 Evil tt2222550 https://upload.wikimedia.org/wikipedia/en/e/e7... https://en.wi
Returns

1920:
287 tt5638500 https://upload.wikimedia.org/wikipedia/en/thum... https://e
London

file:///C:/Users/goura/Downloads/14-DataFrame.html 43/143
2/8/25, 2:23 PM 14-DataFrame

title_x imdb_id poster_path

1021 1920 (film) tt1301698 https://upload.wikimedia.org/wikipedia/en/thum... https:

16
1498 December tt0313844 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wiki
(film)

1629 rows × 18 columns

In [101… students = pd.DataFrame(


{
'name':['Gourab','Saurabh','Suman',np.nan,'Pranav',np.nan,'Sanoj',
np.nan,'aditya',np.nan],
'college':['bit','iit','vit',np.nan,np.nan,'vlsi','ssit',
np.nan,np.nan,'git'],
'branch':['eee','it','cse',np.nan,'me','ce','civ','cse','bio',np.nan],
'cgpa':[6.66,8.25,6.41,np.nan,5.6,9.0,7.4,10,7.4,np.nan],
'package':[4,5,6,np.nan,6,7,8,9,np.nan,np.nan]

}
)

students

Out[101… name college branch cgpa package

0 Gourab bit eee 6.66 4.0

1 Saurabh iit it 8.25 5.0

2 Suman vit cse 6.41 6.0

3 NaN NaN NaN NaN NaN

4 Pranav NaN me 5.60 6.0

5 NaN vlsi ce 9.00 7.0

6 Sanoj ssit civ 7.40 8.0

7 NaN NaN cse 10.00 9.0

8 aditya NaN bio 7.40 NaN

9 NaN git NaN NaN NaN

In [102… students.sort_values('name',na_position='first',ascending=False,inplace=True)

file:///C:/Users/goura/Downloads/14-DataFrame.html 44/143
2/8/25, 2:23 PM 14-DataFrame

In [103… students

Out[103… name college branch cgpa package

3 NaN NaN NaN NaN NaN

5 NaN vlsi ce 9.00 7.0

7 NaN NaN cse 10.00 9.0

9 NaN git NaN NaN NaN

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [104… movies.sort_values(['year_of_release','title_x'],ascending=[True,False])

file:///C:/Users/goura/Downloads/14-DataFrame.html 45/143
2/8/25, 2:23 PM 14-DataFrame

Out[104… title_x imdb_id poster_path

1623 Zubeidaa tt0255713 https://upload.wikimedia.org/wikipedia/en/thum... http

Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Ka Safar

Yeh
Teraa
1622 Ghar Yeh tt0298606 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Meraa
Ghar

Yeh
Raaste
1620 tt0292740 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Hain
Pyaar Ke

Yaadein
1573 (2001 tt0248617 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wiki
film)

... ... ... ...

Article
37 tt10324144 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
15 (film)

Arjun
46 tt7881524 https://upload.wikimedia.org/wikipedia/en/thum... https://e
Patiala

10 Amavas tt8396186 https://upload.wikimedia.org/wikipedia/en/thum... htt

Albert
Pinto Ko
Gussa
26 tt4355838 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Kyun
Aata
Hai?

21 22 Yards tt9496212 https://upload.wikimedia.org/wikipedia/en/thum... http

file:///C:/Users/goura/Downloads/14-DataFrame.html 46/143
2/8/25, 2:23 PM 14-DataFrame

1629 rows × 18 columns

In [105… # rank(series) -->> assign rank on the basis of ...

batsman = pd.read_csv("batsman_runs_ipl.csv")
batsman.head()

Out[105… batter batsman_run

0 A Ashish Reddy 280

1 A Badoni 161

2 A Chandila 4

3 A Chopra 53

4 A Choudhary 25

In [106… batsman['batting_rank'] = batsman['batsman_run'].rank(ascending=False)


batsman.sort_values('batting_rank')

Out[106… batter batsman_run batting_rank

569 V Kohli 6634 1.0

462 S Dhawan 6244 2.0

130 DA Warner 5883 3.0

430 RG Sharma 5881 4.0

493 SK Raina 5536 5.0

... ... ... ...

512 SS Cottrell 0 594.0

466 S Kaushik 0 594.0

203 IC Pandey 0 594.0

467 S Ladda 0 594.0

468 S Lamichhane 0 594.0

605 rows × 3 columns

In [107… # sort_index(series and dataframe)

marks = {
'maths':67,
'english':57,
'science':89,
'hindi':100
}

file:///C:/Users/goura/Downloads/14-DataFrame.html 47/143
2/8/25, 2:23 PM 14-DataFrame

marks_series = pd.Series(marks)
marks_series

Out[107… maths 67
english 57
science 89
hindi 100
dtype: int64

In [108… marks_series.sort_index(ascending=False)

Out[108… science 89
maths 67
hindi 100
english 57
dtype: int64

In [109… movies.sort_index(ascending=False)

file:///C:/Users/goura/Downloads/14-DataFrame.html 48/143
2/8/25, 2:23 PM 14-DataFrame

Out[109… title_x imdb_id poster_path

1628 Humsafar tt2403201 https://upload.wikimedia.org/wikipedia/en/thum... http

1627 Daaka tt10833860 https://upload.wikimedia.org/wikipedia/en/thum... h

Sabse
1626 Bada tt0069204 NaN https://en.w
Sukh

Yeh
1625 Zindagi tt0298607 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Ka Safar

Tera Mera
1624 Saath tt0301250 https://upload.wikimedia.org/wikipedia/en/2/2b... https://en.wik
Rahen

... ... ... ...

Evening
4 tt6028796 NaN https://en.w
Shadows

Why
3 Cheat tt8108208 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w
India

The
Accidental
2 Prime tt6986710 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wik
Minister
(film)

Battalion
1 tt9472208 NaN https://e
609

0 Uri: The tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.w


Surgical
Strike

file:///C:/Users/goura/Downloads/14-DataFrame.html 49/143
2/8/25, 2:23 PM 14-DataFrame

title_x imdb_id poster_path

1629 rows × 18 columns

In [110… # set_index(dataframe) -> inplace

batsman.set_index('batter',inplace=True)

In [111… # reset_index(series + dataframe) -> drop parameter

batsman.reset_index(inplace=True)

In [112… batsman

Out[112… batter batsman_run batting_rank

0 A Ashish Reddy 280 166.5

1 A Badoni 161 226.0

2 A Chandila 4 535.0

3 A Chopra 53 329.0

4 A Choudhary 25 402.5

... ... ... ...

600 Yash Dayal 0 594.0

601 Yashpal Singh 47 343.0

602 Younis Khan 3 547.5

603 Yuvraj Singh 2754 27.0

604 Z Khan 117 256.0

605 rows × 3 columns

In [113… # how to replace existing index without loosing

batsman.reset_index().set_index('batting_rank')

file:///C:/Users/goura/Downloads/14-DataFrame.html 50/143
2/8/25, 2:23 PM 14-DataFrame

Out[113… index batter batsman_run

batting_rank

166.5 0 A Ashish Reddy 280

226.0 1 A Badoni 161

535.0 2 A Chandila 4

329.0 3 A Chopra 53

402.5 4 A Choudhary 25

... ... ... ...

594.0 600 Yash Dayal 0

343.0 601 Yashpal Singh 47

547.5 602 Younis Khan 3

27.0 603 Yuvraj Singh 2754

256.0 604 Z Khan 117

605 rows × 3 columns

In [114… # series to dataframe using reset_index

marks_series.reset_index()

Out[114… index 0

0 maths 67

1 english 57

2 science 89

3 hindi 100

In [115… # rename(dataframe) -> index

movies.set_index('title_x',inplace=True)

In [116… movies.rename(columns = {'imdb_id':'imdb','poster_path':'link'},inplace = True)

In [117… movies.rename(index = {'Uri: The Surgical Strike':'URI',


'Battalion 609':'Battalion'},inplace = True)

In [118… movies.head(2)

file:///C:/Users/goura/Downloads/14-DataFrame.html 51/143
2/8/25, 2:23 PM 14-DataFrame

Out[118… imdb link

title_x

URI tt8291224 https://upload.wikimedia.org/wikipedia/en/thum... https://en.wikipedia.org

Battalion tt9472208 NaN https://en.wikipedi

In [119… # unique(series)

temp = pd.Series([1,1,2,2,3,3,4,4,5,5,np.nan,np.nan])
temp.unique()

Out[119… array([ 1., 2., 3., 4., 5., nan])

In [120… ipl['Season'].unique().shape

Out[120… (15,)

In [121… # nunique(series + dataframe) -> does not count nan -> dropna parameter

# nunique will not count null or missing values but unique will.
ipl['Season'].nunique()

Out[121… 15

In [122… # isnull (series + DataFrame)

students[~students['name'].isnull()]

Out[122… name college branch cgpa package

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [123… students['name'][students['name'].isnull()]

file:///C:/Users/goura/Downloads/14-DataFrame.html 52/143
2/8/25, 2:23 PM 14-DataFrame

Out[123… 3 NaN
5 NaN
7 NaN
9 NaN
Name: name, dtype: object

In [124… # Not Null(series+ Dataframe)

students['name'][students['name'].notnull()]

Out[124… 8 aditya
2 Suman
1 Saurabh
6 Sanoj
4 Pranav
0 Gourab
Name: name, dtype: object

In [125… # hasnans(series)

students['name'].hasnans

Out[125… True

In [126… students

Out[126… name college branch cgpa package

3 NaN NaN NaN NaN NaN

5 NaN vlsi ce 9.00 7.0

7 NaN NaN cse 10.00 9.0

9 NaN git NaN NaN NaN

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [127… students.isnull()

file:///C:/Users/goura/Downloads/14-DataFrame.html 53/143
2/8/25, 2:23 PM 14-DataFrame

Out[127… name college branch cgpa package

3 True True True True True

5 True False False False False

7 True True False False False

9 True False True True True

8 False True False False True

2 False False False False False

1 False False False False False

6 False False False False False

4 False True False False False

0 False False False False False

In [128… students.notnull()

Out[128… name college branch cgpa package

3 False False False False False

5 False True True True True

7 False False True True True

9 False True False False False

8 True False True True False

2 True True True True True

1 True True True True True

6 True True True True True

4 True False True True True

0 True True True True True

In [129… # dropna(series + dataframe) -->> how parameter -->> works like or

students['name'].dropna()

Out[129… 8 aditya
2 Suman
1 Saurabh
6 Sanoj
4 Pranav
0 Gourab
Name: name, dtype: object

In [130… students.dropna(how = 'all')

file:///C:/Users/goura/Downloads/14-DataFrame.html 54/143
2/8/25, 2:23 PM 14-DataFrame

Out[130… name college branch cgpa package

5 NaN vlsi ce 9.00 7.0

7 NaN NaN cse 10.00 9.0

9 NaN git NaN NaN NaN

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [131… students.dropna(subset = ['name'])

Out[131… name college branch cgpa package

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [132… students.dropna(subset= ['name','college'])

Out[132… name college branch cgpa package

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

0 Gourab bit eee 6.66 4.0

In [133… # fillna (series + DataFrame)

students['name'].fillna('unknown')

file:///C:/Users/goura/Downloads/14-DataFrame.html 55/143
2/8/25, 2:23 PM 14-DataFrame

Out[133… 3 unknown
5 unknown
7 unknown
9 unknown
8 aditya
2 Suman
1 Saurabh
6 Sanoj
4 Pranav
0 Gourab
Name: name, dtype: object

In [134… students.fillna(0)

Out[134… name college branch cgpa package

3 0 0 0 0.00 0.0

5 0 vlsi ce 9.00 7.0

7 0 0 cse 10.00 9.0

9 0 git 0 0.00 0.0

8 aditya 0 bio 7.40 0.0

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav 0 me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [135… students['package'].fillna(students['package'].mean())

Out[135… 3 6.428571
5 7.000000
7 9.000000
9 6.428571
8 6.428571
2 6.000000
1 5.000000
6 8.000000
4 6.000000
0 4.000000
Name: package, dtype: float64

In [136… students['name'].fillna(method = 'bfill')

file:///C:/Users/goura/Downloads/14-DataFrame.html 56/143
2/8/25, 2:23 PM 14-DataFrame

Out[136… 3 aditya
5 aditya
7 aditya
9 aditya
8 aditya
2 Suman
1 Saurabh
6 Sanoj
4 Pranav
0 Gourab
Name: name, dtype: object

drop_duplicates(series + dataframe) ->


works like and -> duplicated()
marks.duplicated().sum()

In [138… temp = pd.Series([1,1,1,2,2,3,3,4,5,5,6,6,7,7,8,8,8,9,9,0,0,])


temp.drop_duplicates()

Out[138… 0 1
3 2
5 3
7 4
8 5
10 6
12 7
14 8
17 9
19 0
dtype: int64

marks.drop_duplicates(keep = 'last')

In [139… # find the last match played by virat kohli in Delhi

ipl['all_players'] = ipl['Team1Players'] + ipl['Team2Players']


ipl.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 57/143
2/8/25, 2:23 PM 14-DataFrame

Out[139… ID City Date Season MatchNumber Team1 Team2 Ven

Narend
2022- Rajasthan Gujarat M
0 1312200 Ahmedabad 2022 Final
05-29 Royals Titans Stadiu
Ahmedab

Narend
Royal
2022- Rajasthan M
1 1312199 Ahmedabad 2022 Qualifier 2 Challengers
05-27 Royals Stadiu
Bangalore
Ahmedab

Royal Lucknow Ed
2022-
2 1312198 Kolkata 2022 Eliminator Challengers Super Garde
05-25
Bangalore Giants Kolk

Ed
2022- Rajasthan Gujarat
3 1312197 Kolkata 2022 Qualifier 1 Garde
05-24 Royals Titans
Kolk

Wankhe
2022- Sunrisers Punjab
4 1304116 Mumbai 2022 70 Stadiu
05-22 Hyderabad Kings
Mum

5 rows × 21 columns

In [140… def did_kohli_play(players_list):


return 'V Kohli' in players_list

In [141… ipl['did_kohli_play'] = ipl['all_players'].apply(did_kohli_play)

ipl[(ipl['City'] == "Delhi" ) & (ipl ['did_kohli_play'] == True)] drop_duplicates(subset =


['City','did_kohli_play'],keep = 'first')

In [142… # drop (series + dataframe)

temp = pd.Series([10,2,3,16,45,78,10])

temp

file:///C:/Users/goura/Downloads/14-DataFrame.html 58/143
2/8/25, 2:23 PM 14-DataFrame

Out[142… 0 10
1 2
2 3
3 16
4 45
5 78
6 10
dtype: int64

In [143… temp.drop(index=[0,6])

Out[143… 1 2
2 3
3 16
4 45
5 78
dtype: int64

In [144… students

Out[144… name college branch cgpa package

3 NaN NaN NaN NaN NaN

5 NaN vlsi ce 9.00 7.0

7 NaN NaN cse 10.00 9.0

9 NaN git NaN NaN NaN

8 aditya NaN bio 7.40 NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

0 Gourab bit eee 6.66 4.0

In [145… students.drop(columns = ['branch','cgpa'])

file:///C:/Users/goura/Downloads/14-DataFrame.html 59/143
2/8/25, 2:23 PM 14-DataFrame

Out[145… name college package

3 NaN NaN NaN

5 NaN vlsi 7.0

7 NaN NaN 9.0

9 NaN git NaN

8 aditya NaN NaN

2 Suman vit 6.0

1 Saurabh iit 5.0

6 Sanoj ssit 8.0

4 Pranav NaN 6.0

0 Gourab bit 4.0

In [146… students.drop(index=[0,8])

Out[146… name college branch cgpa package

3 NaN NaN NaN NaN NaN

5 NaN vlsi ce 9.00 7.0

7 NaN NaN cse 10.00 9.0

9 NaN git NaN NaN NaN

2 Suman vit cse 6.41 6.0

1 Saurabh iit it 8.25 5.0

6 Sanoj ssit civ 7.40 8.0

4 Pranav NaN me 5.60 6.0

In [147… students.set_index('name').drop(index= 'Sanoj')

file:///C:/Users/goura/Downloads/14-DataFrame.html 60/143
2/8/25, 2:23 PM 14-DataFrame

Out[147… college branch cgpa package

name

NaN NaN NaN NaN NaN

NaN vlsi ce 9.00 7.0

NaN NaN cse 10.00 9.0

NaN git NaN NaN NaN

aditya NaN bio 7.40 NaN

Suman vit cse 6.41 6.0

Saurabh iit it 8.25 5.0

Pranav NaN me 5.60 6.0

Gourab bit eee 6.66 4.0

In [148… # apply (series + Dataframe)

temp = pd.Series([10,20,30,40,50])

temp

Out[148… 0 10
1 20
2 30
3 40
4 50
dtype: int64

In [149… def sigmoid(value):


return 1/1+np.exp(-value)

In [150… temp.apply(sigmoid)

Out[150… 0 1.000045
1 1.000000
2 1.000000
3 1.000000
4 1.000000
dtype: float64

In [151… points_df = pd.DataFrame(


{
'1st point':[(3,4),(-6,5),(0,0),(-10,1),(4,5)],
'2nd point':[(-3,4),(0,0),(2,2),(10,10),(1,1)]
}
)

points_df

file:///C:/Users/goura/Downloads/14-DataFrame.html 61/143
2/8/25, 2:23 PM 14-DataFrame

Out[151… 1st point 2nd point

0 (3, 4) (-3, 4)

1 (-6, 5) (0, 0)

2 (0, 0) (2, 2)

3 (-10, 1) (10, 10)

4 (4, 5) (1, 1)

In [152… def euclidean(row):


pt_A = row['1st point']
pt_B = row['2nd point']

return ((pt_A[0] - pt_B[0])**2 + (pt_A[1] - pt_B[1])**2)**0.5

In [153… points_df['distance'] = points_df.apply(euclidean,axis = 1)


points_df

Out[153… 1st point 2nd point distance

0 (3, 4) (-3, 4) 6.000000

1 (-6, 5) (0, 0) 7.810250

2 (0, 0) (2, 2) 2.828427

3 (-10, 1) (10, 10) 21.931712

4 (4, 5) (1, 1) 5.000000

GroupBy
GroupBy is study of groups -->> GroupBy will always apply on categorical column

In [154… movies = pd.read_csv("imdb-top-1000.csv")


movies.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 62/143
2/8/25, 2:23 PM 14-DataFrame

Out[154… Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No

The
Frank Tim
0 Shawshank 1994 142 Drama 9.3
Darabont Robbins
Redemption

Francis
The Marlon
1 1972 175 Crime 9.2 Ford
Godfather Brando
Coppola

The Dark Christopher Christian


2 2008 152 Action 9.0
Knight Nolan Bale

The Francis
Al
3 Godfather: 1974 202 Crime 9.0 Ford
Pacino
Part II Coppola

12 Angry Sidney Henry


4 1957 96 Crime 9.0
Men Lumet Fonda

In [155… genres = movies.groupby('Genre')

In [156… # Applying builtin aggregation fuctions on groupby objects


genres.sum()

file:///C:/Users/goura/Downloads/14-DataFrame.html 63/143
2/8/25, 2:23 PM 14-DataFrame

Out[156… Series_Title Released_Yea

Genre

The Dark KnightThe Lord


Action 2008200320102001200219991980197719621954200019
of the Rings: The Retu...

InterstellarBack to the
Adventure 2014198520091981196819621959201319751963194819
FutureInglourious Bast...

Sen to Chihiro no
Animation kamikakushiThe Lion 2001199419882016201820172008199719952019200920
KingHota...

Schindler's
Biography ListGoodfellasHamiltonThe 1993199020202011200220171995198420182013201320
Intoucha...

GisaengchungLa vita è
Comedy 2019199719361931200919641940200120001973196019
bellaModern TimesCity Li...

The GodfatherThe
Crime Godfather: Part II12 Angry 1972197419571994200219991995199120192006199519
Me...

The Shawshank
Drama RedemptionFight 1994199919941975202019981946201420061998198819
ClubForrest Gump...

E.T. the Extra-


Family TerrestrialWilly Wonka & 1982197
the Ch...

Das Cabinet des Dr.


Fantasy 1920192
CaligariNosferatu

The Third ManThe Maltese


Film-Noir 19491941194
FalconShadow of a Doubt

PsychoAlienThe ThingThe
Horror 1960197919821973196819612017197819332004200
ExorcistNight of the L...

MementoRear
Mystery WindowVertigoShutter 20001954195820102012199519721938198820121998199
IslandKahaani...

Thriller Wait Until Dark 196

file:///C:/Users/goura/Downloads/14-DataFrame.html 64/143
2/8/25, 2:23 PM 14-DataFrame

Series_Title Released_Yea

Genre

Il buono, il brutto, il
Western 196619681965197
cattivoOnce Upon a Tim...

In [157… genres.min()

file:///C:/Users/goura/Downloads/14-DataFrame.html 65/143
2/8/25, 2:23 PM 14-DataFrame

Out[157… Series_Title Released_Year Runtime IMDB_Rating Director Star1 N

Genre

Abhishek Aamir
Action 300 1924 45 7.6
Chaubey Khan

2001: A
Akira Aamir
Adventure Space 1925 88 7.6
Kurosawa Khan
Odyssey

Adam Adrian
Animation Akira 1940 71 7.6
Elliot Molina

12 Years a Adam Adrien


Biography 1928 93 7.6
Slave McKay Brody

(500) Days Alejandro Aamir


Comedy 1921 68 7.6
of Summer G. Iñárritu Khan

12 Angry Akira Ajay


Crime 1931 80 7.6
Men Kurosawa Devgn

Aamir Abhay
Drama 1917 1925 64 7.6
Khan Deol

E.T. the
Gene
Family Extra- 1971 100 7.8 Mel Stuart
Wilder
Terrestrial

Das
F.W. Max
Fantasy Cabinet des 1920 76 7.9
Murnau Schreck
Dr. Caligari

Shadow of Alfred Humphrey


Film-Noir 1941 100 7.8
a Doubt Hitchcock Bogart

Alejandro Anthony
Horror Alien 1933 71 7.6
Amenábar Perkins

Bernard-
Alex
Mystery Dark City 1938 96 7.6 Pierre
Proyas
Donnadieu

Wait Until Terence Audrey


Thriller 1967 108 7.8
Dark Young Hepburn

Il buono, il
Clint Clint
Western brutto, il 1965 132 7.8
Eastwood Eastwood
cattivo

In [ ]:

In [ ]:

In [160… # find the top 3 genres by total earning

movies.groupby('Genre').sum()['Gross'].sort_values(ascending = False).head(3)

file:///C:/Users/goura/Downloads/14-DataFrame.html 66/143
2/8/25, 2:23 PM 14-DataFrame

Out[160… Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64

In [161… # 2nd method

movies.groupby('Genre')['Gross'].sum().sort_values(ascending = False).head(3)

Out[161… Genre
Drama 3.540997e+10
Action 3.263226e+10
Comedy 1.566387e+10
Name: Gross, dtype: float64

In [163… # Find the genre with highest avg IMDB rating

movies.groupby('Genre')['IMDB_Rating'].mean().sort_values(ascending = False).hea

Out[163… Genre
Western 8.35
Name: IMDB_Rating, dtype: float64

In [164… # find director with most popularity

movies.groupby('Director')['No_of_Votes'].sum().sort_values(ascending = False).h

Out[164… Director
Christopher Nolan 11578345
Name: No_of_Votes, dtype: int64

In [165… # Find the highest rated movie of each genre

movies.groupby('Genre')['IMDB_Rating'].max()

Out[165… Genre
Action 9.0
Adventure 8.6
Animation 8.6
Biography 8.9
Comedy 8.6
Crime 9.2
Drama 9.3
Family 7.8
Fantasy 8.1
Film-Noir 8.1
Horror 8.5
Mystery 8.4
Thriller 7.8
Western 8.8
Name: IMDB_Rating, dtype: float64

In [166… # find number of movies done by each actor

# movies['Star1'].value_counts()

movies.groupby('Star1')['Series_Title'].count().sort_values(ascending=False)

file:///C:/Users/goura/Downloads/14-DataFrame.html 67/143
2/8/25, 2:23 PM 14-DataFrame

Out[166… Star1
Tom Hanks 12
Robert De Niro 11
Clint Eastwood 10
Al Pacino 10
Leonardo DiCaprio 9
..
Glen Hansard 1
Giuseppe Battiston 1
Giulietta Masina 1
Gerardo Taracena 1
Ömer Faruk Sorak 1
Name: Series_Title, Length: 660, dtype: int64

In [167… len(movies.groupby('Genre'))

Out[167… 14

In [168… movies['Genre'].nunique()

Out[168… 14

In [169… movies.groupby('Genre').size()

Out[169… Genre
Action 172
Adventure 72
Animation 82
Biography 88
Comedy 155
Crime 107
Drama 289
Family 2
Fantasy 2
Film-Noir 3
Horror 11
Mystery 12
Thriller 1
Western 4
dtype: int64

In [170… genres = movies.groupby('Genre')


# genres.first()
# genres.last()
genres.nth(5)

file:///C:/Users/goura/Downloads/14-DataFrame.html 68/143
2/8/25, 2:23 PM 14-DataFrame

Out[170… Series_Title Released_Year Runtime Genre IMDB_Rating Director Sta

Lana L
14 The Matrix 1999 136 Action 8.7
Wachowski Wachow

Saving
Steven
24 Private 1998 169 Drama 8.6 Tom Han
Spielberg
Ryan

The Green Frank


25 1999 189 Crime 8.6 Tom Han
Mile Darabont

Ayla: The
54 Daughter 2017 125 Biography 8.4 Can Ulkay Erdem C
of War

Lee Adr
61 Coco 2017 105 Animation 8.4
Unkrich Mol

Dr.
Strangelove
Stanley Pe
78 or: How I 1964 95 Comedy 8.4
Kubrick Sell
Learned to
Stop Worr...

Lawrence Pe
116 1962 228 Adventure 8.3 David Lean
of Arabia O'Too

Twelve Terry Bru


393 1995 129 Mystery 8.0
Monkeys Gilliam Wi

The Jack Debor


707 1961 100 Horror 7.8
Innocents Clayton K

In [ ]:

In [175… movies['Genre'].value_counts()

Out[175… Genre
Drama 289
Action 172
Comedy 155
Crime 107
Biography 88
Animation 82
Adventure 72
Mystery 12
Horror 11
Western 4
Film-Noir 3
Fantasy 2
Family 2
Thriller 1
Name: count, dtype: int64

In [176… genres.get_group('Horror')

file:///C:/Users/goura/Downloads/14-DataFrame.html 69/143
2/8/25, 2:23 PM 14-DataFrame

Out[176… Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 N

Alfred Anthony
49 Psycho 1960 109 Horror 8.5
Hitchcock Perkins

Ridley Sigourney
75 Alien 1979 117 Horror 8.4
Scott Weaver

John Kurt
271 The Thing 1982 109 Horror 8.1
Carpenter Russell

William Ellen
419 The Exorcist 1973 122 Horror 8.0
Friedkin Burstyn

Night of
George A. Duane
544 the Living 1968 96 Horror 7.9
Romero Jones
Dead

The Jack Deborah


707 1961 100 Horror 7.8
Innocents Clayton Kerr

Jordan Daniel
724 Get Out 2017 104 Horror 7.7
Peele Kaluuya

John Donald
844 Halloween 1978 91 Horror 7.7
Carpenter Pleasence

The
James Claude
876 Invisible 1933 71 Horror 7.7
Whale Rains
Man

James Cary
932 Saw 2004 103 Horror 7.6
Wan Elwes

Alejandro Nicole
948 The Others 2001 101 Horror 7.6
Amenábar Kidman

In [177… #genres.get_group('Fantasy')

movies[movies['Genre'] == 'Fantasy']

Out[177… Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No_

Das
Robert Werner
321 Cabinet des 1920 76 Fantasy 8.1
Wiene Krauss
Dr. Caligari

F.W. Max
568 Nosferatu 1922 94 Fantasy 7.9
Murnau Schreck

In [178… genres.groups

file:///C:/Users/goura/Downloads/14-DataFrame.html 70/143
2/8/25, 2:23 PM 14-DataFrame

Out[178… {'Action': [2, 5, 8, 10, 13, 14, 16, 29, 30, 31, 39, 42, 44, 55, 57, 59, 60, 6
3, 68, 72, 106, 109, 129, 130, 134, 140, 142, 144, 152, 155, 160, 161, 166, 16
8, 171, 172, 177, 181, 194, 201, 202, 216, 217, 223, 224, 236, 241, 262, 275, 2
94, 308, 320, 325, 326, 331, 337, 339, 340, 343, 345, 348, 351, 353, 356, 357,
362, 368, 369, 375, 376, 390, 410, 431, 436, 473, 477, 479, 482, 488, 493, 496,
502, 507, 511, 532, 535, 540, 543, 564, 569, 570, 573, 577, 582, 583, 602, 605,
608, 615, 623, ...], 'Adventure': [21, 47, 93, 110, 114, 116, 118, 137, 178, 17
9, 191, 193, 209, 226, 231, 247, 267, 273, 281, 300, 301, 304, 306, 323, 329, 3
61, 366, 377, 402, 406, 415, 426, 458, 470, 497, 498, 506, 513, 514, 537, 549,
552, 553, 566, 576, 604, 609, 618, 638, 647, 675, 681, 686, 692, 711, 713, 739,
755, 781, 797, 798, 851, 873, 884, 912, 919, 947, 957, 964, 966, 984, 991], 'An
imation': [23, 43, 46, 56, 58, 61, 66, 70, 101, 135, 146, 151, 158, 170, 197, 2
05, 211, 213, 219, 229, 230, 242, 245, 246, 270, 330, 332, 358, 367, 378, 386,
389, 394, 395, 399, 401, 405, 409, 469, 499, 510, 516, 518, 522, 578, 586, 592,
595, 596, 599, 633, 640, 643, 651, 665, 672, 694, 728, 740, 741, 744, 756, 758,
761, 771, 783, 796, 799, 822, 828, 843, 875, 891, 892, 902, 906, 920, 956, 971,
976, 986, 992], 'Biography': [7, 15, 18, 35, 38, 54, 102, 107, 131, 139, 147, 1
57, 159, 173, 176, 212, 215, 218, 228, 235, 243, 263, 276, 282, 290, 298, 317,
328, 338, 342, 346, 359, 360, 365, 372, 373, 385, 411, 416, 418, 424, 429, 484,
525, 536, 542, 545, 575, 579, 587, 600, 606, 614, 622, 632, 635, 644, 649, 650,
657, 671, 673, 684, 729, 748, 753, 757, 759, 766, 770, 779, 809, 810, 815, 820,
831, 849, 858, 877, 882, 897, 910, 915, 923, 940, 949, 952, 987], 'Comedy': [1
9, 26, 51, 52, 64, 78, 83, 95, 96, 112, 117, 120, 127, 128, 132, 153, 169, 183,
192, 204, 207, 208, 214, 221, 233, 238, 240, 250, 251, 252, 256, 261, 266, 277,
284, 311, 313, 316, 318, 322, 327, 374, 379, 381, 392, 396, 403, 413, 414, 417,
427, 435, 445, 446, 449, 455, 459, 460, 463, 464, 466, 471, 472, 475, 481, 490,
494, 500, 503, 509, 526, 528, 530, 531, 533, 538, 539, 541, 547, 557, 558, 562,
563, 565, 574, 591, 593, 594, 598, 613, 626, 630, 660, 662, 667, 679, 680, 683,
687, 701, ...], 'Crime': [1, 3, 4, 6, 22, 25, 27, 28, 33, 37, 41, 71, 77, 79, 8
6, 87, 103, 108, 111, 113, 123, 125, 133, 136, 162, 163, 164, 165, 180, 186, 18
7, 189, 198, 222, 232, 239, 255, 257, 287, 288, 299, 305, 335, 363, 364, 380, 3
84, 397, 437, 438, 441, 442, 444, 450, 451, 465, 474, 480, 485, 487, 505, 512,
519, 520, 523, 527, 546, 556, 560, 584, 597, 603, 607, 611, 621, 639, 653, 664,
669, 676, 695, 708, 723, 762, 763, 767, 775, 791, 795, 802, 811, 823, 827, 833,
885, 895, 921, 922, 926, 938, ...], 'Drama': [0, 9, 11, 17, 20, 24, 32, 34, 36,
40, 45, 50, 53, 62, 65, 67, 73, 74, 76, 80, 82, 84, 85, 88, 89, 90, 91, 92, 94,
97, 98, 99, 100, 104, 105, 121, 122, 124, 126, 138, 141, 143, 148, 149, 150, 15
4, 156, 167, 174, 175, 182, 184, 185, 188, 190, 195, 196, 199, 200, 203, 206, 2
10, 225, 227, 234, 237, 244, 248, 249, 253, 254, 258, 259, 260, 264, 265, 268,
269, 272, 274, 278, 279, 280, 283, 285, 286, 289, 291, 292, 293, 295, 296, 297,
302, 303, 307, 310, 312, 314, 315, ...], 'Family': [688, 698], 'Fantasy': [321,
568], 'Film-Noir': [309, 456, 712], 'Horror': [49, 75, 271, 419, 544, 707, 724,
844, 876, 932, 948], 'Mystery': [69, 81, 119, 145, 220, 393, 420, 714, 829, 89
9, 959, 961], 'Thriller': [700], 'Western': [12, 48, 115, 691]}

In [179… # agg method


# passing dict

genres.agg({
'Runtime' : 'mean',
'IMDB_Rating' : 'mean',
'No_of_Votes' : 'mean',
'Gross' : 'sum',
'Metascore' : 'min'
})

file:///C:/Users/goura/Downloads/14-DataFrame.html 71/143
2/8/25, 2:23 PM 14-DataFrame

Out[179… Runtime IMDB_Rating No_of_Votes Gross Metascore

Genre

Action 129.046512 7.949419 420246.581395 3.263226e+10 33.0

Adventure 134.111111 7.937500 313557.819444 9.496922e+09 41.0

Animation 99.585366 7.930488 268032.073171 1.463147e+10 61.0

Biography 136.022727 7.938636 272805.045455 8.276358e+09 48.0

Comedy 112.129032 7.901290 178195.658065 1.566387e+10 45.0

Crime 126.392523 8.016822 313398.271028 8.452632e+09 47.0

Drama 124.737024 7.957439 212343.612457 3.540997e+10 28.0

Family 107.500000 7.800000 275610.500000 4.391106e+08 67.0

Fantasy 85.000000 8.000000 73111.000000 7.827267e+08 NaN

Film-Noir 104.000000 7.966667 122405.000000 1.259105e+08 94.0

Horror 102.090909 7.909091 340232.363636 1.034649e+09 46.0

Mystery 119.083333 7.975000 350250.333333 1.256417e+09 52.0

Thriller 108.000000 7.800000 27733.000000 1.755074e+07 81.0

Western 148.250000 8.350000 322416.250000 5.822151e+07 69.0

In [182… # split (apply) combine


# apply -> builtin function

genres.apply(min)

file:///C:/Users/goura/Downloads/14-DataFrame.html 72/143
2/8/25, 2:23 PM 14-DataFrame

Out[182… Series_Title Released_Year Runtime Genre IMDB_Rating Director

Genre

Abhishek
Action 300 1924 45 Action 7.6
Chaubey

2001: A
Akira
Adventure Space 1925 88 Adventure 7.6
Kurosawa
Odyssey

Adam
Animation Akira 1940 71 Animation 7.6
Elliot

12 Years a Adam
Biography 1928 93 Biography 7.6
Slave McKay

(500) Days Alejandro


Comedy 1921 68 Comedy 7.6
of Summer G. Iñárritu

12 Angry Akira
Crime 1931 80 Crime 7.6
Men Kurosawa

Aamir
Drama 1917 1925 64 Drama 7.6
Khan

E.T. the
Family Extra- 1971 100 Family 7.8 Mel Stuart
Terrestrial

Das
F.W.
Fantasy Cabinet des 1920 76 Fantasy 7.9
Murnau
Dr. Caligari

Shadow of Alfred H
Film-Noir 1941 100 Film-Noir 7.8
a Doubt Hitchcock

Alejandro
Horror Alien 1933 71 Horror 7.6
Amenábar

Alex
Mystery Dark City 1938 96 Mystery 7.6
Proyas
Do

Wait Until Terence


Thriller 1967 108 Thriller 7.8
Dark Young

Il buono, il
Clint
Western brutto, il 1965 132 Western 7.8
Eastwood E
cattivo

In [183… # find number of movies starting with A for each group

def foo(group):
return group['Series_Title'].str.startswith('A').sum()

In [184… genres.apply(foo)

file:///C:/Users/goura/Downloads/14-DataFrame.html 73/143
2/8/25, 2:23 PM 14-DataFrame

Out[184… Genre
Action 10
Adventure 2
Animation 2
Biography 9
Comedy 14
Crime 4
Drama 21
Family 0
Fantasy 0
Film-Noir 0
Horror 1
Mystery 0
Thriller 0
Western 0
dtype: int64

In [185… # find ranking of each movie in the group according to IMDB score

def rank_movie(group):
group['genre_rank'] = group['IMDB_Rating'].rank(ascending=False)
return group

In [186… genres.apply(rank_movie)

file:///C:/Users/goura/Downloads/14-DataFrame.html 74/143
2/8/25, 2:23 PM 14-DataFrame

Out[186… Series_Title Released_Year Runtime Genre IMDB_Rating Director

Genre

Action 2 The Dark Christopher


2008 152 Action 9.0
Knight Nolan

The Lord of
the Rings: Peter
5 2003 201 Action 8.9
The Return Jackson
of the King

Christopher
8 Inception 2010 148 Action 8.8
Nolan

The Lord of
the Rings:
Peter
10 The 2001 178 Action 8.8
Jackson
Fellowship
of the Ring

The Lord of
the Rings: Peter
13 2002 179 Action 8.7
The Two Jackson
Towers

... ... ... ... ... ... ... ...

Thriller 700 Wait Until Terence


1967 108 Thriller 7.8
Dark Young

Western 12 Il buono, il
Sergio
brutto, il 1966 161 Western 8.8
Leone
cattivo

Once Upon
Sergio
48 a Time in 1968 165 Western 8.5
Leone
the West

Per qualche
Sergio
115 dollaro in 1965 132 Western 8.3
Leone
più

The Outlaw Clint


691 1976 135 Western 7.8
Josey Wales Eastwood

1000 rows × 11 columns

In [187… # find normalized IMDB rating group wise

def normal(group):
group['norm_rating'] = (group['IMDB_Rating'] - group['IMDB_Rating'].min())/(
group['IMDB_Rating'].max()-group['IMDB_Rating'].min())
return group

genres.apply(normal)

file:///C:/Users/goura/Downloads/14-DataFrame.html 75/143
2/8/25, 2:23 PM 14-DataFrame

Out[187… Series_Title Released_Year Runtime Genre IMDB_Rating Director

Genre

Action 2 The Dark Christopher


2008 152 Action 9.0
Knight Nolan

The Lord of
the Rings: Peter
5 2003 201 Action 8.9
The Return Jackson
of the King

Christopher
8 Inception 2010 148 Action 8.8
Nolan

The Lord of
the Rings:
Peter
10 The 2001 178 Action 8.8
Jackson
Fellowship
of the Ring

The Lord of
the Rings: Peter
13 2002 179 Action 8.7
The Two Jackson
Towers

... ... ... ... ... ... ... ...

Thriller 700 Wait Until Terence


1967 108 Thriller 7.8
Dark Young

Western 12 Il buono, il
Sergio
brutto, il 1966 161 Western 8.8
Leone
cattivo

Once Upon
Sergio
48 a Time in 1968 165 Western 8.5
Leone
the West

Per qualche
Sergio
115 dollaro in 1965 132 Western 8.3
Leone
più

The Outlaw Clint


691 1976 135 Western 7.8
Josey Wales Eastwood

1000 rows × 11 columns

In [188… # Group by on multiple columns

duo = movies.groupby(['Director','Star1'])
duo

# Size
duo.size()

# get group
duo.get_group(('Aamir Khan','Amole Gupte'))

file:///C:/Users/goura/Downloads/14-DataFrame.html 76/143
2/8/25, 2:23 PM 14-DataFrame

Out[188… Series_Title Released_Year Runtime Genre IMDB_Rating Director Star1 No_of_

Taare Aamir Amole


65 2007 165 Drama 8.4 1
Zameen Par Khan Gupte

In [189… # Find the most earning actor -->> director combo

duo['Gross'].sum().sort_values(ascending = False).head(1)

Out[189… Director Star1


Akira Kurosawa Toshirô Mifune 2.999877e+09
Name: Gross, dtype: float64

In [190… # find the best(in-terms of metascore(avg)) actor->genre combo

movies.groupby(['Star1','Genre'])['Metascore'].mean().reset_index().sort_values(
'Metascore',ascending = False).head(1)

Out[190… Star1 Genre Metascore

230 Ellar Coltrane Drama 100.0

Merge

MERGING PANDAS I/O


Merging, Joining & Concatenating
In [194… courses = pd.read_csv('courses.csv')
students = pd.read_csv('students.csv')
nov = pd.read_csv('reg-month1.csv')
dec = pd.read_csv('reg-month2.csv')
matches = pd.read_csv('matches.csv')
delivery = pd.read_csv('deliveries.csv')

In [195… # pd.concat -->> Vertically Merged

regs = pd.concat([nov,dec],ignore_index = True)


regs

file:///C:/Users/goura/Downloads/14-DataFrame.html 77/143
2/8/25, 2:23 PM 14-DataFrame

Out[195… student_id course_id

0 23 1

1 15 5

2 18 6

3 23 4

4 16 9

5 18 1

6 1 1

7 7 8

8 22 3

9 15 1

10 19 4

11 1 6

12 7 10

13 11 7

14 13 3

15 24 4

16 21 1

17 16 5

18 23 3

19 17 7

20 23 6

21 25 1

22 19 2

23 25 10

24 3 3

25 3 5

26 16 7

27 12 10

28 12 1

29 14 9

30 7 7

31 7 2

32 16 3

file:///C:/Users/goura/Downloads/14-DataFrame.html 78/143
2/8/25, 2:23 PM 14-DataFrame

student_id course_id

33 17 10

34 11 8

35 14 6

36 12 5

37 12 7

38 18 8

39 1 10

40 1 9

41 2 5

42 7 6

43 22 5

44 22 6

45 23 9

46 23 5

47 14 4

48 14 1

49 11 10

50 42 9

51 50 8

52 38 1

In [197… pd.concat([nov,dec]).shape

Out[197… (53, 2)

In [199… # Multiindex DataFrame

multi = pd.concat([nov,dec],keys=['Nov','Dec'])
# iloc will not work here becoz iloc is used for integer based index
multi.loc['Nov']
# loc is used for labels.
multi.loc['Dec']

file:///C:/Users/goura/Downloads/14-DataFrame.html 79/143
2/8/25, 2:23 PM 14-DataFrame

Out[199… student_id course_id

0 3 5

1 16 7

2 12 10

3 12 1

4 14 9

5 7 7

6 7 2

7 16 3

8 17 10

9 11 8

10 14 6

11 12 5

12 12 7

13 18 8

14 1 10

15 1 9

16 2 5

17 7 6

18 22 5

19 22 6

20 23 9

21 23 5

22 14 4

23 14 1

24 11 10

25 42 9

26 50 8

27 38 1

In [200… # for special row

multi.loc[('Dec',8)]

Out[200… student_id 17
course_id 10
Name: (Dec, 8), dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 80/143
2/8/25, 2:23 PM 14-DataFrame

In [201… # For horizontally join

pd.concat([nov,dec],axis = 1)

Out[201… student_id course_id student_id course_id

0 23.0 1.0 3 5

1 15.0 5.0 16 7

2 18.0 6.0 12 10

3 23.0 4.0 12 1

4 16.0 9.0 14 9

5 18.0 1.0 7 7

6 1.0 1.0 7 2

7 7.0 8.0 16 3

8 22.0 3.0 17 10

9 15.0 1.0 11 8

10 19.0 4.0 14 6

11 1.0 6.0 12 5

12 7.0 10.0 12 7

13 11.0 7.0 18 8

14 13.0 3.0 1 10

15 24.0 4.0 1 9

16 21.0 1.0 2 5

17 16.0 5.0 7 6

18 23.0 3.0 22 5

19 17.0 7.0 22 6

20 23.0 6.0 23 9

21 25.0 1.0 23 5

22 19.0 2.0 14 4

23 25.0 10.0 14 1

24 3.0 3.0 11 10

25 NaN NaN 42 9

26 NaN NaN 50 8

27 NaN NaN 38 1

In [202… # inner join

file:///C:/Users/goura/Downloads/14-DataFrame.html 81/143
2/8/25, 2:23 PM 14-DataFrame

students.merge(regs,how = 'inner',on = 'student_id')

file:///C:/Users/goura/Downloads/14-DataFrame.html 82/143
2/8/25, 2:23 PM 14-DataFrame

Out[202… student_id name partner course_id

0 1 Kailash Harjo 23 1

1 1 Kailash Harjo 23 6

2 1 Kailash Harjo 23 10

3 1 Kailash Harjo 23 9

4 2 Esha Butala 1 5

5 3 Parveen Bhalla 3 3

6 3 Parveen Bhalla 3 5

7 7 Tarun Thaker 9 8

8 7 Tarun Thaker 9 10

9 7 Tarun Thaker 9 7

10 7 Tarun Thaker 9 2

11 7 Tarun Thaker 9 6

12 11 David Mukhopadhyay 20 7

13 11 David Mukhopadhyay 20 8

14 11 David Mukhopadhyay 20 10

15 12 Radha Dutt 19 10

16 12 Radha Dutt 19 1

17 12 Radha Dutt 19 5

18 12 Radha Dutt 19 7

19 13 Munni Varghese 24 3

20 14 Pranab Natarajan 22 9

21 14 Pranab Natarajan 22 6

22 14 Pranab Natarajan 22 4

23 14 Pranab Natarajan 22 1

24 15 Preet Sha 16 5

25 15 Preet Sha 16 1

26 16 Elias Dodiya 25 9

27 16 Elias Dodiya 25 5

28 16 Elias Dodiya 25 7

29 16 Elias Dodiya 25 3

30 17 Yasmin Palan 7 7

31 17 Yasmin Palan 7 10

32 18 Fardeen Mahabir 13 6

file:///C:/Users/goura/Downloads/14-DataFrame.html 83/143
2/8/25, 2:23 PM 14-DataFrame

student_id name partner course_id

33 18 Fardeen Mahabir 13 1

34 18 Fardeen Mahabir 13 8

35 19 Qabeel Raman 12 4

36 19 Qabeel Raman 12 2

37 21 Seema Kota 15 1

38 22 Yash Sethi 21 3

39 22 Yash Sethi 21 5

40 22 Yash Sethi 21 6

41 23 Chhavi Lachman 18 1

42 23 Chhavi Lachman 18 4

43 23 Chhavi Lachman 18 3

44 23 Chhavi Lachman 18 6

45 23 Chhavi Lachman 18 9

46 23 Chhavi Lachman 18 5

47 24 Radhika Suri 17 4

48 25 Shashank D’Alia 2 1

49 25 Shashank D’Alia 2 10

In [203… # left join

courses.merge(regs,how = 'left',on = 'course_id')

file:///C:/Users/goura/Downloads/14-DataFrame.html 84/143
2/8/25, 2:23 PM 14-DataFrame

Out[203… course_id course_name price student_id

0 1 python 2499 23.0

1 1 python 2499 18.0

2 1 python 2499 1.0

3 1 python 2499 15.0

4 1 python 2499 21.0

5 1 python 2499 25.0

6 1 python 2499 12.0

7 1 python 2499 14.0

8 1 python 2499 38.0

9 2 sql 3499 19.0

10 2 sql 3499 7.0

11 3 data analysis 4999 22.0

12 3 data analysis 4999 13.0

13 3 data analysis 4999 23.0

14 3 data analysis 4999 3.0

15 3 data analysis 4999 16.0

16 4 machine learning 9999 23.0

17 4 machine learning 9999 19.0

18 4 machine learning 9999 24.0

19 4 machine learning 9999 14.0

20 5 tableau 2499 15.0

21 5 tableau 2499 16.0

22 5 tableau 2499 3.0

23 5 tableau 2499 12.0

24 5 tableau 2499 2.0

25 5 tableau 2499 22.0

26 5 tableau 2499 23.0

27 6 power bi 1899 18.0

28 6 power bi 1899 1.0

29 6 power bi 1899 23.0

30 6 power bi 1899 14.0

31 6 power bi 1899 7.0

32 6 power bi 1899 22.0

file:///C:/Users/goura/Downloads/14-DataFrame.html 85/143
2/8/25, 2:23 PM 14-DataFrame

course_id course_name price student_id

33 7 ms sxcel 1599 11.0

34 7 ms sxcel 1599 17.0

35 7 ms sxcel 1599 16.0

36 7 ms sxcel 1599 7.0

37 7 ms sxcel 1599 12.0

38 8 pandas 1099 7.0

39 8 pandas 1099 11.0

40 8 pandas 1099 18.0

41 8 pandas 1099 50.0

42 9 plotly 699 16.0

43 9 plotly 699 14.0

44 9 plotly 699 1.0

45 9 plotly 699 23.0

46 9 plotly 699 42.0

47 10 pyspark 2499 7.0

48 10 pyspark 2499 25.0

49 10 pyspark 2499 12.0

50 10 pyspark 2499 17.0

51 10 pyspark 2499 1.0

52 10 pyspark 2499 11.0

53 11 Numpy 699 NaN

54 12 C++ 1299 NaN

In [204… # Right join

temp_df = pd.DataFrame({
'student_id':[26,27,28],
'name':['Gaurav','Saurav','Rohan'],
'partner':[28,26,17]
})

students = pd.concat([students,temp_df],ignore_index=True)

In [205… students.merge(regs,how = 'right',on = 'student_id')

file:///C:/Users/goura/Downloads/14-DataFrame.html 86/143
2/8/25, 2:23 PM 14-DataFrame

Out[205… student_id name partner course_id

0 23 Chhavi Lachman 18.0 1

1 15 Preet Sha 16.0 5

2 18 Fardeen Mahabir 13.0 6

3 23 Chhavi Lachman 18.0 4

4 16 Elias Dodiya 25.0 9

5 18 Fardeen Mahabir 13.0 1

6 1 Kailash Harjo 23.0 1

7 7 Tarun Thaker 9.0 8

8 22 Yash Sethi 21.0 3

9 15 Preet Sha 16.0 1

10 19 Qabeel Raman 12.0 4

11 1 Kailash Harjo 23.0 6

12 7 Tarun Thaker 9.0 10

13 11 David Mukhopadhyay 20.0 7

14 13 Munni Varghese 24.0 3

15 24 Radhika Suri 17.0 4

16 21 Seema Kota 15.0 1

17 16 Elias Dodiya 25.0 5

18 23 Chhavi Lachman 18.0 3

19 17 Yasmin Palan 7.0 7

20 23 Chhavi Lachman 18.0 6

21 25 Shashank D’Alia 2.0 1

22 19 Qabeel Raman 12.0 2

23 25 Shashank D’Alia 2.0 10

24 3 Parveen Bhalla 3.0 3

25 3 Parveen Bhalla 3.0 5

26 16 Elias Dodiya 25.0 7

27 12 Radha Dutt 19.0 10

28 12 Radha Dutt 19.0 1

29 14 Pranab Natarajan 22.0 9

30 7 Tarun Thaker 9.0 7

31 7 Tarun Thaker 9.0 2

32 16 Elias Dodiya 25.0 3

file:///C:/Users/goura/Downloads/14-DataFrame.html 87/143
2/8/25, 2:23 PM 14-DataFrame

student_id name partner course_id

33 17 Yasmin Palan 7.0 10

34 11 David Mukhopadhyay 20.0 8

35 14 Pranab Natarajan 22.0 6

36 12 Radha Dutt 19.0 5

37 12 Radha Dutt 19.0 7

38 18 Fardeen Mahabir 13.0 8

39 1 Kailash Harjo 23.0 10

40 1 Kailash Harjo 23.0 9

41 2 Esha Butala 1.0 5

42 7 Tarun Thaker 9.0 6

43 22 Yash Sethi 21.0 5

44 22 Yash Sethi 21.0 6

45 23 Chhavi Lachman 18.0 9

46 23 Chhavi Lachman 18.0 5

47 14 Pranab Natarajan 22.0 4

48 14 Pranab Natarajan 22.0 1

49 11 David Mukhopadhyay 20.0 10

50 42 NaN NaN 9

51 50 NaN NaN 8

52 38 NaN NaN 1

In [206… # regs.merge(students,how = 'left',on = 'student_id')


# Right join by changing the order
students.merge(regs,how = 'left',on = 'student_id')

file:///C:/Users/goura/Downloads/14-DataFrame.html 88/143
2/8/25, 2:23 PM 14-DataFrame

Out[206… student_id name partner course_id

0 1 Kailash Harjo 23 1.0

1 1 Kailash Harjo 23 6.0

2 1 Kailash Harjo 23 10.0

3 1 Kailash Harjo 23 9.0

4 2 Esha Butala 1 5.0

5 3 Parveen Bhalla 3 3.0

6 3 Parveen Bhalla 3 5.0

7 4 Marlo Dugal 14 NaN

8 5 Kusum Bahri 6 NaN

9 6 Lakshmi Contractor 10 NaN

10 7 Tarun Thaker 9 8.0

11 7 Tarun Thaker 9 10.0

12 7 Tarun Thaker 9 7.0

13 7 Tarun Thaker 9 2.0

14 7 Tarun Thaker 9 6.0

15 8 Radheshyam Dey 5 NaN

16 9 Nitika Chatterjee 4 NaN

17 10 Aayushman Sant 8 NaN

18 11 David Mukhopadhyay 20 7.0

19 11 David Mukhopadhyay 20 8.0

20 11 David Mukhopadhyay 20 10.0

21 12 Radha Dutt 19 10.0

22 12 Radha Dutt 19 1.0

23 12 Radha Dutt 19 5.0

24 12 Radha Dutt 19 7.0

25 13 Munni Varghese 24 3.0

26 14 Pranab Natarajan 22 9.0

27 14 Pranab Natarajan 22 6.0

28 14 Pranab Natarajan 22 4.0

29 14 Pranab Natarajan 22 1.0

30 15 Preet Sha 16 5.0

31 15 Preet Sha 16 1.0

32 16 Elias Dodiya 25 9.0

file:///C:/Users/goura/Downloads/14-DataFrame.html 89/143
2/8/25, 2:23 PM 14-DataFrame

student_id name partner course_id

33 16 Elias Dodiya 25 5.0

34 16 Elias Dodiya 25 7.0

35 16 Elias Dodiya 25 3.0

36 17 Yasmin Palan 7 7.0

37 17 Yasmin Palan 7 10.0

38 18 Fardeen Mahabir 13 6.0

39 18 Fardeen Mahabir 13 1.0

40 18 Fardeen Mahabir 13 8.0

41 19 Qabeel Raman 12 4.0

42 19 Qabeel Raman 12 2.0

43 20 Hanuman Hegde 11 NaN

44 21 Seema Kota 15 1.0

45 22 Yash Sethi 21 3.0

46 22 Yash Sethi 21 5.0

47 22 Yash Sethi 21 6.0

48 23 Chhavi Lachman 18 1.0

49 23 Chhavi Lachman 18 4.0

50 23 Chhavi Lachman 18 3.0

51 23 Chhavi Lachman 18 6.0

52 23 Chhavi Lachman 18 9.0

53 23 Chhavi Lachman 18 5.0

54 24 Radhika Suri 17 4.0

55 25 Shashank D’Alia 2 1.0

56 25 Shashank D’Alia 2 10.0

57 26 Gaurav 28 NaN

58 27 Saurav 26 NaN

59 28 Rohan 17 NaN

In [207… # Outer join

students.merge(regs,how = 'outer',on = 'student_id').tail(10)

file:///C:/Users/goura/Downloads/14-DataFrame.html 90/143
2/8/25, 2:23 PM 14-DataFrame

Out[207… student_id name partner course_id

53 23 Chhavi Lachman 18.0 5.0

54 24 Radhika Suri 17.0 4.0

55 25 Shashank D’Alia 2.0 1.0

56 25 Shashank D’Alia 2.0 10.0

57 26 Gaurav 28.0 NaN

58 27 Saurav 26.0 NaN

59 28 Rohan 17.0 NaN

60 38 NaN NaN 1.0

61 42 NaN NaN 9.0

62 50 NaN NaN 8.0

In [208… # Find total revenue generated by company

total = regs.merge(courses,how = 'inner',on = 'course_id')['price'].sum()


total

Out[208… 154247

In [209… # Find month by month Revenue

temp_df = pd.concat([nov,dec],keys = ['Nov','Dec']).reset_index()


temp_df.merge(courses,on = 'course_id').groupby('level_0')['price'].sum()

Out[209… level_0
Dec 65072
Nov 89175
Name: price, dtype: int64

In [210… # Print the registration table


# cols -> name -> course -> price

regs.merge(students,on = 'student_id').merge(courses,on = 'course_id')


[['name','course_name','price']]

Out[210… [['name', 'course_name', 'price']]

In [212… # plot bar chart for revenue/courses

regs.merge(courses,on = 'course_id').groupby('course_name')['price'].sum().plot(

Out[212… <Axes: xlabel='course_name'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 91/143
2/8/25, 2:23 PM 14-DataFrame

In [213… # find students who enrolled in both the months

common_student_id = np.intersect1d(nov['student_id'],dec['student_id'])
common_student_id

Out[213… array([ 1, 3, 7, 11, 16, 17, 18, 22, 23], dtype=int64)

In [214… students[students['student_id'].isin(common_student_id)]

file:///C:/Users/goura/Downloads/14-DataFrame.html 92/143
2/8/25, 2:23 PM 14-DataFrame

Out[214… student_id name partner

0 1 Kailash Harjo 23

2 3 Parveen Bhalla 3

6 7 Tarun Thaker 9

10 11 David Mukhopadhyay 20

15 16 Elias Dodiya 25

16 17 Yasmin Palan 7

17 18 Fardeen Mahabir 13

21 22 Yash Sethi 21

22 23 Chhavi Lachman 18

In [215… # find course that got no enrollment

# courses['course_id']
# regs['course_id']

course_id_list = np.setdiff1d(courses['course_id'],regs['course_id'])
courses[courses['course_id'].isin(course_id_list)]

Out[215… course_id course_name price

10 11 Numpy 699

11 12 C++ 1299

In [217… # find students who did not enroll into any courses

student_id_list = np.setdiff1d(students['student_id'],regs['student_id'])

In [218… students[students['student_id'].isin(student_id_list)].shape[0]

Out[218… 10

In [219… (10/28)*100

Out[219… 35.714285714285715

In [220… # Print student name -> partner name for all enrolled students
# self join

students.merge(students,how ='inner',left_on = 'partner',


right_on = 'student_id')[['name_x','name_y']]

file:///C:/Users/goura/Downloads/14-DataFrame.html 93/143
2/8/25, 2:23 PM 14-DataFrame

Out[220… name_x name_y

0 Kailash Harjo Chhavi Lachman

1 Esha Butala Kailash Harjo

2 Parveen Bhalla Parveen Bhalla

3 Marlo Dugal Pranab Natarajan

4 Kusum Bahri Lakshmi Contractor

5 Lakshmi Contractor Aayushman Sant

6 Tarun Thaker Nitika Chatterjee

7 Radheshyam Dey Kusum Bahri

8 Nitika Chatterjee Marlo Dugal

9 Aayushman Sant Radheshyam Dey

10 David Mukhopadhyay Hanuman Hegde

11 Radha Dutt Qabeel Raman

12 Munni Varghese Radhika Suri

13 Pranab Natarajan Yash Sethi

14 Preet Sha Elias Dodiya

15 Elias Dodiya Shashank D’Alia

16 Yasmin Palan Tarun Thaker

17 Fardeen Mahabir Munni Varghese

18 Qabeel Raman Radha Dutt

19 Hanuman Hegde David Mukhopadhyay

20 Seema Kota Preet Sha

21 Yash Sethi Seema Kota

22 Chhavi Lachman Fardeen Mahabir

23 Radhika Suri Yasmin Palan

24 Shashank D’Alia Esha Butala

25 Gaurav Rohan

26 Saurav Gaurav

27 Rohan Yasmin Palan

In [222… # 9. find top 3 students who did most number enrollments

regs.merge(students,on = 'student_id').groupby(['student_id','name'])['name'].co

file:///C:/Users/goura/Downloads/14-DataFrame.html 94/143
2/8/25, 2:23 PM 14-DataFrame

Out[222… student_id name


23 Chhavi Lachman 6
7 Tarun Thaker 5
1 Kailash Harjo 4
Name: name, dtype: int64

In [223… # 10. find top 3 students who spent most amount of money on courses

regs.merge(students,on = 'student_id').merge(courses,on = 'course_id').groupby([

Out[223… student_id name


23 Chhavi Lachman 22594
14 Pranab Natarajan 15096
19 Qabeel Raman 13498
Name: price, dtype: int64

In [224… # Alternate syntax for merge


# students.merge(regs)

pd.merge(students,regs,how='inner',on='student_id')

file:///C:/Users/goura/Downloads/14-DataFrame.html 95/143
2/8/25, 2:23 PM 14-DataFrame

Out[224… student_id name partner course_id

0 1 Kailash Harjo 23 1

1 1 Kailash Harjo 23 6

2 1 Kailash Harjo 23 10

3 1 Kailash Harjo 23 9

4 2 Esha Butala 1 5

5 3 Parveen Bhalla 3 3

6 3 Parveen Bhalla 3 5

7 7 Tarun Thaker 9 8

8 7 Tarun Thaker 9 10

9 7 Tarun Thaker 9 7

10 7 Tarun Thaker 9 2

11 7 Tarun Thaker 9 6

12 11 David Mukhopadhyay 20 7

13 11 David Mukhopadhyay 20 8

14 11 David Mukhopadhyay 20 10

15 12 Radha Dutt 19 10

16 12 Radha Dutt 19 1

17 12 Radha Dutt 19 5

18 12 Radha Dutt 19 7

19 13 Munni Varghese 24 3

20 14 Pranab Natarajan 22 9

21 14 Pranab Natarajan 22 6

22 14 Pranab Natarajan 22 4

23 14 Pranab Natarajan 22 1

24 15 Preet Sha 16 5

25 15 Preet Sha 16 1

26 16 Elias Dodiya 25 9

27 16 Elias Dodiya 25 5

28 16 Elias Dodiya 25 7

29 16 Elias Dodiya 25 3

30 17 Yasmin Palan 7 7

31 17 Yasmin Palan 7 10

32 18 Fardeen Mahabir 13 6

file:///C:/Users/goura/Downloads/14-DataFrame.html 96/143
2/8/25, 2:23 PM 14-DataFrame

student_id name partner course_id

33 18 Fardeen Mahabir 13 1

34 18 Fardeen Mahabir 13 8

35 19 Qabeel Raman 12 4

36 19 Qabeel Raman 12 2

37 21 Seema Kota 15 1

38 22 Yash Sethi 21 3

39 22 Yash Sethi 21 5

40 22 Yash Sethi 21 6

41 23 Chhavi Lachman 18 1

42 23 Chhavi Lachman 18 4

43 23 Chhavi Lachman 18 3

44 23 Chhavi Lachman 18 6

45 23 Chhavi Lachman 18 9

46 23 Chhavi Lachman 18 5

47 24 Radhika Suri 17 4

48 25 Shashank D’Alia 2 1

49 25 Shashank D’Alia 2 10

In [225… temp_df = delivery.merge(matches,left_on='match_id',right_on='id')

In [226… # stadium -> sixes


six_df = temp_df[temp_df['batsman_runs'] == 6]
num_sixes = six_df.groupby('venue')['venue'].count()

In [227… num_matches = matches['venue'].value_counts()

In [228… (num_sixes/num_matches).sort_values(ascending=False).head(10)

Out[228… venue
Holkar Cricket Stadium 17.600000
M Chinnaswamy Stadium 13.227273
Sharjah Cricket Stadium 12.666667
Himachal Pradesh Cricket Association Stadium 12.000000
Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium 11.727273
Wankhede Stadium 11.526316
De Beers Diamond Oval 11.333333
Maharashtra Cricket Association Stadium 11.266667
JSCA International Stadium Complex 10.857143
Sardar Patel Stadium, Motera 10.833333
dtype: float64

In [229… temp_df.groupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_v

file:///C:/Users/goura/Downloads/14-DataFrame.html 97/143
2/8/25, 2:23 PM 14-DataFrame

Out[229… season batsman batsman_runs

115 2008 SE Marsh 616

229 2009 ML Hayden 572

446 2010 SR Tendulkar 618

502 2011 CH Gayle 608

684 2012 CH Gayle 733

910 2013 MEK Hussey 733

1088 2014 RV Uthappa 660

1148 2015 DA Warner 562

1383 2016 V Kohli 973

1422 2017 DA Warner 641

In [230… temp_df.groupby(['season','batsman'])['batsman_runs'].sum().reset_index().sort_v

Out[230… season batsman batsman_runs

1383 2016 V Kohli 973

1278 2016 DA Warner 848

910 2013 MEK Hussey 733

684 2012 CH Gayle 733

852 2013 CH Gayle 720

... ... ... ...

1467 2017 MM Patel 0

658 2012 AC Blizzard 0

475 2011 AB Dinda 0

1394 2017 AD Nath 0

58 2008 L Balaji 0

1531 rows × 3 columns

MultiIndex-Objects
higher dimension data ko lower dimension me represent karne ka tarika hota hai multi-
indexing ye ek heirarchy create karta hai .

In [231… import numpy as np


import pandas as pd

file:///C:/Users/goura/Downloads/14-DataFrame.html 98/143
2/8/25, 2:23 PM 14-DataFrame

Series is 1D and DataFrames are 2D objects


But why?
And what exactly is index?

In [233… # can we have multiple index? Let's try

index_val = [('cse',2019),('cse',2020),('cse',2021),
('cse',2022),('ece',2019),('ece',2020),('ece',2021),('ece',2022)]
a = pd.Series([1,2,3,4,5,6,7,8],index=index_val)
a

Out[233… (cse, 2019) 1


(cse, 2020) 2
(cse, 2021) 3
(cse, 2022) 4
(ece, 2019) 5
(ece, 2020) 6
(ece, 2021) 7
(ece, 2022) 8
dtype: int64

In [234… a[('cse',2022)]

Out[234… 4

In [235… # how to create multiindex object


# 1. pd.MultiIndex.from_tuples()
index_val = [('cse',2019),('cse',2020),('cse',2021),('cse',2022),('ece',2019),
('ece',2020),('ece',2021),('ece',2022)]
multiindex = pd.MultiIndex.from_tuples(index_val)
multiindex.levels[1]
# 2. pd.MultiIndex.from_product()
pd.MultiIndex.from_product([['cse','ece'],[2019,2020,2021,2022]])

Out[235… MultiIndex([('cse', 2019),


('cse', 2020),
('cse', 2021),
('cse', 2022),
('ece', 2019),
('ece', 2020),
('ece', 2021),
('ece', 2022)],
)

In [236… # creating a series with multiindex object


s = pd.Series([1,2,3,4,5,6,7,8],index=multiindex)
s

file:///C:/Users/goura/Downloads/14-DataFrame.html 99/143
2/8/25, 2:23 PM 14-DataFrame

Out[236… cse 2019 1


2020 2
2021 3
2022 4
ece 2019 5
2020 6
2021 7
2022 8
dtype: int64

In [237… # how to fetch items from such a series


s['cse']

Out[237… 2019 1
2020 2
2021 3
2022 4
dtype: int64

kyu jarurat hai multiindex series ya object ki??

jab aapne multi index banaya aapne higher dimension data ko ek lower dimension
object me represent kiya series hota kaisa hai 1-D hota hai lekin aapne series ko use
karke 2-D data ko series me represent kiya hum yaha par 3d data ko 2d me display
kara du ya 5d data ko 2d me display kara du ya even 10d data ko 2d me represent
kar sakte ho.

Stacking and Unstacking


In [238… # unstack

temp = s.unstack()
temp

Out[238… 2019 2020 2021 2022

cse 1 2 3 4

ece 5 6 7 8

In [239… # stack

temp.stack()

Out[239… cse 2019 1


2020 2
2021 3
2022 4
ece 2019 5
2020 6
2021 7
2022 8
dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 100/143
2/8/25, 2:23 PM 14-DataFrame

In [240… branch_df1 = pd.DataFrame(


[
[1,2],
[3,4],
[5,6],
[7,8],
[9,10],
[11,12],
[13,14],
[15,16],
],
index = multiindex,
columns = ['avg_package','students']
)

branch_df1

Out[240… avg_package students

cse 2019 1 2

2020 3 4

2021 5 6

2022 7 8

ece 2019 9 10

2020 11 12

2021 13 14

2022 15 16

In [241… branch_df1['students']

Out[241… cse 2019 2


2020 4
2021 6
2022 8
ece 2019 10
2020 12
2021 14
2022 16
Name: students, dtype: int64

In [242… # multiindex df from columns perspective


branch_df2 = pd.DataFrame(
[
[1,2,0,0],
[3,4,0,0],
[5,6,0,0],
[7,8,0,0],
],
index = [2019,2020,2021,2022],
columns = pd.MultiIndex.from_product([['delhi','mumbai'],
['avg_package','students']])
)

file:///C:/Users/goura/Downloads/14-DataFrame.html 101/143
2/8/25, 2:23 PM 14-DataFrame

branch_df2

Out[242… delhi mumbai

avg_package students avg_package students

2019 1 2 0 0

2020 3 4 0 0

2021 5 6 0 0

2022 7 8 0 0

In [243… branch_df2.loc[2019]

Out[243… delhi avg_package 1


students 2
mumbai avg_package 0
students 0
Name: 2019, dtype: int64

In [244… # Multiindex df in terms of both cols and index

branch_df3 = pd.DataFrame(
[
[1,2,0,0],
[3,4,0,0],
[5,6,0,0],
[7,8,0,0],
[9,10,0,0],
[11,12,0,0],
[13,14,0,0],
[15,16,0,0],
],
index = multiindex,
columns = pd.MultiIndex.from_product([['delhi','mumbai'],
['avg_package','students']])
)

branch_df3

file:///C:/Users/goura/Downloads/14-DataFrame.html 102/143
2/8/25, 2:23 PM 14-DataFrame

Out[244… delhi mumbai

avg_package students avg_package students

cse 2019 1 2 0 0

2020 3 4 0 0

2021 5 6 0 0

2022 7 8 0 0

ece 2019 9 10 0 0

2020 11 12 0 0

2021 13 14 0 0

2022 15 16 0 0

In [245… # unstack row ko nikaal k column bana deta hai

branch_df1.unstack().unstack()

Out[245… avg_package 2019 cse 1


ece 9
2020 cse 3
ece 11
2021 cse 5
ece 13
2022 cse 7
ece 15
students 2019 cse 2
ece 10
2020 cse 4
ece 12
2021 cse 6
ece 14
2022 cse 8
ece 16
dtype: int64

In [246… # Series k upar stack kaam nahi karta hai


# stack se column row me convert hota hai

branch_df1.unstack().stack().stack()

file:///C:/Users/goura/Downloads/14-DataFrame.html 103/143
2/8/25, 2:23 PM 14-DataFrame

Out[246… cse 2019avg_package 1


students 2
2020 avg_package 3
students 4
2021 avg_package 5
students 6
2022 avg_package 7
students 8
ece 2019 avg_package 9
students 10
2020 avg_package 11
students 12
2021 avg_package 13
students 14
2022 avg_package 15
students 16
dtype: int64

In [247… branch_df2.unstack()

Out[247… delhi avg_package 2019 1


2020 3
2021 5
2022 7
students 2019 2
2020 4
2021 6
2022 8
mumbai avg_package 2019 0
2020 0
2021 0
2022 0
students 2019 0
2020 0
2021 0
2022 0
dtype: int64

In [248… branch_df2.stack().stack()

Out[248… 2019 avg_package delhi 1


mumbai 0
students delhi 2
mumbai 0
2020 avg_package delhi 3
mumbai 0
students delhi 4
mumbai 0
2021 avg_package delhi 5
mumbai 0
students delhi 6
mumbai 0
2022 avg_package delhi 7
mumbai 0
students delhi 8
mumbai 0
dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 104/143
2/8/25, 2:23 PM 14-DataFrame

Working with multiindex dataframes


In [249… # Head and Tail

branch_df3.head()
branch_df3.tail()

# Shape
branch_df3.shape

# info
branch_df3.info()

# duplicated -> isnull


branch_df3.duplicated()
branch_df3.isnull()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 8 entries, ('cse', 2019) to ('ece', 2022)
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 (delhi, avg_package) 8 non-null int64
1 (delhi, students) 8 non-null int64
2 (mumbai, avg_package) 8 non-null int64
3 (mumbai, students) 8 non-null int64
dtypes: int64(4)
memory usage: 632.0+ bytes
Out[249… delhi mumbai

avg_package students avg_package students

cse 2019 False False False False

2020 False False False False

2021 False False False False

2022 False False False False

ece 2019 False False False False

2020 False False False False

2021 False False False False

2022 False False False False

In [250… # Extracting rows single

# Jab hamare paas named index hote hain tab hum loc use karte hain

branch_df3.loc[('cse',2022)]

Out[250… delhi avg_package 7


students 8
mumbai avg_package 0
students 0
Name: (cse, 2022), dtype: int64

file:///C:/Users/goura/Downloads/14-DataFrame.html 105/143
2/8/25, 2:23 PM 14-DataFrame

In [251… # multiple

branch_df3.loc[('cse',2019):('ece',2022):2]

Out[251… delhi mumbai

avg_package students avg_package students

cse 2019 1 2 0 0

2021 5 6 0 0

ece 2019 9 10 0 0

2021 13 14 0 0

In [252… # Using iloc

branch_df3.iloc[0:5:2]

Out[252… delhi mumbai

avg_package students avg_package students

cse 2019 1 2 0 0

2021 5 6 0 0

ece 2019 9 10 0 0

In [253… # Extracting columns

branch_df3['delhi']['students']

Out[253… cse 2019 2


2020 4
2021 6
2022 8
ece 2019 10
2020 12
2021 14
2022 16
Name: students, dtype: int64

In [254… branch_df3.iloc[:,1:3]

file:///C:/Users/goura/Downloads/14-DataFrame.html 106/143
2/8/25, 2:23 PM 14-DataFrame

Out[254… delhi mumbai

students avg_package

cse 2019 2 0

2020 4 0

2021 6 0

2022 8 0

ece 2019 10 0

2020 12 0

2021 14 0

2022 16 0

In [255… # Extracting both

branch_df3.iloc[[0,4],[1,2]]

Out[255… delhi mumbai

students avg_package

cse 2019 2 0

ece 2019 10 0

In [256… # sort index


# both -> descending -> diff order
# based on one level

branch_df3.sort_index(ascending=False)
branch_df3.sort_index(ascending=[False,True])
branch_df3.sort_index(level=0,ascending=[False])

Out[256… delhi mumbai

avg_package students avg_package students

ece 2019 9 10 0 0

2020 11 12 0 0

2021 13 14 0 0

2022 15 16 0 0

cse 2019 1 2 0 0

2020 3 4 0 0

2021 5 6 0 0

2022 7 8 0 0

file:///C:/Users/goura/Downloads/14-DataFrame.html 107/143
2/8/25, 2:23 PM 14-DataFrame

In [257… # multiindex dataframe(col) -> transpose

branch_df3.transpose()

Out[257… cse ece

2019 2020 2021 2022 2019 2020 2021 2022

delhi avg_package 1 3 5 7 9 11 13 15

students 2 4 6 8 10 12 14 16

mumbai avg_package 0 0 0 0 0 0 0 0

students 0 0 0 0 0 0 0 0

In [258… # swaplevel

branch_df3.swaplevel(axis=1)

Out[258… avg_package students avg_package students

delhi delhi mumbai mumbai

cse 2019 1 2 0 0

2020 3 4 0 0

2021 5 6 0 0

2022 7 8 0 0

ece 2019 9 10 0 0

2020 11 12 0 0

2021 13 14 0 0

2022 15 16 0 0

Long Vs Wide Data

Wide format is where we have a single row for every data point with multiple columns
to hold the values of various attributes.

file:///C:/Users/goura/Downloads/14-DataFrame.html 108/143
2/8/25, 2:23 PM 14-DataFrame

Long format is where, for each data point we have as many rows as the number of
attributes and each row contains the value of a particular attribute for a given data point.

In [259… # melt -> simple example branch


# wide to long

pd.DataFrame({'cse':[120]}).melt()

Out[259… variable value

0 cse 120

In [260… # melt -> branch with year


pd.DataFrame({'cse':[120],'ece':[100],'mech':[50]}).melt(var_name='branch',
value_name='num_students')

Out[260… branch num_students

0 cse 120

1 ece 100

2 mech 50

In [261… pd.DataFrame(
{
'branch':['cse','ece','mech'],
'2020':[100,150,60],
'2021':[120,130,80],
'2022':[150,140,70]
}
).melt(id_vars=['branch'],var_name='year',value_name='students')

Out[261… branch year students

0 cse 2020 100

1 ece 2020 150

2 mech 2020 60

3 cse 2021 120

4 ece 2021 130

5 mech 2021 80

6 cse 2022 150

7 ece 2022 140

8 mech 2022 70

In [262… # melt -> real world example

file:///C:/Users/goura/Downloads/14-DataFrame.html 109/143
2/8/25, 2:23 PM 14-DataFrame

death = pd.read_csv("time_series_covid19_deaths_global.csv")
confirm = pd.read_csv("time_series_covid19_confirmed_global.csv")

In [263… death = death.melt(id_vars =['Province/State','Country/Region','Lat','Long'],


var_name = 'date',value_name= 'num_death')

In [264… confirm=confirm.melt(id_vars =['Province/State','Country/Region','Lat','Long'],


var_name = 'date',value_name= 'num_cases')
confirm

Out[264… Province/State Country/Region Lat Long date num_cases

0 NaN Afghanistan 33.939110 67.709953 1/22/20 0

1 NaN Albania 41.153300 20.168300 1/22/20 0

2 NaN Algeria 28.033900 1.659600 1/22/20 0

3 NaN Andorra 42.506300 1.521800 1/22/20 0

4 NaN Angola -11.202700 17.873900 1/22/20 0

... ... ... ... ... ... ...

West Bank and


311248 NaN 31.952200 35.233200 1/2/23 703228
Gaza

Winter Olympics
311249 NaN 39.904200 116.407400 1/2/23 535
2022

311250 NaN Yemen 15.552727 48.516388 1/2/23 11945

311251 NaN Zambia -13.133897 27.849332 1/2/23 334661

311252 NaN Zimbabwe -19.015438 29.154857 1/2/23 259981

311253 rows × 6 columns

In [265… confirm.merge(death,on = ['Province/State','Country/Region','Lat','Long','date']


)[['Country/Region','date','num_cases','num_death']]

file:///C:/Users/goura/Downloads/14-DataFrame.html 110/143
2/8/25, 2:23 PM 14-DataFrame

Out[265… Country/Region date num_cases num_death

0 Afghanistan 1/22/20 0 0

1 Albania 1/22/20 0 0

2 Algeria 1/22/20 0 0

3 Andorra 1/22/20 0 0

4 Angola 1/22/20 0 0

... ... ... ... ...

311248 West Bank and Gaza 1/2/23 703228 5708

311249 Winter Olympics 2022 1/2/23 535 0

311250 Yemen 1/2/23 11945 2159

311251 Zambia 1/2/23 334661 4024

311252 Zimbabwe 1/2/23 259981 5637

311253 rows × 4 columns

Pivot Table
The pivot table takes simple column-wise data as input, and groups the entries into a
two-dimensional table that provides a multidimensional summarization of the data.

In [267… import seaborn as sns

In [268… df = sns.load_dataset('tips')
df.head()

Out[268… total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

In [269… df.groupby('sex')[['total_bill']].mean()

Out[269… total_bill

sex

Male 20.744076

Female 18.056897

In [270… df.groupby(['sex','smoker'])[['total_bill']].mean().unstack()

file:///C:/Users/goura/Downloads/14-DataFrame.html 111/143
2/8/25, 2:23 PM 14-DataFrame

Out[270… total_bill

smoker Yes No

sex

Male 22.284500 19.791237

Female 17.977879 18.105185

In [271… df.pivot_table(index = 'sex',columns = 'smoker',values = 'total_bill')

Out[271… smoker Yes No

sex

Male 22.284500 19.791237

Female 17.977879 18.105185

In [272… # aggfunc

df.pivot_table(index = 'sex',columns = 'smoker',values = 'total_bill',


aggfunc = 'std')

Out[272… smoker Yes No

sex

Male 9.911845 8.726566

Female 9.189751 7.286455

In [274… df

file:///C:/Users/goura/Downloads/14-DataFrame.html 112/143
2/8/25, 2:23 PM 14-DataFrame

Out[274… total_bill tip sex smoker day time size

0 16.99 1.01 Female No Sun Dinner 2

1 10.34 1.66 Male No Sun Dinner 3

2 21.01 3.50 Male No Sun Dinner 3

3 23.68 3.31 Male No Sun Dinner 2

4 24.59 3.61 Female No Sun Dinner 4

... ... ... ... ... ... ... ...

239 29.03 5.92 Male No Sat Dinner 3

240 27.18 2.00 Female Yes Sat Dinner 2

241 22.67 2.00 Male Yes Sat Dinner 2

242 17.82 1.75 Male No Sat Dinner 2

243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

In [275… df.dtypes

Out[275… total_bill float64


tip float64
sex category
smoker category
day category
time category
size int64
dtype: object

In [276… df.pivot_table(index='sex', columns='smoker', values='size')

Out[276… smoker Yes No

sex

Male 2.500000 2.711340

Female 2.242424 2.592593

In [ ]:

In [277… # Multidimensional

df.pivot_table(index = ['sex','smoker'],columns = ['day','time'],


aggfunc={'size':'mean','tip':'max','total_bill':'sum'}
,margins=True)

file:///C:/Users/goura/Downloads/14-DataFrame.html 113/143
2/8/25, 2:23 PM 14-DataFrame

Out[277… size

day Thur Fri Sat Sun All

time Lunch Dinner Lunch Dinner Dinner Dinner Lu

sex smoker

Male Yes 2.300000 NaN 1.666667 2.400000 2.629630 2.600000 2.500000

No 2.500000 NaN NaN 2.000000 2.656250 2.883721 2.711340

Female Yes 2.428571 NaN 2.000000 2.000000 2.200000 2.500000 2.242424

No 2.500000 2.0 3.000000 2.000000 2.307692 3.071429 2.592593

All 2.459016 2.0 2.000000 2.166667 2.517241 2.842105 2.569672

5 rows × 23 columns

In [278… # margins
df.pivot_table(index='sex',columns='smoker',values='total_bill',
aggfunc='sum',margins=True)

Out[278… smoker Yes No All

sex

Male 1337.07 1919.75 3256.82

Female 593.27 977.68 1570.95

All 1930.34 2897.43 4827.77

In [280… # plotting graphs

df = pd.read_csv("expense_data.csv")
df.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 114/143
2/8/25, 2:23 PM 14-DataFrame

Out[280… Date Account Category Subcategory Note INR Income/Expense No

CUB -
3/2/2022
0 online Food NaN Brownie 50.0 Expense N
10:11
payment

CUB - To
3/2/2022
1 online Other NaN lended 300.0 Expense N
10:11
payment people

CUB -
3/1/2022
2 online Food NaN Dinner 78.0 Expense N
19:50
payment

CUB -
3/1/2022
3 online Transportation NaN Metro 30.0 Expense N
18:56
payment

CUB -
3/1/2022
4 online Food NaN Snacks 67.0 Expense N
18:22
payment

In [281… df['Date'] = pd.to_datetime(df['Date'])


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 277 entries, 0 to 276
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 277 non-null datetime64[ns]
1 Account 277 non-null object
2 Category 277 non-null object
3 Subcategory 0 non-null float64
4 Note 273 non-null object
5 INR 277 non-null float64
6 Income/Expense 277 non-null object
7 Note.1 0 non-null float64
8 Amount 277 non-null float64
9 Currency 277 non-null object
10 Account.1 277 non-null float64
dtypes: datetime64[ns](1), float64(5), object(5)
memory usage: 23.9+ KB

In [282… df['month'] = df['Date'].dt.month_name()

In [283… df.pivot_table(index = 'month',columns = 'Category',values = 'INR',


aggfunc = 'sum',fill_value = 0
).plot()

Out[283… <Axes: xlabel='month'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 115/143
2/8/25, 2:23 PM 14-DataFrame

In [284… df.pivot_table(index = 'month',columns ='Income/Expense',


values = 'INR',aggfunc = 'sum',fill_value = 0
).plot()

Out[284… <Axes: xlabel='month'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 116/143
2/8/25, 2:23 PM 14-DataFrame

In [285… df.pivot_table(index='month',columns='Account',values='INR'
,aggfunc='sum',fill_value=0).plot()

Out[285… <Axes: xlabel='month'>

Pandas Strings
In [286… # What are vectorized operation

a = np.array([1,2,3,4])
a * 4

Out[286… array([ 4, 8, 12, 16])

In [287… # problem in vectorized opertions in vanilla python

s = ['cat','mat',None,'rat']

[i.startswith('c') for i in s]

file:///C:/Users/goura/Downloads/14-DataFrame.html 117/143
2/8/25, 2:23 PM 14-DataFrame

---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[287], line 5
1 # problem in vectorized opertions in vanilla python
3 s = ['cat','mat',None,'rat']
----> 5 [i.startswith('c') for i in s]

Cell In[287], line 5, in <listcomp>(.0)


1 # problem in vectorized opertions in vanilla python
3 s = ['cat','mat',None,'rat']
----> 5 [i.startswith('c') for i in s]

AttributeError: 'NoneType' object has no attribute 'startswith'

In [288… # How pandas solves this issue?

s = pd.Series(['cat','mat',None,'rat'])
# string accessor
s.str.startswith('c')

# fast and optimized

Out[288… 0 True
1 False
2 None
3 False
dtype: object

In [289… # import titanic

df = pd.read_csv("titanic.csv")
df.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 118/143
2/8/25, 2:23 PM 14-DataFrame

Out[289… PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket F

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0
Henry

In [290… # Common Functions


# lower/upper/capitalize/title

df['Name'].str.lower()
df['Name'].str.upper()
df['Name'].str.capitalize()
df['Name'].str.title()

# len
df['Name'].str.len()
df['Name'].str.len().max()
df['Name'].str.len() == 82
df['Name'][df['Name'].str.len() == 82].values[0]

# Strip
" Gourab ".strip()
df['Name'].str.strip()

file:///C:/Users/goura/Downloads/14-DataFrame.html 119/143
2/8/25, 2:23 PM 14-DataFrame

Out[290… 0 Braund, Mr. Owen Harris


1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [291… # split -->> get

df['lastname'] = df['Name'].str.split(',').str.get(0)
df.head()

Out[291… PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket F

Braund,
A/5
0 1 0 3 Mr. Owen male 22.0 1 0 7.2
21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0 7.9
3101282
Laina

Futrelle,
Mrs.
Jacques
3 4 1 1 female 35.0 1 0 113803 53.1
Heath
(Lily May
Peel)

Allen, Mr.
4 5 0 3 William male 35.0 0 0 373450 8.0
Henry

In [293… df[['title','firstname']] = df["Name"].str.split(',').str.get(1).str.strip().str

df.head()

df['title'].value_counts()

file:///C:/Users/goura/Downloads/14-DataFrame.html 120/143
2/8/25, 2:23 PM 14-DataFrame

Out[293… title
Mr. 517
Miss. 182
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Mlle. 2
Major. 2
Col. 2
the 1
Capt. 1
Ms. 1
Sir. 1
Lady. 1
Mme. 1
Don. 1
Jonkheer. 1
Name: count, dtype: int64

In [294… # Replace

df['title'] = df['title'].str.replace('Ms.','Miss.')
df['title'] = df['title'].str.replace('Mlle.','Miss.')

In [295… df['title'].value_counts()

Out[295… title
Mr. 517
Miss. 185
Mrs. 125
Master. 40
Dr. 7
Rev. 6
Major. 2
Col. 2
Don. 1
Mme. 1
Lady. 1
Sir. 1
Capt. 1
the 1
Jonkheer. 1
Name: count, dtype: int64

In [296… # filtering
# startswith/endswith

df[df['firstname'].str.endswith('A')]

# isdigit/isalpha...
df[df['firstname'].str.isdigit()]

Out[296… PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin E

file:///C:/Users/goura/Downloads/14-DataFrame.html 121/143
2/8/25, 2:23 PM 14-DataFrame

In [297… # applying regex


# contains
# search john -> both case
df[df['firstname'].str.contains('john',case=False)]

# find lastnames with start and end char vowel


df[df['lastname'].str.contains('^[^aeiouAEIOU].+[^aeiouAEIOU]$')]

Out[297… PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599
(Florence
Briggs
Th...

Heikkinen,
STON/O2.
2 3 1 3 Miss. female 26.0 0 0
3101282
Laina

Moran,
5 6 0 3 male NaN 0 0 330877
Mr. James

McCarthy,
6 7 0 1 Mr. male 54.0 0 0 17463
Timothy J

... ... ... ... ... ... ... ... ... ...

Sutehall,
SOTON/OQ
884 885 0 3 Mr. Henry male 25.0 0 0
392076
Jr

Graham,
Miss.
887 888 1 1 female 19.0 0 0 112053
Margaret
Edith

Johnston,
Miss.
888 889 0 3 Catherine female NaN 1 2 W./C. 6607
Helen
"Carrie"

Behr, Mr.
889 890 1 1 Karl male 26.0 0 0 111369
Howell

Dooley,
890 891 0 3 Mr. male 32.0 0 0 370376
Patrick

671 rows × 15 columns

file:///C:/Users/goura/Downloads/14-DataFrame.html 122/143
2/8/25, 2:23 PM 14-DataFrame

In [298… # slicing
df['Name'].str[::-1]

Out[298… 0 sirraH newO .rM ,dnuarB


1 )reyahT sggirB ecnerolF( yeldarB nhoJ .srM ,sg...
2 aniaL .ssiM ,nenikkieH
3 )leeP yaM yliL( htaeH seuqcaJ .srM ,ellertuF
4 yrneH mailliW .rM ,nellA
...
886 sazouJ .veR ,alivtnoM
887 htidE teragraM .ssiM ,maharG
888 "eirraC" neleH enirehtaC .ssiM ,notsnhoJ
889 llewoH lraK .rM ,rheB
890 kcirtaP .rM ,yelooD
Name: Name, Length: 891, dtype: object

Vectorize Date & Time

Timestamp Object
Time stamps reference particular moments in time (e.g., Oct
24th, 2022 at 7:00pm)
Creating Timestamp objects

In [299… # Creating a timestamp

pd.Timestamp('2023/03/05')

Out[299… Timestamp('2023-03-05 00:00:00')

In [300… type(pd.Timestamp('2023/03/05'))

Out[300… pandas._libs.tslibs.timestamps.Timestamp

In [301… # Variation
pd.Timestamp('2023-1-5')
pd.Timestamp('2023, 1, 5')

Out[301… Timestamp('2023-01-05 00:00:00')

In [302… # Only Year


pd.Timestamp('2023')

Out[302… Timestamp('2023-01-01 00:00:00')

In [303… # Using Text

pd.Timestamp('9th feb 2023')

Out[303… Timestamp('2023-02-09 00:00:00')

In [304… # Providing time also

file:///C:/Users/goura/Downloads/14-DataFrame.html 123/143
2/8/25, 2:23 PM 14-DataFrame

pd.Timestamp('9th feb 2023 11:59AM ')

Out[304… Timestamp('2023-02-09 11:59:00')

In [305… # using datetime.datetime object

import datetime as dt

dt.datetime(2023,1,5,9,21,56)

Out[305… datetime.datetime(2023, 1, 5, 9, 21, 56)

In [306… x = pd.Timestamp(dt.datetime(2023,1,5,9,21,56))
x

Out[306… Timestamp('2023-01-05 09:21:56')

In [307… # Fetching attributes

x.year
x.month
x.day
x.hour
x.minute
x.second

Out[307… 56

why separate objects to handle data and


time when python already has datetime
functionality?
syntax wise datetime is very convenient

But the performance takes a hit while working with huge data. List vs Numpy Array

The weaknesses of Python's datetime format inspired the NumPy team to add a set of
native time series data type to NumPy.

The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of
dates to be represented very compactly.

In [308… date = np.array('2015-07-04', dtype=np.datetime64)


date

Out[308… array('2015-07-04', dtype='datetime64[D]')

In [309… date + np.arange(12)

Out[309… array(['2015-07-04', '2015-07-05', '2015-07-06', '2015-07-07',


'2015-07-08', '2015-07-09', '2015-07-10', '2015-07-11',
'2015-07-12', '2015-07-13', '2015-07-14', '2015-07-15'],
dtype='datetime64[D]')

Because of the uniform type in NumPy datetime64 arrays, this type of


operation can be accomplished much more quickly than if we were

file:///C:/Users/goura/Downloads/14-DataFrame.html 124/143
2/8/25, 2:23 PM 14-DataFrame

working directly with Python's datetime objects, especially as arrays get


large

Pandas Timestamp object combines the ease-of-use of python datetime


with the efficient storage and vectorized interface of numpy.datetime64

From a group of these Timestamp objects, Pandas can construct a


DatetimeIndex that can be used to index data in a Series or DataFrame

DatetimeIndex Object
A collection of pandas timestamp

In [310… # from strings

pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1'])

Out[310… DatetimeIndex(['2023-01-01', '2022-01-01', '2021-01-01'], dtype='datetime64[n


s]', freq=None)

In [311… # from strings


type(pd.DatetimeIndex(['2023/1/1','2022/1/1','2021/1/1']))

Out[311… pandas.core.indexes.datetimes.DatetimeIndex

ek single date ko store karne ke liye time stamp object use hota hai aur multiple
date ko store karne ke liye DateTimeIndex use hota hai.

In [312… # using python datetime object

pd.DatetimeIndex([dt.datetime(2023,1,1),dt.datetime(2022,1,1),
dt.datetime(2021,1,1)])

Out[312… DatetimeIndex(['2023-01-01', '2022-01-01', '2021-01-01'], dtype='datetime64[n


s]', freq=None)

In [313… # using pd.timestamps

dt_index = pd.DatetimeIndex([pd.Timestamp(2023,1,1),
pd.Timestamp(2022,1,1),pd.Timestamp(2021,1,1)])

In [314… # using datatimeindex as series index

pd.Series([1,2,3],index=dt_index)

Out[314… 2023-01-01 1
2022-01-01 2
2021-01-01 3
dtype: int64

Date Range Function


file:///C:/Users/goura/Downloads/14-DataFrame.html 125/143
2/8/25, 2:23 PM 14-DataFrame

In [315… # generate daily dates in a given range

pd.date_range(start ='1997/2/9',end = '2023/2/9',freq = '2Y')

Out[315… DatetimeIndex(['1997-12-31', '1999-12-31', '2001-12-31', '2003-12-31',


'2005-12-31', '2007-12-31', '2009-12-31', '2011-12-31',
'2013-12-31', '2015-12-31', '2017-12-31', '2019-12-31',
'2021-12-31'],
dtype='datetime64[ns]', freq='2YE-DEC')

In [316… # generate daily dates in a given range

pd.date_range(start='2023/1/5',end='2023/2/28',freq='3D')

Out[316… DatetimeIndex(['2023-01-05', '2023-01-08', '2023-01-11', '2023-01-14',


'2023-01-17', '2023-01-20', '2023-01-23', '2023-01-26',
'2023-01-29', '2023-02-01', '2023-02-04', '2023-02-07',
'2023-02-10', '2023-02-13', '2023-02-16', '2023-02-19',
'2023-02-22', '2023-02-25', '2023-02-28'],
dtype='datetime64[ns]', freq='3D')

In [317… # B -> business days

pd.date_range(start='2023/1/5',end='2023/2/28',freq='B')

Out[317… DatetimeIndex(['2023-01-05', '2023-01-06', '2023-01-09', '2023-01-10',


'2023-01-11', '2023-01-12', '2023-01-13', '2023-01-16',
'2023-01-17', '2023-01-18', '2023-01-19', '2023-01-20',
'2023-01-23', '2023-01-24', '2023-01-25', '2023-01-26',
'2023-01-27', '2023-01-30', '2023-01-31', '2023-02-01',
'2023-02-02', '2023-02-03', '2023-02-06', '2023-02-07',
'2023-02-08', '2023-02-09', '2023-02-10', '2023-02-13',
'2023-02-14', '2023-02-15', '2023-02-16', '2023-02-17',
'2023-02-20', '2023-02-21', '2023-02-22', '2023-02-23',
'2023-02-24', '2023-02-27', '2023-02-28'],
dtype='datetime64[ns]', freq='B')

In [318… # W -> one week per day

pd.date_range(start='2023/1/5',end='2023/2/28',freq='W-MON')

Out[318… DatetimeIndex(['2023-01-09', '2023-01-16', '2023-01-23', '2023-01-30',


'2023-02-06', '2023-02-13', '2023-02-20', '2023-02-27'],
dtype='datetime64[ns]', freq='W-MON')

In [319… # H -> Hourly data(factor)

pd.date_range(start='2023/1/5',end='2023/2/28',freq='6H')

file:///C:/Users/goura/Downloads/14-DataFrame.html 126/143
2/8/25, 2:23 PM 14-DataFrame

Out[319… DatetimeIndex(['2023-01-05 00:00:00', '2023-01-05 06:00:00',


'2023-01-05 12:00:00', '2023-01-05 18:00:00',
'2023-01-06 00:00:00', '2023-01-06 06:00:00',
'2023-01-06 12:00:00', '2023-01-06 18:00:00',
'2023-01-07 00:00:00', '2023-01-07 06:00:00',
...
'2023-02-25 18:00:00', '2023-02-26 00:00:00',
'2023-02-26 06:00:00', '2023-02-26 12:00:00',
'2023-02-26 18:00:00', '2023-02-27 00:00:00',
'2023-02-27 06:00:00', '2023-02-27 12:00:00',
'2023-02-27 18:00:00', '2023-02-28 00:00:00'],
dtype='datetime64[ns]', length=217, freq='6h')

In [320… # M -> Month end

pd.date_range(start='2023/1/5',end='2023/2/28',freq='M')

Out[320… DatetimeIndex(['2023-01-31', '2023-02-28'], dtype='datetime64[ns]', freq='ME')

In [321… # MS -> Month start

pd.date_range(start='2023/1/5',end='2023/2/28',freq='MS')

Out[321… DatetimeIndex(['2023-02-01'], dtype='datetime64[ns]', freq='MS')

In [322… # A -> Year end

pd.date_range(start='2023/1/5',end='2030/2/28',freq='A')

Out[322… DatetimeIndex(['2023-12-31', '2024-12-31', '2025-12-31', '2026-12-31',


'2027-12-31', '2028-12-31', '2029-12-31'],
dtype='datetime64[ns]', freq='YE-DEC')

In [323… # using periods(number of results)

pd.date_range(start='2023/1/5',periods=25,freq='M')

Out[323… DatetimeIndex(['2023-01-31', '2023-02-28', '2023-03-31', '2023-04-30',


'2023-05-31', '2023-06-30', '2023-07-31', '2023-08-31',
'2023-09-30', '2023-10-31', '2023-11-30', '2023-12-31',
'2024-01-31', '2024-02-29', '2024-03-31', '2024-04-30',
'2024-05-31', '2024-06-30', '2024-07-31', '2024-08-31',
'2024-09-30', '2024-10-31', '2024-11-30', '2024-12-31',
'2025-01-31'],
dtype='datetime64[ns]', freq='ME')

to_datetime function
converts an existing objects to pandas timestamp/datetimeindex object

In [325… # simple series example

s = pd.Series(['2023/1/1','2022/1/1','2021/1/1'])
pd.to_datetime(s).dt.day_name() #dt.year,.month,.day,month_name...

file:///C:/Users/goura/Downloads/14-DataFrame.html 127/143
2/8/25, 2:23 PM 14-DataFrame

Out[325… 0 Sunday
1 Saturday
2 Friday
dtype: object

In [326… # With errors

s = pd.Series(['2023/1/1','2022/1/1','2021/130/1'])
pd.to_datetime(s,errors = 'coerce').dt.year

Out[326… 0 2023.0
1 2022.0
2 NaN
dtype: float64

In [327… df = pd.read_csv("expense_data.csv")
df.shape

Out[327… (277, 11)

In [328… df['Date'] = pd.to_datetime(df['Date'])

dt accessor
Accessor object for datetimelike properties of the Series values.

In [329… df['Date'].dt.is_quarter_start

Out[329… 0 False
1 False
2 False
3 False
4 False
...
272 False
273 False
274 False
275 False
276 False
Name: Date, Length: 277, dtype: bool

In [330… # plot graph

import matplotlib.pyplot as plt


plt.plot(df['Date'],df['INR'])

Out[330… [<matplotlib.lines.Line2D at 0x2661832ecd0>]

file:///C:/Users/goura/Downloads/14-DataFrame.html 128/143
2/8/25, 2:23 PM 14-DataFrame

In [331… # day name wise bar chart/month wise bar chart

df['day_name'] = df['Date'].dt.day_name()

In [332… df.groupby('day_name')['INR'].mean().plot(kind='bar')

Out[332… <Axes: xlabel='day_name'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 129/143
2/8/25, 2:23 PM 14-DataFrame

In [333… df['month_name'] = df['Date'].dt.month_name()

In [334… df.groupby('month_name')['INR'].sum().plot(kind='bar')

Out[334… <Axes: xlabel='month_name'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 130/143
2/8/25, 2:23 PM 14-DataFrame

In [335… df[df['Date'].dt.is_month_end]

file:///C:/Users/goura/Downloads/14-DataFrame.html 131/143
2/8/25, 2:23 PM 14-DataFrame

Out[335… Date Account Category Subcategory Note INR Income/Expense

2022- CUB -
7 02-28 online Food NaN Pizza 339.15 Expense
11:56:00 payment

2022- CUB -
From
8 02-28 online Other NaN 200.00 Income
kumara
11:45:00 payment

2022- CUB -
Vnr to
61 01-31 online Transportation NaN 50.00 Expense
apk
08:44:00 payment

2022- CUB -
62 01-31 online Other NaN To vicky 200.00 Expense
08:27:00 payment

2022- CUB -
To ksr
63 01-31 online Transportation NaN 153.00 Expense
station
08:26:00 payment

2021- CUB -
Bharath
242 11-30 online Gift NaN 115.00 Expense
birthday
14:24:00 payment

2021- CUB - Lunch


243 11-30 online Food NaN with 128.00 Expense
14:17:00 payment company

2021- CUB -
244 11-30 online Food NaN Breakfast 70.00 Expense
10:11:00 payment

Time Series Analysis


In [336… # till now
# Timestamp
pd.Timestamp('6th jan 2023 8:10')

# DatetimeIndex -> df and series index


pd.DatetimeIndex([pd.Timestamp('6th jan 2023 8:10'
),pd.Timestamp('7th jan 2023 8:10'
),pd.Timestamp('8th jan 2023 8:10')
])[0]

# date_range()
pd.date_range(start='2023-1-6',end='2023-1-31',freq='D')

# to_datetime()
s = pd.Series(['2023/1/6','2023/1/7','2023/1/7'])
pd.to_datetime(s).dt.day_name()

file:///C:/Users/goura/Downloads/14-DataFrame.html 132/143
2/8/25, 2:23 PM 14-DataFrame

Out[336… 0 Friday
1 Saturday
2 Saturday
dtype: object

Timedelta Object
Represents a duration, the difference between two dates or times.

In [337… # create using Timestamp objects


t1 = pd.Timestamp('6th Jan 2023 8:20:14')
t2 = pd.Timestamp('26th Jan 2023 10:00:00')

t2 - t1

Out[337… Timedelta('20 days 01:39:46')

In [338… # standalone creation


pd.Timedelta(days=2,hours=10,minutes=35)

Out[338… Timedelta('2 days 10:35:00')

In [339… # Arithmetic
pd.Timestamp('6th jan 2023') + pd.Timedelta(days=2,hours=10,minutes=35)

Out[339… Timestamp('2023-01-08 10:35:00')

In [340… pd.date_range(
start='2023-1-6',end='2023-1-31',freq='D') - pd.Timedelta(
days=2,hours=10,minutes=35)

Out[340… DatetimeIndex(['2023-01-03 13:25:00', '2023-01-04 13:25:00',


'2023-01-05 13:25:00', '2023-01-06 13:25:00',
'2023-01-07 13:25:00', '2023-01-08 13:25:00',
'2023-01-09 13:25:00', '2023-01-10 13:25:00',
'2023-01-11 13:25:00', '2023-01-12 13:25:00',
'2023-01-13 13:25:00', '2023-01-14 13:25:00',
'2023-01-15 13:25:00', '2023-01-16 13:25:00',
'2023-01-17 13:25:00', '2023-01-18 13:25:00',
'2023-01-19 13:25:00', '2023-01-20 13:25:00',
'2023-01-21 13:25:00', '2023-01-22 13:25:00',
'2023-01-23 13:25:00', '2023-01-24 13:25:00',
'2023-01-25 13:25:00', '2023-01-26 13:25:00',
'2023-01-27 13:25:00', '2023-01-28 13:25:00'],
dtype='datetime64[ns]', freq='D')

In [346… # real life example


df = pd.read_csv('deliveries.csv')
df.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 133/143
2/8/25, 2:23 PM 14-DataFrame

Out[346… order_date delivery_date

0 5/24/98 2/5/99

1 4/22/92 3/6/98

2 2/10/91 8/26/92

3 7/21/92 11/20/97

4 9/2/93 6/10/98

In [345… df.columns

Out[345… Index(['match_id', 'inning', 'batting_team', 'bowling_team', 'over', 'ball',


'batsman', 'non_striker', 'bowler', 'is_super_over', 'wide_runs',
'bye_runs', 'legbye_runs', 'noball_runs', 'penalty_runs',
'batsman_runs', 'extra_runs', 'total_runs', 'player_dismissed',
'dismissal_kind', 'fielder'],
dtype='object')

In [347… df['order_date'] = pd.to_datetime(df['order_date'])


df['delivery_date'] = pd.to_datetime(df['delivery_date'])

In [348… df['delivery_time_period'] = df['delivery_date'] - df['order_date']

df['delivery_time_period'].mean()

Out[348… Timedelta('1217 days 22:53:53.532934128')

Time series
A time series is a data set that tracks a sample over time. In particular, a time series
allows one to see what factors influence certain variables from period to period. Time
series analysis can be useful to see how a given asset, security, or economic variable
changes over time.

Examples

Financial Data (Company stocks)


Natural Data (Rainfall measurement)
Event Data (Covid)
Medical Data (Heart rate monitoring)

Types of Operations done on Time Series

Time Series Analysis


Time Series Forecasting

In [349… google = pd.read_csv('google.csv')


google.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 134/143
2/8/25, 2:23 PM 14-DataFrame

Out[349… Date Open High Low Close Adj Close Volume

0 2004-08-19 49.813290 51.835709 47.800831 49.982655 49.982655 44871361

1 2004-08-20 50.316402 54.336334 50.062355 53.952770 53.952770 22942874

2 2004-08-23 55.168217 56.528118 54.321388 54.495735 54.495735 18342897

3 2004-08-24 55.412300 55.591629 51.591621 52.239197 52.239197 15319808

4 2004-08-25 52.284027 53.798351 51.746044 52.802086 52.802086 9232276

In [350… google['Date'] = pd.to_datetime(google['Date'])

In [351… google.set_index('Date',inplace=True)

In [352… # fetch a specific date


google.loc['2021-12-30']

Out[352… Open 2929.000000


High 2941.250000
Low 2915.169922
Close 2920.050049
Adj Close 2920.050049
Volume 648900.000000
Name: 2021-12-30 00:00:00, dtype: float64

In [353… google['month_name'] = google.index.month_name()


google['weekday_name'] = google.index.day_name()
google['quarter'] = google.index.quarter

google.head()

Out[353… Open High Low Close Adj Close Volume month_name

Date

2004-
49.813290 51.835709 47.800831 49.982655 49.982655 44871361 August
08-19

2004-
50.316402 54.336334 50.062355 53.952770 53.952770 22942874 August
08-20

2004-
55.168217 56.528118 54.321388 54.495735 54.495735 18342897 August
08-23

2004-
55.412300 55.591629 51.591621 52.239197 52.239197 15319808 August
08-24

2004-
52.284027 53.798351 51.746044 52.802086 52.802086 9232276 August
08-25

In [354… #challenge- fetch info for a particular date every year- limitation of timedelta
google.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 135/143
2/8/25, 2:23 PM 14-DataFrame

google[google.index.isin(pd.date_range(
start='2005-1-6',end='2022-1-6',freq=pd.DateOffset(years=1)))]

Out[354… Open High Low Close Adj Close Volume mon

Date

2005-
97.175758 97.584229 93.509506 93.922951 93.922951 20852067
01-06

2006-
227.581970 234.371521 225.773743 231.960556 231.960556 35646914
01-06

2009-
165.868286 169.763687 162.585587 166.406265 166.406265 12898566
01-06

2010-
311.761444 311.761444 302.047852 302.994293 302.994293 7987226
01-06

2011-
304.199799 308.060303 303.885956 305.604523 305.604523 4131026
01-06

2012-
328.344299 328.767700 323.681763 323.796326 323.796326 5405987
01-06

2014-
554.426880 557.340942 551.154114 556.573853 556.573853 3551864
01-06

2015-
513.589966 514.761719 499.678131 500.585632 500.585632 2899940
01-06

2016-
730.000000 747.179993 728.919983 743.619995 743.619995 1947000
01-06

2017-
795.260010 807.900024 792.203979 806.150024 806.150024 1640200
01-06

2020-
1350.000000 1396.500000 1350.000000 1394.209961 1394.209961 1732300
01-06

2021-
1702.630005 1748.000000 1699.000000 1735.290039 1735.290039 2602100
01-06

2022-
2749.949951 2793.719971 2735.270020 2751.020020 2751.020020 1452500
01-06

In [355… # viz a single col


google['Close'].plot()

Out[355… <Axes: xlabel='Date'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 136/143
2/8/25, 2:23 PM 14-DataFrame

In [356… google.loc['2021-12']['Close'].plot()

Out[356… <Axes: xlabel='Date'>

In [357… google.groupby('month_name')['Close'].mean().plot(kind='bar')

Out[357… <Axes: xlabel='month_name'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 137/143
2/8/25, 2:23 PM 14-DataFrame

In [358… # quaterly trend


google.groupby('quarter')['Close'].mean().plot(kind='bar')

Out[358… <Axes: xlabel='quarter'>

file:///C:/Users/goura/Downloads/14-DataFrame.html 138/143
2/8/25, 2:23 PM 14-DataFrame

In [359… # frequency
google.index

Out[359… DatetimeIndex(['2004-08-19', '2004-08-20', '2004-08-23', '2004-08-24',


'2004-08-25', '2004-08-26', '2004-08-27', '2004-08-30',
'2004-08-31', '2004-09-01',
...
'2022-05-09', '2022-05-10', '2022-05-11', '2022-05-12',
'2022-05-13', '2022-05-16', '2022-05-17', '2022-05-18',
'2022-05-19', '2022-05-20'],
dtype='datetime64[ns]', name='Date', length=4471, freq=None)

In [360… # asfreq
google.asfreq('6H',method='bfill')

file:///C:/Users/goura/Downloads/14-DataFrame.html 139/143
2/8/25, 2:23 PM 14-DataFrame

Out[360… Open High Low Close Adj Close Volume m

Date

2004-
08-19 49.813290 51.835709 47.800831 49.982655 49.982655 44871361
00:00:00

2004-
08-19 50.316402 54.336334 50.062355 53.952770 53.952770 22942874
06:00:00

2004-
08-19 50.316402 54.336334 50.062355 53.952770 53.952770 22942874
12:00:00

2004-
08-19 50.316402 54.336334 50.062355 53.952770 53.952770 22942874
18:00:00

2004-
08-20 50.316402 54.336334 50.062355 53.952770 53.952770 22942874
00:00:00

... ... ... ... ... ... ...

2022-
05-19 2236.820068 2271.750000 2209.360107 2214.909912 2214.909912 1459600
00:00:00

2022-
05-19 2241.709961 2251.000000 2127.459961 2186.260010 2186.260010 1878100
06:00:00

2022-
05-19 2241.709961 2251.000000 2127.459961 2186.260010 2186.260010 1878100
12:00:00

2022-
05-19 2241.709961 2251.000000 2127.459961 2186.260010 2186.260010 1878100
18:00:00

2022-
05-20 2241.709961 2251.000000 2127.459961 2186.260010 2186.260010 1878100
00:00:00

25933 rows × 9 columns

Resampling
Resampling involves changing the frequency of your time series observations.

Two types of resampling are:

Upsampling: Where you increase the frequency of the samples, such as from minutes to
seconds.

file:///C:/Users/goura/Downloads/14-DataFrame.html 140/143
2/8/25, 2:23 PM 14-DataFrame

Downsampling: Where you decrease the frequency of the samples, such as from days to
months.

In [361… # Upsampling
google['Close'].resample('12H').interpolate(method='spline',order=2).plot()

Out[361… <Axes: xlabel='Date'>

Rolling Window(Smoothing)
Time series data in original format can be quite volatile, especially on smaller
aggregation levels. The concept of rolling, or moving averages is a useful technique for
smoothing time series data.

Shifting
The shift() function is Pandas is used to, well, shift the entire series up or down by the
desired number of periods.

In [362… # shift example


df = pd.read_csv('login.csv',header=None)
df = df[[1,2]]
df.head()
df.rename(columns={1:'user_id',2:'login_time'},inplace=True)
df.head()

file:///C:/Users/goura/Downloads/14-DataFrame.html 141/143
2/8/25, 2:23 PM 14-DataFrame

Out[362… user_id login_time

0 466 2017-01-07 18:24:07

1 466 2017-01-07 18:24:55

2 458 2017-01-07 18:25:18

3 458 2017-01-07 18:26:21

4 592 2017-01-07 19:09:59

In [363… import warnings


warnings.filterwarnings('ignore')

In [364… user_df = df[df['user_id'] == 458]


user_df.head()

Out[364… user_id login_time

2 458 2017-01-07 18:25:18

3 458 2017-01-07 18:26:21

9 458 2017-01-09 11:13:12

10 458 2017-01-09 11:34:02

25 458 2017-01-10 12:14:11

In [365… user_df['login_time'] = pd.to_datetime(user_df['login_time'])


user_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 208 entries, 2 to 1018
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 208 non-null int64
1 login_time 208 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 4.9 KB

In [366… user_df['shifted'] = user_df['login_time'].shift(1)


(user_df['login_time'] - user_df['shifted']).mean()

Out[366… Timedelta('0 days 17:29:22.053140096')

In [367… ax = df.plot(subplots=True,
layout=(3, 2),
sharex=False,
sharey=False,
linewidth=0.7,
fontsize=10,
legend=False,
figsize=(20,15))

file:///C:/Users/goura/Downloads/14-DataFrame.html 142/143
2/8/25, 2:23 PM 14-DataFrame

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

In [ ]:

file:///C:/Users/goura/Downloads/14-DataFrame.html 143/143

You might also like