0% found this document useful (0 votes)
66 views13 pages

Data Frames and Charts: 2.1 Working With Dataframes

The document discusses working with pandas DataFrames. It shows how to load a dataset into a DataFrame, display the first few records, find metadata like column names and shapes. It also demonstrates slicing and indexing DataFrames, value counts, cross tabulations, sorting, creating new columns, grouping and aggregating like finding average sold price by age. Finally, it joins DataFrames to compare average auction prices for different player ages and roles.

Uploaded by

Pratyush Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views13 pages

Data Frames and Charts: 2.1 Working With Dataframes

The document discusses working with pandas DataFrames. It shows how to load a dataset into a DataFrame, display the first few records, find metadata like column names and shapes. It also demonstrates slicing and indexing DataFrames, value counts, cross tabulations, sorting, creating new columns, grouping and aggregating like finding average sold price by age. Finally, it joins DataFrames to compare average auction prices for different player ages and roles.

Uploaded by

Pratyush Barua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Frames and Charts

2.1 Working with DataFrames

2.1.1 Loading the dataset onto a DataFrame

import pandas as pd

pd.set_option('display.float_format', lambda x: '%.3f' % x)

pd.read_csv?

ipl_auction_df = pd.read_csv( 'IPL IMB381IPL2013.csv' )

type(ipl_auction_df)
pandas.core.frame.DataFrame

2.1.2 Displaying first few records of the DataFrame

pd.set_option('display.max_columns', 7)

ipl_auction_df.head(5)

Sl.NO. PLAYER NAME AGE ... AUCTION YEAR BASE PRICE SOLD PRICE

0 1 Abdulla, YA 2 ... 2009 50000 50000

1 2 Abdur Razzak 2 ... 2008 50000 50000

2 3 Agarkar, AB 2 ... 2008 200000 350000

3 4 Ashwin, R 1 ... 2011 100000 850000

4 5 Badrinath, S 2 ... 2011 100000 800000

5 rows × 26 columns

2.1.3 Finding metadata of the DataFrame


list(ipl_auction_df.columns)
['Sl.NO.',
'PLAYER NAME',
'AGE',
'COUNTRY',
'TEAM',
'PLAYING ROLE',
'T-RUNS',
'T-WKTS',
'ODI-RUNS-S',
'ODI-SR-B',
'ODI-WKTS',
'ODI-SR-BL',
'CAPTAINCY EXP',
'RUNS-S',
'HS',
'AVE',
'SR-B',
'SIXERS',
'RUNS-C',
'WKTS',
'AVE-BL',
'ECON',
'SR-BL',
'AUCTION YEAR',
'BASE PRICE',
'SOLD PRICE']
ipl_auction_df.head(5).transpose()

0 1 2 3 4

Sl.NO. 1 2 3 4 5

PLAYER NAME Abdulla, YA Abdur Razzak Agarkar, AB Ashwin, R Badrinath, S

AGE 2 2 2 1 2

COUNTRY SA BAN IND IND IND

TEAM KXIP RCB KKR CSK CSK

PLAYING ROLE Allrounder Bowler Bowler Bowler Batsman

T-RUNS 0 214 571 284 63

T-WKTS 0 18 58 31 0

ODI-RUNS-S 0 657 1269 241 79

ODI-SR-B 0.000 71.410 80.620 84.560 45.930

ODI-WKTS 0 185 288 51 0

ODI-SR-BL 0.000 37.600 32.900 36.800 0.000

CAPTAINCY EXP 0 0 0 0 0

RUNS-S 0 0 167 58 1317

HS 0 0 39 11 71

AVE 0.000 0.000 18.560 5.800 32.930

SR-B 0.000 0.000 121.010 76.320 120.710

SIXERS 0 0 5 0 28

RUNS-C 307 29 1059 1125 0

WKTS 15 0 29 49 0

AVE-BL 20.470 0.000 36.520 22.960 0.000

ECON 8.900 14.500 8.810 6.230 0.000

SR-BL 13.930 0.000 24.900 22.140 0.000

AUCTION YEAR 2009 2008 2008 2011 2011

BASE PRICE 50000 50000 200000 100000 100000

SOLD PRICE 50000 50000 350000 850000 800000

ipl_auction_df.shape
(130, 26)

2.1.4 Finding Summary of the DataFrame


ipl_auction_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 26 columns):
Sl.NO. 130 non-null int64
PLAYER NAME 130 non-null object
AGE 130 non-null int64
COUNTRY 130 non-null object
TEAM 130 non-null object
PLAYING ROLE 130 non-null object
T-RUNS 130 non-null int64
T-WKTS 130 non-null int64
ODI-RUNS-S 130 non-null int64
ODI-SR-B 130 non-null float64
ODI-WKTS 130 non-null int64
ODI-SR-BL 130 non-null float64
CAPTAINCY EXP 130 non-null int64
RUNS-S 130 non-null int64
HS 130 non-null int64
AVE 130 non-null float64
SR-B 130 non-null float64
SIXERS 130 non-null int64
RUNS-C 130 non-null int64
WKTS 130 non-null int64
AVE-BL 130 non-null float64
ECON 130 non-null float64
SR-BL 130 non-null float64
AUCTION YEAR 130 non-null int64
BASE PRICE 130 non-null int64
SOLD PRICE 130 non-null int64
dtypes: float64(7), int64(15), object(4)
memory usage: 26.5+ KB

2.1.5 Slicing and Indexing a dataframe

Selecting Rows by Indexes

ipl_auction_df[0:5]

Sl.NO. PLAYER NAME AGE ... AUCTION YEAR BASE PRICE SOLD PRICE

0 1 Abdulla, YA 2 ... 2009 50000 50000

1 2 Abdur Razzak 2 ... 2008 50000 50000

2 3 Agarkar, AB 2 ... 2008 200000 350000

3 4 Ashwin, R 1 ... 2011 100000 850000

4 5 Badrinath, S 2 ... 2011 100000 800000

5 rows × 26 columns
ipl_auction_df[-5:]

Sl.NO. PLAYER NAME AGE ... AUCTION YEAR BASE PRICE SOLD PRICE

125 126 Yadav, AS 2 ... 2010 50000 750000

126 127 Younis Khan 2 ... 2008 225000 225000

127 128 Yuvraj Singh 2 ... 2011 400000 1800000

128 129 Zaheer Khan 2 ... 2008 200000 450000

129 130 Zoysa, DNT 2 ... 2008 100000 110000

5 rows × 26 columns

Selecting Columns by Column Names

ipl_auction_df['PLAYER NAME'][0:5]
0 Abdulla, YA
1 Abdur Razzak
2 Agarkar, AB
3 Ashwin, R
4 Badrinath, S
Name: PLAYER NAME, dtype: object

ipl_auction_df[['PLAYER NAME', 'COUNTRY']][0:5]

PLAYER NAME COUNTRY

0 Abdulla, YA SA

1 Abdur Razzak BAN

2 Agarkar, AB IND

3 Ashwin, R IND

4 Badrinath, S IND

Selecting Rows and Columns by indexes


ipl_auction_df.iloc[4:9, 1:4]

PLAYER NAME AGE COUNTRY

4 Badrinath, S 2 IND

5 Bailey, GJ 2 AUS

6 Balaji, L 2 IND

7 Bollinger, DE 2 AUS

8 Botha, J 2 SA

2.1.6 Value Counts and Cross Tabulations

Finding Unique Occurances of Values in Columns

ipl_auction_df.COUNTRY.value_counts()

IND 53
AUS 22
SA 16
SL 12
PAK 9
NZ 7
WI 6
ENG 3
ZIM 1
BAN 1
Name: COUNTRY, dtype: int64

ipl_auction_df.COUNTRY.value_counts(normalize=True)*100

IND 40.769
AUS 16.923
SA 12.308
SL 9.231
PAK 6.923
NZ 5.385
WI 4.615
ENG 2.308
ZIM 0.769
BAN 0.769
Name: COUNTRY, dtype: float64

Cross-tabulation between two columns


pd.crosstab( ipl_auction_df['AGE'], ipl_auction_df['PLAYING ROLE'] )

PLAYING ROLE Allrounde Batsman Bowler W. Keeper


r
AGE

1 4 5 7 0

2 25 21 29 11

3 6 13 8 1

2.1.7 Sorting dataframe by column values

ipl_auction_df[['PLAYER NAME', 'SOLD PRICE']].sort_values('SOLD PRICE')[0:5]

PLAYER NAME SOLD PRICE

73 Noffke, AA 20000

46 Kamran Khan 24000

0 Abdulla, YA 50000

1 Abdur Razzak 50000

118 Van der Merwe 50000

ipl_auction_df[['PLAYER NAME', 'SOLD PRICE']].sort_values('SOLD PRICE', ascendin


g = False)[0:5]

PLAYER NAME SOLD PRICE

93 Sehwag, V 1800000

127 Yuvraj Singh 1800000

50 Kohli, V 1800000

111 Tendulkar, SR 1800000

113 Tiwary, SS 1600000

2.1.8 Creating new columns

Which player got the maximum premium on the base price?


ipl_auction_df['premium'] = ipl_auction_df['SOLD PRICE'] - ipl_auction_df['BASE
PRICE']
ipl_auction_df[['PLAYER NAME', 'BASE PRICE', 'SOLD PRICE', 'premium']][0:5]

PLAYER NAME BASE PRICE SOLD PRICE premium

0 Abdulla, YA 50000 50000 0

1 Abdur Razzak 50000 50000 0

2 Agarkar, AB 200000 350000 150000

3 Ashwin, R 100000 850000 750000

4 Badrinath, S 100000 800000 700000

Which players got the maximum premium offering on their base price?

ipl_auction_df[['PLAYER NAME',
'BASE PRICE',
'SOLD PRICE', 'premium']].sort_values('premium',
ascending = False)[0:5]

PLAYER NAME BASE PRICE SOLD PRICE premium

50 Kohli, V 150000 1800000 1650000

113 Tiwary, SS 100000 1600000 1500000

127 Yuvraj Singh 400000 1800000 1400000

111 Tendulkar, SR 400000 1800000 1400000

93 Sehwag, V 400000 1800000 1400000

2.1.9 Grouping and Aggregating

What is the average SOLD PRICE for each age category?

ipl_auction_df.groupby('AGE')['SOLD PRICE'].mean()

AGE
1 720250.000
2 484534.884
3 520178.571
Name: SOLD PRICE, dtype: float64
soldprice_by_age = ipl_auction_df.groupby('AGE')['SOLD PRICE'].mean().reset_inde
x()
soldprice_by_age

AGE SOLD PRICE

0 1 720250.000

1 2 484534.884

2 3 520178.571

Average SOLD PRICE for Different Playing Roles in Each Age Category?

soldprice_by_age_role = ipl_auction_df.groupby(['AGE', 'PLAYING ROLE'])['SOLD PR


ICE'].mean().reset_index()
soldprice_by_age_role

AGE PLAYING ROLE SOLD PRICE

0 1 Allrounder 587500.000

1 1 Batsman 1110000.000

2 1 Bowler 517714.286

3 2 Allrounder 449400.000

4 2 Batsman 654761.905

5 2 Bowler 397931.034

6 2 W. Keeper 467727.273

7 3 Allrounder 766666.667

8 3 Batsman 457692.308

9 3 Bowler 414375.000

10 3 W. Keeper 700000.000

2.1.10 Joining dataframes

Compare the average auction price for different ages and playing roles.

soldprice_comparison = soldprice_by_age_role.merge( soldprice_by_age,


on = 'AGE',
how = 'outer')
soldprice_comparison

AGE PLAYING ROLE SOLD PRICE_x SOLD PRICE_y

0 1 Allrounder 587500.000 720250.000

1 1 Batsman 1110000.000 720250.000

2 1 Bowler 517714.286 720250.000

3 2 Allrounder 449400.000 484534.884

4 2 Batsman 654761.905 484534.884

5 2 Bowler 397931.034 484534.884

6 2 W. Keeper 467727.273 484534.884

7 3 Allrounder 766666.667 520178.571

8 3 Batsman 457692.308 520178.571

9 3 Bowler 414375.000 520178.571

10 3 W. Keeper 700000.000 520178.571

2.1.11 Re-naming columns

soldprice_comparison.rename( columns = { 'SOLD PRICE_x': 'SOLD_PRICE_AGE_ROLE',


'SOLD PRICE_y': 'SOLD_PRICE_AGE' }, inplace = Tru
e )

soldprice_comparison.head(5)

AGE PLAYING ROLE SOLD_PRICE_AGE_ROLE SOLD_PRICE_AGE

0 1 Allrounder 587500.000 720250.000

1 1 Batsman 1110000.000 720250.000

2 1 Bowler 517714.286 720250.000

3 2 Allrounder 449400.000 484534.884

4 2 Batsman 654761.905 484534.884

2.1.12 Applying Operations to multiple columns

Percentage change in SOLD PRICE

soldprice_comparison['change'] = soldprice_comparison.apply(lambda rec:


(rec.SOLD_PRICE_AGE_ROLE - rec.SOLD_PRICE_AGE) / rec.SOL
D_PRICE_AGE,
axis = 1)
soldprice_comparison

AGE PLAYING ROLE SOLD_PRICE_AGE_ROLE SOLD_PRICE_AGE change

0 1 Allrounder 587500.000 720250.000 -0.184

1 1 Batsman 1110000.000 720250.000 0.541

2 1 Bowler 517714.286 720250.000 -0.281

3 2 Allrounder 449400.000 484534.884 -0.073

4 2 Batsman 654761.905 484534.884 0.351

5 2 Bowler 397931.034 484534.884 -0.179

6 2 W. Keeper 467727.273 484534.884 -0.035

7 3 Allrounder 766666.667 520178.571 0.474

8 3 Batsman 457692.308 520178.571 -0.120

9 3 Bowler 414375.000 520178.571 -0.203

10 3 W. Keeper 700000.000 520178.571 0.346

2.1.13 Filtering Records from Dataframe based on conditions

Which players have hit more then 80 sixes in the IPL tournament so far?

ipl_auction_df[ipl_auction_df['SIXERS'] > 80 ][['PLAYER NAME', 'SIXERS']]

PLAYER NAME SIXERS

26 Gayle, CH 129

28 Gilchrist, AC 86

82 Pathan, YK 81

88 Raina, SK 97

97 Sharma, RG 82

2.1.14 Removing a column

ipl_auction_df.drop( 'Sl.NO.', inplace = True, axis = 1)


ipl_auction_df.columns
Index(['PLAYER NAME', 'AGE', 'COUNTRY', 'TEAM', 'PLAYING ROLE', 'T-R
UNS',
'T-WKTS', 'ODI-RUNS-S', 'ODI-SR-B', 'ODI-WKTS', 'ODI-SR-BL',
'CAPTAINCY EXP', 'RUNS-S', 'HS', 'AVE', 'SR-B', 'SIXERS', 'RU
NS-C',
'WKTS', 'AVE-BL', 'ECON', 'SR-BL', 'AUCTION YEAR', 'BASE PRIC
E',
'SOLD PRICE', 'premium'],

dtype='object')

You might also like