0% found this document useful (0 votes)
7 views1 page

9 Libraries

The document provides an overview of various Python libraries, particularly focusing on the math and pandas libraries, which are used for mathematical functions and data analysis, respectively. It includes examples of functions from the math library and demonstrates how to load and manipulate car data using pandas, including filtering and displaying specific records. Additionally, it touches on the structure of data frames and methods for querying data in a similar manner to SQL.

Uploaded by

prpt2608
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views1 page

9 Libraries

The document provides an overview of various Python libraries, particularly focusing on the math and pandas libraries, which are used for mathematical functions and data analysis, respectively. It includes examples of functions from the math library and demonstrates how to load and manipulate car data using pandas, including filtering and displaying specific records. Additionally, it touches on the structure of data frames and methods for querying data in a similar manner to SQL.

Uploaded by

prpt2608
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

In [ ]: #### Libraries have a set of built-in functions which we can reuse

In [1]: import math #### Math - a library that is used for math function

In [3]: math.cos(90)

Out[3]: -0.4480736161291701

In [4]: math.factorial(5) # factorial - multiply all the numbers behind the given number until 0

Out[4]: 120

In [5]: math.lcm(6,4)

Out[5]: 12

In [6]: math.gcd(6,4)

Out[6]: 2

In [7]: math.exp(4)

Out[7]: 54.598150033144236

In [9]: math.pow(5,2)

Out[9]: 25.0

In [10]: math.log(5)

Out[10]: 1.6094379124341003

In [11]: math.sqrt(10) #square root

Out[11]: 3.1622776601683795

In [16]: math.pi

Out[16]: 3.141592653589793

In [5]: import scipy as sc ##(scientific python) (linear algebra, complex functions)

In [ ]: sc.

In [ ]: #### Pandas library ####

In [ ]: # Python data analysis


# Name derived from panel data
# Powerful library for data analysis, mainly used for data manipulation
# Open source
# derived tables should be called as " data frame"

In [3]: import pandas as pd


car_data = pd.read_excel("D:\\Data Engineering\\Python\\Data sets\\Car_Data.xlsx")

In [128… car_data.head(5) # Head returns first records(number of records)

Out[128… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

1 sx4 2013 4.75 9.54 43000 Diesel Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

3 wagon r 2011 2.85 4.15 5200 Petrol Dealer Manual

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

In [129… car_data.tail(5) # tail returns last records (number of records)

Out[129… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

296 city 2016 9.50 11.6 33988 Diesel Dealer Manual

297 brio 2015 4.00 5.9 60000 Petrol Dealer Manual

298 city 2009 3.35 11.0 87934 Petrol Dealer Manual

299 city 2017 11.50 12.5 9000 Diesel Dealer Manual

300 brio 2016 5.30 5.9 5464 Petrol Dealer Manual

In [130… car_data.shape # Returns no. of rows , no. of columns

Out[130… (301, 8)

In [131… car_data.columns

Out[131… Index(['Car_Name', 'Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',


'Fuel_Type', 'Seller_Type', 'Transmission'],
dtype='object')

In [132… list(car_data.columns)

Out[132… ['Car_Name',
'Year',
'Selling_Price',
'Present_Price',
'Kms_Driven',
'Fuel_Type',
'Seller_Type',
'Transmission']

In [234… car_data.count()

Out[234… Car_Name 301


Year 301
Selling_Price 301
Present_Price 301
Kms_Driven 301
Fuel_Type 301
Seller_Type 301
Transmission 301
dtype: int64

In [133… type(car_data)
# Dataframe
#2D tabular form (column and rows)
#similar to table or spreadsheet

Out[133… pandas.core.frame.DataFrame

In [134… car_data.dtypes

Out[134… Car_Name object


Year int64
Selling_Price float64
Present_Price float64
Kms_Driven int64
Fuel_Type object
Seller_Type object
Transmission object
dtype: object

In [ ]: # car_data.head(5) ------ similar to select * from table limit 5

In [ ]: # Q. return year wise car names with selling price

In [135… car_data[["Car_Name", "Year", "Selling_Price"]].head(5)

Out[135… Car_Name Year Selling_Price

0 ritz 2014 3.35

1 sx4 2013 4.75

2 ciaz 2017 7.25

3 wagon r 2011 2.85

4 swift 2014 4.60

In [ ]: # Q. retuen only 2014 cars

In [138… car_data[car_data["Year"] == 2014].head(5) # similar to (select * from car_data where yaer = 2014)

Out[138… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

23 alto k10 2014 2.50 3.46 45280 Petrol Dealer Manual

32 swift 2014 4.95 7.49 39000 Diesel Dealer Manual

33 ertiga 2014 6.00 9.95 45000 Diesel Dealer Manual

In [ ]: # Q. return only petrol cars

In [139… car_data[car_data["Fuel_Type"] == "Petrol"].head(5)

Out[139… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

3 wagon r 2011 2.85 4.15 5200 Petrol Dealer Manual

6 ciaz 2015 6.75 8.12 18796 Petrol Dealer Manual

10 alto 800 2017 2.85 3.60 2135 Petrol Dealer Manual

In [140… car_data[car_data["Kms_Driven"] <= 5000].head(5)

Out[140… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

5 vitara brezza 2018 9.25 9.83 2071 Diesel Dealer Manual

10 alto 800 2017 2.85 3.60 2135 Petrol Dealer Manual

21 ignis 2017 4.90 5.71 2400 Petrol Dealer Manual

100 Royal Enfield Thunder 500 2016 1.75 1.90 3000 Petrol Individual Manual

101 UM Renegade Mojave 2017 1.70 1.82 1400 Petrol Individual Manual

In [145… car_data[(car_data["Year"] == 2014) & (car_data["Fuel_Type"] == "Diesel")].head(5)

Out[145… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

32 swift 2014 4.95 7.49 39000 Diesel Dealer Manual

33 ertiga 2014 6.00 9.95 45000 Diesel Dealer Manual

34 dzire 2014 5.50 8.06 45000 Diesel Dealer Manual

43 dzire 2014 5.50 8.06 45780 Diesel Dealer Manual

In [146… car_data[(car_data["Year"]==2014) & (car_data["Fuel_Type"]=="Diesel") & (car_data["Transmission"]=="Automatic")].head(5)

Out[146… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

59 fortuner 2014 19.99 35.96 41000 Diesel Dealer Automatic

62 fortuner 2014 18.75 35.96 78000 Diesel Dealer Automatic

In [ ]: # Q. Return 2016 cars with automatic transmission

In [147… car_data[(car_data["Year"]== 2016) & (car_data["Transmission"]== "Automatic")].head(5)

Out[147… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

40 baleno 2016 5.85 7.87 24524 Petrol Dealer Automatic

96 innova 2016 20.75 25.39 29000 Diesel Dealer Automatic

165 Activa 3g 2016 0.45 0.54 500 Petrol Individual Automatic

177 Honda Activa 125 2016 0.35 0.57 24000 Petrol Individual Automatic

275 city 2016 10.90 13.60 30753 Petrol Dealer Automatic

In [148… pd.set_option("display.max_rows",20) ## it sets the limit to the records that will be displayed

In [149… car_data.head(200) # by restricting the rows to 20, even if we ask for 200 records it will display only 20

Out[149… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

1 sx4 2013 4.75 9.54 43000 Diesel Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

3 wagon r 2011 2.85 4.15 5200 Petrol Dealer Manual

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

... ... ... ... ... ... ... ... ...

195 Bajaj ct 100 2015 0.18 0.32 35000 Petrol Individual Manual

196 Activa 3g 2008 0.17 0.52 500000 Petrol Individual Automatic

197 Honda CB twister 2010 0.16 0.51 33000 Petrol Individual Manual

198 Bajaj Discover 125 2011 0.15 0.57 35000 Petrol Individual Manual

199 Honda CB Shine 2007 0.12 0.58 53000 Petrol Individual Manual

200 rows × 8 columns

In [ ]: ################################## Different ways of filtering ######################################

In [150… emp_data=pd.read_csv("D:\\Data Engineering\\Python\\Data sets\\emp1.csv")

In [151… type(emp_data)

Out[151… pandas.core.frame.DataFrame

In [152… emp_data.shape

Out[152… (50, 14)

In [153… emp_data.columns

Out[153… Index(['EmployeeID', 'NationalIDNumber', 'LoginID', 'Title', 'PhoneNumber',


'BirthDate', 'MaritalStatus', 'Gender', 'HireDate', 'Dept', 'Salary',
'Job Grade', 'CurrentFlag', 'rowguid'],
dtype='object')

In [154… emp_data.head(3)

Out[154… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 M M 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

1 2 253022876 adventure-works\kevin0 Catherine Abel 9.453570e+09 12-03-1991 00:00 S M 31-08-2013 00:00 Sales 962.0 Management -1 {1B480240-95C0-410F-A717-EB29943C8886}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 M M 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

In [ ]: #Method 1

In [ ]: # Q. Get sales department with all the columns

In [155… emp_data[emp_data["Dept"]=="sales"].head(3)

Out[155… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

6 7 309738752 NaN Margaret Smith NaN 25-11-1959 00:00 S F 31-07-2014 00:00 sales NaN Operations -1 {2CC71B96-F421-485E-9832-8723337749BB}

18 19 9659517 NaN Milton Albury NaN 06-02-1960 00:00 M F 01-11-2014 00:00 sales 3408.0 Admin -1 {C334B2D2-0C56-4906-9095-F1D07A98CBEC}

40 41 885055826 NaN John Arthur 7.945558e+09 26-01-1980 00:00 M M 15-07-2018 00:00 sales 1355.0 Admin -1 {E249D613-36C9-4544-9B6F-6CE50E5E0DA5}

In [ ]: #Method 1a

In [ ]: # Q. Get sales dept with Title, Dept, Gender column

In [156… emp_data[emp_data["Dept"]=="sales"][["Title","Dept","Gender"]].head(3) ### Desired columns should be written in double square brackets

Out[156… Title Dept Gender

6 Margaret Smith sales F

18 Milton Albury sales F

40 John Arthur sales M

In [ ]: # Method 2 ---Query Method

In [ ]: #Q. Get sales Dept with all the columns

In [157… emp_data.query('Dept=="sales"').head(3)

Out[157… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

6 7 309738752 NaN Margaret Smith NaN 25-11-1959 00:00 S F 31-07-2014 00:00 sales NaN Operations -1 {2CC71B96-F421-485E-9832-8723337749BB}

18 19 9659517 NaN Milton Albury NaN 06-02-1960 00:00 M F 01-11-2014 00:00 sales 3408.0 Admin -1 {C334B2D2-0C56-4906-9095-F1D07A98CBEC}

40 41 885055826 NaN John Arthur 7.945558e+09 26-01-1980 00:00 M M 15-07-2018 00:00 sales 1355.0 Admin -1 {E249D613-36C9-4544-9B6F-6CE50E5E0DA5}

In [ ]: # Method 2a ----Query method 2

In [ ]: # Q. Get sales dept with Title, Dept, Gender column

In [158… emp_data.query('Dept=="sales"')[["Title", "Dept", "Gender"]].head(3)

Out[158… Title Dept Gender

6 Margaret Smith sales F

18 Milton Albury sales F

40 John Arthur sales M

In [ ]: # Method 3 ---- (loc or iloc) ###loc- label based indexing --- not to use, just for knowledge

In [ ]: #Q. Get sales Dept with all the columns

In [159… emp_data.loc[emp_data["Dept"]=="sales"].head(3)

Out[159… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

6 7 309738752 NaN Margaret Smith NaN 25-11-1959 00:00 S F 31-07-2014 00:00 sales NaN Operations -1 {2CC71B96-F421-485E-9832-8723337749BB}

18 19 9659517 NaN Milton Albury NaN 06-02-1960 00:00 M F 01-11-2014 00:00 sales 3408.0 Admin -1 {C334B2D2-0C56-4906-9095-F1D07A98CBEC}

40 41 885055826 NaN John Arthur 7.945558e+09 26-01-1980 00:00 M M 15-07-2018 00:00 sales 1355.0 Admin -1 {E249D613-36C9-4544-9B6F-6CE50E5E0DA5}

In [ ]: # Method 3a

In [ ]: # Q. Get sales dept with Title, Dept, Gender column

In [160… emp_data.loc[emp_data["Dept"]=="sales"][["Title","Dept","Gender"]].head(3)

Out[160… Title Dept Gender

6 Margaret Smith sales F

18 Milton Albury sales F

40 John Arthur sales M

In [ ]: #### Filtering with multiple fields

In [161… emp_data.loc[(emp_data["Dept"]=="sales") & (emp_data["Gender"]=="M")].head(5)

Out[161… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

40 41 885055826 NaN John Arthur 7.945558e+09 26-01-1980 00:00 M M 15-07-2018 00:00 sales 1355.0 Admin -1 {E249D613-36C9-4544-9B6F-6CE50E5E0DA5}

In [162… emp_data.loc[(emp_data["Dept"]=="sales") & (emp_data["Gender"]=="M")][["Title","Dept","Gender"]].head(5)

Out[162… Title Dept Gender

40 John Arthur sales M

In [163… emp_data[(emp_data["Gender"]=="M") & (emp_data["MaritalStatus"]=="M")].head(3)

Out[163… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 M M 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 M M 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

4 5 480168528 NaN Pilar Ackerman NaN 07-06-1963 00:00 M M 16-07-2014 00:00 Human Resource 1932.0 Admin -1 {1D955171-E773-4FAD-8382-40FD898D5D4D}

In [164… emp_data[(emp_data["Gender"]=="M") & (emp_data["MaritalStatus"]=="M")][["Title","Dept","Gender"]].head(5)

Out[164… Title Dept Gender

0 Gustavo Achong Sales M

2 Kim Abercrombie Finance M

4 Pilar Ackerman Human Resource M

10 Samuel Agcaoili Logistics M

12 Robert Ahlering Production M

In [ ]: ### Query Method

In [165… emp_data.query('Gender == "M" and MaritalStatus == "M"').head(3) # IN the query method fields should be written
# between a single inverted comma ''

Out[165… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 M M 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 M M 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

4 5 480168528 NaN Pilar Ackerman NaN 07-06-1963 00:00 M M 16-07-2014 00:00 Human Resource 1932.0 Admin -1 {1D955171-E773-4FAD-8382-40FD898D5D4D}

In [166… emp_data.query('Gender =="M" and MaritalStatus =="M"')[["Title","Dept","Gender"]].head(3)

Out[166… Title Dept Gender

0 Gustavo Achong Sales M

2 Kim Abercrombie Finance M

4 Pilar Ackerman Human Resource M

In [ ]: # Summery
# Normal method with single column filteration (all columns, selected columns)
# normal method with multiple column filteration (all column, selected columns)
# Query method with single column filteration (all columns, selected columns)
# Query method with multiple column filteration (all columns, selected columns)
# loc method with single column filteration (all columns, selected columns)
# loc method with multiple column filteration (all columns, selected columns)

In [63]: def add(a,b):


c=a+b
return c

In [64]: add(5,6)

Out[64]: 11

In [65]: def MaritalStatus(a):


if a == "M":
return"Married"
else:
return"Single"

In [67]: MaritalStatus("M")

Out[67]: 'Married'

In [68]: MaritalStatus("S")

Out[68]: 'Single'

In [74]: def Gender(a):


if a=="M":
return"Male"
elif a=="F":
return"Female"

In [75]: Gender("M")

Out[75]: 'Male'

In [76]: Gender("F")

Out[76]: 'Female'

In [167… emp_data.head(3)

Out[167… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 M M 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

1 2 253022876 adventure-works\kevin0 Catherine Abel 9.453570e+09 12-03-1991 00:00 S M 31-08-2013 00:00 Sales 962.0 Management -1 {1B480240-95C0-410F-A717-EB29943C8886}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 M M 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

In [168… emp_data["Gender"]=emp_data["Gender"].apply(Gender)

In [169… emp_data.head(3)

Out[169… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 M Male 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

1 2 253022876 adventure-works\kevin0 Catherine Abel 9.453570e+09 12-03-1991 00:00 S Male 31-08-2013 00:00 Sales 962.0 Management -1 {1B480240-95C0-410F-A717-EB29943C8886}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 M Male 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

In [170… emp_data["MaritalStatus"] = emp_data["MaritalStatus"].apply(MaritalStatus)

In [171… emp_data.head(3)

Out[171… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 Married Male 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

1 2 253022876 adventure-works\kevin0 Catherine Abel 9.453570e+09 12-03-1991 00:00 Single Male 31-08-2013 00:00 Sales 962.0 Management -1 {1B480240-95C0-410F-A717-EB29943C8886}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 Married Male 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

In [ ]: ############################################### Using a new dataset #########################################

In [86]: import pandas as pd

In [172… bike_data =pd.read_excel("D:\\Data Engineering\\Python\\Data sets\\Bike_Data.xlsx")

In [173… type(bike_data)

Out[173… pandas.core.frame.DataFrame

In [174… bike_data.shape

Out[174… (60919, 11)

In [175… bike_data.dtypes

Out[175… Region object


Country object
Customer object
Business Segment object
Category object
Model object
Color object
SalesDate datetime64[ns]
ListPrice float64
UnitPrice float64
OrderQty int64
dtype: object

In [176… bike_data.head(5) # the data does not have any measuring column e.g sales, profit etc.

Out[176… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2

In [177… # Q. Calculate cost of the product

bike_data["cost"] = bike_data["UnitPrice"] * bike_data["OrderQty"]

In [178… bike_data.head(5)

Out[178… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty cost

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1 183.94

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1 2039.99

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6 173.04

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2 1429.40

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2 1749.58

In [189… # Q. Calculate sales of the product

bike_data["Sales"] = bike_data["ListPrice"] * bike_data["OrderQty"]

In [190… bike_data.head(5)

Out[190… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty cost Profit Sales

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1 183.94 153.28 337.22

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1 2039.99 1360.00 3399.99

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6 173.04 126.90 299.94

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2 1429.40 1269.80 2699.20

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2 1749.58 1166.40 2915.98

In [192… # Q. Calculate Profit from the product

bike_data["Profit"] = bike_data["Sales"] - bike_data["cost"]

In [193… bike_data.head(5)

Out[193… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty cost Profit Sales

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1 183.94 153.28 337.22

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1 2039.99 1360.00 3399.99

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6 173.04 126.90 299.94

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2 1429.40 1269.80 2699.20

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2 1749.58 1166.40 2915.98

In [ ]: # I accidentally added an extra column("sales") to the data frame. dropped the "sales" column

In [184… bike_data.drop("sales", axis=1, inplace=True)

# drop("sales, axis=1): Specify that you want to drop the column named 'B'.
# The axis=1 parameter indicates that you're dropping a column (use axis=0 to drop rows).
# inplace=True: If you want to modify the original DataFrame without creating a new one, use inplace = True.

In [196… bike_data.head(5)

Out[196… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty cost Profit Sales

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1 183.94 153.28 337.22

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1 2039.99 1360.00 3399.99

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6 173.04 126.90 299.94

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2 1429.40 1269.80 2699.20

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2 1749.58 1166.40 2915.98

In [197… # Assign Catagories(Platinum, Gold, Silver) to the customers based on the sales

def Sales_group(a):
if a < 5000:
return "Bronze"
elif a < 10000:
return "Silver"
elif a < 15000:
return "Gold"
else:
return"Platinum"

In [198… Sales_group(18000)

Out[198… 'Platinum'

In [199… bike_data["Customer_segment"] = bike_data["Sales"].apply(Sales_group)

In [201… bike_data.head(5)

Out[201… Region Country Customer Business Segment Category Model Color SalesDate ListPrice UnitPrice OrderQty cost Profit Sales Customer_segment

0 North America United States Advanced Bike Components Components Road Frames LL Road Frame Red 2020-04-01 337.22 183.94 1 183.94 153.28 337.22 Bronze

1 North America United States Central Discount Store Bikes Mountain Bikes Mountain-100 Silver 2020-04-01 3399.99 2039.99 1 2039.99 1360.00 3399.99 Bronze

2 North America United States Leading Sales & Repair Clothing Jerseys Long-Sleeve Logo Jersey Multi 2020-04-01 49.99 28.84 6 173.04 126.90 299.94 Bronze

3 North America United States Paint Supply Components Mountain Frames HL Mountain Frame Black 2020-04-01 1349.60 714.70 2 1429.40 1269.80 2699.20 Bronze

4 North America United States Scooters and Bikes Store Bikes Road Bikes Road-450 Red 2020-04-01 1457.99 874.79 2 1749.58 1166.40 2915.98 Bronze

In [204… bike_data.columns

Out[204… Index(['Region', 'Country', 'Customer', 'Business Segment', 'Category',


'Model', 'Color', 'SalesDate', 'ListPrice', 'UnitPrice', 'OrderQty',
'cost', 'Profit', 'Sales', 'Customer_segment'],
dtype='object')

In [ ]: # Q. When will we use SQL and when will we use Python?

#Ans. When the data is present on the database we use SQL and when the data is present in the files we use Python

In [112… import pandas as pd

In [205… crime_data=pd.read_excel("D:\\Data Engineering\\Python\\Data sets\\India_crimes.xlsx")

In [206… crime_data.head(5)

Out[206… S. No Category State/UT 2016 2017 2018 Total

0 1 State Andhra Pradesh 616 931 1207 2754

1 2 State Arunachal Pradesh 4 1 7 12

2 3 State Assam 696 1120 2022 3838

3 4 State Bihar 309 433 374 1116

4 5 State Chhattisgarh 90 171 139 400

In [ ]: # Q . Display no of crimes in each saate/UT

In [207… cdata = crime_data[["Category", "State/UT", "Total"]] # to display only state/UT and no of crimes

In [208… cdata.head(5)

Out[208… Category State/UT Total

0 State Andhra Pradesh 2754

1 State Arunachal Pradesh 12

2 State Assam 3838

3 State Bihar 1116

4 State Chhattisgarh 400

In [ ]: # Q. Sort the data by no of crimes

In [209… cdata.sort_values("Total").head(10) # by default sort by ascending order

Out[209… Category State/UT Total

32 Union Territory Daman & Diu 0

31 Union Territory D&N Haveli 2

22 State Sikkim 3

18 State Nagaland 4

34 Union Territory Lakshadweep 4

1 State Arunachal Pradesh 12

29 Union Territory A & N Islands 13

17 State Mizoram 17

35 Union Territory Puducherry 21

25 State Tripura 35

In [ ]: # Q. sort the same data in descending order

In [210… cdata.sort_values("Total", ascending = False).head(10)

Out[210… Category State/UT Total

26 State Uttar Pradesh 13890

11 State Karnataka 10114

14 State Maharashtra 9495

2 State Assam 3838

21 State Rajasthan 3349

24 State Telangana 3007

0 State Andhra Pradesh 2754

19 State Odisha 1984

10 State Jharkhand 1909

6 State Gujarat 1522

In [ ]: #Assignment:-
# Add new column as Risk
# for <2000 crimes --. low risk
# for 2000 to 4000 crimes ---. moderate risk
# for 4000 to 6000 crimes ---. high risk

In [ ]: # Q Return the costliest car from car_data

In [211… car_data.head(5)

Out[211… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

1 sx4 2013 4.75 9.54 43000 Diesel Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

3 wagon r 2011 2.85 4.15 5200 Petrol Dealer Manual

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

In [220… car_data.sort_values("Selling_Price", ascending = False).head(5)

Out[220… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

189 Hero Super Splendor 2005 0.20 0.57 55000 Petrol Individual Manual

191 Bajaj Discover 125 2012 0.20 0.57 25000 Petrol Individual Manual

193 Hero Ignitor Disc 2013 0.20 0.65 24000 Petrol Individual Manual

190 Bajaj Pulsar 150 2008 0.20 0.75 60000 Petrol Individual Manual

195 Bajaj ct 100 2015 0.18 0.32 35000 Petrol Individual Manual

In [ ]: # Q. sort car_data with two columns

In [ ]: car_data= car_data.sort_values(by=["Selling_Price", "Present_Price"]).head(5)

In [222… # mistake that i have made while writting the above line was not to assign a dataframe e.g car_data1.
# instead i have made changes directly to the original dataframe thats a strict no no
# it should be
car_data.sort_values(by=["Selling_Price", "Present_Price"]).head(5)

Out[222… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

200 Bajaj Pulsar 150 2006 0.10 0.75 92233 Petrol Individual Manual

199 Honda CB Shine 2007 0.12 0.58 53000 Petrol Individual Manual

198 Bajaj Discover 125 2011 0.15 0.57 35000 Petrol Individual Manual

197 Honda CB twister 2010 0.16 0.51 33000 Petrol Individual Manual

196 Activa 3g 2008 0.17 0.52 500000 Petrol Individual Automatic

In [ ]: # Assignment:-
# Q. Try to sort two columns, one is ascending and one is descending

In [223… car_data.head(5)

Out[223… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

1 sx4 2013 4.75 9.54 43000 Diesel Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

3 wagon r 2011 2.85 4.15 5200 Petrol Dealer Manual

4 swift 2014 4.60 6.87 42450 Diesel Dealer Manual

In [229… car_data[car_data["Year"] == 2014]["Car_Name"].count()

Out[229… 38

In [232… car_data[car_data["Year"] == 2015]["Car_Name"].count()

Out[232… 61

In [230… car_data[car_data["Fuel_Type"] == "Diesel"]["Car_Name"].count()

Out[230… 60

In [ ]: ############################################ Group By in Pandas##############################

In [ ]: # SQL code -- select year, count(car_name) from car_data group by car_name

In [245… car_data.groupby("Year")[["Car_Name"]].count()

### have to use double square brackets [[]] to ensure the result is a dataframe

Out[245… Car_Name

Year

2003 2

2004 1

2005 4

2006 4

2007 2

2008 7

2009 6

2010 15

2011 19

2012 23

2013 33

2014 38

2015 61

2016 50

2017 35

2018 1

In [238… car_data.groupby("Fuel_Type")[["Car_Name"]].count()

Out[238… Car_Name

Fuel_Type

CNG 2

Diesel 60

Petrol 239

In [239… car_data.groupby("Transmission")[["Car_Name"]].count()

Out[239… Car_Name

Transmission

Automatic 40

Manual 261

In [ ]: ###### we have to group by two columns

In [ ]: # when putting two columns in the group by we have to put it in a list format

In [244… car_data.groupby(["Fuel_Type", "Transmission"])[["Car_Name"]].count()

Out[244… Car_Name

Fuel_Type Transmission

CNG Manual 2

Diesel Automatic 12

Manual 48

Petrol Automatic 28

Manual 211

In [247… emp_data.head(3)

Out[247… EmployeeID NationalIDNumber LoginID Title PhoneNumber BirthDate MaritalStatus Gender HireDate Dept Salary Job Grade CurrentFlag rowguid

0 1 14417807 adventure-works\guy1 Gustavo Achong 8.185304e+09 21-02-1986 00:00 Married Male 02-02-2013 00:00 Sales 2295.0 Admin -1 {AAE1D04A-C237-4974-B4D5-935247737718}

1 2 253022876 adventure-works\kevin0 Catherine Abel 9.453570e+09 12-03-1991 00:00 Single Male 31-08-2013 00:00 Sales 962.0 Management -1 {1B480240-95C0-410F-A717-EB29943C8886}

2 3 509647174 NaN Kim Abercrombie 9.513804e+09 21-09-1978 00:00 Married Male 16-06-2014 00:00 Finance 4006.0 Admin -1 {9BBBFB2C-EFBB-4217-9AB7-F97689328841}

In [249… emp_data.groupby(["MaritalStatus", "Gender"])[["EmployeeID"]].count()

Out[249… EmployeeID

MaritalStatus Gender

Married Female 7

Male 15

Single Female 6

Male 22

In [253… emp_data.groupby("Dept")[["EmployeeID"]].count()

Out[253… EmployeeID

Dept

Finance 11

Human Resource 6

Logistics 8

Production 14

Sales 7

sales 4

In [258… Storedata = pd.read_excel("D:\\Data Engineering\\Python\\Data sets\\Superstore.xlsx")

In [256… import warnings ### to remove the warnings that comes after fetching a dataset

In [257… warnings.filterwarnings("ignore")

In [261… Storedata.head(3)

Out[261… Row Order Ship Customer Customer Postal Sub-


Order ID Ship Mode Segment Country/Region City ... Region Product ID Category Product Name Sales Quantity Discount Profit
ID Date Date ID Name Code Category

CA-2016- 2016-03- 2016-03- Standard Home TEC-MA- Cisco TelePresence System EX90
0 2698 SM-20320 Sean Miller United States Jacksonville ... 32216.0 South Technology Machines 22638.48 6 0.5 -1811.0784
145317 18 23 Class Office 10002412 Videoconferenci...

CA-2018- 2018-10- 2018-10- Standard Tamara TEC-CO- Canon imageCLASS 2200 Advanced
1 6827 TC-20980 Corporate United States Lafayette ... 47905.0 Central Technology Copiers 17499.95 5 0.0 8399.9760
118689 02 09 Class Chand 10004722 Copier

CA-2019- 2019-03- 2019-03- Raymond TEC-CO- Canon imageCLASS 2200 Advanced


2 8154 First Class RB-19360 Consumer United States Seattle ... 98115.0 West Technology Copiers 13999.96 4 0.0 6719.9808
140151 23 25 Buch 10004722 Copier

3 rows × 21 columns

In [262… Storedata.dtypes

Out[262… Row ID int64


Order ID object
Order Date datetime64[ns]
Ship Date datetime64[ns]
Ship Mode object
...
Product Name object
Sales float64
Quantity int64
Discount float64
Profit float64
Length: 21, dtype: object

In [264… sd = Storedata[["Category", "Sub-Category", "Region", "Segment", "Sales", "Profit", "Quantity"]]

In [265… sd.head(3)

Out[265… Category Sub-Category Region Segment Sales Profit Quantity

0 Technology Machines South Home Office 22638.48 -1811.0784 6

1 Technology Copiers Central Corporate 17499.95 8399.9760 5

2 Technology Copiers West Consumer 13999.96 6719.9808 4

In [ ]: # Q. Find no. of categories, Regions, Segemnts present in the dataframe

In [267… sd.groupby("Category")[["Category"]].count()

Out[267… Category

Category

Furniture 2121

Office Supplies 6026

Technology 1847

In [ ]: ### instead of Grouping by, there is a function called 'unique' to find no of catagories

In [273… sd["Category"].unique() # Returns unique records

Out[273… array(['Technology', 'Office Supplies', 'Furniture'], dtype=object)

In [274… sd["Category"].nunique() # Returns no. of unique records

Out[274… 3

In [275… sd["Region"].unique()

Out[275… array(['South', 'Central', 'West', 'East'], dtype=object)

In [ ]: # Q. Give Total sales made by technology category

In [277… sd[sd["Category"] == "Technology"]["Sales"].sum()

Out[277… 826154.0329999999

In [278… sd[sd["Category"] == "Office Supplies"]["Sales"].sum()

Out[278… 719047.032

In [ ]: # We can't keep on doing this for every category, so we use Group by for multiple categories

In [ ]: # Method 1 - 1 column 1 aggrigation

In [279… sd.groupby(["Category"])[["Sales"]].sum()

Out[279… Sales

Category

Furniture 741999.7953

Office Supplies 719047.0320

Technology 826154.0330

In [ ]: # Q. Return sales made within each category in each region

In [ ]: # Method 2 -- Multiple columns and 1 aggrigation

In [280… sd.groupby(["Category", "Region"])[["Sales"]].sum()

Out[280… Sales

Category Region

Furniture Central 163797.1638

East 208291.2040

South 117298.6840

West 252612.7435

Office Supplies Central 167026.4150

East 205516.0550

South 125651.3130

West 220853.2490

Technology Central 170416.3120

East 254973.9810

South 148771.9080

West 251991.8320

In [ ]: # Q. Return Profit for each category and region

In [281… sd.groupby(["Category", "Region"])[["Profit"]].sum()

Out[281… Profit

Category Region

Furniture Central -2871.0494

East 3046.1658

South 6771.2061

West 11504.9503

Office Supplies Central 8879.9799

East 41014.5791

South 19986.3928

West 52609.8490

Technology Central 33697.4320

East 47462.0351

South 19991.8314

West 44303.6496

In [ ]: # Method 3 - Multiple columns and multiple aggrigation

In [ ]: # Q. Return Sales and Profit for region and category

In [282… sd.groupby(["Category", "Region"])[["Sales", "Profit"]].sum()

Out[282… Sales Profit

Category Region

Furniture Central 163797.1638 -2871.0494

East 208291.2040 3046.1658

South 117298.6840 6771.2061

West 252612.7435 11504.9503

Office Supplies Central 167026.4150 8879.9799

East 205516.0550 41014.5791

South 125651.3130 19986.3928

West 220853.2490 52609.8490

Technology Central 170416.3120 33697.4320

East 254973.9810 47462.0351

South 148771.9080 19991.8314

West 251991.8320 44303.6496

In [ ]: # Q. Return sales, profit and quantity of goods sold for every category and region

In [285… sd.groupby(["Category","Region"])[["Sales", "Profit", "Quantity"]].sum()

Out[285… Sales Profit Quantity

Category Region

Furniture Central 163797.1638 -2871.0494 1827

East 208291.2040 3046.1658 2214

South 117298.6840 6771.2061 1291

West 252612.7435 11504.9503 2696

Office Supplies Central 167026.4150 8879.9799 5409

East 205516.0550 41014.5791 6462

South 125651.3130 19986.3928 3800

West 220853.2490 52609.8490 7235

Technology Central 170416.3120 33697.4320 1544

East 254973.9810 47462.0351 1942

South 148771.9080 19991.8314 1118

West 251991.8320 44303.6496 2335

In [ ]: # Method 4 - multiple columns and multiple aggregation with different functions

In [287… sd.groupby(["Category", "Region"]).agg({"Sales":"sum", "Profit":"mean", "Quantity})

# Created a dictionary for sales and profit


# mean is used to return average values

Out[287… Sales Profit

Category Region

Furniture Central 163797.1638 -5.968918

East 208291.2040 5.068496

South 117298.6840 20.395199

West 252612.7435 16.272914

Office Supplies Central 167026.4150 6.244712

East 205516.0550 23.957114

South 125651.3130 20.086827

West 220853.2490 27.733183

Technology Central 170416.3120 80.231981

East 254973.9810 88.714084

South 148771.9080 68.231506

West 251991.8320 73.962687

In [ ]: # Method 5 - Multiple columns with single aggregation with different function

In [288… sd.groupby(["Category", "Region"]).agg({"Sales":["mean", "sum", "min", "max"]})

# written this code because keys in the dictionary can't be repeated or there cannot be duplicate keys
# so created a list for all the aggregate functions inside the list

Out[288… Sales

mean sum min max

Category Region

Furniture Central 340.534644 163797.1638 1.892 3504.900

East 346.574383 208291.2040 2.960 4416.174

South 353.309289 117298.6840 2.784 4297.644

West 357.302325 252612.7435 3.480 3610.848

Office Supplies Central 117.458801 167026.4150 0.444 9892.740

East 120.044425 205516.0550 0.852 4663.736

South 126.282727 125651.3130 1.167 6354.950

West 116.422377 220853.2490 1.080 8187.650

Technology Central 405.753124 170416.3120 1.980 17499.950

East 476.586880 254973.9810 2.970 10499.970

South 507.753952 148771.9080 1.584 22638.480

West 420.687533 251991.8320 0.990 13999.960

In [ ]: # Giving alias names to the columns

In [292… sd.groupby(["Category", "Region"]).agg({"Sales":"sum", "Profit":"mean"}).rename(columns = {"Sales": "Total_Sales", "Profit": "Average_Profit"})

Out[292… Total_Sales Average_Profit

Category Region

Furniture Central 163797.1638 -5.968918

East 208291.2040 5.068496

South 117298.6840 20.395199

West 252612.7435 16.272914

Office Supplies Central 167026.4150 6.244712

East 205516.0550 23.957114

South 125651.3130 20.086827

West 220853.2490 27.733183

Technology Central 170416.3120 80.231981

East 254973.9810 88.714084

South 148771.9080 68.231506

West 251991.8320 73.962687

In [294… car_data.head(3)

Out[294… Car_Name Year Selling_Price Present_Price Kms_Driven Fuel_Type Seller_Type Transmission

0 ritz 2014 3.35 5.59 27000 Petrol Dealer Manual

1 sx4 2013 4.75 9.54 43000 Diesel Dealer Manual

2 ciaz 2017 7.25 9.85 6900 Petrol Dealer Manual

In [ ]: # Q Return no. of cars per year year/total cars

In [9]: car_data.groupby(["Year"])[["Car_Name"]].count().rename(columns={"Car_Name":"Total_Cars"}) # used group by/ rename

Out[9]: Total_Cars

Year

2003 2

2004 1

2005 4

2006 4

2007 2

2008 7

2009 6

2010 15

2011 19

2012 23

2013 33

2014 38

2015 61

2016 50

2017 35

2018 1

In [ ]: # Sort the total cars

In [10]: car_data.groupby(["Year"])[["Car_Name"]].count().rename(columns={"Car_Name":"Total_Cars"}).sort_values("Total_Cars", ascending = False)

Out[10]: Total_Cars

Year

2015 61

2016 50

2014 38

2017 35

2013 33

2012 23

2011 19

2010 15

2008 7

2009 6

2005 4

2006 4

2003 2

2007 2

2004 1

2018 1

In [12]: Car1 = car_data.groupby(["Year"])[["Car_Name"]].count().rename(columns={"Car_Name":"Total_Cars"}).sort_values("Total_Cars", ascending = False)

In [14]: type(Car1)

Out[14]: pandas.core.frame.DataFrame

In [16]: Car1.head(5)

Out[16]: Total_Cars

Year

2015 61

2016 50

2014 38

2017 35

2013 33

In [ ]:

You might also like