Unit-03: Capturing, Preparing and Working With Data
Unit-03: Capturing, Preparing and Working With Data
Unit-03
Capturing, Preparing
and Working with data
Topics
Looping
Unit-03.02
Lets Learn
Pandas
Pandas
Pandas is an open source library built on top of NumPy.
It allows for fast data cleaning, preparation and analysis.
It excels in performance and productivity.
It also has built-in visualization features.
It can work with the data from wide variety of sources.
Install :
conda install pandas
OR pip install pandas
4
Topics
Looping
Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
Series
Series is an one-dimensional* array with axis labels.
It supports both integer and label-based index but index must be of hashable type.
If we do not specify index it will assign integer zero-based index.
syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False
pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
6
Series (Cont.)
We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)
7
Series (Cont.)
We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name dahod
2 import pandas as pd address dh
3 i = ['name','address','phone','email','website'] phone 123
4 d = [dahod',‘dh',123','[email protected]',‘gecdahod.ac.in'] email [email protected]
5 s = pd.Series(data=d,index=i) website gecdahod.ac.in
6 print(s) dtype: object
8
Creating Time Series
We can use some of pandas inbuilt date functions to create a time series.
pdSeriesEle.py Output
1 import numpy as np 2020-07-27 50
2 import pandas as pd 2020-07-28 53
3 dates = pd.to_datetime("27th of July, 2020") 2020-07-29 25
4 i = dates + pd.to_timedelta(np.arange(5), 2020-07-30 70
unit='D') 2020-07-31 60
5 d = [50,53,25,70,60] dtype: int64
6 time_series = pd.Series(data=d,index=i)
7 print(time_series)
9
Data Frames
Data frames are two dimensional data structure, i.e. data is aligned in a tabular format in rows
and columns.
Data frame also contains labelled axes on rows and columns.
Features of Data Frame :
It is size-mutable
Has labelled axes
Columns can be of different data types
We can perform arithmetic operations on rows and columns.
Structure :
PDS Algo SE INS
101
102
103
….
160
10
Data Frames (Cont.)
Syntax :
syntax Parameters
import pandas as pd data = array like Iterable
df = pd.DataFrame(data,index,columns,dtype,copy=False) index = array like row index
columns = array like col index
dtype = data-type
copy = bool, default is False
Example :
pdDataFrame.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 0 23 93 46
3 randArr = np.random.randint(0,100,20).reshape(5,4) 102 85 47 31 12
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 103 35 34 6 89
['PDS','Algo','SE','INS']) 104 66 83 70 50
print(df) 105 65 88 87 87
5
11
Data Frames (Cont.)
Grabbing the column
dfGrabCol.py Output
1 import numpy as np 101 0
2 import pandas as pd 102 85
3 randArr = np.random.randint(0,100,20).reshape(5,4) 103 35
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 104 66
['PDS','Algo','SE','INS']) 105 65
print(df['PDS']) Name: PDS, dtype: int32
5
Grabbing the multiple column
Output
dfGrabMulCol.py
PDS SE
1 print(df['PDS', 'SE']) 101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
12
Data Frames (Cont.)
Grabbing a row
dfGrabRow.py Output
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87
13
Data Frames (Cont.)
Creating new column Output
dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
Deleting Column and Row Output
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87
14
Data Frames (Cont.)
Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], ['PDS','INS']]) PDS INS
101 0 46
104 66 50
15
Conditional Selection
Similar to NumPy we can do conditional selection in pandas.
dfCondSel.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 66 85 8 95
3 np.random.seed(121) 102 65 52 83 96
4 randArr = 103 46 34 52 60
np.random.randint(0,100,20).reshape(5,4) 104 54 3 94 52
5 df = 105 57 75 88 39
pd.DataFrame(randArr,np.arange(101,106,1) PDS Algo SE INS
,['PDS','Algo','SE','INS']) 101 True True False True
6 print(df) 102 True True True True
7 print(df>50) 103 False False True True
104 True False True True
105 True True True False
Note : we have used np.random.seed() method and set seed to be 121, so that when you
generate random number it matches with the random number I have generated.
16
Conditional Selection (Cont.)
We can then use this boolean DataFrame to get associated values.
dfCondSel.py Output
1 dfBool = df > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 NaN 95
102 65 52 83 96
Note : It will set NaN (Not a Number) in case of False 103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN
18
Setting/Resetting index (Cont.)
set_index(new_index)
dfCondSel.py Output
1 df.set_index('PDS') #inplace=True Algo SE INS
PDS
66 85 8 95
65 52 83 96
Note: We have PDS as our
46 34 52 60
index now
54 3 94 52
reset_index() 57 75 88 39
dfCondSel.py Output
1 df.reset_index() RollNo PDS Algo SE INS
0 101 66 85 8 95
Note: Our RollNo(index) 1 102 65 52 83 96
become new column, and 2 103 46 34 52 60
we now have zero based 3 104 54 3 94 52
numeric index 4 105 57 75 88 39
19
Multi-Index DataFrame
Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate information
faster at almost no cost.
Example where we need Hierarchical indexes
Numeric Index/Single Index Multi Index
Col Dep Sem RN S1 S2 S3 RN S1 S2 S3
0 ABC CE 5 101 50 60 70 Col Dep Sem
1 ABC CE 5 102 48 70 25 ABC CE 5 101 50 60 70
2 ABC CE 7 101 58 59 51 5 102 48 70 25
3 ABC ME 5 101 30 35 39 7 101 58 59 51
4 ABC ME 5 102 50 90 48 ME 5 101 30 35 39
5 Dahod CE 5 101 88 99 77 5 102 50 90 48
6 Dahod CE 5 102 99 84 76 Dahod CE 5 101 88 99 77
7 Dahod CE 7 101 88 77 99 5 102 99 84 76
8 Dahod ME 5 101 44 88 99 7 101 88 77 99
ME 5 101 44 88 99
20
Multi-Index DataFrame (Cont.)
Creating multiindexes is as simple as creating single index using set_index method, only
difference is in case of multiindexes we need to provide list of indexes instead of a single string
index, lets see and example for that
dfMultiIndex.py Output
1 dfMulti = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv') Col Dep Sem
2 dfMulti.set_index(['Col','Dep','Sem'], ABC CE 5 101 50 60 70
inplace=True) 5 102 48 70 25
3 print(dfMulti) 7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
21
Multi-Index DataFrame (Cont.)
Now we have multi-indexed DataFrame from which we can access data using multiple index
For Example Output (Dahod)
Sub DataFrame for all the students of Dahod RN S1 S2 S3
dfGrabDahodStu.py Dep Sem
1 print(dfMulti.loc['Dahod']) CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output (Dahod>CE)
RN S1 S2 S3
Sem
5 101 88 99 77
5 102 99 84 76
SubdfGrabDahodCEStu.py
DataFrame for Computer Engineering
7 101 88 77 99
students from a Dahod
1 print(dfMulti.loc['Dahod','CE'])
22
Reading in Multiindexed DataFrame directly from CSV
read_csv function of pandas provides easy way to create multi-indexed DataFrame directly
while fetching the CSV file.
dfMultiIndex.py Output
1 dfMultiCSV = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv' Col Dep Sem
,index_col=[0,1,2]) ABC CE 5 101 50 60 70
#for multi-index in cols we can 5 102 48 70 25
use header parameter 7 101 58 59 51
2 print(dfMultiCSV) ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
23
Cross Sections in DataFrame
The xs() function is used to get cross-section from the === Parameters ===
key : label
Series/DataFrame.
axis : Axis to retrieve
This method takes a key argument to select data at a cross section
particular level of a MultiIndex. level : level of key
Syntax : drop_level : False if you
want to preserve the level
syntax
Output
DataFrame.xs(key, axis=0, level=None, drop_level=True)
Example : RN S1 S2RN S3S1 S2 S3
Col
Col Dep
Sem Sem
dfMultiIndex.py
ABC
ABC CE
5 5 101 101
50 50
60 60
70 70
1 dfMultiCSV = 5 5 102 102
48 48
70 70
25 25
pd.read_csv('MultiIndexDemo.csv', 7 7 101 101
58 58
59 59
51 51
index_col=[0,1,2]) Dahod ME
5 5 101 101
88 30
99 35
77 39
2 print(dfMultiCSV) 5 5 102 102
99 50
84 90
76 48
3 print(dfMultiCSV.xs('CE',axis=0,level='Dep') Darshan CE
7 5 101 101
88 88
77 99
99 77
) 5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
24
Dealing with Missing Data
There are many methods by which we can deal with the missing data, some of most commons
are listed below,
dropna, will drop (delete) the missing data (rows/cols)
fillna, will fill specified values in place of missing data
interpolate, will interpolate missing data and fill interpolated value in place of missing data.
25
Groupby in Pandas
Any groupby operation involves one of the following
operations on the original object. They are College Enno CPI
Splitting the Object Dahod 123 8.9
Applying a function
Dahod 124 9.2
Combining the results
Dahod 125 7.8
In many situations, we split the data into sets and we
Dahod 128 8.7 College Mean CPI
apply some functionality on each subset.
ABC 211 5.6 Dahod 8.65
we can perform the following operations
ABC 212 6.2 ABC 4.8
Aggregation − computing a summary statistic
ABC 215 3.2 XYZ 5.83
Transformation − perform some group-specific operation
Filtration − discarding the data with some condition ABC 218 4.2
XYZ 312 5.2
Basic ways to use of groupby method
df.groupby('key') XYZ 315 6.5
df.groupby(['key1','key2']) XYZ 315 5.8
df.groupby(key,axis=1)
26
Groupby in Pandas (Cont.)
Example : Listing all the groups
dfGroup.py Output
27
Groupby in Pandas (Cont.)
Example : Group by multiple columns
dfGroupMul.py Output
28
Output
Groupby in Pandas (Cont.) 2014
Example : Iterating through groups Team Rank Year Points
0 Riders 1 2014 876
dfGroupIter.py 2 Devils 2 2014 863
4 Kings 3 2014 741
1 dfIPL = pd.read_csv('IPLDataSet.csv')
9 Royals 4 2014 701
2 groupIPL = dfIPL.groupby('Year')
2015
3 for name,group in groupIPL :
Team Rank Year Points
4 print(name)
1 Riders 2 2015 789
5 print(group)
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
29
Groupby in Pandas (Cont.)
Example : Aggregating groups Output
YEAR_ID
dfGroupAgg.py 2003 1000
1 dfSales = pd.read_csv('SalesDataSet.csv') 2004 1345
2 print(dfSales.groupby(['YEAR_ID']).count( 2005 478
)['QUANTITYORDERED']) Name: QUANTITYORDERED, dtype:
3 print(dfSales.groupby(['YEAR_ID']).sum() int64
['QUANTITYORDERED']) YEAR_ID
4 print(dfSales.groupby(['YEAR_ID']).mean() 2003 34612
['QUANTITYORDERED']) 2004 46824
2005 17631
Name: QUANTITYORDERED, dtype:
int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype:
float64
30
Groupby in Pandas (Cont.)
Output
Example : Describe details
count mean std min
dfGroupDesc.py 25% 50% 75% max
1 dfIPL = Year
pd.read_csv('IPLDataSet.csv') 2014 4.0 795.25 87.439026 701.0 731.0
2 print(dfIPL.groupby('Year').desc 802.0 866.25 876.0
ribe()['Points']) 2015 4.0 769.50 65.035888 673.0 760.0
796.5 806.00 812.0
2016 2.0 725.00 43.840620 694.0 709.5
725.0 740.50 756.0
2017 2.0 739.00 69.296465 690.0 714.5
739.0 763.50 788.0
31
Concatenation in Pandas
Concatenation basically glues together DataFrames.
Keep in mind that dimensions should match along the axis you are concatenating on.
You can use pd.concat and pass in a list of DataFrames to concatenate together:
dfConcat.py Output
1 dfCX = pd.read_csv('CX_Marks.csv',index_col=0) PDS Algo SE
2 dfCY = pd.read_csv('CY_Marks.csv',index_col=0) 101 50 55 60
3 dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0) 102 70 80 61
4 dfAllStudent = pd.concat([dfCX,dfCY,dfCZ]) 103 55 89 70
5 print(dfAllStudent) 104 58 96 85
201 77 96 63
Note : We can use axis=1 parameter to concat columns. 202 44 78 32
203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
32
Join in Pandas
df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
some of important Parameters :
dfOther : Right Data Frame
on (Not recommended) : specify the column on which we want to join (Default is index)
how : How to handle the operation of the two objects.
left: use calling frame’s index (Default).
right: use dfOther index.
outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.
33
Join in Pandas (Example)
dfJoin.py Output - 1 Output - 2
1 dfINS = PDS Algo SE INS PDS Algo SE INS
pd.read_csv('INS_Marks.csv',index_col=0) 101 50 55 60 55.0 301 11 75 88 11
2 dfLeftJoin = allStudent.join(dfINS) 102 70 80 61 66.0 302 22 48 77 22
3 print(dfLeftJoin) 103 55 89 70 77.0 303 33 59 68 33
4 dfRightJoin = 104 58 96 85 88.0 304 44 55 62 44
allStudent.join(dfINS,how='right') 201 77 96 63 66.0 101 50 55 60 55
5 print(dfRightJoin) 202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0
34
Merge in Pandas
Merge DataFrame or named Series objects with a database-style join.
Similar to join method, but used when we want to join/merge with the columns instead of index.
some of important Parameters :
dfOther : Right Data Frame
on : specify the column on which we want to join (Default is index)
left_on : specify the column of left Dataframe
right_on : specify the column of right Dataframe
how : How to handle the operation of the two objects.
left: use calling frame’s index (Default).
right: use dfOther index.
outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.
35
Merge in Pandas (Example)
dfMerge.py Output
1 m1 = pd.read_csv('Merge1.csv') RollNo EnNo Name
2 print(m1) 0 101 11112222 Abc
3 m2 = pd.read_csv('Merge2.csv') 1 102 11113333 Xyz
4 print(m2) 2 103 22224444 Def
5 m3 = m1.merge(m2,on='EnNo')
6 print(m3) EnNo PDS INS
0 11112222 50 60
1 11113333 60 70
36
Read CSV in Pandas
read_csv() is used to read Comma Separated Values (CSV) file into a pandas DataFrame.
some of important Parameters :
filePath : str, path object, or file-like object
sep : separator (Default is comma)
header: Row number(s) to use as the column names.
index_col : index column(s) of the data frame.
readCSV.py Output
1 dfINS = pd.read_csv('Marks.csv',index_col=0,header=0) PDS Algo SE INS
2 print(dfINS) 101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0
37
Read Excel in Pandas
Read an Excel file into a pandas DataFrame.
Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL.
Supports an option to read a single sheet or a list of sheets.
some of important Parameters :
excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
index_col : index column of the data frame.
38
Read from MySQL Database
We need two libraries for that,
conda install sqlalchemy
conda install pymysql
After installing both the libraries, import create_engine from sqlalchemy and import
pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql
Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)
39
Read from MySQL Database (Cont.)
After getting the engine, we can fire any sql query using pd.read_sql method.
read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL, Oracle
etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT
40
Web Scrapping using Beautiful Soup
Beautiful Soup is a library that makes it easy to scrape information from web pages.
It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and
modifying the parse tree. Output
webScrap.py Dr. Gopi Sanghani
1 import requests Dr. Nilesh Gambhava
2 import bs4 Dr. Pradyumansinh
3 req = requests.get('https://www.gecdahod.ac.in/Faculty') Jadeja
soup = bs4.BeautifulSoup(req.text,'lxml') Prof. Hardik Doshi
4 allFaculty = soup.select('body > main > section:nth- Prof. Maulik Trivedi
5 child(5) > div > div > div.col-lg-8.col-xl-9 > div > Prof. Dixita Kagathara
div') Prof. Firoz Sherasiya
for fac in allFaculty : Prof. Rupesh Vaishnav
6 allSpans = fac.select('h2>a') Prof. Swati Sharma
7 print(allSpans[0].text.strip()) Prof. Arjun Bala
8 Prof. Mayur Padia
…..
…..
41