0% found this document useful (0 votes)

31 views

Unit-03: Capturing, Preparing and Working With Data

The document discusses Python for Data Science (PDS) and covers topics related to capturing, preparing, and working with data in Python. It specifically discusses the Pandas library, which allows for fast data cleaning, preparation and analysis of data from a wide variety of sources. The document provides an overview of key Pandas concepts like Series, DataFrames, and how to access and manipulate data.

Uploaded by

Mubaraka Kundawala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Unit-03: Capturing, Preparing and Working With Data

Uploaded by

Mubaraka Kundawala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

Python for Data Science (PDS) (3150713)

Unit-03
Capturing, Preparing
and Working with data
Topics
Looping

Basic File IO in Python

NumPy V/S Pandas (what to use?)
NumPy
Pandas
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Web Scrapping using BeautifulSoup
Python for Data Science (PDS) (3150713)

Unit-03.02
Lets Learn
Pandas
Pandas
 Pandas is an open source library built on top of NumPy.
 It allows for fast data cleaning, preparation and analysis.
 It excels in performance and productivity.
 It also has built-in visualization features.
 It can work with the data from wide variety of sources.
 Install :
 conda install pandas
OR  pip install pandas

4
Topics
Looping

Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
Series
 Series is an one-dimensional* array with axis labels.
 It supports both integer and label-based index but index must be of hashable type.
 If we do not specify index it will assign integer zero-based index.

syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False

pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
6
Series (Cont.)
 We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

 We can specify the data type of Series using dtype parameter

pdSeriesdtype.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str') Sum = 13
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

7
Series (Cont.)
 We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name dahod
2 import pandas as pd address dh
3 i = ['name','address','phone','email','website'] phone 123
4 d = [dahod',‘dh',123','[email protected]',‘gecdahod.ac.in'] email [email protected]
5 s = pd.Series(data=d,index=i) website gecdahod.ac.in
6 print(s) dtype: object

8
Creating Time Series
 We can use some of pandas inbuilt date functions to create a time series.
pdSeriesEle.py Output
1 import numpy as np 2020-07-27 50
2 import pandas as pd 2020-07-28 53
3 dates = pd.to_datetime("27th of July, 2020") 2020-07-29 25
4 i = dates + pd.to_timedelta(np.arange(5), 2020-07-30 70
unit='D') 2020-07-31 60
5 d = [50,53,25,70,60] dtype: int64
6 time_series = pd.Series(data=d,index=i)
7 print(time_series)

9
Data Frames
 Data frames are two dimensional data structure, i.e. data is aligned in a tabular format in rows
and columns.
 Data frame also contains labelled axes on rows and columns.
 Features of Data Frame :
 It is size-mutable
 Has labelled axes
 Columns can be of different data types
 We can perform arithmetic operations on rows and columns.
 Structure :
PDS Algo SE INS
101
102
103
….
160
10
Data Frames (Cont.)
 Syntax :
syntax Parameters
import pandas as pd data = array like Iterable
df = pd.DataFrame(data,index,columns,dtype,copy=False) index = array like row index
columns = array like col index
dtype = data-type
copy = bool, default is False
 Example :
pdDataFrame.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 0 23 93 46
3 randArr = np.random.randint(0,100,20).reshape(5,4) 102 85 47 31 12
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 103 35 34 6 89
['PDS','Algo','SE','INS']) 104 66 83 70 50
print(df) 105 65 88 87 87
5

11
Data Frames (Cont.)
 Grabbing the column
dfGrabCol.py Output
1 import numpy as np 101 0
2 import pandas as pd 102 85
3 randArr = np.random.randint(0,100,20).reshape(5,4) 103 35
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 104 66
['PDS','Algo','SE','INS']) 105 65
print(df['PDS']) Name: PDS, dtype: int32
5
 Grabbing the multiple column
Output
dfGrabMulCol.py
PDS SE
1 print(df['PDS', 'SE']) 101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
12
Data Frames (Cont.)
 Grabbing a row
dfGrabRow.py Output

1 print(df.loc[101]) # using labels PDS 0

2 #OR Algo 23
3 print(df.iloc[0]) # using zero based index SE 93
INS 46
Name: 101, dtype: int32
 Grabbing Single Value
dfGrabSingle.py Output
1 print(df.loc[101, 'PDS']) # using labels 0

 Deleting Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87
13
Data Frames (Cont.)
 Creating new column Output

dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
 Deleting Column and Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87

14
Data Frames (Cont.)
 Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], ['PDS','INS']]) PDS INS
101 0 46
104 66 50

 Selecting all cols except one Output

dfGrabExcept.py PDS SE INS

101 0 93 46
1 print(df.loc[:, df.columns != 'Algo' ])
102 85 31 12
103 35 6 89
104 66 70 50
105 65 87 87

15
Conditional Selection
 Similar to NumPy we can do conditional selection in pandas.
dfCondSel.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 66 85 8 95
3 np.random.seed(121) 102 65 52 83 96
4 randArr = 103 46 34 52 60
np.random.randint(0,100,20).reshape(5,4) 104 54 3 94 52
5 df = 105 57 75 88 39
pd.DataFrame(randArr,np.arange(101,106,1) PDS Algo SE INS
,['PDS','Algo','SE','INS']) 101 True True False True
6 print(df) 102 True True True True
7 print(df>50) 103 False False True True
104 True False True True
105 True True True False
 Note : we have used np.random.seed() method and set seed to be 121, so that when you
generate random number it matches with the random number I have generated.

16
Conditional Selection (Cont.)
 We can then use this boolean DataFrame to get associated values.
dfCondSel.py Output
1 dfBool = df > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 NaN 95
102 65 52 83 96
 Note : It will set NaN (Not a Number) in case of False 103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN

 We can apply condition on specific column.

dfCondSel.py Output
1 dfBool = df['PDS'] > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 8 95
102 65 52 83 96
104 54 3 94 52
105 57 75 88 39
17
Setting/Resetting index
 In our previous example we have seen our index does not have name, if we want to specify name
to our index we can specify it using DataFrame.index.name property.
dfCondSel.py Output
1 df.index.name('RollNo') PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
Note: We have name to our
103 46 34 52 60
index now
104 54 3 94 52
105 57 75 88 39

 We can use pandas built-in methods to set or reset the index

 pd.set_index('NewColumn',inplace=True), will set new column as index,
 pd.reset_index(), will reset index to zero based numberic index.

18
Setting/Resetting index (Cont.)
 set_index(new_index)
dfCondSel.py Output
1 df.set_index('PDS') #inplace=True Algo SE INS
PDS
66 85 8 95
65 52 83 96
Note: We have PDS as our
46 34 52 60
index now
54 3 94 52
 reset_index() 57 75 88 39
dfCondSel.py Output
1 df.reset_index() RollNo PDS Algo SE INS
0 101 66 85 8 95
Note: Our RollNo(index) 1 102 65 52 83 96
become new column, and 2 103 46 34 52 60
we now have zero based 3 104 54 3 94 52
numeric index 4 105 57 75 88 39

19
Multi-Index DataFrame
 Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate information
faster at almost no cost.
 Example where we need Hierarchical indexes
Numeric Index/Single Index Multi Index
Col Dep Sem RN S1 S2 S3 RN S1 S2 S3
0 ABC CE 5 101 50 60 70 Col Dep Sem
1 ABC CE 5 102 48 70 25 ABC CE 5 101 50 60 70
2 ABC CE 7 101 58 59 51 5 102 48 70 25
3 ABC ME 5 101 30 35 39 7 101 58 59 51
4 ABC ME 5 102 50 90 48 ME 5 101 30 35 39
5 Dahod CE 5 101 88 99 77 5 102 50 90 48
6 Dahod CE 5 102 99 84 76 Dahod CE 5 101 88 99 77
7 Dahod CE 7 101 88 77 99 5 102 99 84 76
8 Dahod ME 5 101 44 88 99 7 101 88 77 99
ME 5 101 44 88 99

20
Multi-Index DataFrame (Cont.)
 Creating multiindexes is as simple as creating single index using set_index method, only
difference is in case of multiindexes we need to provide list of indexes instead of a single string
index, lets see and example for that
dfMultiIndex.py Output
1 dfMulti = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv') Col Dep Sem
2 dfMulti.set_index(['Col','Dep','Sem'], ABC CE 5 101 50 60 70
inplace=True) 5 102 48 70 25
3 print(dfMulti) 7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

21
Multi-Index DataFrame (Cont.)
 Now we have multi-indexed DataFrame from which we can access data using multiple index
 For Example Output (Dahod)
 Sub DataFrame for all the students of Dahod RN S1 S2 S3
dfGrabDahodStu.py Dep Sem
1 print(dfMulti.loc['Dahod']) CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output (Dahod>CE)
RN S1 S2 S3
Sem
5 101 88 99 77
5 102 99 84 76
 SubdfGrabDahodCEStu.py
DataFrame for Computer Engineering
7 101 88 77 99
students from a Dahod
1 print(dfMulti.loc['Dahod','CE'])

22
Reading in Multiindexed DataFrame directly from CSV
 read_csv function of pandas provides easy way to create multi-indexed DataFrame directly
while fetching the CSV file.
dfMultiIndex.py Output
1 dfMultiCSV = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv' Col Dep Sem
,index_col=[0,1,2]) ABC CE 5 101 50 60 70
#for multi-index in cols we can 5 102 48 70 25
use header parameter 7 101 58 59 51
2 print(dfMultiCSV) ME 5 101 30 35 39
5 102 50 90 48
Dahod CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

23
Cross Sections in DataFrame
 The xs() function is used to get cross-section from the === Parameters ===
key : label
Series/DataFrame.
axis : Axis to retrieve
 This method takes a key argument to select data at a cross section
particular level of a MultiIndex. level : level of key
 Syntax : drop_level : False if you
want to preserve the level
syntax
Output
DataFrame.xs(key, axis=0, level=None, drop_level=True)
 Example : RN S1 S2RN S3S1 S2 S3
Col
Col Dep
Sem Sem
dfMultiIndex.py
ABC
ABC CE
5 5 101 101
50 50
60 60
70 70
1 dfMultiCSV = 5 5 102 102
48 48
70 70
25 25
pd.read_csv('MultiIndexDemo.csv', 7 7 101 101
58 58
59 59
51 51
index_col=[0,1,2]) Dahod ME
5 5 101 101
88 30
99 35
77 39
2 print(dfMultiCSV) 5 5 102 102
99 50
84 90
76 48
3 print(dfMultiCSV.xs('CE',axis=0,level='Dep') Darshan CE
7 5 101 101
88 88
77 99
99 77
) 5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
24
Dealing with Missing Data
 There are many methods by which we can deal with the missing data, some of most commons
are listed below,
 dropna, will drop (delete) the missing data (rows/cols)
 fillna, will fill specified values in place of missing data
 interpolate, will interpolate missing data and fill interpolated value in place of missing data.

25
Groupby in Pandas
 Any groupby operation involves one of the following
operations on the original object. They are College Enno CPI
 Splitting the Object Dahod 123 8.9
 Applying a function
Dahod 124 9.2
 Combining the results
Dahod 125 7.8
 In many situations, we split the data into sets and we
Dahod 128 8.7 College Mean CPI
apply some functionality on each subset.
ABC 211 5.6 Dahod 8.65
 we can perform the following operations
ABC 212 6.2 ABC 4.8
 Aggregation − computing a summary statistic
ABC 215 3.2 XYZ 5.83
 Transformation − perform some group-specific operation
 Filtration − discarding the data with some condition ABC 218 4.2
XYZ 312 5.2
 Basic ways to use of groupby method
 df.groupby('key') XYZ 315 6.5
 df.groupby(['key1','key2']) XYZ 315 5.8
 df.groupby(key,axis=1)

26
Groupby in Pandas (Cont.)
 Example : Listing all the groups
dfGroup.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {2014: Int64Index([0, 2, 4, 9],

2 print(dfIPL.groupby('Year').groups) dtype='int64'),
2015: Int64Index([1, 3, 5, 10],
dtype='int64'),
2016: Int64Index([6, 8],
dtype='int64'),
2017: Int64Index([7, 11],
dtype='int64')}

27
Groupby in Pandas (Cont.)
 Example : Group by multiple columns
dfGroupMul.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {(2014, 'Devils'): Int64Index([2],

2 print(dfIPL.groupby(['Year','Team']).groups) dtype='int64'),
(2014, 'Kings'): Int64Index([4],
dtype='int64'),
(2014, 'Riders'): Int64Index([0],
dtype='int64'),
………
………
(2016, 'Riders'): Int64Index([8],
dtype='int64'),
(2017, 'Kings'): Int64Index([7],
dtype='int64'),
(2017, 'Riders'): Int64Index([11],
dtype='int64')}

28
Output
Groupby in Pandas (Cont.) 2014
 Example : Iterating through groups Team Rank Year Points
0 Riders 1 2014 876
dfGroupIter.py 2 Devils 2 2014 863
4 Kings 3 2014 741
1 dfIPL = pd.read_csv('IPLDataSet.csv')
9 Royals 4 2014 701
2 groupIPL = dfIPL.groupby('Year')
2015
3 for name,group in groupIPL :
Team Rank Year Points
4 print(name)
1 Riders 2 2015 789
5 print(group)
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
29
Groupby in Pandas (Cont.)
 Example : Aggregating groups Output
YEAR_ID
dfGroupAgg.py 2003 1000
1 dfSales = pd.read_csv('SalesDataSet.csv') 2004 1345
2 print(dfSales.groupby(['YEAR_ID']).count( 2005 478
)['QUANTITYORDERED']) Name: QUANTITYORDERED, dtype:
3 print(dfSales.groupby(['YEAR_ID']).sum() int64
['QUANTITYORDERED']) YEAR_ID
4 print(dfSales.groupby(['YEAR_ID']).mean() 2003 34612
['QUANTITYORDERED']) 2004 46824
2005 17631
Name: QUANTITYORDERED, dtype:
int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype:
float64
30
Groupby in Pandas (Cont.)
Output
 Example : Describe details
count mean std min
dfGroupDesc.py 25% 50% 75% max
1 dfIPL = Year
pd.read_csv('IPLDataSet.csv') 2014 4.0 795.25 87.439026 701.0 731.0
2 print(dfIPL.groupby('Year').desc 802.0 866.25 876.0
ribe()['Points']) 2015 4.0 769.50 65.035888 673.0 760.0
796.5 806.00 812.0
2016 2.0 725.00 43.840620 694.0 709.5
725.0 740.50 756.0
2017 2.0 739.00 69.296465 690.0 714.5
739.0 763.50 788.0

31
Concatenation in Pandas
 Concatenation basically glues together DataFrames.
 Keep in mind that dimensions should match along the axis you are concatenating on.
 You can use pd.concat and pass in a list of DataFrames to concatenate together:
dfConcat.py Output
1 dfCX = pd.read_csv('CX_Marks.csv',index_col=0) PDS Algo SE
2 dfCY = pd.read_csv('CY_Marks.csv',index_col=0) 101 50 55 60
3 dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0) 102 70 80 61
4 dfAllStudent = pd.concat([dfCX,dfCY,dfCZ]) 103 55 89 70
5 print(dfAllStudent) 104 58 96 85
201 77 96 63
 Note : We can use axis=1 parameter to concat columns. 202 44 78 32
203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
32
Join in Pandas
 df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
 some of important Parameters :
 dfOther : Right Data Frame
 on (Not recommended) : specify the column on which we want to join (Default is index)
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.

33
Join in Pandas (Example)
dfJoin.py Output - 1 Output - 2
1 dfINS = PDS Algo SE INS PDS Algo SE INS
pd.read_csv('INS_Marks.csv',index_col=0) 101 50 55 60 55.0 301 11 75 88 11
2 dfLeftJoin = allStudent.join(dfINS) 102 70 80 61 66.0 302 22 48 77 22
3 print(dfLeftJoin) 103 55 89 70 77.0 303 33 59 68 33
4 dfRightJoin = 104 58 96 85 88.0 304 44 55 62 44
allStudent.join(dfINS,how='right') 201 77 96 63 66.0 101 50 55 60 55
5 print(dfRightJoin) 202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0

34
Merge in Pandas
 Merge DataFrame or named Series objects with a database-style join.
 Similar to join method, but used when we want to join/merge with the columns instead of index.
 some of important Parameters :
 dfOther : Right Data Frame
 on : specify the column on which we want to join (Default is index)
 left_on : specify the column of left Dataframe
 right_on : specify the column of right Dataframe
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it. lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of
the calling’s one.

35
Merge in Pandas (Example)
dfMerge.py Output
1 m1 = pd.read_csv('Merge1.csv') RollNo EnNo Name
2 print(m1) 0 101 11112222 Abc
3 m2 = pd.read_csv('Merge2.csv') 1 102 11113333 Xyz
4 print(m2) 2 103 22224444 Def
5 m3 = m1.merge(m2,on='EnNo')
6 print(m3) EnNo PDS INS
0 11112222 50 60
1 11113333 60 70

RollNo EnNo Name PDS INS

0 101 11112222 Abc 50 60
1 102 11113333 Xyz 60 70

36
Read CSV in Pandas
 read_csv() is used to read Comma Separated Values (CSV) file into a pandas DataFrame.
 some of important Parameters :
 filePath : str, path object, or file-like object
 sep : separator (Default is comma)
 header: Row number(s) to use as the column names.
 index_col : index column(s) of the data frame.
readCSV.py Output
1 dfINS = pd.read_csv('Marks.csv',index_col=0,header=0) PDS Algo SE INS
2 print(dfINS) 101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0

37
Read Excel in Pandas
 Read an Excel file into a pandas DataFrame.
 Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL.
Supports an option to read a single sheet or a list of sheets.
 some of important Parameters :
 excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
 sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
 index_col : index column of the data frame.

38
Read from MySQL Database
 We need two libraries for that,
 conda install sqlalchemy
 conda install pymysql
 After installing both the libraries, import create_engine from sqlalchemy and import
pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql

 Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)

39
Read from MySQL Database (Cont.)
 After getting the engine, we can fire any sql query using pd.read_sql method.
 read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL, Oracle
etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT

40
Web Scrapping using Beautiful Soup
 Beautiful Soup is a library that makes it easy to scrape information from web pages.
 It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and
modifying the parse tree. Output
webScrap.py Dr. Gopi Sanghani
1 import requests Dr. Nilesh Gambhava
2 import bs4 Dr. Pradyumansinh
3 req = requests.get('https://www.gecdahod.ac.in/Faculty') Jadeja
soup = bs4.BeautifulSoup(req.text,'lxml') Prof. Hardik Doshi
4 allFaculty = soup.select('body > main > section:nth- Prof. Maulik Trivedi
5 child(5) > div > div > div.col-lg-8.col-xl-9 > div > Prof. Dixita Kagathara
div') Prof. Firoz Sherasiya
for fac in allFaculty : Prof. Rupesh Vaishnav
6 allSpans = fac.select('h2>a') Prof. Swati Sharma
7 print(allSpans[0].text.strip()) Prof. Arjun Bala
8 Prof. Mayur Padia
…..
…..
41

O9 Academy - Creating A Model - Beginner's Guide - Quiz - Creating A Model - Beginner's Gu..
0% (2)
O9 Academy - Creating A Model - Beginner's Guide - Quiz - Creating A Model - Beginner's Gu..
14 pages
Time Series
No ratings yet
Time Series
31 pages
Practical File Class 10 2022-2023
100% (10)
Practical File Class 10 2022-2023
23 pages
Class XII Data Handlinng Using PandasI
No ratings yet
Class XII Data Handlinng Using PandasI
46 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Airflow
No ratings yet
Airflow
37 pages
Numpy - Pandas - Lab - Jupyter Notebook
No ratings yet
Numpy - Pandas - Lab - Jupyter Notebook
29 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Learneverythingai 1661068200
No ratings yet
Learneverythingai 1661068200
66 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Tamil Rosary
No ratings yet
Tamil Rosary
12 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Snowflake - Billing Components
No ratings yet
Snowflake - Billing Components
9 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Database Course Outline INFO1101
No ratings yet
Database Course Outline INFO1101
5 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
3 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Pandas in Python 16sept2022
No ratings yet
Pandas in Python 16sept2022
8 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Mongodb Cheat Sheet
No ratings yet
Mongodb Cheat Sheet
10 pages
Python Advanced - Threads and Threading
No ratings yet
Python Advanced - Threads and Threading
9 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
ENG202 - Introduction To Python
No ratings yet
ENG202 - Introduction To Python
34 pages
Airflow Chapter4
No ratings yet
Airflow Chapter4
30 pages
Python for Data Engineering Guide
No ratings yet
Python for Data Engineering Guide
4 pages
Introduction to Docker
No ratings yet
Introduction to Docker
136 pages
PostgreSQL Cheat Sheet - Hackr - Io
No ratings yet
PostgreSQL Cheat Sheet - Hackr - Io
90 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Snowflake Setup - MD
No ratings yet
Snowflake Setup - MD
2 pages
Airflow Chapter3
No ratings yet
Airflow Chapter3
31 pages
06 Linux Shell Programming
No ratings yet
06 Linux Shell Programming
59 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
100% (1)
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
62 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
Steps in SHA-256 Algorithm
No ratings yet
Steps in SHA-256 Algorithm
5 pages
Pyspark With Docker
100% (1)
Pyspark With Docker
15 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Python Lab File
No ratings yet
Python Lab File
26 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Python for Machine Learning
No ratings yet
Python for Machine Learning
384 pages
An Introduction To Seaborn
No ratings yet
An Introduction To Seaborn
42 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Hands On Scripting
No ratings yet
Hands On Scripting
24 pages
MongoDB Lab
No ratings yet
MongoDB Lab
41 pages
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
Placement_2023-24
No ratings yet
Placement_2023-24
3 pages
Basics of Python
No ratings yet
Basics of Python
48 pages
Chapter - 2: Data Science & Python
No ratings yet
Chapter - 2: Data Science & Python
17 pages
While Loop
No ratings yet
While Loop
8 pages
Lists
No ratings yet
Lists
27 pages
For Loop
No ratings yet
For Loop
5 pages
Os Lab Manual Final
No ratings yet
Os Lab Manual Final
17 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
ORACLe Backup Policy
No ratings yet
ORACLe Backup Policy
2 pages
Project Synopsis On Line Institute Management
No ratings yet
Project Synopsis On Line Institute Management
20 pages
Datacamp Tutorials - SQL Courses - Plan of Study
No ratings yet
Datacamp Tutorials - SQL Courses - Plan of Study
3 pages
Database Management Systems OVERVIEW
No ratings yet
Database Management Systems OVERVIEW
21 pages
Midterm Exam Key: CMPT 354
No ratings yet
Midterm Exam Key: CMPT 354
7 pages
MSSQLServer Arch DBA 9years ITIL ITILServiceOperation
No ratings yet
MSSQLServer Arch DBA 9years ITIL ITILServiceOperation
6 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
Event Driven 6
No ratings yet
Event Driven 6
26 pages
DFC20203 Database Design: Topic 1: Fundamentals of Database Management System
No ratings yet
DFC20203 Database Design: Topic 1: Fundamentals of Database Management System
12 pages
SQL Queries Sheet1
No ratings yet
SQL Queries Sheet1
6 pages
AU14D08-Now I Have AESEv3
No ratings yet
AU14D08-Now I Have AESEv3
44 pages
Assignment From Module 1
No ratings yet
Assignment From Module 1
1 page
Database Administration Level - 3 Practical Exam Level 3: Filed Name Data Type Size
100% (8)
Database Administration Level - 3 Practical Exam Level 3: Filed Name Data Type Size
3 pages
XML Interview Questions and Answers
No ratings yet
XML Interview Questions and Answers
5 pages
Mendeley Teaching Presentation - 2011
No ratings yet
Mendeley Teaching Presentation - 2011
33 pages
6.moving Data Into Hadoop
No ratings yet
6.moving Data Into Hadoop
18 pages
Database Programming Section 10 Quiz
100% (1)
Database Programming Section 10 Quiz
16 pages
SQL Introduction
100% (1)
SQL Introduction
67 pages
Generative AI Tghjraining in Hyderabad
No ratings yet
Generative AI Tghjraining in Hyderabad
22 pages
Assignment Normalization
No ratings yet
Assignment Normalization
2 pages
Data Definition Language (DDL) [Slides]
No ratings yet
Data Definition Language (DDL) [Slides]
10 pages
Large Language Model Enhanced Text-to-SQL Generation- A Survey
No ratings yet
Large Language Model Enhanced Text-to-SQL Generation- A Survey
18 pages
CS403 Short Easy Notes
No ratings yet
CS403 Short Easy Notes
28 pages
LM 3 - Database System Architecture
No ratings yet
LM 3 - Database System Architecture
20 pages
Database System 1.6.2020, 2.30 - 4
No ratings yet
Database System 1.6.2020, 2.30 - 4
39 pages
Data_Lakehouse_Architecture
No ratings yet
Data_Lakehouse_Architecture
11 pages
CA7
No ratings yet
CA7
2 pages

Unit-03: Capturing, Preparing and Working With Data

Uploaded by

Unit-03: Capturing, Preparing and Working With Data

Uploaded by

Python for Data Science (PDS) (3150713)

Basic File IO in Python

 We can specify the data type of Series using dtype parameter

1 print(df.loc[101]) # using labels PDS 0

 Deleting Row Output

 Selecting all cols except one Output

dfGrabExcept.py PDS SE INS

 We can apply condition on specific column.

 We can use pandas built-in methods to set or reset the index

1 dfIPL = pd.read_csv('IPLDataSet.csv') {2014: Int64Index([0, 2, 4, 9],

1 dfIPL = pd.read_csv('IPLDataSet.csv') {(2014, 'Devils'): Int64Index([2],

RollNo EnNo Name PDS INS

You might also like