0% found this document useful (0 votes)
176 views

Data Manipulation With Pandas

Data science unit 2

Uploaded by

bonamkotaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views

Data Manipulation With Pandas

Data science unit 2

Uploaded by

bonamkotaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

PANDAS

INTRODUCTION
Pandas is an open-source Python Library. It providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is derived
from the word Panel Data. Developed by Wes McKinney in 2008.

Using Pandas, we can accomplish five typical steps in the processing and analysis of
data:

Load
Prepare
Manipulate
Model
Analyze

---

INTRODUCING PANDAS OBJECTS


Pandas objects are enhanced versions of NumPy structured arrays in which the rows
and columns are identified with labels rather than simple integer indices. The three fundamental
Pandas objects are: Series, Data Frames and Index.

 The Pandas Series Object


Series is a one-dimensional labelled array capable of holding data of any type (integer,
string, float, python objects, etc.). A pandas series can be created using the following
constructor:

pandas.series( data, index, dtype, copy)

data - Data takes various forms like ndarray, list, constants


index - Index values must be unique and same length as data.
Dtype - dtype is for data type. If None, data type will be inferred
Copy - Copy data. Default False
Ex:
data = pd.series([0.25, 0.5, 0.75, 1.0])
data
0 0.25
1 0.50
2 0.75
3 1.00

The Series wraps both a sequence of values and a sequence of indices, which we can
access with the values and index attributes. Like with a NumPy array, data can be accessed by
the associated index using Python square-bracket notation:
Ex:
data[1]
0.5
data[1:3]
1 0.50
2 0.75

A series can be created using various inputs like – Array, Dictionary, Scalar value or
constant.

Series as generalized NumPy array


The NumPy array has an implicitly defined integer index used to access the values.
The Pandas Series has an explicitly defined index associated with the values. This explicit
index definition gives the Series object additional capabilities. For example, the index need
not be an integer, but can consist of values of any desired type.

Ex:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
We can even use non-contiguous or nonsequential indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00

data[5]
0.5

Series as specialized dictionary

A dictionary can be passed as input and if no index is specified, then the dictionary
keys are taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Ex:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.series(data)
print s
output
a 0.0
b 1.0
c 2.0

Dictionary keys are also used to construct index. In this case Index order is persisted
and the missing element is filled with NaN (Not a Number).
Ex:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.series(data, index=['b','c','d','a'])
print s
b 1.0
c 2.0
d NaN
a 0.0
Create a Series from Scalar

If data is a scalar value, an index must be provided. The value will be repeated to
match the length of index.

Ex:
s = pd.series(5, index=[0, 1, 2, 3])
print s
output
0 5
1 5
2 5
3 5

 The Pandas DataFrame Object

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular


fashion in rows and columns. A pandas DataFrame can be created as:

pandas.DataFrame( data, index, columns, dtype, copy)


Data - Data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
Index - For the row labels, the Index to be used for the resulting frame is optional. Default
np.arange(n) if no index is passed.
Columns - For column labels, the optional default syntax is - np.arange(n). This is only true
if no index is passed.
Dtype - Data type of each column.
Copy - This command is used for copying of data, if the default is False.

Create DataFrame:

A pandas DataFrame can be created using various inputs like -

Lists, dict, Series, Numpy ndarrays, Another DataFrame.


Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
Ex:
data = [['Alex',10], ['Bob',12], ['Clarke',13]]
df = pd.DataFrame(data, columns=['Name','Age'])
print df

Name Age
0 Alex 10
1 Bob 12
2 Clarke 13

Create a DataFrame from Dict of ndarrays / Lists


All the ndarrays must be of same length. If index is passed, then the length of the
index should equal to the length of the arrays.
Ex:
data = {'Name':['Tom', 'Jack', 'Steve', Ricky'], 'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df

Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky

Create a DataFrame from List of Dicts


List of Dictionaries can be passed as input data to create a DataFrame. The dictionary
keys are by default taken as column names.
Ex:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df

a b c
first 1 2 NaN
second 5 10 20.0
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the
union of all the series indexes passed.
Ex:
d= {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
df = pd.DataFrame(d)
print df
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4

 The Pandas Index Object


This Index object can be thought of either as an immutable array or as an ordered set.
Index objects may contain repeated values.

Index as immutable array


The Index object in many ways operates like an array. For example, we can use
standard Python indexing notation to retrieve values or slices:
ind = pd.Index([2, 3, 5, 7, 11])
ind[1]
3

ind[::2]
Int64Index([2, 5, 11], dtype='int64')

Index as ordered set

indA = pd.Index([1, 3, 5, 7, 9])


indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')

---
DATA INDEXING AND SELECTION
Indexing in Pandas means simply selecting or accessing and modifying values
in Pandas series and dataframes.

 Data selection in series


Series object acts in many ways like a one-dimensional NumPy array and also like a
standard Python dictionary.

 Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a
collection of values:

Ex:

data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[‘a’, ‘b’, ‘c’, ‘d’])

data

a 0.25
b 0.50
c 0.75
d 1.00

data[‘b’]

0.5

We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:

Ex:

>>> ‘a’ in data


True

>>> data.keys()
Index([‘a’, ‘b’, ‘c’, ‘d’])
>>> list(data.items())
[(‘a’, 0.25), (‘b’, 0.5), (‘c’, 0.75), (‘d’, 1.0)]

 Series as one-dimensional array

Series object also provide array style item selection.

Ex:

# slicing by explicit index. The final index is included in the slice.

data['a':'c']
a 0.25
b 0.50
c 0.75

# slicing by implicit integer index. The final index is not included in the slice.

data[0:2]
a 0.25
b 0.50

# masking

data[(data > 0.3) & (data < 0.8)]

b 0.50
c 0.75

# fancy indexing

data[['a', 'e']]

a 0.25
e 1.25

indexers ( loc, iloc and ix )

Slicing and indexing can be confusing. If a Series has an explicit integer index,
indexing as data[1] uses the explicit index, but a slicing operation like data[1:3] will use
the implicit index.

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])


data
1 a
3 b
5 c

# explicit index when indexing

data[1]
'a'

# implicit index when slicing

data[1:3]
3 b
5 c

Because of this, pandas provides special indexer attributes to expose certain indexing
schemes.

loc

The loc attribute references the explicit index for indexing and slicing (label-location
based):

Ex:

data.loc[1]
'a'

data.loc[1:3]
1 a
3 b

iloc

The iloc attribute references the implicit Python-style index for indexing and slicing
(integer-location based)

Ex:

data.iloc[1]
'b'

data.iloc[1:3]
3 b
5 c
ix

The ix attribute is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.

 Data selection in DataFrames


There are some indexing method in Pandas which help in getting an element from a
DataFrame. These indexing methods called as indexers appear very similar but behave very
differently. Pandas support four types of Multi-axes indexing they are:

 Dataframe.[ ] : This function also known as indexing operator


 Dataframe.loc[ ] : This function is used for labels.
 Dataframe.iloc[ ] : This function is used for positions or integer based
 Dataframe.ix[] : This function is used for both label and integer based

Indexing a Dataframe using indexing operator [] :

We can access columns of a DataFrame using the bracket ([]) operator. For example:

import pandas as pd

# create a DataFrame

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 47, 33],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(data)

# access the Name column

names = df['Name']
print(names)

output

0 Alice
1 Bob
2 Charlie
3 David
4 Eve
We can also access multiple columns using the [] operator. For example,

# access multiple columns

name_city = df[['Name','City']]
print(name_city)

output

Name City
0 Alice New York
1 Bob Paris
2 Charlie London
3 David Tokyo
4 Eve Sydney

The [] operator, however, provides limited functionality. Even basic operations like
selecting rows, slicing DataFrames and selecting individual elements are quite tricky using
the [] operator only.

Indexing a DataFrame using .loc[ ] :

This function selects data by the label of the rows and columns. The df.loc indexer
selects data in a different way than just the indexing operator. It can select subsets of rows or
columns.

Selecting a single row

In order to select a single row using .loc[], we put a single row label in a .loc function.

# retrieving row by loc method

first = data.loc["Avery Bradley"]


second = data.loc["R.J. Hunter"]

Selecting multiple rows

# retrieving multiple rows by loc method

first = data.loc[["Avery Bradley", "R.J. Hunter"]]


Selecting two rows and three columns

In order to select two rows and three columns, we select two rows which we want to
select and three columns and put it in a separate list like this:

Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]

# retrieving two rows and three columns by loc method

first = data.loc[["Avery Bradley", "R.J. Hunter"], ["Team", "Number", "Position"]]

Selecting all of the rows and some columns

In order to select all of the rows and some columns, we use single colon [:] to select all
of rows and list of some columns which we want to select

first = data.loc[:, ["Team", "Number", "Position"]]

Indexing a DataFrame using .iloc[ ] :

In Pandas, the .iloc property is used to access and modify data within a DataFrame
using integer-based indexing. It allows us to select specific rows and columns based on their
integer locations.

import pandas as pd

# create a DataFrame

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 47, 33],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}

df = pd.DataFrame(data)

# access single row

single_row = df.iloc[2]
print("Single Row:")

output:

Name Charlie
Age 18
City London

# access rows 0, 3 and 4

row_list = df.iloc[[0, 3, 4]]


print("List of Rows:")

output:

Name Age City


0 Alice 25 New York
3 David 47 Tokyo
4 Eve 33 Sydney

# access columns 0 and 2

column_list = df.iloc[:,[0,2]]
print(column_list)

output:

Name City
0 Alice New York
1 Bob Paris
2 Charlie London
3 David Tokyo
4 Eve Sydney

# access a specific value

specific_value = df.iloc[1, 0]
print(specific_value)

output:

Bob

The main differences between .loc and .iloc are as follows:

Basis .loc .iloc

Indexing Label-based indexing Integer-based indexing

Endpoint Endpoint is included Endpoint is not included

Boolean Boolean indexing is Boolean indexing is not


indexing supported supported
Indexing a using Dataframe.ix[ ] :

This indexer was capable of selecting both by label and by integer location.
Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer
type selections (as in .iloc) where passed an integer.

# retrieving row by ix method

first = data.ix["Avery Bradley"]

---

OPERATING ON DATA IN PANDAS

NumPy provides quick element-wise operations via ufuncs for basic arithmetic and
more complicated operations (e.g. trigonometric functions). Pandas inherits much of this
functionality. It also features couple of differences like:

 Unary operations like negation and trigonometric functions, ufuncs preserve index
and column labels in the output.

 For binary operations such as addition and multiplication, Pandas will automatically
align indices when passing the objects to the ufunc.

This means that keeping the context of data and combining data from different
sources is much easier.

Ufunc: Index Preservation

Since pandas is designed to work with NumPy, NumPy ufuncs work with
pandas Series and DataFrame objects.

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
0 6
1 3
2 7
3 4

df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])


df

A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4

Applying a NumPy ufunc on these objects produces Pandas objects with preserved indices:
np.exp(ser)
0 403.428793
1 20.085537
2 1096.633158
3 54.598150

np.sin(df * np.pi / 4)

A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16

Ufunc: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in
the process of performing the operation. This is very convenient when you are working with
incomplete data.

Suppose we are combining two different data sources,

area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')

population = pd.Series({'California': 38332521, 'Texas': 26448193,


'New York': 19651127}, name='population')

# Dividing these to compute population density

population / area

Alaska NaN
California 90.413926
New York NaN
Texas 38.018740

The resulting array contains the union of indices of the two input.
This matching is implemented for any of Python's built-in arithmetic expressions:
missing values are filled in with NaN by default:

A = pd.Series([2, 4, 6], index=[0, 1, 2])


B = pd.Series([1, 3, 5], index=[1, 2, 3])
A+B
0 NaN
1 5.0
2 9.0
3 NaN

We can modify the fill value using appropriate object methods in place of the operators.

A.add(B, fill_value=0)

0 2.0
1 5.0
2 9.0
3 5.0

Index alignment in DataFrames:

A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:

A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))


A
A B
0 1 11
1 5 1

B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))


B
B A C
0 4 0 9
1 5 8 0
2 9 2 6

A+B
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
The indices are aligned correctly irrespective of their order in the two objects, and the indices
in the result are sorted.

We can use arithmetic methods with fill values instead of ending up with NaN. Here we’ll fill
with the mean of all values in A.

fill = A.stack().mean() # Stack rows of A first; mean is 4.5.


A.add(B, fill_value=fill)

A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5

Ufunc: Operations between DataFrames and Series

When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array.

This preservation and alignment of indices and columns means that operations on
data in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or misaligned data in raw
NumPy arrays.

---

HANDLING MISSING DATA


Missing Data can occur when no information is provided for one or more items or for
a whole unit. In DataFrame sometimes many datasets simply arrive with missing data, either
because it exists and was not collected or it never existed.
In Pandas missing data is represented by two value:
 None: None is a Python singleton object that is often used for missing data in Python
code.
 NaN: NaN (an acronym for Not a Number), is a special floating-point
value recognized by all systems that use the standard IEEE
floating-point representation.

There are several useful functions for detecting, removing, and replacing null values
in Pandas DataFrame :

 isnull()
 notnull()
 dropna()
 fillna()

Checking for missing values using isnull()

In order to check null values in Pandas DataFrame, we use isnull() function. This
function return dataframe of Boolean values which are True for NaN values.
Ex:

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, p.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)

# using isnull() function


df.isnull()

print(df)
print(df.isnull())

First Score Second Score Third Score


0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 NaN 56.0 80.0
3 95.0 NaN 98.0

First Score Second Score Third Score


0 False False True
1 False False False
2 True False False
3 False True False
Checking for missing values using notnull()

In order to check null values in Pandas Dataframe, we use notnull() function this
function return dataframe of Boolean values.
Ex:
df.notnull()

Checking for missing values using notnull()


In order to fill null values in a datasets, we use fillna(). These function replace NaN
values with some value of their own.
Ex:

df.fillna(0)

Filling null values with the previous ones

We can specify a forward-fill to propagate the previous value forward.

Ex:

df.fillna(method ='pad')

Filling null value with the next ones


we can specify a back-fill to propagate the next values backward.

Ex:
df.fillna(method ='bfill')
Filling a null values using replace() method.

These function replace NaN values with some value of their own.

Ex:

df.replace(to_replace = np.nan, value = -99)

Dropping missing values using dropna()

In order to drop a null values from a dataframe, we use dropna() function. This
function drop Rows/Columns of datasets with Null values in different ways.

Ex:

---
HIERARCHICAL INDEXING
Hierarchical indexing is also known as ‘multi-indexing’. Pandas provide Panel and
Panel4D objects that can store three-dimensional and four-dimensional data. This type of
higher dimensional data can be indexed by making use of hierarchical indexing. It incorporates
multiple index levels within a single index.
MultiIndex contains multiple levels of indexing. In the following example the output
contains state names and years, as well as multiple labels for each data point which encode
these levels.
Ex:

index = pd.MultiIndex.from_tuples(index)
index
(levels=[['California', 'New York', 'Texas'], [2000, 2010]], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1,
0, 1]])

If we re-index our series with this MultiIndex, we see the hierarchical representation of the
data:

pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561

To access all data for which the second index is 2010, we can simply use the Pandas slicing
notation:

pop[:, 2010]

California 37253956
New York 19378102
Texas 25145561

MultiIndex as extra dimension

The unstack() method will quickly convert a multiplyindexed Series into a conventionally
indexed DataFrame:

pop_df = pop.unstack()
pop_df
The stack() method provides the opposite operation
pop_df.stack()

Methods of multiIndex creation


The simplest way is to pass a list of two or more index arrays to the constructor of
a Series or DataFrame.
df = pd.DataFrame(np.random.rand(4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])

Explicit MultiIndex constructors

We can also use the class method constructors available in the pd.MultiIndex.
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

From tuples
You can also construct it from a list of tuples.
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

From Cartesian product


You can even construct it from a Cartesian product of single indices.
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish
this by passing the names argument.

pop.index.names = ['state', 'year']

Multiply Indexed Series

We can access single elements by indexing with multiple terms.


pop['California', 2000]

Multiply Indexed DataFrames

A multiply indexed DataFrame behaves in a similar manner. The syntax used for
multiply indexed Series applies to the columns.

health_data['Guido', 'HR']

Sorted and unsorted indices

Pandas provides a number of convenience routines to perform this type of sorting;


examples are the sort_index() and sortlevel() methods of the DataFrame.

data = data.sort_index()

---

COMBINING DATASETS: CONCAT AND APPEND, MERGE AND


JOIN

The Series and DataFrame objects in pandas are powerful tools for exploring and
analyzing data. In many real-life situations, the data that we want to use comes in multiple
files. We need to combine these files into a single DataFrame to analyze the data. Pandas
provide such facilities for easily combining Series or DataFrame.
 Concatenation with pd.concat
The concat() function in pandas is used to append either columns or rows from one
DataFrame to another.

Syntax:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None,
levels=None, names=None, verify_integrity=False, copy=True)

Ex 1:

Ex 2:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1); print(df2); print(pd.concat([df1, df2]))

Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas
concatenation preserves indices, even if the result will have duplicate indices.
The outcome is often undesirable and pd.concat() gives us a few ways to handle it:
 Ignore the index: By using ignore_index in pd.concat the concatenation will create a
new integer index.

 Adding multi-index keys: Another alternative is to use the keys option to specify a
label for the data sources.

 Concatenation with joins

By default, the join is a union of the input columns (join='outer'), but we can change
this to an intersection of the columns using join='inner'.

 By using join_axes argument we can directly specify the index of the remaining
colums.
 The append() method
Series and DataFrame objects have an append method that can accomplish direct
concatenation. For example, rather than calling pd.concat([df1, df2]), you can simply call
df1.append(df2).

 Combining data using merge()

pd.merge() implements a number of types of joins: the one-to-one, many-to-one,


and many-to-many joins. All three are accessed via identical calls to pd.merge(). The type of
join performed depends on the form of the input data.

 One-to-one joins

Perhaps the simplest type of merge is the one-to-one join, which is similar to the
column-wise concatenation.

pd.merge() recognises that each DataFrame has an employee column, and


automatically joins using this column as a key.
 Many-to-one joins

Many-to-one joins are joins in which one of the two key columns contains duplicate
entries. The resulting DataFrame will preserve those duplicate entries as appropriate.

 Many-to-many joins

If the key column in both the left and right array contains duplicates, then the result is
a many-to-many merge.
Specification of the Merge Key

The on keyword
You can explicitly specify the name of the key column using the on keyword,
which takes a column name or a list of column names.

The left_on and right_on keywords


If you want to merge two datasets with different column names. For example, the employee
name is labelled as name instead of employee.

The left_index and right_index keywords


Merge on an index rather than merging on a column.
---

AGGREGATION AND GROUPING


Aggregation in pandas provides various functions that perform a mathematical or
logical operation on our dataset and returns a summary of that function. Aggregation can be
used to get a summary of columns in our dataset like getting sum, minimum, maximum, etc.
from a particular column of our dataset. The function used for aggregation is agg(), the
parameter is the function we want to perform.

Some functions used in the aggregation are:


 sum() :Compute sum of column values
 min() :Compute min of column values
 max() :Compute max of column values
 mean() :Compute mean of column
 size() :Compute column sizes
 describe() :Generates descriptive statistics
 first() :Compute first of group values
 last() :Compute last of group values
 count() :Compute count of column values
 std() :Standard deviation of column
 var() :Compute variance of column
For example

The sum() function is used to calculate the sum of every value.


df.sum()

The describe() function is used to get a summary of our dataset


df.describe()

We used agg() function to calculate the sum, min, and max of each column in our dataset.
df.agg('sum')
df.agg([‘sum’, ‘min’, ‘max’])
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-
apply-combine strategy.

 Splitting the data into groups based on some criteria.


 Applying a function to each group independently.
 Combining the results into a data structure.

We use groupby() function to group the data. It returns the object as result
(DataFrameGroupBy)

Ex:
df.groupby('Maths')
<pandas.core.groupby.DataFrameGroupBy object at 0x117272160>

To produce a result, we can apply an aggregate to this DataFrameGroupBy object.


df.groupby('Maths').sum()

The most important operations made available by a GroupBy are aggregate,


filter, transform, and apply.

Aggregation:
The aggregate() method takes a string, a function, or a list thereof, and compute all the
aggregates at once.

Ex:
df.groupby('key').aggregate(['min', np.median, max])

Filtering:

A filtering operation allows you to drop data based on the group properties.
Ex:
df.groupby('key').filter(filter_func)

Transformation:
Transformation can return some transformed version of the full data. That is the output
is the same shape as the input.

Ex:
df.groupby('key').transform(lambda x: x - x.mean())

The apply() method:

The apply() method lets you apply an arbitrary function to the group results.

Ex;
df.groupby('key').apply(norm_by_data2).

---

PIVOT TABLES

A pivot table is a similar operation that is commonly seen in spreadsheets and


other programs that operate on tabular data. The pivot table takes simple columnwise
data as input, and groups the entries into a two-dimensional table that provides
a multidimensional summarization of the data.

Creating Pivot Table

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill


_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=T
rue)
data - DataFrame
values - list-like or scalar, optional
index - column, Grouper, array, or list of the previous
Keys to group by on the pivot table index.
Columns - column, Grouper, array, or list of the previous
Aggfunc - function, list of functions, dict, default “mean”
fill_value - scalar, default None
margins - bool, default False
If margins=True, special All columns and rows will be added
Dropna - bool, default True
Do not include columns whose entries are all NaN
margins_name Name of the row / column that will contain the totals when margins is
True.
Observed - If True: only show observed values. f False: show all values for
categorical groupers.
Sort - Specifies if the result should be sorted.
DataFrame - An Excel style pivot table.

table = pd.pivot_table(df, values ='A', index =['B', 'C'], columns =['B'], aggfunc = np.sum)

---
VECTORIZED STRING OPERATIONS
String manipulation is the process of changing, parsing, splicing, pasting, or
analyzing strings. Nearly all Python’s built-in string methods are mirrored by a Pandas
vectorized string method.

len():
Compute the length of each element in the Series/Index.
print(df.str.len())

lower():
Converts all uppercase characters in strings to lower case.
df.str.lower()

upper():
Converts all lowercase characters in strings to upper case.
df.str.upper()

ljust():

It returns the string with padding done on the right end of the string to the specified
length.

print str.ljust(50, '^')

startswith():
It is used to search and filter text data in Series or Data Frame. Return True if string
starts with the prefix, otherwise return False.

df.str.startswith(search)
isnumeric():
Check whether all characters in each string are numeric.
df.str. isnumeric()
index():
It returns the index information of the DataFrame. The index information contains the
labels of the rows.

print(df.index)
strip():
Remove the extra spaces at the beginning or end of a string
df.str.strip()

Methods using regular expressions


match():
It searches only from the beginning of the string and return match object if found,
returns none if not found.

str.match(pat, case=True, flags=0, na=nan)


pat : Regular expression pattern with capturing groups.
case : If True, case sensitive
flags : A re module flag, for example re.IGNORECASE.
na : default NaN, fill value for missing values
Ex:
print(str.match(Substring, String1, re.IGNORECASE))

extract()
This function is used to extract capture groups in the regex pattern as columns in a
DataFrame. In in the Series, extract groups from the first match of regular
expression pat.

str.extract(pat, flags=0, expand=True)


Ex:
str.extract(pat = '([aeiou].)')

findall():
Used to find substrings or separators in each string. It returns list with substrings and
size of list is number of times it occurred.

str.findall(pat, flags=0)
Ex: str.findall(search)
replace():
It is used to replace a string, regex, list, dictionary, series, number, etc.
Ex:
df = {
"Array_1": [49.50, 70],
"Array_2": [65.1, 49.50]
}

data = pd.DataFrame(df)

print(data.replace(49.50, 60))

contains():
This function is used to test if pattern or regex is contained within a string.
Ex:
str.contains(pat = 'is')

count():
It is used to count the no. of non-NA/null observations across the given axis.
Ex; df.count(axis = 0)

split():
Splits the string in the Series/Index from the beginning, at the specified delimiter
string.
Ex: str.split(‘t’)

Miscellaneous methods
get() - returns the specified column(s) from the DataFrame.
cat() - used to concatenate strings.
repeat() - repeat elements of a Series

---
HIGH-PERFORMANCE PANDAS: EVAL() AND QUERY()

eval()
The eval() function in Pandas uses string expressions to efficiently compute
operations using DataFrames.
Syntax:

DataFrame.eval(expr, inplace=False)

expr : The expression string to evaluate.


inplace : If the expression contains an assignment, whether to perform the operation inplace
and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.

df.eval('D = A + B+C', inplace = True)

query():

The query() method allows you to query the DataFrame. The query() method takes a
query expression as a string parameter, which has to evaluate to either True of False. It returns
the DataFrame where the result is True according to the query expression.

This method only works if the column name doesn’t have any empty spaces. So before
applying the method, spaces in column names are replaced with ‘_’.
Syntax:

DataFrame.query(expr, inplace=False)

expr: Expression in string form to filter data.


inplace: Make changes in the original data frame if True

Ex:
data.query('Senior_Management == True', inplace=True)

---

You might also like