Data Manipulation With Pandas
Data Manipulation With Pandas
INTRODUCTION
Pandas is an open-source Python Library. It providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is derived
from the word Panel Data. Developed by Wes McKinney in 2008.
Using Pandas, we can accomplish five typical steps in the processing and analysis of
data:
Load
Prepare
Manipulate
Model
Analyze
---
The Series wraps both a sequence of values and a sequence of indices, which we can
access with the values and index attributes. Like with a NumPy array, data can be accessed by
the associated index using Python square-bracket notation:
Ex:
data[1]
0.5
data[1:3]
1 0.50
2 0.75
A series can be created using various inputs like – Array, Dictionary, Scalar value or
constant.
Ex:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a 0.25
b 0.50
c 0.75
d 1.00
We can even use non-contiguous or nonsequential indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data
2 0.25
5 0.50
3 0.75
7 1.00
data[5]
0.5
A dictionary can be passed as input and if no index is specified, then the dictionary
keys are taken in a sorted order to construct index. If index is passed, the values in data
corresponding to the labels in the index will be pulled out.
Ex:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.series(data)
print s
output
a 0.0
b 1.0
c 2.0
Dictionary keys are also used to construct index. In this case Index order is persisted
and the missing element is filled with NaN (Not a Number).
Ex:
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.series(data, index=['b','c','d','a'])
print s
b 1.0
c 2.0
d NaN
a 0.0
Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to
match the length of index.
Ex:
s = pd.series(5, index=[0, 1, 2, 3])
print s
output
0 5
1 5
2 5
3 5
Create DataFrame:
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
a b c
first 1 2 NaN
second 5 10 20.0
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the
union of all the series indexes passed.
Ex:
d= {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
df = pd.DataFrame(d)
print df
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
---
DATA INDEXING AND SELECTION
Indexing in Pandas means simply selecting or accessing and modifying values
in Pandas series and dataframes.
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection of keys to a
collection of values:
Ex:
data
a 0.25
b 0.50
c 0.75
d 1.00
data[‘b’]
0.5
We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values:
Ex:
>>> data.keys()
Index([‘a’, ‘b’, ‘c’, ‘d’])
>>> list(data.items())
[(‘a’, 0.25), (‘b’, 0.5), (‘c’, 0.75), (‘d’, 1.0)]
Ex:
data['a':'c']
a 0.25
b 0.50
c 0.75
# slicing by implicit integer index. The final index is not included in the slice.
data[0:2]
a 0.25
b 0.50
# masking
b 0.50
c 0.75
# fancy indexing
data[['a', 'e']]
a 0.25
e 1.25
Slicing and indexing can be confusing. If a Series has an explicit integer index,
indexing as data[1] uses the explicit index, but a slicing operation like data[1:3] will use
the implicit index.
data[1]
'a'
data[1:3]
3 b
5 c
Because of this, pandas provides special indexer attributes to expose certain indexing
schemes.
loc
The loc attribute references the explicit index for indexing and slicing (label-location
based):
Ex:
data.loc[1]
'a'
data.loc[1:3]
1 a
3 b
iloc
The iloc attribute references the implicit Python-style index for indexing and slicing
(integer-location based)
Ex:
data.iloc[1]
'b'
data.iloc[1:3]
3 b
5 c
ix
The ix attribute is a hybrid of the two, and for Series objects is equivalent to
standard []-based indexing.
We can access columns of a DataFrame using the bracket ([]) operator. For example:
import pandas as pd
# create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 47, 33],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}
df = pd.DataFrame(data)
names = df['Name']
print(names)
output
0 Alice
1 Bob
2 Charlie
3 David
4 Eve
We can also access multiple columns using the [] operator. For example,
name_city = df[['Name','City']]
print(name_city)
output
Name City
0 Alice New York
1 Bob Paris
2 Charlie London
3 David Tokyo
4 Eve Sydney
The [] operator, however, provides limited functionality. Even basic operations like
selecting rows, slicing DataFrames and selecting individual elements are quite tricky using
the [] operator only.
This function selects data by the label of the rows and columns. The df.loc indexer
selects data in a different way than just the indexing operator. It can select subsets of rows or
columns.
In order to select a single row using .loc[], we put a single row label in a .loc function.
In order to select two rows and three columns, we select two rows which we want to
select and three columns and put it in a separate list like this:
In order to select all of the rows and some columns, we use single colon [:] to select all
of rows and list of some columns which we want to select
In Pandas, the .iloc property is used to access and modify data within a DataFrame
using integer-based indexing. It allows us to select specific rows and columns based on their
integer locations.
import pandas as pd
# create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 32, 18, 47, 33],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney']
}
df = pd.DataFrame(data)
single_row = df.iloc[2]
print("Single Row:")
output:
Name Charlie
Age 18
City London
output:
column_list = df.iloc[:,[0,2]]
print(column_list)
output:
Name City
0 Alice New York
1 Bob Paris
2 Charlie London
3 David Tokyo
4 Eve Sydney
specific_value = df.iloc[1, 0]
print(specific_value)
output:
Bob
This indexer was capable of selecting both by label and by integer location.
Generally, ix is label based and acts just as the .loc indexer. However, .ix also supports integer
type selections (as in .iloc) where passed an integer.
---
NumPy provides quick element-wise operations via ufuncs for basic arithmetic and
more complicated operations (e.g. trigonometric functions). Pandas inherits much of this
functionality. It also features couple of differences like:
Unary operations like negation and trigonometric functions, ufuncs preserve index
and column labels in the output.
For binary operations such as addition and multiplication, Pandas will automatically
align indices when passing the objects to the ufunc.
This means that keeping the context of data and combining data from different
sources is much easier.
Since pandas is designed to work with NumPy, NumPy ufuncs work with
pandas Series and DataFrame objects.
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
0 6
1 3
2 7
3 4
A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
Applying a NumPy ufunc on these objects produces Pandas objects with preserved indices:
np.exp(ser)
0 403.428793
1 20.085537
2 1096.633158
3 54.598150
np.sin(df * np.pi / 4)
A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
For binary operations on two Series or DataFrame objects, Pandas will align indices in
the process of performing the operation. This is very convenient when you are working with
incomplete data.
population / area
Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
The resulting array contains the union of indices of the two input.
This matching is implemented for any of Python's built-in arithmetic expressions:
missing values are filled in with NaN by default:
We can modify the fill value using appropriate object methods in place of the operators.
A.add(B, fill_value=0)
0 2.0
1 5.0
2 9.0
3 5.0
A similar type of alignment takes place for both columns and indices when you are
performing operations on DataFrames:
A+B
A B C
0 1.0 15.0 NaN
1 13.0 6.0 NaN
2 NaN NaN NaN
The indices are aligned correctly irrespective of their order in the two objects, and the indices
in the result are sorted.
We can use arithmetic methods with fill values instead of ending up with NaN. Here we’ll fill
with the mean of all values in A.
A B C
0 1.0 15.0 13.5
1 13.0 6.0 4.5
2 6.5 13.5 10.5
When you are performing operations between a DataFrame and a Series, the index
and column alignment is similarly maintained. Operations between a DataFrame and
a Series are similar to operations between a two-dimensional and one-dimensional
NumPy array.
This preservation and alignment of indices and columns means that operations on
data in Pandas will always maintain the data context, which prevents the types of silly errors
that might come up when you are working with heterogeneous and/or misaligned data in raw
NumPy arrays.
---
There are several useful functions for detecting, removing, and replacing null values
in Pandas DataFrame :
isnull()
notnull()
dropna()
fillna()
In order to check null values in Pandas DataFrame, we use isnull() function. This
function return dataframe of Boolean values which are True for NaN values.
Ex:
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95], 'Second Score': [30, 45, 56, p.nan],
'Third Score':[np.nan, 40, 80, 98]}
print(df)
print(df.isnull())
In order to check null values in Pandas Dataframe, we use notnull() function this
function return dataframe of Boolean values.
Ex:
df.notnull()
df.fillna(0)
Ex:
df.fillna(method ='pad')
Ex:
df.fillna(method ='bfill')
Filling a null values using replace() method.
These function replace NaN values with some value of their own.
Ex:
In order to drop a null values from a dataframe, we use dropna() function. This
function drop Rows/Columns of datasets with Null values in different ways.
Ex:
---
HIERARCHICAL INDEXING
Hierarchical indexing is also known as ‘multi-indexing’. Pandas provide Panel and
Panel4D objects that can store three-dimensional and four-dimensional data. This type of
higher dimensional data can be indexed by making use of hierarchical indexing. It incorporates
multiple index levels within a single index.
MultiIndex contains multiple levels of indexing. In the following example the output
contains state names and years, as well as multiple labels for each data point which encode
these levels.
Ex:
index = pd.MultiIndex.from_tuples(index)
index
(levels=[['California', 'New York', 'Texas'], [2000, 2010]], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1,
0, 1]])
If we re-index our series with this MultiIndex, we see the hierarchical representation of the
data:
pop = pop.reindex(index)
pop
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
To access all data for which the second index is 2010, we can simply use the Pandas slicing
notation:
pop[:, 2010]
California 37253956
New York 19378102
Texas 25145561
The unstack() method will quickly convert a multiplyindexed Series into a conventionally
indexed DataFrame:
pop_df = pop.unstack()
pop_df
The stack() method provides the opposite operation
pop_df.stack()
We can also use the class method constructors available in the pd.MultiIndex.
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
From tuples
You can also construct it from a list of tuples.
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish
this by passing the names argument.
A multiply indexed DataFrame behaves in a similar manner. The syntax used for
multiply indexed Series applies to the columns.
health_data['Guido', 'HR']
data = data.sort_index()
---
The Series and DataFrame objects in pandas are powerful tools for exploring and
analyzing data. In many real-life situations, the data that we want to use comes in multiple
files. We need to combine these files into a single DataFrame to analyze the data. Pandas
provide such facilities for easily combining Series or DataFrame.
Concatenation with pd.concat
The concat() function in pandas is used to append either columns or rows from one
DataFrame to another.
Syntax:
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None,
levels=None, names=None, verify_integrity=False, copy=True)
Ex 1:
Ex 2:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1); print(df2); print(pd.concat([df1, df2]))
Duplicate indices
One important difference between np.concatenate and pd.concat is that Pandas
concatenation preserves indices, even if the result will have duplicate indices.
The outcome is often undesirable and pd.concat() gives us a few ways to handle it:
Ignore the index: By using ignore_index in pd.concat the concatenation will create a
new integer index.
Adding multi-index keys: Another alternative is to use the keys option to specify a
label for the data sources.
By default, the join is a union of the input columns (join='outer'), but we can change
this to an intersection of the columns using join='inner'.
By using join_axes argument we can directly specify the index of the remaining
colums.
The append() method
Series and DataFrame objects have an append method that can accomplish direct
concatenation. For example, rather than calling pd.concat([df1, df2]), you can simply call
df1.append(df2).
One-to-one joins
Perhaps the simplest type of merge is the one-to-one join, which is similar to the
column-wise concatenation.
Many-to-one joins are joins in which one of the two key columns contains duplicate
entries. The resulting DataFrame will preserve those duplicate entries as appropriate.
Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is
a many-to-many merge.
Specification of the Merge Key
The on keyword
You can explicitly specify the name of the key column using the on keyword,
which takes a column name or a list of column names.
We used agg() function to calculate the sum, min, and max of each column in our dataset.
df.agg('sum')
df.agg([‘sum’, ‘min’, ‘max’])
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-
apply-combine strategy.
We use groupby() function to group the data. It returns the object as result
(DataFrameGroupBy)
Ex:
df.groupby('Maths')
<pandas.core.groupby.DataFrameGroupBy object at 0x117272160>
Aggregation:
The aggregate() method takes a string, a function, or a list thereof, and compute all the
aggregates at once.
Ex:
df.groupby('key').aggregate(['min', np.median, max])
Filtering:
A filtering operation allows you to drop data based on the group properties.
Ex:
df.groupby('key').filter(filter_func)
Transformation:
Transformation can return some transformed version of the full data. That is the output
is the same shape as the input.
Ex:
df.groupby('key').transform(lambda x: x - x.mean())
The apply() method lets you apply an arbitrary function to the group results.
Ex;
df.groupby('key').apply(norm_by_data2).
---
PIVOT TABLES
table = pd.pivot_table(df, values ='A', index =['B', 'C'], columns =['B'], aggfunc = np.sum)
---
VECTORIZED STRING OPERATIONS
String manipulation is the process of changing, parsing, splicing, pasting, or
analyzing strings. Nearly all Python’s built-in string methods are mirrored by a Pandas
vectorized string method.
len():
Compute the length of each element in the Series/Index.
print(df.str.len())
lower():
Converts all uppercase characters in strings to lower case.
df.str.lower()
upper():
Converts all lowercase characters in strings to upper case.
df.str.upper()
ljust():
It returns the string with padding done on the right end of the string to the specified
length.
startswith():
It is used to search and filter text data in Series or Data Frame. Return True if string
starts with the prefix, otherwise return False.
df.str.startswith(search)
isnumeric():
Check whether all characters in each string are numeric.
df.str. isnumeric()
index():
It returns the index information of the DataFrame. The index information contains the
labels of the rows.
print(df.index)
strip():
Remove the extra spaces at the beginning or end of a string
df.str.strip()
extract()
This function is used to extract capture groups in the regex pattern as columns in a
DataFrame. In in the Series, extract groups from the first match of regular
expression pat.
findall():
Used to find substrings or separators in each string. It returns list with substrings and
size of list is number of times it occurred.
str.findall(pat, flags=0)
Ex: str.findall(search)
replace():
It is used to replace a string, regex, list, dictionary, series, number, etc.
Ex:
df = {
"Array_1": [49.50, 70],
"Array_2": [65.1, 49.50]
}
data = pd.DataFrame(df)
print(data.replace(49.50, 60))
contains():
This function is used to test if pattern or regex is contained within a string.
Ex:
str.contains(pat = 'is')
count():
It is used to count the no. of non-NA/null observations across the given axis.
Ex; df.count(axis = 0)
split():
Splits the string in the Series/Index from the beginning, at the specified delimiter
string.
Ex: str.split(‘t’)
Miscellaneous methods
get() - returns the specified column(s) from the DataFrame.
cat() - used to concatenate strings.
repeat() - repeat elements of a Series
---
HIGH-PERFORMANCE PANDAS: EVAL() AND QUERY()
eval()
The eval() function in Pandas uses string expressions to efficiently compute
operations using DataFrames.
Syntax:
DataFrame.eval(expr, inplace=False)
query():
The query() method allows you to query the DataFrame. The query() method takes a
query expression as a string parameter, which has to evaluate to either True of False. It returns
the DataFrame where the result is True according to the query expression.
This method only works if the column name doesn’t have any empty spaces. So before
applying the method, spaces in column names are replaced with ‘_’.
Syntax:
DataFrame.query(expr, inplace=False)
Ex:
data.query('Senior_Management == True', inplace=True)
---