Module 3 Pandas 2 (1)
Module 3 Pandas 2 (1)
Prepared By,
Dr. Anitha DB
Associate Professor & Head
Department of CSE-Data Science
ATME College of Engineering, Mysuru
ATME College of 1
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 2
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Vectorized String Operations
ATME College of 4
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 5
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 6
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 7
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 8
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 9
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Or we can do something more complicated, like finding all names that start and end with a consonant, making
use of the start-of-string (^) and end-of-string ($) regular expression characters:
The ability to concisely apply regular expressions across Series or DataFrame entries opens up many possibilities
for analysis and cleaning of data.
ATME College of 10
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Regular Expression
ATME College of 11
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Metacharacters
ATME College of 12
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 13
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 14
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 15
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Indexing via df.str.get(i) and df.str[i] is similar. These get() and slice() methods also let you access elements of
arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():
ATME College of 16
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 17
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
With these operations as building blocks, you can construct an endless range of string processing
procedures when cleaning your data.
ATME College of 18
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 19
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 20
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.1 Dates and Times in Python
The Python world has a number of available representations of dates, times, deltas, and timespans. While the
time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see
their relationship to other packages used in Python.
ATME College of 21
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Once we have this date formatted, however, we can quickly do vectorized operations on it:
ATME College of 23
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 24
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
For example, we can use Pandas tools to repeat the demonstration from above. We can parse a flexibly formatted
string date, and use format codes to output the day of the week:
ATME College of 25
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.2 Pandas Time Series: Indexing by Time
Where the Pandas time series tools really become useful is when you begin to index data by timestamps. For
example, we can construct a Series object that has time indexed data:
ATME College of 26
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from
that year:
ATME College of 27
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.3 Pandas Time Series Data Structures
This section will introduce the fundamental Pandas data structures for working with time series data:
• For time stamps, Pandas provides the Timestamp type. It is essentially a replacement for Python’s native
datetime, but is based on the more efficient numpy.datetime64 data type. The associated index structure is
DatetimeIndex.
• For time periods, Pandas provides the Period type. This encodes a fixed frequency interval based on
numpy.datetime64. The associated index structure is PeriodIndex.
• For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for
Python’s native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is
TimedeltaIndex.
ATME College of 28
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 29
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
A TimedeltaIndex is created, for example, when one date is subtracted from another:
ATME College of 30
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 31
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
We can modify the spacing by altering the freq argument, which defaults to D. For example, here we will
construct a range of hourly timestamps:
ATME College of 32
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 33
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
ATME College of 34
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
In the same way, you can modify the split-point of the weekly frequency by adding a three-letter
weekday code:
On top of this, codes can be combined with numbers to specify other frequencies. For example, for a
frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:
ATME College of 35
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the
pd.tseries.offsets module. For example, we can create a business day offset directly as follows:
ATME College of 36
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
The ability to use dates and times as indices to organize and access data is an important piece of the Pandas
time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive
data slicing and access, etc.) still apply, and Pandas provides several additional time series–specific
operations.
Pandas was developed largely in a finance context, it includes some very specific tools for financial data.
For example, the accompanying pandas-datareader package (installable via conda install pandas-
datareader) knows how to import financial data from a number of available sources, including Yahoo
finance, Google Finance, and others. Here we will load Google’s closing price history:
ATME College of 37
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
ATME College of 38
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
We can visualize this using the plot() method, after the normal Matplotlib setup boilerplate (Figure 3.5):
ATME College of 39
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
The primary difference between the two is that resample() is fundamentally a data aggregation, while
asfreq() is fundamentally a data selection.
Taking a look at the Google closing price, let’s compare what the two return when we down-sample the data.
Here we will resample the data at the end of business year (Figure 3-6):
ATME College of 40
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
ATME College of 41
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
resample the business day data at a daily frequency (i.e., including weekends); see Figure 3-7:
Time-shifts
Another common time series–specific operation is shifting
of data in time. Pandas has two closely related methods for
computing this: shift() and tshift(). In short, the difference
between them is that shift() shifts the data, while tshift()
shifts the index. In both cases, the shift is specified in
multiples of the frequency.
ATME College of 43
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
ATME College of 44
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
A common context for this type of shift is computing differences over time. For example, we use shifted
values to compute the one-year return on investment for Google stock over the course of the dataset (Figure 3-
9):
ATME College of 45
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
Rolling windows
Rolling statistics are a third type of time series–specific operation implemented by Pandas. These can be
accomplished via the rolling() attribute of Series and Data Frame objects, which returns a view similar to what
we saw with the groupby operation. This rolling view makes available a number of aggregation operations by
default. For example, here is the one-year centered rolling mean and standard deviation of the Google stock
prices (Figure 3-10):
ATME College of 46
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
ATME College of 47
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.6 High-Performance Pandas: eval() and query()
Because NumPy evaluates each subexpression, this is roughly equivalent to the following:
In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large,
this can lead to significant memory and computational over head. The Numexpr library gives you the ability
to compute this type of compound expression element by element, without the need to allocate full
intermediate arrays.
ATME College of 48
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For
example, consider the following DataFrames
To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:
ATME College of 49
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
We can compute the same result via pd.eval by constructing the expression as a string:
The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same
result:
ATME College of 50
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following
integer DataFrames:
ATME College of 51
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
Comparison operators. chained expressions: pd.eval() supports all comparison operators, including
ATME College of 52
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Object attributes and indices. pd.eval() supports access to object attributes via the obj.attr syntax, and indexes
via the obj[index] syntax:
ATME College of 53
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar
ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array
as an example:
ATME College of 54
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Using pd.eval() as above, we can compute expressions with the three columns like this:
The DataFrame.eval() method allows much more succinct evaluation of expressions with the
columns:
Notice here that we treat column names as variables within the evaluated expression, and the result is what we
would wish.
ATME College of 55
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has
columns 'A', 'B', and 'C'
We can use df.eval() to create a new column 'D' and assign to it a value computed from the other columns:
ATME College of 56
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
In the same way, any existing column can be modified:
ATME College of 57
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Local variables in DataFrame.eval()
The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables.
Consider the following:
The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate
expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python
objects. Notice that this @ character is only supported by the DataFrame.eval() method, not by the
pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace
ATME College of 58
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method
The DataFrame has another method based on evaluated strings, called the query() method. Consider the
following:
As with the example used in our discussion of DataFrame.eval(), this is an expression involving columns of the
DataFrame. It cannot be expressed using the Data Frame.eval() syntax, however! Instead, for this type of filtering
operation, you can use the query() method
ATME College of 59
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method
In addition to being a more efficient computation, compared to the masking expression this is much easier
to read and understand. Note that the query() method also accepts the @ flag to mark local variables
ATME College of 60
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions
When considering whether to use these functions, there are two considerations: computation time and memory
use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving
NumPy arrays or Pandas Data Frames will result in implicit creation of temporary arrays: For example, this
If the size of the temporary DataFrames is significant compared to your available sys tem memory (typically
several gigabytes), then it’s a good idea to use an eval() or query() expression. You can check the approximate
size of your array in bytes using this
ATME College of 61
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions
On the performance side, eval() can be faster even when you are not maxing out your system memory.
The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your
system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some
potentially slow movement of values between the different memory caches. In practice, I find that the
difference in computation time between the traditional methods and the eval/query method is usually not
significant—if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is
mainly in the saved memory, and the sometimes cleaner syntax they offer.
ATME College of 62
Department of CSE-DS, ATMECE
Engineering, Mysuru
THANK
YOU
ATME College of 63
Department of CSE-DS, ATMECE
Engineering, Mysuru