0% found this document useful (0 votes)
2 views63 pages

Module 3 Pandas 2 (1)

The document outlines Module 3 of a course on Data Manipulation with Pandas, focusing on vectorized string operations, time series, and high-performance techniques. It covers various string methods in Pandas, including those for regular expressions and indicator variables, as well as tools for working with time series data. The module also includes practical examples, such as a recipe database and visualizing time series data.

Uploaded by

Sonia N.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views63 pages

Module 3 Pandas 2 (1)

The document outlines Module 3 of a course on Data Manipulation with Pandas, focusing on vectorized string operations, time series, and high-performance techniques. It covers various string methods in Pandas, including those for regular expressions and indicator variables, as well as tools for working with time series data. The module also includes practical examples, such as a recipe database and visualizing time series data.

Uploaded by

Sonia N.S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Exploratory Data Analysis- BDS613B

Prepared By,
Dr. Anitha DB
Associate Professor & Head
Department of CSE-Data Science
ATME College of Engineering, Mysuru

ATME College of 1
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

• Vectorized String Operations


• Working with Time Series,
• HighPerformance Pandas: eval and query

ATME College of 2
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Vectorized String Operations

• Introducing Pandas String Operations


• Tables of Pandas String Methods
• Methods similar to Python string methods
• Methods using regular expressions
• Miscellaneous methods
• Vectorized item access and slicing.
• Indicator variables
• Example: Recipe Database
• A simple recipe recommender
ATME College of 3
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Vectorized String Operations

Introducing Pandas String Operations

ATME College of 4
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Introducing Pandas String Operations
Pandas includes features to address both this need for vectorized string operations and for correctly handling
missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example,
suppose we create a Pandas Series with this data:

We can now call a single method that will capitalize all


the entries, while skipping over any missing values:

ATME College of 5
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Tables of Pandas String Methods
Methods similar to Python string methods :Nearly all Python’s built-in string methods are mirrored by a Pandas
vectorized string method. Here is a list of Pandas str methods that mirror Python string methods

ATME College of 6
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Tables of Pandas String Methods

ATME College of 7
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Tables of Pandas String Methods

ATME College of 8
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string
element, and follow some of the API conventions of Python’s built-in re module (see Table 3-4).

ATME College of 9
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


With these, we can do a wide range of interesting operations. For example, we can extract the first name from
each by asking for a contiguous group of characters at the beginning of each element:

Or we can do something more complicated, like finding all names that start and end with a consonant, making
use of the start-of-string (^) and end-of-string ($) regular expression characters:

The ability to concisely apply regular expressions across Series or DataFrame entries opens up many possibilities
for analysis and cleaning of data.
ATME College of 10
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Regular Expression

Slide no 11 to 14 is not there in the syllabus.


A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
RegEx can be used to check if a string contains the specified search pattern.

ATME College of 11
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Metacharacters

ATME College of 12
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

ATME College of 13
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

ATME College of 14
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations (see Table 3-5)

ATME College of 15
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Vectorized item access and slicing. The get() and slice() operations, in particular, enable vectorized element access
from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note
that this behavior is also available through Python’s normal indexing syntax—for example, df.str.slice(0, 3) is
equivalent to df.str[0:3]:

Indexing via df.str.get(i) and df.str[i] is similar. These get() and slice() methods also let you access elements of
arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():

ATME College of 16
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Indicator variables. Another method that requires a bit of extra explanation is the get_dummies() method. This is
useful when your data has a column containing some sort of coded indicator. For example, we might have a
dataset that contains information in the form of codes, such as A=“born in America,” B=“born in the United King
dom,” C=“likes cheese,” D=“likes spam”:

ATME College of 17
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Indicator variables.
The get_dummies() routine lets you quickly split out these indicator variables into a DataFrame:

With these operations as building blocks, you can construct an endless range of string processing
procedures when cleaning your data.

ATME College of 18
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations


Example: Recipe Database These vectorized string operations become most useful in the process of cleaning up
messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from
various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly
find a recipe based on some ingredients we have on hand.

ATME College of 19
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Pandas was developed in the context of financial modeling, and it contains a fairly extensive set of tools for
working with dates, times, and time indexed data.
Topics
• Dates and Times in Python
• Pandas Time Series: Indexing by Time
• Pandas Time Series Data Structures
• Frequencies and Offsets
• Resampling, Shifting, and Windowing
• Example: Visualizing Seattle Bicycle Counts

ATME College of 20
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.1 Dates and Times in Python

The Python world has a number of available representations of dates, times, deltas, and timespans. While the
time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see
their relationship to other packages used in Python.

Native Python dates and times: datetime and dateutil


Python’s basic objects for working with dates and times reside in the built-in date time module. Along with the
third-party dateutil module, we can use it to quickly perform a host of useful functionalities on dates and times. For
example, we can manually build a date using the datetime type:

ATME College of 21
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Once we have a datetime object, we can do things like printing the day of the week:

Typed arrays of times: NumPy’s datetime64


The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very
compactly. The date time64 requires a very specific input format:

Once we have this date formatted, however, we can quickly do vectorized operations on it:

date time64 imposes a


trade-off between time
resolution and maximum
time span.
ATME College of 22
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

ATME College of 23
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

ATME College of 24
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Dates and times in Pandas: Best of both worlds
Pandas builds upon all the tools just discussed to provide a Timestamp object, which combines the ease of use of
datetime and dateutil with the efficient storage and vectorized interface of numpy.datetime64. From a group of
these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a Series or
DataFrame

For example, we can use Pandas tools to repeat the demonstration from above. We can parse a flexibly formatted
string date, and use format codes to output the day of the week:

ATME College of 25
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.2 Pandas Time Series: Indexing by Time
Where the Pandas time series tools really become useful is when you begin to index data by timestamps. For
example, we can construct a Series object that has time indexed data:

ATME College of 26
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from
that year:

ATME College of 27
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.3 Pandas Time Series Data Structures
This section will introduce the fundamental Pandas data structures for working with time series data:

• For time stamps, Pandas provides the Timestamp type. It is essentially a replacement for Python’s native
datetime, but is based on the more efficient numpy.datetime64 data type. The associated index structure is
DatetimeIndex.
• For time periods, Pandas provides the Period type. This encodes a fixed frequency interval based on
numpy.datetime64. The associated index structure is PeriodIndex.
• For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for
Python’s native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is
TimedeltaIndex.

ATME College of 28
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


The most fundamental of these date/time objects are the Timestamp and DatetimeIndex objects. While these
class objects can be invoked directly, it is more common to use the pd.to_datetime() function, which can parse a
wide variety of formats. Passing a single date to pd.to_datetime() yields a Timestamp; passing a series of dates
by default yields a DatetimeIndex:

ATME College of 29
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function with the addition of a
frequency code; here we’ll use 'D' to indicate daily frequency:

A TimedeltaIndex is created, for example, when one date is subtracted from another:

ATME College of 30
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Regular sequences: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this
purpose: pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time
deltas.
pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence
of dates. By default, the fre quency is one day:

ATME College of 31
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


Alternatively, the date range can be specified not with a start- and endpoint, but with a startpoint and a number of
periods:

We can modify the spacing by altering the freq argument, which defaults to D. For example, here we will
construct a range of hourly timestamps:

ATME College of 32
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


To create regular sequences of period or time delta values, the very similar pd.period_range() and
pd.timedelta_range() functions are useful. Here are some monthly periods:

And a sequence of durations increasing by an hour:

ATME College of 33
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


3.2.4 Frequencies and Offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw
the D (day) and H (hour) codes previously, we can use such codes to specify any desired frequency
spacing. Table summarizes the main codes available.
Listing of Pandas frequency codes The monthly, quarterly, and annual frequencies are all marked
at the end of the specified period. Adding an S suffix to any of
these marks it instead at the beginning (Table 3-8)

ATME College of 34
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2.4 Frequencies and Offsets 3.2 Working with Time Series


Additionally, you can change the month used to mark any quarterly or annual code by adding a three-
letter month code as a suffix:

In the same way, you can modify the split-point of the weekly frequency by adding a three-letter
weekday code:

On top of this, codes can be combined with numbers to specify other frequencies. For example, for a
frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:

ATME College of 35
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series


3.2.4 Frequencies and Offsets

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the
pd.tseries.offsets module. For example, we can create a business day offset directly as follows:

ATME College of 36
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

The ability to use dates and times as indices to organize and access data is an important piece of the Pandas
time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive
data slicing and access, etc.) still apply, and Pandas provides several additional time series–specific
operations.

Pandas was developed largely in a finance context, it includes some very specific tools for financial data.
For example, the accompanying pandas-datareader package (installable via conda install pandas-
datareader) knows how to import financial data from a number of available sources, including Yahoo
finance, Google Finance, and others. Here we will load Google’s closing price history:

ATME College of 37
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

ATME College of 38
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

We can visualize this using the plot() method, after the normal Matplotlib setup boilerplate (Figure 3.5):

ATME College of 39
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Resampling and converting frequencies


One common need for time series data is resampling at a higher or lower frequency. You can do this using the
resample() method, or the much simpler asfreq() method.

The primary difference between the two is that resample() is fundamentally a data aggregation, while
asfreq() is fundamentally a data selection.

Taking a look at the Google closing price, let’s compare what the two return when we down-sample the data.
Here we will resample the data at the end of business year (Figure 3-6):

ATME College of 40
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Notice the difference:


at each point, resample
reports the average of
the previous year,
while asfreq reports
the value at the end of
the year.

ATME College of 41
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
resample the business day data at a daily frequency (i.e., including weekends); see Figure 3-7:

The top panel is the default: non-business days are left


as NA values and do not appear on the plot. The
bottom panel shows the differences between two
strategies for filling the gaps: forward-filling and
backward-filling.
ATME College of 42
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Time-shifts
Another common time series–specific operation is shifting
of data in time. Pandas has two closely related methods for
computing this: shift() and tshift(). In short, the difference
between them is that shift() shifts the data, while tshift()
shifts the index. In both cases, the shift is specified in
multiples of the frequency.

Here we will both shift() and tshift() by 900 days (Figure


3-8):

ATME College of 43
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

We see here that shift(900)


shifts the data by 900 days,
pushing some of it off the
end of the graph (and
leaving NA values at the
other end), while tshift(900)
shifts the index values by
900 days.

ATME College of 44
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
A common context for this type of shift is computing differences over time. For example, we use shifted
values to compute the one-year return on investment for Google stock over the course of the dataset (Figure 3-
9):

This helps us to see the overall trend in


Google stock: thus far, the most profitable
times to invest in Google have been
(unsurprisingly, in retrospect) shortly after
its IPO, and in the middle of the 2009
recession.

ATME College of 45
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Rolling windows
Rolling statistics are a third type of time series–specific operation implemented by Pandas. These can be
accomplished via the rolling() attribute of Series and Data Frame objects, which returns a view similar to what
we saw with the groupby operation. This rolling view makes available a number of aggregation operations by
default. For example, here is the one-year centered rolling mean and standard deviation of the Google stock
prices (Figure 3-10):

ATME College of 46
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Example: Visualizing Seattle Bicycle Counts

ATME College of 47
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.6 High-Performance Pandas: eval() and query()

Motivating query() and eval(): Compound Expressions


Computing compound expressions

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large,
this can lead to significant memory and computational over head. The Numexpr library gives you the ability
to compute this type of compound expression element by element, without the need to allocate full
intermediate arrays.

ATME College of 48
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For
example, consider the following DataFrames

To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:

ATME College of 49
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
We can compute the same result via pd.eval by constructing the expression as a string:

The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same
result:

ATME College of 50
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following
integer DataFrames:

Arithmetic operators. pd.eval() supports all arithmetic operators. For example

ATME College of 51
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
Comparison operators. chained expressions: pd.eval() supports all comparison operators, including

Bitwise operators. pd.eval() supports the & and | bitwise operators:

ATME College of 52
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

Operations supported by pd.eval()

In addition, it supports the use of the literal and or in Boolean expressions:

Object attributes and indices. pd.eval() supports access to object attributes via the obj.attr syntax, and indexes
via the obj[index] syntax:

ATME College of 53
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

DataFrame.eval() for Column-Wise Operations

Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar
ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array
as an example:

ATME College of 54
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

DataFrame.eval() for Column-Wise Operations

Using pd.eval() as above, we can compute expressions with the three columns like this:

The DataFrame.eval() method allows much more succinct evaluation of expressions with the
columns:

Notice here that we treat column names as variables within the evaluated expression, and the result is what we
would wish.
ATME College of 55
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has
columns 'A', 'B', and 'C'

We can use df.eval() to create a new column 'D' and assign to it a value computed from the other columns:

ATME College of 56
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
In the same way, any existing column can be modified:

ATME College of 57
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Local variables in DataFrame.eval()

The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables.
Consider the following:

The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate
expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python
objects. Notice that this @ character is only supported by the DataFrame.eval() method, not by the
pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace

ATME College of 58
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method

The DataFrame has another method based on evaluated strings, called the query() method. Consider the
following:

As with the example used in our discussion of DataFrame.eval(), this is an expression involving columns of the
DataFrame. It cannot be expressed using the Data Frame.eval() syntax, however! Instead, for this type of filtering
operation, you can use the query() method

ATME College of 59
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method

In addition to being a more efficient computation, compared to the masking expression this is much easier
to read and understand. Note that the query() method also accepts the @ flag to mark local variables

ATME College of 60
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations: computation time and memory
use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving
NumPy arrays or Pandas Data Frames will result in implicit creation of temporary arrays: For example, this

If the size of the temporary DataFrames is significant compared to your available sys tem memory (typically
several gigabytes), then it’s a good idea to use an eval() or query() expression. You can check the approximate
size of your array in bytes using this

ATME College of 61
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions

On the performance side, eval() can be faster even when you are not maxing out your system memory.
The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your
system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some
potentially slow movement of values between the different memory caches. In practice, I find that the
difference in computation time between the traditional methods and the eval/query method is usually not
significant—if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is
mainly in the saved memory, and the sometimes cleaner syntax they offer.

ATME College of 62
Department of CSE-DS, ATMECE
Engineering, Mysuru
THANK
YOU
ATME College of 63
Department of CSE-DS, ATMECE
Engineering, Mysuru

You might also like