0% found this document useful (0 votes)

2 views63 pages

Module 3 Pandas 2 (1)

The document outlines Module 3 of a course on Data Manipulation with Pandas, focusing on vectorized string operations, time series, and high-performance techniques. It covers various string methods in Pandas, including those for regular expressions and indicator variables, as well as tools for working with time series data. The module also includes practical examples, such as a recipe database and visualizing time series data.

Uploaded by

Sonia N.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views63 pages

Module 3 Pandas 2 (1)

Uploaded by

Sonia N.S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Exploratory Data Analysis- BDS613B

Prepared By,
Dr. Anitha DB
Associate Professor & Head
Department of CSE-Data Science
ATME College of Engineering, Mysuru

ATME College of 1
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

• Vectorized String Operations

• Working with Time Series,
• HighPerformance Pandas: eval and query

ATME College of 2
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Vectorized String Operations

• Introducing Pandas String Operations

• Tables of Pandas String Methods
• Methods similar to Python string methods
• Methods using regular expressions
• Miscellaneous methods
• Vectorized item access and slicing.
• Indicator variables
• Example: Recipe Database
• A simple recipe recommender
ATME College of 3
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
Vectorized String Operations

Introducing Pandas String Operations

ATME College of 4
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Introducing Pandas String Operations
Pandas includes features to address both this need for vectorized string operations and for correctly handling
missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example,
suppose we create a Pandas Series with this data:

We can now call a single method that will capitalize all

the entries, while skipping over any missing values:

ATME College of 5
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Tables of Pandas String Methods
Methods similar to Python string methods :Nearly all Python’s built-in string methods are mirrored by a Pandas
vectorized string method. Here is a list of Pandas str methods that mirror Python string methods

ATME College of 6
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Tables of Pandas String Methods

ATME College of 7
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Tables of Pandas String Methods

ATME College of 8
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string
element, and follow some of the API conventions of Python’s built-in re module (see Table 3-4).

ATME College of 9
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

With these, we can do a wide range of interesting operations. For example, we can extract the first name from
each by asking for a contiguous group of characters at the beginning of each element:

Or we can do something more complicated, like finding all names that start and end with a consonant, making
use of the start-of-string (^) and end-of-string ($) regular expression characters:

The ability to concisely apply regular expressions across Series or DataFrame entries opens up many possibilities
for analysis and cleaning of data.
ATME College of 10
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Regular Expression

Slide no 11 to 14 is not there in the syllabus.

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
RegEx can be used to check if a string contains the specified search pattern.

ATME College of 11
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Metacharacters

ATME College of 12
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

ATME College of 13
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

ATME College of 14
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations (see Table 3-5)

ATME College of 15
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Vectorized item access and slicing. The get() and slice() operations, in particular, enable vectorized element access
from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note
that this behavior is also available through Python’s normal indexing syntax—for example, df.str.slice(0, 3) is
equivalent to df.str[0:3]:

Indexing via df.str.get(i) and df.str[i] is similar. These get() and slice() methods also let you access elements of
arrays returned by split(). For example, to extract the last name of each entry, we can combine split() and get():

ATME College of 16
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Indicator variables. Another method that requires a bit of extra explanation is the get_dummies() method. This is
useful when your data has a column containing some sort of coded indicator. For example, we might have a
dataset that contains information in the form of codes, such as A=“born in America,” B=“born in the United King
dom,” C=“likes cheese,” D=“likes spam”:

ATME College of 17
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Indicator variables.
The get_dummies() routine lets you quickly split out these indicator variables into a DataFrame:

With these operations as building blocks, you can construct an endless range of string processing
procedures when cleaning your data.

ATME College of 18
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

Vectorized String Operations

Example: Recipe Database These vectorized string operations become most useful in the process of cleaning up
messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from
various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly
find a recipe based on some ingredients we have on hand.

ATME College of 19
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Pandas was developed in the context of financial modeling, and it contains a fairly extensive set of tools for
working with dates, times, and time indexed data.
Topics
• Dates and Times in Python
• Pandas Time Series: Indexing by Time
• Pandas Time Series Data Structures
• Frequencies and Offsets
• Resampling, Shifting, and Windowing
• Example: Visualizing Seattle Bicycle Counts

ATME College of 20
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.1 Dates and Times in Python

The Python world has a number of available representations of dates, times, deltas, and timespans. While the
time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see
their relationship to other packages used in Python.

Native Python dates and times: datetime and dateutil

Python’s basic objects for working with dates and times reside in the built-in date time module. Along with the
third-party dateutil module, we can use it to quickly perform a host of useful functionalities on dates and times. For
example, we can manually build a date using the datetime type:

ATME College of 21
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Once we have a datetime object, we can do things like printing the day of the week:

Typed arrays of times: NumPy’s datetime64

The datetime64 dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very
compactly. The date time64 requires a very specific input format:

Once we have this date formatted, however, we can quickly do vectorized operations on it:

date time64 imposes a

trade-off between time
resolution and maximum
time span.
ATME College of 22
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

ATME College of 23
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

ATME College of 24
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Dates and times in Pandas: Best of both worlds
Pandas builds upon all the tools just discussed to provide a Timestamp object, which combines the ease of use of
datetime and dateutil with the efficient storage and vectorized interface of numpy.datetime64. From a group of
these Timestamp objects, Pandas can construct a DatetimeIndex that can be used to index data in a Series or
DataFrame

For example, we can use Pandas tools to repeat the demonstration from above. We can parse a flexibly formatted
string date, and use format codes to output the day of the week:

ATME College of 25
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.2 Pandas Time Series: Indexing by Time
Where the Pandas time series tools really become useful is when you begin to index data by timestamps. For
example, we can construct a Series object that has time indexed data:

ATME College of 26
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from
that year:

ATME College of 27
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.3 Pandas Time Series Data Structures
This section will introduce the fundamental Pandas data structures for working with time series data:

• For time stamps, Pandas provides the Timestamp type. It is essentially a replacement for Python’s native
datetime, but is based on the more efficient numpy.datetime64 data type. The associated index structure is
DatetimeIndex.
• For time periods, Pandas provides the Period type. This encodes a fixed frequency interval based on
numpy.datetime64. The associated index structure is PeriodIndex.
• For time deltas or durations, Pandas provides the Timedelta type. Timedelta is a more efficient replacement for
Python’s native datetime.timedelta type, and is based on numpy.timedelta64. The associated index structure is
TimedeltaIndex.

ATME College of 28
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

The most fundamental of these date/time objects are the Timestamp and DatetimeIndex objects. While these
class objects can be invoked directly, it is more common to use the pd.to_datetime() function, which can parse a
wide variety of formats. Passing a single date to pd.to_datetime() yields a Timestamp; passing a series of dates
by default yields a DatetimeIndex:

ATME College of 29
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Any DatetimeIndex can be converted to a PeriodIndex with the to_period() function with the addition of a
frequency code; here we’ll use 'D' to indicate daily frequency:

A TimedeltaIndex is created, for example, when one date is subtracted from another:

ATME College of 30
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Regular sequences: pd.date_range()
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this
purpose: pd.date_range() for timestamps, pd.period_range() for periods, and pd.timedelta_range() for time
deltas.
pd.date_range() accepts a start date, an end date, and an optional frequency code to create a regular sequence
of dates. By default, the fre quency is one day:

ATME College of 31
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

Alternatively, the date range can be specified not with a start- and endpoint, but with a startpoint and a number of
periods:

We can modify the spacing by altering the freq argument, which defaults to D. For example, here we will
construct a range of hourly timestamps:

ATME College of 32
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

To create regular sequences of period or time delta values, the very similar pd.period_range() and
pd.timedelta_range() functions are useful. Here are some monthly periods:

And a sequence of durations increasing by an hour:

ATME College of 33
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

3.2.4 Frequencies and Offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset. Just as we saw
the D (day) and H (hour) codes previously, we can use such codes to specify any desired frequency
spacing. Table summarizes the main codes available.
Listing of Pandas frequency codes The monthly, quarterly, and annual frequencies are all marked
at the end of the specified period. Adding an S suffix to any of
these marks it instead at the beginning (Table 3-8)

ATME College of 34
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2.4 Frequencies and Offsets 3.2 Working with Time Series

Additionally, you can change the month used to mark any quarterly or annual code by adding a three-
letter month code as a suffix:

In the same way, you can modify the split-point of the weekly frequency by adding a three-letter
weekday code:

On top of this, codes can be combined with numbers to specify other frequencies. For example, for a
frequency of 2 hours 30 minutes, we can combine the hour (H) and minute (T) codes as follows:

ATME College of 35
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II

3.2 Working with Time Series

3.2.4 Frequencies and Offsets

All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the
pd.tseries.offsets module. For example, we can create a business day offset directly as follows:

ATME College of 36
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

The ability to use dates and times as indices to organize and access data is an important piece of the Pandas
time series tools. The benefits of indexed data in general (automatic alignment during operations, intuitive
data slicing and access, etc.) still apply, and Pandas provides several additional time series–specific
operations.

Pandas was developed largely in a finance context, it includes some very specific tools for financial data.
For example, the accompanying pandas-datareader package (installable via conda install pandas-
datareader) knows how to import financial data from a number of available sources, including Yahoo
finance, Google Finance, and others. Here we will load Google’s closing price history:

ATME College of 37
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

ATME College of 38
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

We can visualize this using the plot() method, after the normal Matplotlib setup boilerplate (Figure 3.5):

ATME College of 39
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Resampling and converting frequencies

One common need for time series data is resampling at a higher or lower frequency. You can do this using the
resample() method, or the much simpler asfreq() method.

The primary difference between the two is that resample() is fundamentally a data aggregation, while
asfreq() is fundamentally a data selection.

Taking a look at the Google closing price, let’s compare what the two return when we down-sample the data.
Here we will resample the data at the end of business year (Figure 3-6):

ATME College of 40
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Notice the difference:

at each point, resample
reports the average of
the previous year,
while asfreq reports
the value at the end of
the year.

ATME College of 41
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
resample the business day data at a daily frequency (i.e., including weekends); see Figure 3-7:

The top panel is the default: non-business days are left

as NA values and do not appear on the plot. The
bottom panel shows the differences between two
strategies for filling the gaps: forward-filling and
backward-filling.
ATME College of 42
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Time-shifts
Another common time series–specific operation is shifting
of data in time. Pandas has two closely related methods for
computing this: shift() and tshift(). In short, the difference
between them is that shift() shifts the data, while tshift()
shifts the index. In both cases, the shift is specified in
multiples of the frequency.

Here we will both shift() and tshift() by 900 days (Figure

3-8):

ATME College of 43
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

We see here that shift(900)

shifts the data by 900 days,
pushing some of it off the
end of the graph (and
leaving NA values at the
other end), while tshift(900)
shifts the index values by
900 days.

ATME College of 44
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing
A common context for this type of shift is computing differences over time. For example, we use shifted
values to compute the one-year return on investment for Google stock over the course of the dataset (Figure 3-
9):

This helps us to see the overall trend in

Google stock: thus far, the most profitable
times to invest in Google have been
(unsurprisingly, in retrospect) shortly after
its IPO, and in the middle of the 2009
recession.

ATME College of 45
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Rolling windows
Rolling statistics are a third type of time series–specific operation implemented by Pandas. These can be
accomplished via the rolling() attribute of Series and Data Frame objects, which returns a view similar to what
we saw with the groupby operation. This rolling view makes available a number of aggregation operations by
default. For example, here is the one-year centered rolling mean and standard deviation of the Google stock
prices (Figure 3-10):

ATME College of 46
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.5 Resampling, Shifting, and Windowing

Example: Visualizing Seattle Bicycle Counts

ATME College of 47
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
3.2.6 High-Performance Pandas: eval() and query()

Motivating query() and eval(): Compound Expressions

Computing compound expressions

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large,
this can lead to significant memory and computational over head. The Numexpr library gives you the ability
to compute this type of compound expression element by element, without the need to allocate full
intermediate arrays.

ATME College of 48
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For
example, consider the following DataFrames

To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:

ATME College of 49
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
We can compute the same result via pd.eval by constructing the expression as a string:

The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same
result:

ATME College of 50
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following
integer DataFrames:

Arithmetic operators. pd.eval() supports all arithmetic operators. For example

ATME College of 51
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations
Operations supported by pd.eval()
Comparison operators. chained expressions: pd.eval() supports all comparison operators, including

Bitwise operators. pd.eval() supports the & and | bitwise operators:

ATME College of 52
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

Operations supported by pd.eval()

In addition, it supports the use of the literal and or in Boolean expressions:

Object attributes and indices. pd.eval() supports access to object attributes via the obj.attr syntax, and indexes
via the obj[index] syntax:

ATME College of 53
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

DataFrame.eval() for Column-Wise Operations

Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar
ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array
as an example:

ATME College of 54
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
pandas.eval() for Efficient Operations

DataFrame.eval() for Column-Wise Operations

Using pd.eval() as above, we can compute expressions with the three columns like this:

The DataFrame.eval() method allows much more succinct evaluation of expressions with the
columns:

Notice here that we treat column names as variables within the evaluated expression, and the result is what we
would wish.
ATME College of 55
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has
columns 'A', 'B', and 'C'

We can use df.eval() to create a new column 'D' and assign to it a value computed from the other columns:

ATME College of 56
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Assignment in DataFrame.eval()
In the same way, any existing column can be modified:

ATME College of 57
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Local variables in DataFrame.eval()

The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables.
Consider the following:

The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate
expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python
objects. Notice that this @ character is only supported by the DataFrame.eval() method, not by the
pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace

ATME College of 58
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method

The DataFrame has another method based on evaluated strings, called the query() method. Consider the
following:

As with the example used in our discussion of DataFrame.eval(), this is an expression involving columns of the
DataFrame. It cannot be expressed using the Data Frame.eval() syntax, however! Instead, for this type of filtering
operation, you can use the query() method

ATME College of 59
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
DataFrame.query() Method

In addition to being a more efficient computation, compared to the masking expression this is much easier
to read and understand. Note that the query() method also accepts the @ flag to mark local variables

ATME College of 60
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions

When considering whether to use these functions, there are two considerations: computation time and memory
use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving
NumPy arrays or Pandas Data Frames will result in implicit creation of temporary arrays: For example, this

If the size of the temporary DataFrames is significant compared to your available sys tem memory (typically
several gigabytes), then it’s a good idea to use an eval() or query() expression. You can check the approximate
size of your array in bytes using this

ATME College of 61
Department of CSE-DS, ATMECE
Engineering, Mysuru
Module 3 : Data Manipulation with Pandas II
3.2 Working with Time Series
Performance: When to Use These Functions

On the performance side, eval() can be faster even when you are not maxing out your system memory.
The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your
system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some
potentially slow movement of values between the different memory caches. In practice, I find that the
difference in computation time between the traditional methods and the eval/query method is usually not
significant—if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is
mainly in the saved memory, and the sometimes cleaner syntax they offer.

ATME College of 62
Department of CSE-DS, ATMECE
Engineering, Mysuru
THANK
YOU
ATME College of 63
Department of CSE-DS, ATMECE
Engineering, Mysuru

Strategy Game Programming With Directx 9.0 PDF
No ratings yet
Strategy Game Programming With Directx 9.0 PDF
558 pages
Module 2 Pandas 1 (1)
No ratings yet
Module 2 Pandas 1 (1)
79 pages
Mastering Pandas - Important Pandas Functions For Your Next Project
No ratings yet
Mastering Pandas - Important Pandas Functions For Your Next Project
5 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
Pandas AI ML Python Software Engineering
No ratings yet
Pandas AI ML Python Software Engineering
63 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
Data Analysis and Visualisation With Python
No ratings yet
Data Analysis and Visualisation With Python
75 pages
Lesson 07 Data Manipulation With Pandas
No ratings yet
Lesson 07 Data Manipulation With Pandas
82 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas From Basic To Advanced
No ratings yet
Pandas From Basic To Advanced
78 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Numpy_Data_Analysis_and_visualisation_with_Python
No ratings yet
Numpy_Data_Analysis_and_visualisation_with_Python
75 pages
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
No ratings yet
Advance Data Analysis and Visualisation - With - Python For Executives and Business Management
76 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
No ratings yet
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
72 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
72 pages
Unit 4
No ratings yet
Unit 4
36 pages
Data Handling Python NCERT
No ratings yet
Data Handling Python NCERT
36 pages
Data Handling using pandas – I
No ratings yet
Data Handling using pandas – I
42 pages
CH 2
No ratings yet
CH 2
36 pages
leip102
No ratings yet
leip102
36 pages
NumPy and Pandas Tutorial
No ratings yet
NumPy and Pandas Tutorial
8 pages
Introducing Pandas String Operations & Plots
No ratings yet
Introducing Pandas String Operations & Plots
16 pages
Ip 102
No ratings yet
Ip 102
36 pages
data handling module
No ratings yet
data handling module
10 pages
Ncert Pandas
No ratings yet
Ncert Pandas
36 pages
Python_for_DataScience
No ratings yet
Python_for_DataScience
47 pages
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Class 12 Practical File
No ratings yet
Class 12 Practical File
29 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Eda Unit 2
No ratings yet
Eda Unit 2
65 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Online Python Training Institute in Hyderabad
No ratings yet
Online Python Training Institute in Hyderabad
3 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
01 Data Handling Using Pandas I
No ratings yet
01 Data Handling Using Pandas I
19 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
ELE492 - ELE492 - Image Process Lecture Notes 5
No ratings yet
ELE492 - ELE492 - Image Process Lecture Notes 5
41 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Session2-DM Using Pandas
No ratings yet
Session2-DM Using Pandas
51 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Unit 3 Data Analysis using pandas - Copy
No ratings yet
Unit 3 Data Analysis using pandas - Copy
49 pages
Pandas
No ratings yet
Pandas
49 pages
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
No ratings yet
Lecture 15 (DS) - Pandas - DataFrame Merging, String Operations
25 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Report
No ratings yet
Report
18 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Pandas & Numpy
No ratings yet
Pandas & Numpy
32 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
26 pages
Exp 25_26
No ratings yet
Exp 25_26
17 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
M2 Session 2
No ratings yet
M2 Session 2
26 pages
Problems on Trains
No ratings yet
Problems on Trains
9 pages
Ada_module 5_np 1 (1)
No ratings yet
Ada_module 5_np 1 (1)
48 pages
ADA_Module 3_MN
No ratings yet
ADA_Module 3_MN
39 pages
ADA_Module 5_NP
No ratings yet
ADA_Module 5_NP
4 pages
ADA_Module 4_PM
No ratings yet
ADA_Module 4_PM
79 pages
ADA_Module 2_MN 1
No ratings yet
ADA_Module 2_MN 1
39 pages
Module 4 Visua;Ization Using Matplotlib (1)
No ratings yet
Module 4 Visua;Ization Using Matplotlib (1)
74 pages
CONTROL-M Messages Manual
No ratings yet
CONTROL-M Messages Manual
198 pages
Band in A Box 2016 Mac Manual
No ratings yet
Band in A Box 2016 Mac Manual
416 pages
Curriculum Vitae: Nitya Joyce Viswasanathan
No ratings yet
Curriculum Vitae: Nitya Joyce Viswasanathan
4 pages
The Append Hint - Direct Path Load
No ratings yet
The Append Hint - Direct Path Load
2 pages
Final SPSS Record (1)
No ratings yet
Final SPSS Record (1)
44 pages
Worthy Goals PDF
No ratings yet
Worthy Goals PDF
2 pages
Catalogue - Water Meter Ultrasonic Modbus SAITEC
No ratings yet
Catalogue - Water Meter Ultrasonic Modbus SAITEC
2 pages
VSAT700 User Manual
No ratings yet
VSAT700 User Manual
93 pages
Lab 1 Introduction To UML
No ratings yet
Lab 1 Introduction To UML
8 pages
Design Boundaries For Metro Rail Project
No ratings yet
Design Boundaries For Metro Rail Project
3 pages
Trion 8000 Multi
No ratings yet
Trion 8000 Multi
7 pages
SXT Lite 5 Ac
No ratings yet
SXT Lite 5 Ac
4 pages
AVS Security Requirements-V1 - 2
No ratings yet
AVS Security Requirements-V1 - 2
2 pages
AppForm_1A4244C7-90A9-4E
No ratings yet
AppForm_1A4244C7-90A9-4E
4 pages
Reactive Streams PDF
No ratings yet
Reactive Streams PDF
4 pages
03-Subroutines and Stacks
No ratings yet
03-Subroutines and Stacks
53 pages
Signals and Systems: 18EC45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
No ratings yet
Signals and Systems: 18EC45 Model Question Paper-1 With Effect From 2019-20 (CBCS Scheme)
3 pages
TA2020
No ratings yet
TA2020
13 pages
G5 Gemini
No ratings yet
G5 Gemini
1 page
Yap Lai Wah
No ratings yet
Yap Lai Wah
2 pages
System - Windows.Forms Namespace: Class Category Details
No ratings yet
System - Windows.Forms Namespace: Class Category Details
42 pages
STD 4 Computer Question Bank
No ratings yet
STD 4 Computer Question Bank
35 pages
30 Hrs Deep Learning CV Images Video
No ratings yet
30 Hrs Deep Learning CV Images Video
6 pages
assignment -1 with answer
No ratings yet
assignment -1 with answer
17 pages
Thesis Attendance Monitoring System
100% (2)
Thesis Attendance Monitoring System
6 pages
Found 465999304 2273280
No ratings yet
Found 465999304 2273280
55 pages
D SP Wiki German Course.14.Level I.section D.lesson 11.privileg Und Verantwortung
No ratings yet
D SP Wiki German Course.14.Level I.section D.lesson 11.privileg Und Verantwortung
4 pages
Linux Assignment Questions
No ratings yet
Linux Assignment Questions
6 pages

Module 3 Pandas 2 (1)

Uploaded by

Module 3 Pandas 2 (1)

Uploaded by

Exploratory Data Analysis- BDS613B

• Vectorized String Operations

• Introducing Pandas String Operations

Introducing Pandas String Operations

Vectorized String Operations

We can now call a single method that will capitalize all

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

Slide no 11 to 14 is not there in the syllabus.

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

Vectorized String Operations

3.2 Working with Time Series

Native Python dates and times: datetime and dateutil

3.2 Working with Time Series

Typed arrays of times: NumPy’s datetime64

date time64 imposes a

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

3.2 Working with Time Series

And a sequence of durations increasing by an hour:

3.2 Working with Time Series

3.2.4 Frequencies and Offsets 3.2 Working with Time Series

3.2 Working with Time Series

Resampling and converting frequencies

Notice the difference:

The top panel is the default: non-business days are left

Here we will both shift() and tshift() by 900 days (Figure

We see here that shift(900)

This helps us to see the overall trend in

Example: Visualizing Seattle Bicycle Counts

Motivating query() and eval(): Compound Expressions

Arithmetic operators. pd.eval() supports all arithmetic operators. For example

Bitwise operators. pd.eval() supports the & and | bitwise operators:

Operations supported by pd.eval()

In addition, it supports the use of the literal and or in Boolean expressions:

DataFrame.eval() for Column-Wise Operations

DataFrame.eval() for Column-Wise Operations

You might also like