Skip to content

Add interpolation options to rolling quantile #20497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kornilova203
Copy link
Contributor

@kornilova203 kornilova203 commented Mar 27, 2018

It version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

Regular (not rolling) quantile supports these interpolation options: linear, lower, higher, nearest and midpoint.
This commit adds the same options to moving quantile.

Performance issues of this commit (note: I re-run benchmarks, see message below)
This code has 15% worse performance on benchmarks with small values of window (window=10). This is because loop inside roll_quantile now contains switch.
I tried to replace switch with callback but it led to even worse performance. Even if I move some of the code to new function (without any change in logic) it still makes performance much worse.
How bad is it? Could you please give me an advice on how to arrange the code such that it has the same performance?

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@kornilova203 kornilova203 changed the title Add interpolation options to moving quantile Add interpolation options to rolling quantile Mar 27, 2018
@jreback
Copy link
Contributor

jreback commented Mar 27, 2018

what issue are you addressing here? if its about performance, pls show representative benchmarks (or run the asv)

it version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

if you read this: https://github.com/pandas-dev/pandas/pull/16247/files not sure what to say here, it had different defaults.

@jreback
Copy link
Contributor

jreback commented Mar 27, 2018

re-reading your changes I see that you are trying to fix perf. can you run the asv's here and show the results. (you may need to add one if the case you are addressing is not covered).

@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 27, 2018
@kornilova203
Copy link
Contributor Author

@jreback, hi,

There is no issue related to this pull request. Do I need to create an issue first?

I am aware about the BUG that you referenced. Company at which I work uses rolling quantile without interpolation, when we updated pandas version, we discovered that behavior of rolling quantile has changed and there is no way to get non-interpolated results.

I re-ran asv benchmarks for rolling quantile (default version uses linear interpolation) and it did not show significant changes, sorry for confusing you.

BENCHMARKS NOT SIGNIFICANTLY CHANGED

I added new benchmarks for different interpolation options. I cannot compare them with previous results because interpolation options were not supported (supporting them is the purpose of this commit). Here are new benchmarks:

[ 20.00%] ··· Running rolling.Quantile.time_quantile             1.37±0.2ms;...
[ 40.00%] ··· Running rolling.Quantile.time_quantile_higher      1.38±0.2ms;...
[ 60.00%] ··· Running rolling.Quantile.time_quantile_lower       1.37±0.2ms;...
[ 80.00%] ··· Running rolling.Quantile.time_quantile_midpoint    1.38±0.2ms;...
[100.00%] ··· Running rolling.Quantile.time_quantile_nearest     1.37±0.2ms;...

I am not sure if I should add this new benchmarks to the commit. Should all the code be covered by benchmarks?

@TomAugspurger
Copy link
Contributor

@codecov
Copy link

codecov bot commented Mar 28, 2018

Codecov Report

Merging #20497 into master will increase coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20497      +/-   ##
==========================================
+ Coverage   91.77%   91.83%   +0.06%     
==========================================
  Files         153      153              
  Lines       49263    49320      +57     
==========================================
+ Hits        45213    45295      +82     
+ Misses       4050     4025      -25
Flag Coverage Δ
#multiple 90.23% <100%> (+0.06%) ⬆️
#single 41.87% <60%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 93.9% <ø> (ø) ⬆️
pandas/core/frame.py 97.17% <ø> (ø) ⬆️
pandas/core/window.py 96.3% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.16% <0%> (-0.02%) ⬇️
pandas/core/strings.py 98.32% <0%> (-0.02%) ⬇️
pandas/api/extensions/__init__.py 100% <0%> (ø) ⬆️
pandas/core/arrays/base.py 84.14% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.63% <0%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.72% <0%> (+0.01%) ⬆️
pandas/core/common.py 92.04% <0%> (+0.02%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ae19a1...9cc7a71. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2018

@kornilova-l

There is no issue related to this pull request. Do I need to create an issue first?

Generally create an issue first. Since you already created a PR no need then.

Ok so perf is basically unchanged on small series. If you want to add an asv for that is ok.

@jreback jreback removed the Performance Memory or execution speed performance label Mar 28, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. comments and please add a line in the Other Enhancements section.

@@ -1357,25 +1357,53 @@ cdef _roll_min_max(ndarray[numeric] input, int64_t win, int64_t minp,
return output


def _get_interpolation_id(str interpolation):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look in algos.pxd for how we handle cdef enum TiebreakEnumType:

e.g you create an enum and a dict to map to that enum. This becomes a bit simpler then.

idx_with_fraction = quantile * <double> (nobs - 1)
idx = int(idx_with_fraction)

if interpolation_id == 0: # linear
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you would use the enums here


if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

# interpolation_id is needed to avoid string comparisons inside the loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the comments here

* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest.
* midpoint: (`i` + `j`) / 2.""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a Returns, Examples and a See Also section (look at the Series.quantile doc-string for inspiration). Also add a reference from the Series.quantile doc-string to here

@@ -1135,7 +1135,22 @@ def test_rolling_quantile_series(self):
s = Series(arr)
q1 = s.quantile(0.1)
q2 = s.rolling(100).quantile(0.1).iloc[-1]
tm.assert_almost_equal(q1, q2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to a separate test and parameterize on the interpolation (inlcude linear as well)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are all coming to the same result value, is that right?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note (other enhancements section)

2 2.0
3 3.0
dtype: float64
>>> s.rolling(2).quantile(.4, interpolation='midpoint')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a blank line here

@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

linting issue:

pandas/core/window.py:1289:1: W293 blank line contains whitespace

you can run
LINT=TRUE ci/lint.sh to see

@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

@TomAugspurger comments?

@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

@WillAyd if you'd have a look

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple comments on my end - is it also possible to use nogil? May help with some of the performance regression?


if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

try:
interpolation_type = interpolation_types[interpolation]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this raise a KeyError not a ValueError?

output[i] = ((vlow + (vhigh - vlow) *
(quantile * (nobs - 1) - idx)))
idx_with_fraction = quantile * <double> (nobs - 1)
idx = int(idx_with_fraction)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given everything else here is done with C syntax, can we use that casting convention here?

@@ -1138,6 +1138,20 @@ def test_rolling_quantile_series(self):

tm.assert_almost_equal(q1, q2)

@pytest.mark.parametrize('quantile', [0.0, 0.1, 0.45, 0.5, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add examples with NA data?

@kornilova203
Copy link
Contributor Author

@WillAyd I tried to replace range with prange but it does not compile because calling skiplist.get requires gil: Calling gil-requiring function not allowed without gil did I miss something?

@jreback
Copy link
Contributor

jreback commented Apr 1, 2018

ok we don't / can't use prange anywhere in the codebase. only you can use nogil

@kornilova203
Copy link
Contributor Author

@jreback but still using nogil is not possible because there are gil-requiring function calls

@WillAyd
Copy link
Member

WillAyd commented Apr 1, 2018

The only gil-requiring function call at the time of that comment was int which is why I asked if it was possible to use the C-style cast instead. I think if you did those in combination it would work. I see that you since added the round built-in so to get nogil to work you'd have to use a C-style version of that as well.

I did something similar in https://github.com/pandas-dev/pandas/pull/20405/files which you may be able to reference

# Tests that rolling window's quantile behavior is analogous to
# Series' quantile for each interpolation option
size = 100
s = Series(np.random.rand(size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd advise against using random data in a test case as it could cause intermittent failures and make bugs harder to find. To inject missing data, could you not just parametrize the function with a second array of say [0., np.nan, 0.2, np.nan, 0.4] to match what you had earlier?


if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

try:
interpolation_type = interpolation_types[interpolation]
except KeyError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test case to cover that this raises the expected error message when passing an invalid argument? If not can you add?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

q2 = s.rolling(size, min_periods=1).quantile(
quantile, interpolation).iloc[-1]

tm.assert_almost_equal(q1, q2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is almost equal required here or is it possible to use input data so that float precision would not be an issue? I'm trying to be extra cautious that almost_equaldoesn't allow bugs to silently pass through some interpolation options

@kornilova203
Copy link
Contributor Author

About nogil: I changed int() to <int> and also removed round() function, but after adding nogil compiler still complains about skiplist methods (skiplist.insert, skiplist.remove, skiplist.get).

One of compiler messages:

Error compiling Cython file:
------------------------------------------------------------
...
                        elif idx_with_fraction - idx < 0.5:
                            output[i] = skiplist.get(idx)
                        else:
                            output[i] = skiplist.get(idx + 1)
                    elif interpolation_type == MIDPOINT:
                        vlow = skiplist.get(idx)
                                          ^
------------------------------------------------------------

pandas/_libs/window.pyx:1474:43: Calling gil-requiring function not allowed without gil

else:
output[i] = skiplist.get(idx + 1)
elif interpolation_type == MIDPOINT:
vlow = skiplist.get(idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this work w/o the gil, you need to call skiplist_get(skiplist, idx) (there are examples in the file). skiplist.get invokes a python function call and is not allowed

@kornilova203
Copy link
Contributor Author

Finally code works with nogil

There is another problem:
Some tests in python 2.7 environment fail. All failed tests use midpoint interpolation. I am almost sure that it is related to numpy/numpy#7163
Tests compare results of quantile and rolling_quantile. quantile uses np.percentile which used to have a bug in 'midpoint' interpolation.

What should I do with failed tests? Is it okay to test midpoint separately and not compare it to quantile?

@jreback
Copy link
Contributor

jreback commented Apr 13, 2018

just skip if < numpy 1.12 on those tests

@@ -6,6 +6,7 @@
from datetime import datetime, timedelta
from numpy.random import randn
import numpy as np
from scipy._lib._version import NumpyVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use from pandas import _np_version_under1p12

q2 = s.rolling(len(data), min_periods=1).quantile(
quantile, interpolation).iloc[-1]

if np.isnan(q1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is tm.assert_series_equal not an option here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd, it is not series type. I edited test data a bit and was able to get rid of round

[np.nan, np.nan, np.nan, np.nan],
[np.nan, 0.1, np.nan, 0.3, 0.4, 0.5],
[0.5], [np.nan, 0.7, 0.5]])
def test_rolling_quantile_interpolation_options(self, quantile,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the above test cover practically the same use case. If that's the case I'd get rid of the test above


if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

try:
interpolation_type = interpolation_types[interpolation]
except KeyError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor edits otherwise lgtm. Can you post updated ASVs?

* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest. Implementation uses
round() built-in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this comment about the round() built-in

-------
Series or DataFrame
Returned object type is determined by the caller of the %(name)s
calculation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor but description should end with a period


See Also
--------
pandas.Series.quantile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just add some quick descriptions after these (separated by colon). Can reference docstring guide:

https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-5-see-also

s = Series(data)

q1 = s.quantile(quantile, interpolation)
q2 = s.rolling(len(data), min_periods=1).quantile(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is just expanding instead of rolling so should use the former to stay idiomatic

arr = np.random.random(N).astype(dtype)
self.roll = getattr(pd, constructor)(arr).rolling(window)

def time_quantile(self, constructor, window, dtype, percentile):
self.roll.quantile(percentile)

def time_quantile_nearest(self, constructor, window, dtype, percentile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha when I asked for updated ASVs I was just asking for a refresh of what you posted earlier in the thread but I suppose this is better. Is it possible to just add interpolation as an argument to setup? Would be more concise.

Here's an example from GroupBy that you could use for reference, though the existing Quantile class could clue you in on how to do this as well:

params = [['int', 'float', 'object', 'datetime'],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also post results to a comment when done. I'm assuming that these will fail on master but would still be good to know how they baseline against the current implementation that you posted earlier in thread

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase and address small comments by @WillAyd, otherwise lgtm.

@@ -402,6 +402,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche do these refs work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, how should they work? I built documentation and all references in the whatsnew.html are rendered as plain bold text.

@@ -402,6 +402,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add (this) issue number here

@kornilova203
Copy link
Contributor Author

In ASV benchmarks I added interpolation as a parameter to setup. Performance is the same compared to results that I posted earlier

[100.00%] ··· Running rolling.Quantile.time_quantile                 1.38±0.2ms;...

I will rebase branch

@kornilova203
Copy link
Contributor Author

Now there are lots of commits from master :( hope I did rebasing correctly

@WillAyd
Copy link
Member

WillAyd commented Apr 23, 2018

Shouldn't be including the commits of other contributors - how did you rebase this? Need to figure out how to undo that and only replay your commits on top of master

@kornilova203 kornilova203 force-pushed the add-interpolation-options-to-rolling-quantile branch 2 times, most recently from 277ab9f to 9212c9f Compare April 24, 2018 07:21
@kornilova203 kornilova203 force-pushed the add-interpolation-options-to-rolling-quantile branch from 9212c9f to 3a2e431 Compare April 24, 2018 07:26
@kornilova203
Copy link
Contributor Author

Finally I did rebasing correctly (I hope)
Should I squash these 13 commits into one?

@jreback jreback added this to the 0.23.0 milestone Apr 24, 2018
@jreback jreback merged commit ce8f6e8 into pandas-dev:master Apr 24, 2018
@jreback
Copy link
Contributor

jreback commented Apr 24, 2018

thanks @kornilova-l nice patch!

generally you don't need to squash, you can do it if you want to make things more readable / easier if you want. but not necessary to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants