-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Add interpolation options to rolling quantile #20497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interpolation options to rolling quantile #20497
Conversation
what issue are you addressing here? if its about performance, pls show representative benchmarks (or run the asv)
if you read this: https://github.com/pandas-dev/pandas/pull/16247/files not sure what to say here, it had different defaults. |
re-reading your changes I see that you are trying to fix perf. can you run the asv's here and show the results. (you may need to add one if the case you are addressing is not covered). |
@jreback, hi, There is no issue related to this pull request. Do I need to create an issue first? I am aware about the BUG that you referenced. Company at which I work uses rolling quantile without interpolation, when we updated pandas version, we discovered that behavior of rolling quantile has changed and there is no way to get non-interpolated results. I re-ran asv benchmarks for rolling quantile (default version uses linear interpolation) and it did not show significant changes, sorry for confusing you.
I added new benchmarks for different interpolation options. I cannot compare them with previous results because interpolation options were not supported (supporting them is the purpose of this commit). Here are new benchmarks:
I am not sure if I should add this new benchmarks to the commit. Should all the code be covered by benchmarks? |
Codecov Report
@@ Coverage Diff @@
## master #20497 +/- ##
==========================================
+ Coverage 91.77% 91.83% +0.06%
==========================================
Files 153 153
Lines 49263 49320 +57
==========================================
+ Hits 45213 45295 +82
+ Misses 4050 4025 -25
Continue to review full report at Codecov.
|
Generally create an issue first. Since you already created a PR no need then. Ok so perf is basically unchanged on small series. If you want to add an asv for that is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. comments and please add a line in the Other Enhancements section.
pandas/_libs/window.pyx
Outdated
@@ -1357,25 +1357,53 @@ cdef _roll_min_max(ndarray[numeric] input, int64_t win, int64_t minp, | |||
return output | |||
|
|||
|
|||
def _get_interpolation_id(str interpolation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
look in algos.pxd
for how we handle cdef enum TiebreakEnumType:
e.g you create an enum and a dict to map to that enum. This becomes a bit simpler then.
pandas/_libs/window.pyx
Outdated
idx_with_fraction = quantile * <double> (nobs - 1) | ||
idx = int(idx_with_fraction) | ||
|
||
if interpolation_id == 0: # linear |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you would use the enums here
pandas/_libs/window.pyx
Outdated
|
||
if quantile <= 0.0 or quantile >= 1.0: | ||
raise ValueError("quantile value {0} not in [0, 1]".format(quantile)) | ||
|
||
# interpolation_id is needed to avoid string comparisons inside the loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the comments here
pandas/core/window.py
Outdated
* lower: `i`. | ||
* higher: `j`. | ||
* nearest: `i` or `j` whichever is nearest. | ||
* midpoint: (`i` + `j`) / 2.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a Returns, Examples and a See Also section (look at the Series.quantile doc-string for inspiration). Also add a reference from the Series.quantile doc-string to here
pandas/tests/test_window.py
Outdated
@@ -1135,7 +1135,22 @@ def test_rolling_quantile_series(self): | |||
s = Series(arr) | |||
q1 = s.quantile(0.1) | |||
q2 = s.rolling(100).quantile(0.1).iloc[-1] | |||
tm.assert_almost_equal(q1, q2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to a separate test and parameterize on the interpolation (inlcude linear as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all coming to the same result value, is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note (other enhancements section)
pandas/core/window.py
Outdated
2 2.0 | ||
3 3.0 | ||
dtype: float64 | ||
>>> s.rolling(2).quantile(.4, interpolation='midpoint') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a blank line here
linting issue:
you can run |
@TomAugspurger comments? |
@WillAyd if you'd have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple comments on my end - is it also possible to use nogil? May help with some of the performance regression?
|
||
if quantile <= 0.0 or quantile >= 1.0: | ||
raise ValueError("quantile value {0} not in [0, 1]".format(quantile)) | ||
|
||
try: | ||
interpolation_type = interpolation_types[interpolation] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this raise a KeyError
not a ValueError
?
pandas/_libs/window.pyx
Outdated
output[i] = ((vlow + (vhigh - vlow) * | ||
(quantile * (nobs - 1) - idx))) | ||
idx_with_fraction = quantile * <double> (nobs - 1) | ||
idx = int(idx_with_fraction) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given everything else here is done with C syntax, can we use that casting convention here?
pandas/tests/test_window.py
Outdated
@@ -1138,6 +1138,20 @@ def test_rolling_quantile_series(self): | |||
|
|||
tm.assert_almost_equal(q1, q2) | |||
|
|||
@pytest.mark.parametrize('quantile', [0.0, 0.1, 0.45, 0.5, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add examples with NA data?
@WillAyd I tried to replace |
ok we don't / can't use |
@jreback but still using |
The only gil-requiring function call at the time of that comment was I did something similar in https://github.com/pandas-dev/pandas/pull/20405/files which you may be able to reference |
pandas/tests/test_window.py
Outdated
# Tests that rolling window's quantile behavior is analogous to | ||
# Series' quantile for each interpolation option | ||
size = 100 | ||
s = Series(np.random.rand(size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd advise against using random data in a test case as it could cause intermittent failures and make bugs harder to find. To inject missing data, could you not just parametrize the function with a second array of say [0., np.nan, 0.2, np.nan, 0.4]
to match what you had earlier?
|
||
if quantile <= 0.0 or quantile >= 1.0: | ||
raise ValueError("quantile value {0} not in [0, 1]".format(quantile)) | ||
|
||
try: | ||
interpolation_type = interpolation_types[interpolation] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a test case to cover that this raises the expected error message when passing an invalid argument? If not can you add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)
pandas/tests/test_window.py
Outdated
q2 = s.rolling(size, min_periods=1).quantile( | ||
quantile, interpolation).iloc[-1] | ||
|
||
tm.assert_almost_equal(q1, q2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is almost equal required here or is it possible to use input data so that float precision would not be an issue? I'm trying to be extra cautious that almost_equal
doesn't allow bugs to silently pass through some interpolation options
About One of compiler messages:
|
pandas/_libs/window.pyx
Outdated
else: | ||
output[i] = skiplist.get(idx + 1) | ||
elif interpolation_type == MIDPOINT: | ||
vlow = skiplist.get(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make this work w/o the gil, you need to call skiplist_get(skiplist, idx)
(there are examples in the file). skiplist.get
invokes a python function call and is not allowed
Finally code works with There is another problem: What should I do with failed tests? Is it okay to test |
just skip if < numpy 1.12 on those tests |
pandas/tests/test_window.py
Outdated
@@ -6,6 +6,7 @@ | |||
from datetime import datetime, timedelta | |||
from numpy.random import randn | |||
import numpy as np | |||
from scipy._lib._version import NumpyVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use from pandas import _np_version_under1p12
pandas/tests/test_window.py
Outdated
q2 = s.rolling(len(data), min_periods=1).quantile( | ||
quantile, interpolation).iloc[-1] | ||
|
||
if np.isnan(q1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is tm.assert_series_equal
not an option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd, it is not series type. I edited test data a bit and was able to get rid of round
pandas/tests/test_window.py
Outdated
[np.nan, np.nan, np.nan, np.nan], | ||
[np.nan, 0.1, np.nan, 0.3, 0.4, 0.5], | ||
[0.5], [np.nan, 0.7, 0.5]]) | ||
def test_rolling_quantile_interpolation_options(self, quantile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the above test cover practically the same use case. If that's the case I'd get rid of the test above
|
||
if quantile <= 0.0 or quantile >= 1.0: | ||
raise ValueError("quantile value {0} not in [0, 1]".format(quantile)) | ||
|
||
try: | ||
interpolation_type = interpolation_types[interpolation] | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor edits otherwise lgtm. Can you post updated ASVs?
pandas/core/window.py
Outdated
* lower: `i`. | ||
* higher: `j`. | ||
* nearest: `i` or `j` whichever is nearest. Implementation uses | ||
round() built-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove this comment about the round() built-in
pandas/core/window.py
Outdated
------- | ||
Series or DataFrame | ||
Returned object type is determined by the caller of the %(name)s | ||
calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor but description should end with a period
pandas/core/window.py
Outdated
|
||
See Also | ||
-------- | ||
pandas.Series.quantile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just add some quick descriptions after these (separated by colon). Can reference docstring guide:
pandas/tests/test_window.py
Outdated
s = Series(data) | ||
|
||
q1 = s.quantile(quantile, interpolation) | ||
q2 = s.rolling(len(data), min_periods=1).quantile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is just expanding
instead of rolling
so should use the former to stay idiomatic
asv_bench/benchmarks/rolling.py
Outdated
arr = np.random.random(N).astype(dtype) | ||
self.roll = getattr(pd, constructor)(arr).rolling(window) | ||
|
||
def time_quantile(self, constructor, window, dtype, percentile): | ||
self.roll.quantile(percentile) | ||
|
||
def time_quantile_nearest(self, constructor, window, dtype, percentile): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha when I asked for updated ASVs I was just asking for a refresh of what you posted earlier in the thread but I suppose this is better. Is it possible to just add interpolation as an argument to setup
? Would be more concise.
Here's an example from GroupBy that you could use for reference, though the existing Quantile class could clue you in on how to do this as well:
pandas/asv_bench/benchmarks/groupby.py
Line 342 in 78fee04
params = [['int', 'float', 'object', 'datetime'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also post results to a comment when done. I'm assuming that these will fail on master but would still be good to know how they baseline against the current implementation that you posted earlier in thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase and address small comments by @WillAyd, otherwise lgtm.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -402,6 +402,7 @@ Other Enhancements | |||
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row. | |||
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`) | |||
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`) | |||
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche do these refs work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback, how should they work? I built documentation and all references in the whatsnew.html are rendered as plain bold text.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -402,6 +402,7 @@ Other Enhancements | |||
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row. | |||
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`) | |||
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`) | |||
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add (this) issue number here
In ASV benchmarks I added
I will rebase branch |
Now there are lots of commits from master :( hope I did rebasing correctly |
Shouldn't be including the commits of other contributors - how did you rebase this? Need to figure out how to undo that and only replay your commits on top of master |
277ab9f
to
9212c9f
Compare
Add whatsnew Fix bug: catch KeyError not ValueError Add return type to roll_quantile
9212c9f
to
3a2e431
Compare
Finally I did rebasing correctly (I hope) |
thanks @kornilova-l nice patch! generally you don't need to squash, you can do it if you want to make things more readable / easier if you want. but not necessary to merge. |
It version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.
Regular (not rolling) quantile supports these interpolation options:
linear
,lower
,higher
,nearest
andmidpoint
.This commit adds the same options to moving quantile.
Performance issues of this commit (note: I re-run benchmarks, see message below)
This code has 15% worse performance on benchmarks with small values of window (
window=10
). This is because loop insideroll_quantile
now contains switch.I tried to replace switch with callback but it led to even worse performance. Even if I move some of the code to new function (without any change in logic) it still makes performance much worse.
How bad is it? Could you please give me an advice on how to arrange the code such that it has the same performance?
git diff upstream/master -u -- "*.py" | flake8 --diff