Add interpolation options to rolling quantile #20497

kornilova203 · 2018-03-27T09:34:52Z

It version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

Regular (not rolling) quantile supports these interpolation options: linear, lower, higher, nearest and midpoint.
This commit adds the same options to moving quantile.

Performance issues of this commit (note: I re-run benchmarks, see message below)
This code has 15% worse performance on benchmarks with small values of window (window=10). This is because loop inside roll_quantile now contains switch.
I tried to replace switch with callback but it led to even worse performance. Even if I move some of the code to new function (without any change in logic) it still makes performance much worse.
How bad is it? Could you please give me an advice on how to arrange the code such that it has the same performance?

closes #xxxx
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2018-03-27T10:19:26Z

what issue are you addressing here? if its about performance, pls show representative benchmarks (or run the asv)

it version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

if you read this: https://github.com/pandas-dev/pandas/pull/16247/files not sure what to say here, it had different defaults.

jreback · 2018-03-27T10:21:14Z

re-reading your changes I see that you are trying to fix perf. can you run the asv's here and show the results. (you may need to add one if the case you are addressing is not covered).

kornilova203 · 2018-03-27T11:48:39Z

@jreback, hi,

There is no issue related to this pull request. Do I need to create an issue first?

I am aware about the BUG that you referenced. Company at which I work uses rolling quantile without interpolation, when we updated pandas version, we discovered that behavior of rolling quantile has changed and there is no way to get non-interpolated results.

I re-ran asv benchmarks for rolling quantile (default version uses linear interpolation) and it did not show significant changes, sorry for confusing you.

BENCHMARKS NOT SIGNIFICANTLY CHANGED

I added new benchmarks for different interpolation options. I cannot compare them with previous results because interpolation options were not supported (supporting them is the purpose of this commit). Here are new benchmarks:

[ 20.00%] ··· Running rolling.Quantile.time_quantile             1.37±0.2ms;...
[ 40.00%] ··· Running rolling.Quantile.time_quantile_higher      1.38±0.2ms;...
[ 60.00%] ··· Running rolling.Quantile.time_quantile_lower       1.37±0.2ms;...
[ 80.00%] ··· Running rolling.Quantile.time_quantile_midpoint    1.38±0.2ms;...
[100.00%] ··· Running rolling.Quantile.time_quantile_nearest     1.37±0.2ms;...

I am not sure if I should add this new benchmarks to the commit. Should all the code be covered by benchmarks?

TomAugspurger · 2018-03-27T12:03:29Z

@kornilova-l some linting issues: https://travis-ci.org/pandas-dev/pandas/jobs/358799956#L3014

codecov · 2018-03-28T08:33:12Z

Codecov Report

Merging #20497 into master will increase coverage by 0.06%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20497      +/-   ##
==========================================
+ Coverage   91.77%   91.83%   +0.06%     
==========================================
  Files         153      153              
  Lines       49263    49320      +57     
==========================================
+ Hits        45213    45295      +82     
+ Misses       4050     4025      -25

Flag	Coverage Δ
#multiple	`90.23% <100%> (+0.06%)`	⬆️
#single	`41.87% <60%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/core/series.py	`93.9% <ø> (ø)`	⬆️
pandas/core/frame.py	`97.17% <ø> (ø)`	⬆️
pandas/core/window.py	`96.3% <100%> (ø)`	⬆️
pandas/core/dtypes/concat.py	`99.16% <0%> (-0.02%)`	⬇️
pandas/core/strings.py	`98.32% <0%> (-0.02%)`	⬇️
pandas/api/extensions/__init__.py	`100% <0%> (ø)`	⬆️
pandas/core/arrays/base.py	`84.14% <0%> (ø)`	⬆️
pandas/core/indexes/base.py	`96.63% <0%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`96.72% <0%> (+0.01%)`	⬆️
pandas/core/common.py	`92.04% <0%> (+0.02%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ae19a1...9cc7a71. Read the comment docs.

jreback · 2018-03-28T10:25:05Z

@kornilova-l

There is no issue related to this pull request. Do I need to create an issue first?

Generally create an issue first. Since you already created a PR no need then.

Ok so perf is basically unchanged on small series. If you want to add an asv for that is ok.

jreback

looking good. comments and please add a line in the Other Enhancements section.

jreback · 2018-03-28T10:29:06Z

pandas/_libs/window.pyx

@@ -1357,25 +1357,53 @@ cdef _roll_min_max(ndarray[numeric] input, int64_t win, int64_t minp,
    return output


+def _get_interpolation_id(str interpolation):


look in algos.pxd for how we handle cdef enum TiebreakEnumType:

e.g you create an enum and a dict to map to that enum. This becomes a bit simpler then.

jreback · 2018-03-28T10:29:23Z

pandas/_libs/window.pyx

+                idx_with_fraction = quantile * <double> (nobs - 1)
+                idx = int(idx_with_fraction)
+
+                if interpolation_id == 0:  # linear


so you would use the enums here

jreback · 2018-03-28T10:29:41Z

pandas/_libs/window.pyx


    if quantile <= 0.0 or quantile >= 1.0:
        raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

+    # interpolation_id is needed to avoid string comparisons inside the loop


don't need the comments here

jreback · 2018-03-28T10:31:32Z

pandas/core/window.py

+            * lower: `i`.
+            * higher: `j`.
+            * nearest: `i` or `j` whichever is nearest.
+            * midpoint: (`i` + `j`) / 2.""")


can you add a Returns, Examples and a See Also section (look at the Series.quantile doc-string for inspiration). Also add a reference from the Series.quantile doc-string to here

jreback · 2018-03-28T10:32:06Z

pandas/tests/test_window.py

@@ -1135,7 +1135,22 @@ def test_rolling_quantile_series(self):
        s = Series(arr)
        q1 = s.quantile(0.1)
        q2 = s.rolling(100).quantile(0.1).iloc[-1]
+        tm.assert_almost_equal(q1, q2)


can you move to a separate test and parameterize on the interpolation (inlcude linear as well)

these are all coming to the same result value, is that right?

jreback

can you add a whatsnew note (other enhancements section)

jreback · 2018-03-30T20:12:55Z

pandas/core/window.py

+    2    2.0
+    3    3.0
+    dtype: float64
+    >>> s.rolling(2).quantile(.4, interpolation='midpoint')


can you add a blank line here

jreback · 2018-03-30T20:17:31Z

linting issue:

pandas/core/window.py:1289:1: W293 blank line contains whitespace

you can run
LINT=TRUE ci/lint.sh to see

jreback · 2018-03-30T20:17:43Z

@TomAugspurger comments?

jreback · 2018-03-30T20:18:33Z

@WillAyd if you'd have a look

WillAyd

Couple comments on my end - is it also possible to use nogil? May help with some of the performance regression?

WillAyd · 2018-03-30T21:13:58Z

pandas/_libs/window.pyx


    if quantile <= 0.0 or quantile >= 1.0:
        raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

+    try:
+        interpolation_type = interpolation_types[interpolation]


Wouldn't this raise a KeyError not a ValueError?

WillAyd · 2018-03-30T21:15:51Z

pandas/_libs/window.pyx

-                output[i] = ((vlow + (vhigh - vlow) *
-                             (quantile * (nobs - 1) - idx)))
+                idx_with_fraction = quantile * <double> (nobs - 1)
+                idx = int(idx_with_fraction)


Given everything else here is done with C syntax, can we use that casting convention here?

WillAyd · 2018-03-30T21:16:50Z

pandas/tests/test_window.py

@@ -1138,6 +1138,20 @@ def test_rolling_quantile_series(self):

        tm.assert_almost_equal(q1, q2)

+    @pytest.mark.parametrize('quantile', [0.0, 0.1, 0.45, 0.5, 1])


Can we add examples with NA data?

kornilova203 · 2018-04-01T13:25:23Z

@WillAyd I tried to replace range with prange but it does not compile because calling skiplist.get requires gil: Calling gil-requiring function not allowed without gil did I miss something?

jreback · 2018-04-01T13:33:43Z

ok we don't / can't use prange anywhere in the codebase. only you can use nogil

kornilova203 · 2018-04-01T14:40:32Z

@jreback but still using nogil is not possible because there are gil-requiring function calls

WillAyd · 2018-04-01T14:50:24Z

The only gil-requiring function call at the time of that comment was int which is why I asked if it was possible to use the C-style cast instead. I think if you did those in combination it would work. I see that you since added the round built-in so to get nogil to work you'd have to use a C-style version of that as well.

I did something similar in https://github.com/pandas-dev/pandas/pull/20405/files which you may be able to reference

WillAyd · 2018-04-01T14:55:08Z

pandas/tests/test_window.py

+        # Tests that rolling window's quantile behavior is analogous to
+        # Series' quantile for each interpolation option
+        size = 100
+        s = Series(np.random.rand(size))


I'd advise against using random data in a test case as it could cause intermittent failures and make bugs harder to find. To inject missing data, could you not just parametrize the function with a second array of say [0., np.nan, 0.2, np.nan, 0.4] to match what you had earlier?

WillAyd · 2018-04-01T14:56:27Z

pandas/_libs/window.pyx


    if quantile <= 0.0 or quantile >= 1.0:
        raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

+    try:
+        interpolation_type = interpolation_types[interpolation]
+    except KeyError:


Is there a test case to cover that this raises the expected error message when passing an invalid argument? If not can you add?

Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

WillAyd · 2018-04-01T15:00:23Z

pandas/tests/test_window.py

+        q2 = s.rolling(size, min_periods=1).quantile(
+            quantile, interpolation).iloc[-1]
+
+        tm.assert_almost_equal(q1, q2)


Is almost equal required here or is it possible to use input data so that float precision would not be an issue? I'm trying to be extra cautious that almost_equaldoesn't allow bugs to silently pass through some interpolation options

kornilova203 · 2018-04-13T08:15:33Z

About nogil: I changed int() to <int> and also removed round() function, but after adding nogil compiler still complains about skiplist methods (skiplist.insert, skiplist.remove, skiplist.get).

One of compiler messages:

Error compiling Cython file:
------------------------------------------------------------
...
                        elif idx_with_fraction - idx < 0.5:
                            output[i] = skiplist.get(idx)
                        else:
                            output[i] = skiplist.get(idx + 1)
                    elif interpolation_type == MIDPOINT:
                        vlow = skiplist.get(idx)
                                          ^
------------------------------------------------------------

pandas/_libs/window.pyx:1474:43: Calling gil-requiring function not allowed without gil

jreback · 2018-04-13T11:32:06Z

pandas/_libs/window.pyx

+                    else:
+                        output[i] = skiplist.get(idx + 1)
+                elif interpolation_type == MIDPOINT:
+                    vlow = skiplist.get(idx)


to make this work w/o the gil, you need to call skiplist_get(skiplist, idx) (there are examples in the file). skiplist.get invokes a python function call and is not allowed

kornilova203 · 2018-04-13T13:37:19Z

Finally code works with nogil

There is another problem:
Some tests in python 2.7 environment fail. All failed tests use midpoint interpolation. I am almost sure that it is related to numpy/numpy#7163
Tests compare results of quantile and rolling_quantile. quantile uses np.percentile which used to have a bug in 'midpoint' interpolation.

What should I do with failed tests? Is it okay to test midpoint separately and not compare it to quantile?

jreback · 2018-04-13T14:11:16Z

just skip if < numpy 1.12 on those tests

jreback · 2018-04-16T10:25:13Z

pandas/tests/test_window.py

@@ -6,6 +6,7 @@
 from datetime import datetime, timedelta
 from numpy.random import randn
 import numpy as np
+from scipy._lib._version import NumpyVersion


use from pandas import _np_version_under1p12

WillAyd · 2018-04-16T16:02:51Z

pandas/tests/test_window.py

+        q2 = s.rolling(len(data), min_periods=1).quantile(
+            quantile, interpolation).iloc[-1]
+
+        if np.isnan(q1):


Is tm.assert_series_equal not an option here?

@WillAyd, it is not series type. I edited test data a bit and was able to get rid of round

WillAyd · 2018-04-16T16:05:09Z

pandas/tests/test_window.py

+                                      [np.nan, np.nan, np.nan, np.nan],
+                                      [np.nan, 0.1, np.nan, 0.3, 0.4, 0.5],
+                                      [0.5], [np.nan, 0.7, 0.5]])
+    def test_rolling_quantile_interpolation_options(self, quantile,


I think this and the above test cover practically the same use case. If that's the case I'd get rid of the test above

WillAyd · 2018-04-16T16:07:01Z

pandas/_libs/window.pyx


    if quantile <= 0.0 or quantile >= 1.0:
        raise ValueError("quantile value {0} not in [0, 1]".format(quantile))

+    try:
+        interpolation_type = interpolation_types[interpolation]
+    except KeyError:


Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

WillAyd

Couple minor edits otherwise lgtm. Can you post updated ASVs?

WillAyd · 2018-04-18T22:38:36Z

pandas/core/window.py

+            * lower: `i`.
+            * higher: `j`.
+            * nearest: `i` or `j` whichever is nearest. Implementation uses
+              round() built-in.


Can remove this comment about the round() built-in

WillAyd · 2018-04-18T22:40:52Z

pandas/core/window.py

+    -------
+    Series or DataFrame
+        Returned object type is determined by the caller of the %(name)s
+        calculation


Very minor but description should end with a period

WillAyd · 2018-04-18T22:42:08Z

pandas/core/window.py

+
+    See Also
+    --------
+    pandas.Series.quantile


Just add some quick descriptions after these (separated by colon). Can reference docstring guide:

https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-5-see-also

WillAyd · 2018-04-18T22:43:12Z

pandas/tests/test_window.py

+        s = Series(data)
+
+        q1 = s.quantile(quantile, interpolation)
+        q2 = s.rolling(len(data), min_periods=1).quantile(


I suppose this is just expanding instead of rolling so should use the former to stay idiomatic

WillAyd · 2018-04-19T16:25:47Z

asv_bench/benchmarks/rolling.py

        arr = np.random.random(N).astype(dtype)
        self.roll = getattr(pd, constructor)(arr).rolling(window)

    def time_quantile(self, constructor, window, dtype, percentile):
        self.roll.quantile(percentile)
+
+    def time_quantile_nearest(self, constructor, window, dtype, percentile):


Ha when I asked for updated ASVs I was just asking for a refresh of what you posted earlier in the thread but I suppose this is better. Is it possible to just add interpolation as an argument to setup? Would be more concise.

Here's an example from GroupBy that you could use for reference, though the existing Quantile class could clue you in on how to do this as well:

pandas/asv_bench/benchmarks/groupby.py

Line 342 in 78fee04

params = [['int', 'float', 'object', 'datetime'],

Please also post results to a comment when done. I'm assuming that these will fail on master but would still be good to know how they baseline against the current implementation that you posted earlier in thread

jreback

can you rebase and address small comments by @WillAyd, otherwise lgtm.

jreback · 2018-04-21T22:07:52Z

doc/source/whatsnew/v0.23.0.txt

@@ -402,6 +402,7 @@ Other Enhancements
 - :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
  ``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
 - :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
+- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword


@jorisvandenbossche do these refs work?

@jreback, how should they work? I built documentation and all references in the whatsnew.html are rendered as plain bold text.

jreback · 2018-04-21T22:08:11Z

doc/source/whatsnew/v0.23.0.txt

@@ -402,6 +402,7 @@ Other Enhancements
 - :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
  ``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
 - :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
+- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword


can you add (this) issue number here

kornilova203 · 2018-04-23T11:24:32Z

In ASV benchmarks I added interpolation as a parameter to setup. Performance is the same compared to results that I posted earlier

[100.00%] ··· Running rolling.Quantile.time_quantile                 1.38±0.2ms;...

I will rebase branch

kornilova203 · 2018-04-23T11:37:54Z

Now there are lots of commits from master :( hope I did rebasing correctly

WillAyd · 2018-04-23T15:28:09Z

Shouldn't be including the commits of other contributors - how did you rebase this? Need to figure out how to undo that and only replay your commits on top of master

Add whatsnew Fix bug: catch KeyError not ValueError Add return type to roll_quantile

kornilova203 · 2018-04-24T08:01:55Z

Finally I did rebasing correctly (I hope)
Should I squash these 13 commits into one?

jreback · 2018-04-24T11:30:07Z

thanks @kornilova-l nice patch!

generally you don't need to squash, you can do it if you want to make things more readable / easier if you want. but not necessary to merge.

kornilova203 changed the title ~~Add interpolation options to moving quantile~~ Add interpolation options to rolling quantile Mar 27, 2018

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 27, 2018

jreback removed the Performance Memory or execution speed performance label Mar 28, 2018

jreback requested changes Mar 28, 2018

View reviewed changes

jreback added the Enhancement label Mar 28, 2018

jreback requested changes Mar 30, 2018

View reviewed changes

WillAyd requested changes Mar 30, 2018

View reviewed changes

WillAyd requested changes Apr 1, 2018

View reviewed changes

jreback reviewed Apr 13, 2018

View reviewed changes

jreback requested changes Apr 16, 2018

View reviewed changes

WillAyd requested changes Apr 16, 2018

View reviewed changes

WillAyd requested changes Apr 18, 2018

View reviewed changes

WillAyd reviewed Apr 19, 2018

View reviewed changes

jreback requested changes Apr 21, 2018

View reviewed changes

kornilova203 force-pushed the add-interpolation-options-to-rolling-quantile branch 2 times, most recently from 277ab9f to 9212c9f Compare April 24, 2018 07:21

kornilova203 added 13 commits April 24, 2018 10:23

Add interpolation options to moving quantile

f31a448

Fix linting issues

7afa6b5

Use enum instead of integers

a02183a

Add parametrized test

412eb98

Update documentation

f5fb6cb

Fix bug in 'nearest' interpolation

4f80369

Add whatsnew Fix bug: catch KeyError not ValueError Add return type to roll_quantile

Add test with nan values. Do not interpolate if index is an integer

6d5c77d

Use nogil

dc5e74d

Skip tests if numpy < 1.12.0 (midpoint interpolation is broken there)

5280f95

Use _np_version_under1p12 instead of NumpyVersion

2ac734c

Remove redundant test. Place interpolation in single quotes

f21d21a

Update ASVs, docs and tests

f9b1e7e

Update whatsnew and benchmarks

3a2e431

kornilova203 force-pushed the add-interpolation-options-to-rolling-quantile branch from 9212c9f to 3a2e431 Compare April 24, 2018 07:26

jreback added 2 commits April 24, 2018 07:25

test rename

b986a5d

Merge branch 'master' into PR_TOOL_MERGE_PR_20497

fd9568a

jreback approved these changes Apr 24, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Apr 24, 2018

doc

9cc7a71

jreback merged commit ce8f6e8 into pandas-dev:master Apr 24, 2018

TomAugspurger mentioned this pull request Apr 25, 2018

Doc build failing on master #20822

Closed

		@@ -1357,25 +1357,53 @@ cdef _roll_min_max(ndarray[numeric] input, int64_t win, int64_t minp,
		return output


		def _get_interpolation_id(str interpolation):

		@@ -1138,6 +1138,20 @@ def test_rolling_quantile_series(self):

		tm.assert_almost_equal(q1, q2)

		@pytest.mark.parametrize('quantile', [0.0, 0.1, 0.45, 0.5, 1])

Uh oh!

Add interpolation options to rolling quantile #20497

Add interpolation options to rolling quantile #20497

Uh oh!

Conversation

kornilova203 commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 27, 2018

Uh oh!

kornilova203 commented Mar 27, 2018

Uh oh!

TomAugspurger commented Mar 27, 2018

Uh oh!

codecov bot commented Mar 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Mar 28, 2018

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 30, 2018

Uh oh!

jreback commented Mar 30, 2018

Uh oh!

jreback commented Mar 30, 2018

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kornilova203 commented Apr 1, 2018

Uh oh!

jreback commented Apr 1, 2018

Uh oh!

kornilova203 commented Apr 1, 2018

Uh oh!

WillAyd commented Apr 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kornilova203 commented Apr 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kornilova203 commented Apr 13, 2018

Uh oh!

kornilova203 commented Mar 27, 2018 •

edited

Loading

jreback commented Mar 27, 2018 •

edited

Loading

codecov bot commented Mar 28, 2018 •

edited

Loading