Description
Many DataFrame methods (now including __getitem__
) accept callables that take the DataFrame as input, e..g, df[lambda x: x.sepal_length > 3]
.
However, this is annoyingly verbose. I recently suggested (#13040) enabling argument-free lambdas like df[lambda: sepal_length > 3]
, but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.
pandas-ply and dplython provide an alternative approach, based on a magic X
operator, e.g.,
(flights
.groupby(['year', 'month', 'day'])
.ply_select(
arr = X.arr_delay.mean(),
dep = X.dep_delay.mean())
.ply_where(X.arr > 30, X.dep > 30))
pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods ply_select
and ply_where
that accept these symbolic expression build from X
. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.
I think we should consider introducing an object like X
into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input.
I don't think we need to port ply_select
and ply_where
, because support for expressions in DataFrame.assign
and indexing is a good substitute.
So my proposed syntax (after from pandas import X
) looks like the following:
(flights
.groupby(['year', 'month', 'day'])
.assign(
arr = X.arr_delay.mean(),
dep = X.dep_delay.mean())
[(X.arr > 30) & (X.dep > 30)])
Indexing is a little uglier than using the ply_where
method, but otherwise this is a nice improvement.
Best of all, we don't need do any special tricks to introduce new scopes -- we simply define X.__getattr__
to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.
In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our X
implementation in pandas would be something that could be reused by dplython.