API: port the magic X from pandas_ply/dplython to pandas proper?

Many DataFrame methods ([now including `__getitem__`](https://github.com/pydata/pandas/issues/11485)) accept callables that take the DataFrame as input, e..g, `df[lambda x: x.sepal_length > 3]`.

However, this is annoyingly verbose. I recently suggested (https://github.com/pydata/pandas/issues/13040) enabling argument-free lambdas like `df[lambda: sepal_length > 3]`, but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.

[pandas-ply](https://github.com/coursera/pandas-ply) and [dplython](https://github.com/dodger487/dplython) provide an alternative approach, based on a magic `X` operator, e.g.,

``` python
(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))
```

pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods `ply_select` and `ply_where` that accept these symbolic expression build from `X`. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.

I think we should consider introducing an object like `X` into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input. 

I don't think we need to port `ply_select` and `ply_where`, because support for expressions in `DataFrame.assign` and indexing is a good substitute.

So my proposed syntax (after `from pandas import X`) looks like the following:

``` python
(flights
 .groupby(['year', 'month', 'day'])
 .assign(
     arr = X.arr_delay.mean(),
     dep = X.dep_delay.mean())
 [(X.arr > 30) & (X.dep > 30)])
```

Indexing is a little uglier than using the `ply_where` method, but otherwise this is a nice improvement. 

Best of all, we don't need do any special tricks to introduce new scopes -- we simply define `X.__getattr__` to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.

In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our `X` implementation in pandas would be something that could be reused by dplython.

cc @joshuahhh @dodger487


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions