Skip to content

API: port the magic X from pandas_ply/dplython to pandas proper? #13133

Closed
@shoyer

Description

@shoyer

Many DataFrame methods (now including __getitem__) accept callables that take the DataFrame as input, e..g, df[lambda x: x.sepal_length > 3].

However, this is annoyingly verbose. I recently suggested (#13040) enabling argument-free lambdas like df[lambda: sepal_length > 3], but this isn't a viable solution (too much magic!) because it's impossible to implement with Python's standard scoping rules.

pandas-ply and dplython provide an alternative approach, based on a magic X operator, e.g.,

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

pandas-ply also introduces (injects onto pandas.DataFrame) two new dataframe methods ply_select and ply_where that accept these symbolic expression build from X. dplython takes a different approach, introducing it's own dplyr like API for chaining expressions instead of using method chaining. The pandas-ply approach is much closer to what makes sense for pandas proper, given that we already support method chaining.

I think we should consider introducing an object like X into pandas proper and supporting its use on all pandas methods that accept callables that take the DataFrame as input.

I don't think we need to port ply_select and ply_where, because support for expressions in DataFrame.assign and indexing is a good substitute.

So my proposed syntax (after from pandas import X) looks like the following:

(flights
 .groupby(['year', 'month', 'day'])
 .assign(
     arr = X.arr_delay.mean(),
     dep = X.dep_delay.mean())
 [(X.arr > 30) & (X.dep > 30)])

Indexing is a little uglier than using the ply_where method, but otherwise this is a nice improvement.

Best of all, we don't need do any special tricks to introduce new scopes -- we simply define X.__getattr__ to looking attributes as columns in the DataFrame context. I expect we could even reuse the expression engines from pandas-ply or dplython directly, perhaps with a few modifications.

In my mind, this would mostly obviate the need for pandas-ply, though the alternate API provided by dpython would still be independently useful. In an ideal world, our X implementation in pandas would be something that could be reused by dplython.

cc @joshuahhh @dodger487

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions