Skip to content

BUG: Fix groupby duplicate column error message #8210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

bthyreau
Copy link
Contributor

@bthyreau bthyreau commented Sep 8, 2014

closes #7511

Though having duplicated column names in a dataframe is never a good idea, it may happen, and that shouldn't confuse groupby() with a meaningless message. Currently

>>> obj=pd.DataFrame(columns=["A","B","A", "C"], data=[range(4), range(2,6), range(0, 8, 2)])
>>> obj.groupby("A").mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 2846, in groupby
    sort=sort, group_keys=group_keys, squeeze=squeeze)
  File "pandas/core/groupby.py", line 1173, in groupby
    return klass(obj, by, **kwds)
  File "pandas/core/groupby.py", line 390, in __init__
    level=level, sort=sort)
  File "pandas/core/groupby.py", line 2086, in _get_grouper
    ping = Grouping(group_axis, gpr, obj=obj, name=name, level=level, sort=sort)
  File "pandas/core/groupby.py", line 1925, in __init__
    self.grouper = self.index.map(self.grouper)
  File "pandas/core/index.py", line 1524, in map
    return self._arrmap(self.values, mapper)
  File "generated.pyx", line 2203, in pandas.algos.arrmap_int64 (pandas/algos.c:72274)
TypeError: 'DataFrame' object is not callable

This patch fixes that.

>>> obj.groupby("A").mean()
     B  C
A A      
0 2  1  3
  4  2  6
2 4  3  5

Thanks.

@jreback
Copy link
Contributor

jreback commented Sep 8, 2014

this is issue #7511

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Groupby labels Sep 8, 2014
@jreback jreback added this to the 0.15.0 milestone Sep 8, 2014
data=[range(4), range(2,6), range(0, 8, 2)])

grouped = df.groupby('A')
assert grouped.count().index.nlevels == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to use self.assertTrue rather than bare asserts

@jreback
Copy link
Contributor

jreback commented Sep 8, 2014

This needs to raise if getattr(self.grouper,'ndim',None) != 1
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1943

instead of the specific check you are doing

@bthyreau
Copy link
Contributor Author

bthyreau commented Sep 9, 2014

ok, i updated following your request

  • The underlying ndim error in Grouping now properly raise (+ testcase) (though it's harder for end-user to trigger, as the problematic use-case itself is now correctly handled)
  • all assert -> self.assertTrue

@@ -1922,6 +1922,8 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,

# no level passed
if not isinstance(self.grouper, (Series, Index, np.ndarray)):
if getattr(self.grouper,'ndim', 1) != 1:
raise AssertionError("Grouper result with an ndim != 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a ValueError

groupings.append(ping)
if isinstance(gpr, DataFrame) and gpr.ndim > 1:
for name, gpr in gpr.iteritems():
ping = Grouping(group_axis, gpr, obj=obj, name=name, level=level, sort=sort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this while section you added should be taken out. If their is a duplicate grouper then it should simply raise an error (can be even more informative if you'd like). also creates a multi-grouper which is really oddd (and in a weird order).

@bthyreau
Copy link
Contributor Author

Well, if pandas allow dataframes to have duplicated columns, ability to group like this seems a pretty user-expectable consequence, to me. In fact, I'm not even sure of an easy (user-friendly) way to workaround that limitation without altering/duplicating the data...

But anyway, i don't mind much about that feature, and it's trivial to put back in the future if needed. So the current code now focuses only on the error message thing, as requested. See last commit.

Thanks

@bthyreau bthyreau changed the title BUG: Fix groupby duplicate column BUG: Fix groupby duplicate column error message Sep 10, 2014
@jreback
Copy link
Contributor

jreback commented Sep 10, 2014

@bthyreau ok will merge this thanks.

In response to your comment, the ambiguity is that you are saying to pandas, hey I have multiple columns that I want to groupby on but they have the same NAME! (e.g. 'A','A') in your example. So the order if these is ambiguous.

If you wanted to group this way you could

df.groupby([df.iloc[:,3],df.iloc[:,4]]) or whatever. I honestly don't think this is very common at all and most likely a mistake; that's why we are raising.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2014

merged via 77d5f04

@bthyreau
Copy link
Contributor Author

Ok Thanks
I noticed that you didn't merge my specific error-condition message before the Grouping() call; ie. instead of crashing with:

>>>df.groupby("A")
ValueError: Grouper result is not unique for 'A'
it would crashes with
ValueError: Grouper result with an ndim != 1

which, for the user point-of-view, is arguably no more helpful than before (#7511)

If that was not intended, i can reopen or create a new pr; just tell me.
Thanks

ps. code was:

+ if isinstance(gpr, DataFrame) and gpr.ndim > 1:
+   errmsg = "Grouper result is not unique for '%s'" % name
+   raise ValueError(errmsg)

@jreback
Copy link
Contributor

jreback commented Sep 11, 2014

I took that out as it is a very specific case; better to have a general message.

@bthyreau
Copy link
Contributor Author

I understand, but then there is no way for the user to know what causes the error.
Would you accept some trivial PR with at least the following error-message change ?
raise ValueError("Grouper result with an ndim != 1")
to
raise ValueError("Grouper result with an ndim != 1 for '%s'" % self.name)

@jreback
Copy link
Contributor

jreback commented Sep 14, 2014

sure

how about showing tyoe(self.object)

can't assume it would have a name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unhelpful error message when groupby() called with a column that is duplicated
2 participants