BUG: Fix groupby duplicate column error message #8210

bthyreau · 2014-09-08T02:41:35Z

Though having duplicated column names in a dataframe is never a good idea, it may happen, and that shouldn't confuse groupby() with a meaningless message. Currently

>>> obj=pd.DataFrame(columns=["A","B","A", "C"], data=[range(4), range(2,6), range(0, 8, 2)])
>>> obj.groupby("A").mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/generic.py", line 2846, in groupby
    sort=sort, group_keys=group_keys, squeeze=squeeze)
  File "pandas/core/groupby.py", line 1173, in groupby
    return klass(obj, by, **kwds)
  File "pandas/core/groupby.py", line 390, in __init__
    level=level, sort=sort)
  File "pandas/core/groupby.py", line 2086, in _get_grouper
    ping = Grouping(group_axis, gpr, obj=obj, name=name, level=level, sort=sort)
  File "pandas/core/groupby.py", line 1925, in __init__
    self.grouper = self.index.map(self.grouper)
  File "pandas/core/index.py", line 1524, in map
    return self._arrmap(self.values, mapper)
  File "generated.pyx", line 2203, in pandas.algos.arrmap_int64 (pandas/algos.c:72274)
TypeError: 'DataFrame' object is not callable

This patch fixes that.

>>> obj.groupby("A").mean()
     B  C
A A      
0 2  1  3
  4  2  6
2 4  3  5

Thanks.

jreback · 2014-09-08T02:44:26Z

this is issue #7511

jreback · 2014-09-08T11:17:35Z

pandas/tests/test_groupby.py

+                data=[range(4), range(2,6), range(0, 8, 2)])
+
+        grouped = df.groupby('A')
+        assert grouped.count().index.nlevels == 2


need to use self.assertTrue rather than bare asserts

jreback · 2014-09-08T12:57:14Z

This needs to raise if getattr(self.grouper,'ndim',None) != 1
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1943

instead of the specific check you are doing

bthyreau · 2014-09-09T01:04:35Z

ok, i updated following your request

The underlying ndim error in Grouping now properly raise (+ testcase) (though it's harder for end-user to trigger, as the problematic use-case itself is now correctly handled)
all assert -> self.assertTrue

jreback · 2014-09-09T01:10:52Z

pandas/core/groupby.py

@@ -1922,6 +1922,8 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,

            # no level passed
            if not isinstance(self.grouper, (Series, Index, np.ndarray)):
+                if getattr(self.grouper,'ndim', 1) != 1:
+                    raise AssertionError("Grouper result with an ndim != 1")


make this a ValueError

jreback · 2014-09-09T17:08:06Z

pandas/core/groupby.py

-        groupings.append(ping)
+        if isinstance(gpr, DataFrame) and gpr.ndim > 1:
+            for name, gpr in gpr.iteritems():
+                ping = Grouping(group_axis, gpr, obj=obj, name=name, level=level, sort=sort)


this while section you added should be taken out. If their is a duplicate grouper then it should simply raise an error (can be even more informative if you'd like). also creates a multi-grouper which is really oddd (and in a weird order).

bthyreau · 2014-09-10T00:25:54Z

Well, if pandas allow dataframes to have duplicated columns, ability to group like this seems a pretty user-expectable consequence, to me. In fact, I'm not even sure of an easy (user-friendly) way to workaround that limitation without altering/duplicating the data...

But anyway, i don't mind much about that feature, and it's trivial to put back in the future if needed. So the current code now focuses only on the error message thing, as requested. See last commit.

Thanks

jreback · 2014-09-10T19:40:54Z

@bthyreau ok will merge this thanks.

In response to your comment, the ambiguity is that you are saying to pandas, hey I have multiple columns that I want to groupby on but they have the same NAME! (e.g. 'A','A') in your example. So the order if these is ambiguous.

If you wanted to group this way you could

df.groupby([df.iloc[:,3],df.iloc[:,4]]) or whatever. I honestly don't think this is very common at all and most likely a mistake; that's why we are raising.

jreback · 2014-09-10T19:50:50Z

merged via 77d5f04

bthyreau · 2014-09-11T06:13:38Z

Ok Thanks
I noticed that you didn't merge my specific error-condition message before the Grouping() call; ie. instead of crashing with:

>>>df.groupby("A")
ValueError: Grouper result is not unique for 'A'
it would crashes with
ValueError: Grouper result with an ndim != 1

which, for the user point-of-view, is arguably no more helpful than before (#7511)

If that was not intended, i can reopen or create a new pr; just tell me.
Thanks

ps. code was:

+ if isinstance(gpr, DataFrame) and gpr.ndim > 1:
+   errmsg = "Grouper result is not unique for '%s'" % name
+   raise ValueError(errmsg)

jreback · 2014-09-11T11:31:22Z

I took that out as it is a very specific case; better to have a general message.

bthyreau · 2014-09-14T13:33:06Z

I understand, but then there is no way for the user to know what causes the error.
Would you accept some trivial PR with at least the following error-message change ?
raise ValueError("Grouper result with an ndim != 1")
to
raise ValueError("Grouper result with an ndim != 1 for '%s'" % self.name)

jreback · 2014-09-14T14:30:10Z

sure

how about showing tyoe(self.object)

can't assume it would have a name

bthyreau added 2 commits September 8, 2014 11:05

BUG: fix groupby crash on duplicated columns

3c8e43f

BUG: groupby crash on dup columns, + test

cf79d6c

jreback added Error Reporting Incorrect or improved errors from pandas Groupby labels Sep 8, 2014

jreback added this to the 0.15.0 milestone Sep 8, 2014

jreback reviewed Sep 8, 2014
View reviewed changes

bthyreau added 3 commits September 9, 2014 09:26

catch a possible error-case

58b786e

use assertTrue; adds ref to GH bug

82eb71e

add test for multidim error-case

4156b06

jreback reviewed Sep 9, 2014
View reviewed changes

bthyreau added 2 commits September 9, 2014 10:17

s/AssertionError/ValueError

4ec5216

adds entry to changelog

0562a07

jreback reviewed Sep 9, 2014
View reviewed changes

bthyreau added 2 commits September 10, 2014 09:11

removing ability to groupby on duplicated column

8fce3fd

+ test

8697398

bthyreau changed the title ~~BUG: Fix groupby duplicate column~~ BUG: Fix groupby duplicate column error message Sep 10, 2014

jreback closed this Sep 10, 2014

jreback mentioned this pull request Sep 10, 2014

Unhelpful error message when groupby() called with a column that is duplicated #7511

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Fix groupby duplicate column error message #8210

BUG: Fix groupby duplicate column error message #8210

Uh oh!

bthyreau commented Sep 8, 2014

Uh oh!

jreback commented Sep 8, 2014

Uh oh!

jreback Sep 8, 2014

Uh oh!

jreback commented Sep 8, 2014

Uh oh!

bthyreau commented Sep 9, 2014

Uh oh!

jreback Sep 9, 2014

Uh oh!

jreback Sep 9, 2014

Uh oh!

bthyreau commented Sep 10, 2014

Uh oh!

jreback commented Sep 10, 2014

Uh oh!

jreback commented Sep 10, 2014

Uh oh!

bthyreau commented Sep 11, 2014

Uh oh!

jreback commented Sep 11, 2014

Uh oh!

bthyreau commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

Uh oh!

Uh oh!

BUG: Fix groupby duplicate column error message #8210

BUG: Fix groupby duplicate column error message #8210

Uh oh!

Conversation

bthyreau commented Sep 8, 2014

Uh oh!

jreback commented Sep 8, 2014

Uh oh!

jreback Sep 8, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 8, 2014

Uh oh!

bthyreau commented Sep 9, 2014

Uh oh!

jreback Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

jreback Sep 9, 2014

Choose a reason for hiding this comment

Uh oh!

bthyreau commented Sep 10, 2014

Uh oh!

jreback commented Sep 10, 2014

Uh oh!

jreback commented Sep 10, 2014

Uh oh!

bthyreau commented Sep 11, 2014

Uh oh!

jreback commented Sep 11, 2014

Uh oh!

bthyreau commented Sep 14, 2014

Uh oh!

jreback commented Sep 14, 2014

Uh oh!

Uh oh!