Skip to content

Series.map should return default dictionary values rather than NaN #15999

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dhimmel opened this issue Apr 14, 2017 · 7 comments
Closed

Series.map should return default dictionary values rather than NaN #15999

dhimmel opened this issue Apr 14, 2017 · 7 comments
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@dhimmel
Copy link
Contributor

dhimmel commented Apr 14, 2017

collections.Counter and collections.defaultdict both have default values. However, pandas.Series.map does not respect these defaults and instead returns missing values.

The issue is illustrated below:

import pandas
from collections import Counter, defaultdict
input = pandas.Series(range(5))
counter = Counter()
counter[1] += 1
output = input.map(counter)
expected = series.map(lambda x: counter[x])
pandas.DataFrame({
    'input': input,
    'output': output,
    'expected': expected,
})

Here's the output:

   expected  input  output
0         0      0     NaN
1         1      1     1.0
2         0      2     NaN
3         0      3     NaN
4         0      4     NaN

The workaround is rather easy (lambda x: dictionary[x]) and shouldn't be to hard to implement. Are people on board with the change? Is there a performance concern with looking up each key independently?

@jreback
Copy link
Contributor

jreback commented Apr 14, 2017

why would you do this?

@dhimmel
Copy link
Contributor Author

dhimmel commented Apr 14, 2017

I've ran into this issue several times with collections.Counter. Most recently see cell 6 of this notebook. With counters, if you haven't observed a key, it defaults to zero (since they're used for counting occurrences).

By using a defaultdict or Counter, the user has chosen that they would like default values. If they don't want defaults, they should just convert or use dict.

@jreback
Copy link
Contributor

jreback commented Apr 14, 2017

.map does not accept a Counter, sure its dictlike but not sure why you would actually do this anyhow.

@jreback
Copy link
Contributor

jreback commented Apr 14, 2017

looks like you just should do .groupby(...).value_counts() anyhow.

@dhimmel
Copy link
Contributor Author

dhimmel commented Apr 14, 2017

not sure why you would actually do this anyhow

Because I have a counter of occurrences that I want to add as a column to a dataframe. In many cases the counter cannot be created in pandas using .value_counts(). For example:

  • the counter is created by iteratively reading a file that won't fit in memory
  • code must deal with a counter that is returned by another function

Now you could always use series.map(counter).fillna(0).astype(int) but this forces the user to deal with the conversion of ints to float when there's missing data (which is one of the must frustrating aspects of pandas and should be avoided when possible).

.map does not accept a Counter

Map does accept a Counter, since it's a subclass of dict, and provides no warning.

@chris-b1
Copy link
Contributor

Right now we take a fastpath, building an Index out of the dict keys.

arg = self._constructor(arg, index=arg.keys())

So probably either should add a slowpath that respects full semantics if passed a dict subclass, or just raise.

@dhimmel
Copy link
Contributor Author

dhimmel commented Apr 14, 2017

So probably either should add a slowpath that respects full semantics if passed a dict subclass, or just raise.

What about adding the following to the head of the function?

if isinstance(arg, (collections.Counter, collections.defaultdict)):
    dictionary = arg
    arg = lambda x: dictionary[x]

Note there are other ways of simplifying the function's code I would also explore.

I'm happy to submit a PR and add tests if this is an enhancement that would be accepted.

@jreback jreback added Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 14, 2017
@jreback jreback modified the milestones: Next Major Release, 0.20.0 Apr 14, 2017
jreback pushed a commit that referenced this issue Apr 15, 2017
* series.map: support dicts with defaults
closes #15999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

3 participants