Skip to content

PERF: improves merge performance when key space exceeds i8 bounds #9151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 28, 2014

Conversation

behzadnouri
Copy link
Contributor

In join operations, current master switches to a less efficient path if the key space exceeds int64 bounds. This commit improves performance and memory usage:

on master:

In [1]: np.random.seed(2718281)

In [2]: left = DataFrame(np.random.randint(-1 << 10, 1 << 10, (1 << 20, 8)),
   ...:                  columns=list('ABCDEFG') + ['left'])

In [3]: i = np.random.permutation(len(left))

In [4]: right = left.iloc[i].copy()

In [5]: right.columns = right.columns[:-1].tolist() + ['right']

In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 13.8 s per loop

In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 1064.16 MiB, increment: 820.65 MiB

on branch:

In [6]: %timeit pd.merge(left, right, how='outer')
1 loops, best of 3: 1.42 s per loop

In [7]: %memit pd.merge(left, right, how='outer')
peak memory: 440.72 MiB, increment: 199.89 MiB

join|merge benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
i8merge                                      | 1510.8590 | 13001.0207 |   0.1162 |
merge_2intkey_nosort                         |  20.9847 |  27.0010 |   0.7772 |
join_dataframe_index_single_key_small        |  16.4944 |  19.0903 |   0.8640 |
join_dataframe_integer_2key                  |   7.3093 |   8.0273 |   0.9106 |
merge_2intkey_sort                           |  60.6734 |  61.8456 |   0.9810 |
left_outer_join_index                        | 3165.2357 | 3214.8040 |   0.9846 |
join_dataframe_index_multi                   |  36.0967 |  36.5300 |   0.9881 |
join_dataframe_index_single_key_bigger_sort  |  24.7893 |  25.0640 |   0.9890 |
join_non_unique_equal                        |   0.9350 |   0.9403 |   0.9943 |
join_dataframe_index_single_key_bigger       |  24.8820 |  24.9500 |   0.9973 |
strings_join_split                           |  57.1507 |  56.9630 |   1.0033 |
join_dataframe_integer_key                   |   2.9247 |   2.8093 |   1.0411 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [f2d9e17] : improves merge performance when key space exceeds i8 bounds
Base   [def58c9] : Merge pull request #9128 from hsperr/expanduser

ENH: Expanduser in to_file methods GH9066

@behzadnouri behzadnouri force-pushed the i8merge branch 2 times, most recently from d8a4043 to 847a6a1 Compare December 26, 2014 13:22
@jreback jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 28, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 28, 2014
@jreback
Copy link
Contributor

jreback commented Dec 28, 2014

looks good. pls add a release note in perf section.

@behzadnouri
Copy link
Contributor Author

@jreback release notes added; travis build is all green

jreback added a commit that referenced this pull request Dec 28, 2014
PERF: improves merge performance when key space exceeds i8 bounds
@jreback jreback merged commit 773ee8b into pandas-dev:master Dec 28, 2014
@jreback
Copy link
Contributor

jreback commented Dec 28, 2014

thanks!

@behzadnouri behzadnouri deleted the i8merge branch December 29, 2014 01:29
@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

4aa0e0a was necessary to fix the test comparison on 32-bit platforms (e.g. windows).

I think this was only a test issue. But can you confirm (maybe add a couple of tests). That this is ok when the originals are different dtypes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants