Skip to content

Performance issue with DataFrames with numpy fortran-order arrays and pd.concat #11958

Closed
@jennolsen84

Description

@jennolsen84

Hi,

When trying to concat() multiple big fortran order arrays, there is a big performance hit, as most of the work goes into calling ravel().

See:
https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L4772

You can see the is_null(self) is using just a few values from the data after calling .ravel()

An easy fix is to change that line to values_flat = values.ravel(order='K')

Here is a link to numpy.ravel docs: http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ravel.html

‘K’ means to read the elements in the order they occur in memory, except for reversing the data when strides are negative. By default, ‘C’ index order is used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, ExplodeUsage Question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions