MemoryError with more than 1E9 rows #8252

mattdowle · 2014-09-12T08:10:40Z

I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
>>> def randChar(f, numGrp, N) :
...    things = [f%x for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> def randFloat(numGrp, N) :
...    things = [round(100*np.random.random(),4) for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> N=int(1.5e9)       # N=int(1e9) works fine
>>> K=100
>>> DF = pd.DataFrame({
...   'id1' : randChar("id%03d", K, N),       # large groups (char)
...   'id2' : randChar("id%03d", K, N),       # large groups (char)
...   'id3' : randChar("id%010d", N//K, N),   # small groups (char)
...   'id4' : np.random.choice(K, N),         # large groups (int)
...   'id5' : np.random.choice(K, N),         # large groups (int)
...   'id6' : np.random.choice(N//K, N),      # small groups (int)
...   'v1' :  np.random.choice(5, N),         # int in range [1,5]
...   'v2' :  np.random.choice(5, N),         # int in range [1,5]
...   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
... })
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 203, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 327, in _init_dict
    dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4630, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3235, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3322, in form_blocks
    object_items, np.object_)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3346, in _simple_blockify
    values, placement = _stack_arrays(tuples, dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3410, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.070
BogoMIPS:              5054.21
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G       2.3G       237G       364K        66M       632M
-/+ buffers/cache:       1.6G       238G
Swap:           0B         0B         0B
$

An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas

jreback · 2014-09-12T11:58:11Z

You can try separately creating Series (with each of the columns first), then putting them into a dict and creating the frame. However you might be having a problem finding contiguous memory.

jreback · 2014-09-20T16:03:48Z

@mattdowle

finally had time to look at this. I think their was an extra copy going on in certain cases.

so try this out using master (once I merge this change). This seems to scale much better.

and the following slightly modified code:

# your original routines were using lots of extra memory as they were creating many python objects
def randChar(f, num_group, N):
    things = np.array([f%x for x in range(num_group)])
    return things.take(np.random.choice(num_group,N)).astype('object')

def randFloat(num_group, N):
    things = (np.random.randn(num_group)*100).round(4)
    return things.take(np.random.choice(num_group,N))

def f4(K, N):
    objects = pd.DataFrame({'id1' : randChar("id%03d", K, N),      # large groups (char)
                            'id2' : randChar("id%03d", K, N),      # large groups (char)
                            'id3' : randChar("id%010d", N//K, N)   # small groups (char)
                            })
    ints = pd.DataFrame({ 'id4' : np.random.choice(K, N),         # large groups (int)
                          'id5' : np.random.choice(K, N),         # large groups (int)
                          'id6' : np.random.choice(N//K, N),      # small groups (int)
                          'v1' : np.random.choice(5, N),         # int in range [1,5]
                          'v2' : np.random.choice(5, N)          # int in range [1,5]
                          })
    floats = pd.DataFrame({ 'v3' : randFloat(100,N) })               # numeric e.g. 23.5749

    return pd.concat([objects,ints,floats],axis=1,copy=False)

jreback · 2014-09-21T17:54:54Z

@mattdowle I updated the example to give a pretty simplied version, that give pretty good memory performance (e.g is just a bit over 1X final data size) by not trying to create everything at once.

hayd mentioned this issue Sep 12, 2014

Compare / vbench groupby operations vs. R tapply, data.table #696

Closed

jreback mentioned this issue Sep 20, 2014

PERF: add copy=True argument to pd.concat to enable pass-thru concats with complete blocks (GH8252) #8331

Merged

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 20, 2014

jreback added this to the 0.15.0 milestone Sep 20, 2014

jreback closed this as completed in #8331 Sep 20, 2014

mattdowle mentioned this issue Sep 21, 2014

Rerun pandas 2E9 benchmark from dev Rdatatable/data.table#823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MemoryError with more than 1E9 rows #8252

MemoryError with more than 1E9 rows #8252

mattdowle commented Sep 12, 2014

jreback commented Sep 12, 2014

Uh oh!

jreback commented Sep 20, 2014

Uh oh!

jreback commented Sep 21, 2014

Uh oh!

Uh oh!

MemoryError with more than 1E9 rows #8252

MemoryError with more than 1E9 rows #8252

Comments

mattdowle commented Sep 12, 2014

jreback commented Sep 12, 2014

Uh oh!

jreback commented Sep 20, 2014

Uh oh!

jreback commented Sep 21, 2014

Uh oh!