Skip to content

MemoryError with more than 1E9 rows #8252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattdowle opened this issue Sep 12, 2014 · 3 comments · Fixed by #8331
Closed

MemoryError with more than 1E9 rows #8252

mattdowle opened this issue Sep 12, 2014 · 3 comments · Fixed by #8331
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@mattdowle
Copy link

I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
>>> def randChar(f, numGrp, N) :
...    things = [f%x for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> def randFloat(numGrp, N) :
...    things = [round(100*np.random.random(),4) for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> N=int(1.5e9)       # N=int(1e9) works fine
>>> K=100
>>> DF = pd.DataFrame({
...   'id1' : randChar("id%03d", K, N),       # large groups (char)
...   'id2' : randChar("id%03d", K, N),       # large groups (char)
...   'id3' : randChar("id%010d", N//K, N),   # small groups (char)
...   'id4' : np.random.choice(K, N),         # large groups (int)
...   'id5' : np.random.choice(K, N),         # large groups (int)
...   'id6' : np.random.choice(N//K, N),      # small groups (int)
...   'v1' :  np.random.choice(5, N),         # int in range [1,5]
...   'v2' :  np.random.choice(5, N),         # int in range [1,5]
...   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
... })
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 203, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 327, in _init_dict
    dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4630, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3235, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3322, in form_blocks
    object_items, np.object_)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3346, in _simple_blockify
    values, placement = _stack_arrays(tuples, dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3410, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.070
BogoMIPS:              5054.21
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G       2.3G       237G       364K        66M       632M
-/+ buffers/cache:       1.6G       238G
Swap:           0B         0B         0B
$

An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas

@jreback
Copy link
Contributor

jreback commented Sep 12, 2014

You can try separately creating Series (with each of the columns first), then putting them into a dict and creating the frame. However you might be having a problem finding contiguous memory.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2014

@mattdowle

finally had time to look at this. I think their was an extra copy going on in certain cases.

so try this out using master (once I merge this change). This seems to scale much better.

and the following slightly modified code:

# your original routines were using lots of extra memory as they were creating many python objects
def randChar(f, num_group, N):
    things = np.array([f%x for x in range(num_group)])
    return things.take(np.random.choice(num_group,N)).astype('object')

def randFloat(num_group, N):
    things = (np.random.randn(num_group)*100).round(4)
    return things.take(np.random.choice(num_group,N))

def f4(K, N):
    objects = pd.DataFrame({'id1' : randChar("id%03d", K, N),      # large groups (char)
                            'id2' : randChar("id%03d", K, N),      # large groups (char)
                            'id3' : randChar("id%010d", N//K, N)   # small groups (char)
                            })
    ints = pd.DataFrame({ 'id4' : np.random.choice(K, N),         # large groups (int)
                          'id5' : np.random.choice(K, N),         # large groups (int)
                          'id6' : np.random.choice(N//K, N),      # small groups (int)
                          'v1' : np.random.choice(5, N),         # int in range [1,5]
                          'v2' : np.random.choice(5, N)          # int in range [1,5]
                          })
    floats = pd.DataFrame({ 'v3' : randFloat(100,N) })               # numeric e.g. 23.5749

    return pd.concat([objects,ints,floats],axis=1,copy=False)

@jreback
Copy link
Contributor

jreback commented Sep 21, 2014

@mattdowle I updated the example to give a pretty simplied version, that give pretty good memory performance (e.g is just a bit over 1X final data size) by not trying to create everything at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
2 participants