Skip to content

MemoryError with more than 1E9 rows #8252

Closed
@mattdowle

Description

@mattdowle

I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
>>> def randChar(f, numGrp, N) :
...    things = [f%x for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> def randFloat(numGrp, N) :
...    things = [round(100*np.random.random(),4) for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> N=int(1.5e9)       # N=int(1e9) works fine
>>> K=100
>>> DF = pd.DataFrame({
...   'id1' : randChar("id%03d", K, N),       # large groups (char)
...   'id2' : randChar("id%03d", K, N),       # large groups (char)
...   'id3' : randChar("id%010d", N//K, N),   # small groups (char)
...   'id4' : np.random.choice(K, N),         # large groups (int)
...   'id5' : np.random.choice(K, N),         # large groups (int)
...   'id6' : np.random.choice(N//K, N),      # small groups (int)
...   'v1' :  np.random.choice(5, N),         # int in range [1,5]
...   'v2' :  np.random.choice(5, N),         # int in range [1,5]
...   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
... })
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 203, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 327, in _init_dict
    dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4630, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3235, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3322, in form_blocks
    object_items, np.object_)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3346, in _simple_blockify
    values, placement = _stack_arrays(tuples, dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3410, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError
$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.070
BogoMIPS:              5054.21
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G       2.3G       237G       364K        66M       632M
-/+ buffers/cache:       1.6G       238G
Swap:           0B         0B         0B
$

An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions