Open
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 20.04
- Modin version (
modin.__version__
): 006ccd0 vs0.12.0
- Python version:
3.8.12
- Engine:
Ray
- Code we can use to reproduce:
MODIN_NPARTITIONS=32 python reproducer.py
reproducer.py
:
import string
import random
import modin.pandas as pd
import numpy as np
from time import time
rows = 1500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)
df = pd.DataFrame(res)
def _format(x):
vals = x.values
if len(vals) > 2:
return '-'.join(map(str, vals[:-1]))
return np.nan
def test2():
for _ in range(3):
start = time()
result = df.groupby(3, sort=False).agg(
col1=(2, _format),
col2=(1, _format),
)
print(f"end: {time()-start}")
test2()
Describe the problem
~3.5x slower
Source code / logs
0.12.0
UserWarning: The size of /dev/shm is too small (369469804544 bytes). The required size at least half of RAM (405121079296 bytes). Please, delete files in /dev/shm or increase size of /dev/shm with --shm-size in Docker. Also, you can set the required memory size for each Ray worker in bytes to MODIN_MEMORY environment variable.
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
end: 16.87742853164673
end: 16.95181179046631
end: 17.432859182357788
006ccd0
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
(_apply_func pid=645761)
end: 59.233707427978516
end: 58.498608350753784
end: 57.40613126754761