Skip to content

groupby.agg() regression #4345

Open
Open
@anmyachev

Description

@anmyachev

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • Modin version (modin.__version__): 006ccd0 vs 0.12.0
  • Python version: 3.8.12
  • Engine: Ray
  • Code we can use to reproduce: MODIN_NPARTITIONS=32 python reproducer.py

reproducer.py:

import string
import random

import modin.pandas as pd
import numpy as np
from time import time

rows = 1500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan

def test2():
    for _ in range(3):
        start = time()
        result = df.groupby(3, sort=False).agg(
            col1=(2, _format),
            col2=(1, _format),
        )
        print(f"end: {time()-start}")

test2()

Describe the problem

~3.5x slower

Source code / logs

0.12.0
UserWarning: The size of /dev/shm is too small (369469804544 bytes). The required size at least half of RAM (405121079296 bytes). Please, delete files in /dev/shm or increase size of /dev/shm with --shm-size in Docker. Also, you can set the required memory size for each Ray worker in bytes to MODIN_MEMORY environment variable.
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
end: 16.87742853164673
end: 16.95181179046631
end: 17.432859182357788


006ccd0
UserWarning: Distributing <class 'numpy.ndarray'> object. This may take some time.
(_apply_func pid=645761)
end: 59.233707427978516
end: 58.498608350753784
end: 57.40613126754761

Metadata

Metadata

Assignees

No one assigned

    Labels

    Blocked ❌A pull request that is blockedExternalPull requests and issues from people who do not regularly contribute to modinP2Minor bugs or low-priority feature requestsPerformance 🚀Performance related issues and pull requests.Regression ↩️Something that used to work but doesn't anymoredependencies 🔗Issues related to dependencies

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions