Creating a LocalCluster on a multi-core machine can take a while #2450

mrocklin · 2019-01-07T22:29:12Z

I'm playing with a machine that has 80 logical cores. In this case creating a LocalCluster or raw Client can take a while. I believe that this is because it tries creating 80 workers at once with forkserver.

In [1]: from dask.distributed import Client

In [2]: %time client = Client()
distributed.nanny - WARNING - Worker process still alive after 47 seconds, killing
distributed.nanny - WARNING - Worker process 15200 was killed by unknown signal
<things crash>

There are a few potential approaches to solve this:

We can try to reduce the cost of creating a new process running a worker. My guess is that we're spending a long time importing libraries, but I'm not sure. Someone could investigate this.
We can use the multiprocessing fork approach rather than forkserver (though dask breaks for other reasons here)
We can change the mixture of threads and processes as we get to high number of cores

In [1]: from dask.distributed import Client

In [2]: %time client = Client(n_workers=8, threads_per_worker=10)
CPU times: user 984 ms, sys: 252 ms, total: 1.24 s
Wall time: 6.43 s

Probably we should do some combination of 1 and 3. Any thoughts or suggestions?

The text was updated successfully, but these errors were encountered:

mrocklin · 2019-01-07T22:29:43Z

cc @mt-jones who ran into this as well I think

mrocklin · 2019-01-08T01:08:35Z

So given a number of cores (like 32) we try to find some factors (like 8 and 4) that we can use for processes and threads. I'm inclined to split the difference and try to find numbers that are roughly equal, with a preference given to processes, especially for low numbers. Intuitively I like the following splits:

4: (4, 1)
8: (8, 2)
12: (6, 2) or (4, 3), don't care
24: (6, 4) or (8, 3), don't care
32: (8, 4)
40: (10, 8)
80: (16, 5) or (10, 8)

mrocklin · 2019-01-08T01:08:48Z

cc @jhamman who has some experience with high-core-count nodes.

When we have a large number of available cores we should set the default number of threads per process to be higher than one. This implements a policy that aims for a number of processes equal to the square root of the number of cores, at least above a certain amount of cores. Partially addresses dask#2450

mrocklin · 2019-01-08T02:33:05Z

Partially addressed in #2452

bluecoconut · 2019-01-10T19:28:22Z

Happened upon this and wanted to give some quick feedback.

I have been working with 80 and 160 core machines recently and found that for me, the heuristic that has been working best was trending towards fewer workers and more threads (due to sending millions of tasks)

My heuristic was to take the sqrt(core) and go down finding good divisors.

import numpy as np
def approx_sqrt_div(num):
    for i in range(int(np.sqrt(num)), 0, -1):
        if num % i == 0:
            return i, num//i

Giving me on 160 core machine (10, 16) (worker, thread), and on 80 core (8, 10).

jhamman · 2019-01-11T04:21:56Z

I've played around with this a bit as well. In general, I find that we usually want more processes than threads. @bluecoconut's sqrt approach is interesting I would always make sure processes >= threads.

mrocklin · 2019-01-11T05:14:07Z

Thank you for chiming in with your experience @bluecoconut . What you say is useful information.

The approach in #2452 is the one that @jhamman suggests. It's reassuring that we all came to almost the same approach. The exact right approach will always be workload dependent though of course.

Regardless, it would be nice to be able to reduce (or at least identify) the cost of creating new workers with forkserver if anyone has spare cycles (no obligation to anyone though of course).

When we have a large number of available cores we should set the default number of threads per process to be higher than one. This implements a policy that aims for a number of processes equal to the square root of the number of cores, at least above a certain amount of cores. Partially addresses #2450

mrocklin mentioned this issue Jan 8, 2019

Start fewer but larger workers with LocalCluster #2452

Merged

mrocklin mentioned this issue Jan 17, 2019

startup time with Client is long #1399

Closed

GenevieveBuckley added the performance label Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Creating a LocalCluster on a multi-core machine can take a while #2450

Creating a LocalCluster on a multi-core machine can take a while #2450

mrocklin commented Jan 7, 2019

mrocklin commented Jan 7, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

bluecoconut commented Jan 10, 2019

Uh oh!

jhamman commented Jan 11, 2019

Uh oh!

mrocklin commented Jan 11, 2019

Uh oh!

Uh oh!

Creating a LocalCluster on a multi-core machine can take a while #2450

Creating a LocalCluster on a multi-core machine can take a while #2450

Comments

mrocklin commented Jan 7, 2019

mrocklin commented Jan 7, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

mrocklin commented Jan 8, 2019

Uh oh!

bluecoconut commented Jan 10, 2019

Uh oh!

jhamman commented Jan 11, 2019

Uh oh!

mrocklin commented Jan 11, 2019

Uh oh!