Skip to content

Creating a LocalCluster on a multi-core machine can take a while #2450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mrocklin opened this issue Jan 7, 2019 · 7 comments
Open

Creating a LocalCluster on a multi-core machine can take a while #2450

mrocklin opened this issue Jan 7, 2019 · 7 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Jan 7, 2019

I'm playing with a machine that has 80 logical cores. In this case creating a LocalCluster or raw Client can take a while. I believe that this is because it tries creating 80 workers at once with forkserver.

In [1]: from dask.distributed import Client

In [2]: %time client = Client()
distributed.nanny - WARNING - Worker process still alive after 47 seconds, killing
distributed.nanny - WARNING - Worker process 15200 was killed by unknown signal
<things crash>

There are a few potential approaches to solve this:

  1. We can try to reduce the cost of creating a new process running a worker. My guess is that we're spending a long time importing libraries, but I'm not sure. Someone could investigate this.
  2. We can use the multiprocessing fork approach rather than forkserver (though dask breaks for other reasons here)
  3. We can change the mixture of threads and processes as we get to high number of cores
In [1]: from dask.distributed import Client

In [2]: %time client = Client(n_workers=8, threads_per_worker=10)
CPU times: user 984 ms, sys: 252 ms, total: 1.24 s
Wall time: 6.43 s

Probably we should do some combination of 1 and 3. Any thoughts or suggestions?

@mrocklin
Copy link
Member Author

mrocklin commented Jan 7, 2019

cc @mt-jones who ran into this as well I think

@mrocklin
Copy link
Member Author

mrocklin commented Jan 8, 2019

So given a number of cores (like 32) we try to find some factors (like 8 and 4) that we can use for processes and threads. I'm inclined to split the difference and try to find numbers that are roughly equal, with a preference given to processes, especially for low numbers. Intuitively I like the following splits:

  • 4: (4, 1)
  • 8: (8, 2)
  • 12: (6, 2) or (4, 3), don't care
  • 24: (6, 4) or (8, 3), don't care
  • 32: (8, 4)
  • 40: (10, 8)
  • 80: (16, 5) or (10, 8)

@mrocklin
Copy link
Member Author

mrocklin commented Jan 8, 2019

cc @jhamman who has some experience with high-core-count nodes.

mrocklin added a commit to mrocklin/distributed that referenced this issue Jan 8, 2019
When we have a large number of available cores we should set the default number
of threads per process to be higher than one.

This implements a policy that aims for a number of processes equal to the
square root of the number of cores, at least above a certain amount of cores.

Partially addresses dask#2450
@mrocklin
Copy link
Member Author

mrocklin commented Jan 8, 2019

Partially addressed in #2452

@bluecoconut
Copy link

Happened upon this and wanted to give some quick feedback.

I have been working with 80 and 160 core machines recently and found that for me, the heuristic that has been working best was trending towards fewer workers and more threads (due to sending millions of tasks)

My heuristic was to take the sqrt(core) and go down finding good divisors.

import numpy as np
def approx_sqrt_div(num):
    for i in range(int(np.sqrt(num)), 0, -1):
        if num % i == 0:
            return i, num//i

Giving me on 160 core machine (10, 16) (worker, thread), and on 80 core (8, 10).

@jhamman
Copy link
Member

jhamman commented Jan 11, 2019

I've played around with this a bit as well. In general, I find that we usually want more processes than threads. @bluecoconut's sqrt approach is interesting I would always make sure processes >= threads.

@mrocklin
Copy link
Member Author

Thank you for chiming in with your experience @bluecoconut . What you say is useful information.

The approach in #2452 is the one that @jhamman suggests. It's reassuring that we all came to almost the same approach. The exact right approach will always be workload dependent though of course.

Regardless, it would be nice to be able to reduce (or at least identify) the cost of creating new workers with forkserver if anyone has spare cycles (no obligation to anyone though of course).

mrocklin added a commit that referenced this issue Jan 14, 2019
When we have a large number of available cores we should set the default number
of threads per process to be higher than one.

This implements a policy that aims for a number of processes equal to the
square root of the number of cores, at least above a certain amount of cores.

Partially addresses #2450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants