Skip to content

dask_ml.decomposition.PCA: ValueError with data > 1 TB #592

Closed
@demaheim

Description

@demaheim

When I do dask_ml.decomposition.PCA().fit(x), where the array x has a size > 1 TB, I get the error ValueError: output array is read-only.

I use

dask-ml                   1.1.1
distributed               2.9.0

The script

from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask_ml.decomposition import PCA
import dask.array as da

cluster = SLURMCluster()
nb_workers = 58
cluster.scale(nb_workers)
client = Client(cluster)
client.wait_for_workers(nb_workers)

x = da.random.random((1000000, 140000), chunks=(100000, 2000))
pca = PCA(n_components=64)
pca.fit(x)

gives the error

Traceback (most recent call last):
  File "value_error.py", line 48, in <module>
    pca.fit(x)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
    self._fit(X)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 338, in _fit
    raise e
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
    singular_values,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2573, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1873, in gather
    asynchronous=asynchronous,
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 768, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
    raise exc.with_traceback(tb)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
    result[0] = yield future
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1729, in _gather
    raise exception.with_traceback(traceback)
  File "/home/dheim/miniconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py", line 516, in svd_flip
    v *= signs[:, np.newaxis]
ValueError: output array is read-only

Note that

  • If I use x = da.random.random((1000000, 130000), chunks=(100000, 2000)) (1.0 TB), the error does not appear.
  • When I look at the dashboard, the PCA seems to run fine and the error appears at the very end of the computation.
  • I temporarily fixed the error in extmath.py by changing
def svd_flip(u, v, u_based_decision=True):
    if u_based_decision:
        # columns of u, rows of v
        max_abs_cols = np.argmax(np.abs(u), axis=0)
        signs = np.sign(u[max_abs_cols, range(u.shape[1])])
        u *= signs
-        v *= signs[:, np.newaxis]
+        v_copy = np.copy(v)
+        v_copy *= signs[:, np.newaxis]
+        return u, v_copy
    else:

I think this is not a good fix because I assume that the array v is blocked by another function.
Is there another way to fix the error?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions