Closed
Description
When I do dask_ml.decomposition.PCA().fit(x)
, where the array x
has a size > 1 TB, I get the error ValueError: output array is read-only
.
I use
dask-ml 1.1.1
distributed 2.9.0
The script
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
from dask_ml.decomposition import PCA
import dask.array as da
cluster = SLURMCluster()
nb_workers = 58
cluster.scale(nb_workers)
client = Client(cluster)
client.wait_for_workers(nb_workers)
x = da.random.random((1000000, 140000), chunks=(100000, 2000))
pca = PCA(n_components=64)
pca.fit(x)
gives the error
Traceback (most recent call last):
File "value_error.py", line 48, in <module>
pca.fit(x)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 190, in fit
self._fit(X)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 338, in _fit
raise e
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask_ml/decomposition/pca.py", line 325, in _fit
singular_values,
File "/home/dheim/miniconda3/lib/python3.7/site-packages/dask/base.py", line 436, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 2573, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1873, in gather
asynchronous=asynchronous,
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 768, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 334, in sync
raise exc.with_traceback(tb)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/utils.py", line 318, in f
result[0] = yield future
File "/home/dheim/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/dheim/miniconda3/lib/python3.7/site-packages/distributed/client.py", line 1729, in _gather
raise exception.with_traceback(traceback)
File "/home/dheim/miniconda3/lib/python3.7/site-packages/sklearn/utils/extmath.py", line 516, in svd_flip
v *= signs[:, np.newaxis]
ValueError: output array is read-only
Note that
- If I use
x = da.random.random((1000000, 130000), chunks=(100000, 2000))
(1.0 TB), the error does not appear. - When I look at the dashboard, the PCA seems to run fine and the error appears at the very end of the computation.
- I temporarily fixed the error in
extmath.py
by changing
def svd_flip(u, v, u_based_decision=True):
if u_based_decision:
# columns of u, rows of v
max_abs_cols = np.argmax(np.abs(u), axis=0)
signs = np.sign(u[max_abs_cols, range(u.shape[1])])
u *= signs
- v *= signs[:, np.newaxis]
+ v_copy = np.copy(v)
+ v_copy *= signs[:, np.newaxis]
+ return u, v_copy
else:
I think this is not a good fix because I assume that the array v
is blocked by another function.
Is there another way to fix the error?
Metadata
Metadata
Assignees
Labels
No labels