-
-
Notifications
You must be signed in to change notification settings - Fork 261
DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is an average coefficient difference of 0.008 large? That's what |
I ran the same experiment but now with 4 features. import time
from rich import print
import numpy as np
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
from memo import grid, memlist, memfunc
data_hindsight = []
@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000):
X, y = make_classification(n_samples=size, n_features=4, n_informative=4, chunksize=1000)
# Run everything in Dask
tic = time.time()
mod_dask = DaskLogReg()
mod_dask.fit(X, y)
dask_time = time.time() - tic
# Run everything in Scikit-Learn
X_np, y_np = np.asarray(X), np.asarray(y)
mod_sklearn = SklearnLogReg()
tic = time.time()
mod_sklearn.fit(X_np, y_np)
sklearn_time = time.time() - tic
# Return stats that are of interest
return {
'time_dask': dask_time,
'time_sklearn': sklearn_time,
'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
'coef_dask': mod_dask.coef_,
'coef_sklearn': mod_sklearn.coef_
}
for size in [1000, 2000]:
run_experiment(size=size) I got these results. {
'size': 1000,
'time_dask': 0.47808194160461426,
'time_sklearn': 0.004954099655151367,
'acc_dask': 0.645,
'acc_sklearn': 0.645,
'coef_dask': array([-0.04967113, -0.43258482, 0.10778322, -0.67010295]),
'coef_sklearn': array([[-0.43595994, 0.10862159, -0.67551076, 0.05938726]])
}
{
'size': 2000,
'time_dask': 6.170458555221558,
'time_sklearn': 0.008183717727661133,
'acc_dask': 0.796,
'acc_sklearn': 0.796,
'coef_dask': array([ 0.01435567, -1.61027525, -0.04957294, -0.96870719]),
'coef_sklearn': array([[-1.61070132, -0.04759224, -0.96795166, -1.03047095]])
} Looking at these results it seems like the results are seriously skewed, but I wonder ... it also looks like the coefficients appear in a different order. Would you agree? |
Hm... I've modified the script a bit below, and get different results when I train Dask-ML's $ python train_glm.py
Training on NumPy arrays, rel_error = 0.3344027325490487
Training on Dask array of same data, rel_error = 1.8500723840924802 Is that expected @TomAugspurger? I'm not seeing any tests to make sure Scikit-learn and Dask-ML linear models produce the same results. Test script (with hidden warning)import numpy as np
from sklearn.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
import dask.array as da
import numpy.linalg as LA
if __name__ == "__main__":
n, d = 1000, 10
X, y = make_classification(
n_samples=n,
n_features=d,
n_informative=d // 2,
n_redundant=d // 2,
random_state=42,
)
kwargs = dict(max_iter=10_000, random_state=68, C=1e8, tol=0e-10)
baseline = SklearnLogReg(solver="lbfgs", **kwargs)
baseline.fit(X, y)
est = DaskLogReg(solver="lbfgs", **kwargs)
est.fit(X, y)
rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
print("Training on NumPy arrays, rel_error =", rel_error)
X_dist = da.from_array(X, chunks=(n, d))
y_dist = da.from_array(y, chunks=n)
assert X_dist.npartitions == y_dist.npartitions == 1
est.fit(X_dist, y_dist)
rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
print("Training on Dask arrays, rel_error =", rel_error) I also get a warning:
|
I'm not too familiar with the glm code. That would be Chris White. I do recall noticing that the default method can take a really long time to converge. |
I am facing a similar problem. Are you trying it on multiple classes? |
I have been using the |
I was also experiencing the same issue. I believe that this is due to the assignment of the intercept and the multiple dispatch of Because $ python train_glm.py
NumPy array _coef: [ 0.44225 0.65241 0.28006 -0.24262 0.61215]
Dask array _coef: [ 0.61215 0.44225 0.65241 0.28006 -0.24262]
Baseline coef_: [ 0.44225 0.65241 0.28006 -0.24262]
NumPy array coef_: [ 0.44225 0.65241 0.28006 -0.24262]
Dask array coef_: [0.61215 0.44225 0.65241 0.28006]
Baseline intercept_: 0.61215
NumPy array intercept_: 0.61215
Dask array intercept_: -0.24262
NumPy Xbeta: -1.34381
Dask Xbeta: -1.34381 Test script
|
Thanks for that output @kchare. I think I've found the issue:
Details: dask-ml/dask_ml/linear_model/glm.py Line 191 in 1e811ce
Script showing ordering issue>>> import numpy as np
>>> import dask.array as da
>>> from dask_ml.linear_model.utils import add_intercept
>>> X = np.array([[2, 2], [3, 3], [4, 4]])
>>> Y = da.from_array(X)
>>> Xt = add_intercept(X)
>>> Yt = add_intercept(Y).compute()
>>> print(Xt)
[[2. 2. 1.]
[3. 3. 1.]
[4. 4. 1.]]
>>> print(Yt)
[[1 2 2]
[1 3 3]
[1 4 4]]
>>> assert np.allclose(Xt, Yt) I think this diff in @dispatch(np.ndarray)
def add_intercept(X):
- return np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
+ return _add_intercept(X)
def _add_intercept(x):
ones = np.ones((x.shape[0], 1), dtype=x.dtype)
- return np.concatenate([ones, x], axis=1)
+ return np.concatenate([x, ones], axis=1) Dask-GLM also has some utils, and the column ordering is specified correctly there. |
Thank you for tracing that @stsievert, I agree that Dask-GLM defaulting to the final coefficient being the intercept. I am happy to create a PR to implement the diff that you have suggested in |
I'd love to see a PR for that! I think it'd be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn's solution (and to be frank I'm surprised that test it missing!). |
Uh oh!
There was an error while loading. Please reload this page.
The coefficients from the
daskml.linear_model.LogisticRegression
seem different from thesklearn.linear_model.LogisticRegression
.The script below demonstrates a benchmark that shows this.
The output in
data_hindsight
looks like this:Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the
coef_diff
is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there's gotta some difference here, but the difference seem too big for my gut feeling.Environment:
The text was updated successfully, but these errors were encountered: