DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

koaning · 2021-10-11T14:10:50Z

The coefficients from the daskml.linear_model.LogisticRegression seem different from the sklearn.linear_model.LogisticRegression.

The script below demonstrates a benchmark that shows this.

# %pip install dask_ml memo
import time
import numpy as np
from dask_ml.datasets import make_classification
from dask_glm.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000, chunksize=1_000):
    X, y = make_classification(n_samples=size, n_features=20, n_informative=10, chunksize=chunksize)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_diff': np.abs(mod_sklearn.coef_ - mod_dask.coef_).sum() # This is what I'm interested in understanding.
    }

for size in [1000, 2000]:
    for chunksize in [500, 1000]:
        run_experiment(size=size, chunksize=chunksize)

The output in data_hindsight looks like this:

{'size': 1000, 'chunksize': 500, 'time_dask': 1.2568867206573486, 'time_sklearn': 0.003863811492919922, 'acc_dask': 0.795, 'acc_sklearn': 0.795, 'coef_diff': 7.47177509238394}
{'size': 1000, 'chunksize': 1000, 'time_dask': 0.09700465202331543, 'time_sklearn': 0.005501270294189453, 'acc_dask': 0.784, 'acc_sklearn': 0.784, 'coef_diff': 8.645457699298978}
{'size': 2000, 'chunksize': 500, 'time_dask': 2.3598458766937256, 'time_sklearn': 0.007267951965332031, 'acc_dask': 0.757, 'acc_sklearn': 0.7605, 'coef_diff': 5.962742662561093}
{'size': 2000, 'chunksize': 1000, 'time_dask': 1.5501070022583008, 'time_sklearn': 0.010750293731689453, 'acc_dask': 0.7985, 'acc_sklearn': 0.7995, 'coef_diff': 8.450044499024445}

Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the coef_diff is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there's gotta some difference here, but the difference seem too big for my gut feeling.

Environment:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

dask==2021.9.1
dask-glm==0.2.0
dask-ml==1.9.0
scikit-learn==0.24.2

The text was updated successfully, but these errors were encountered:

stsievert · 2021-10-11T15:27:19Z

Is an average coefficient difference of 0.008 large? That's what np.abs(mod_sklearn.coef_ - mod_dask.coef_).mean() would produce. Yes, the sum is large (8) but the number of examples is much larger (1000). Typically, when I examine convergence I look at the relative error, norm(coef_est - coef_true) / norm(coef_true).

koaning · 2021-10-11T17:04:18Z

I ran the same experiment but now with 4 features.

import time
from rich import print
import numpy as np
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000):
    X, y = make_classification(n_samples=size, n_features=4, n_informative=4, chunksize=1000)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_dask': mod_dask.coef_,
        'coef_sklearn': mod_sklearn.coef_ 
    }

for size in [1000, 2000]:
    run_experiment(size=size)

I got these results.

{
    'size': 1000,
    'time_dask': 0.47808194160461426,
    'time_sklearn': 0.004954099655151367,
    'acc_dask': 0.645,
    'acc_sklearn': 0.645,
    'coef_dask': array([-0.04967113, -0.43258482,  0.10778322, -0.67010295]),
    'coef_sklearn': array([[-0.43595994,  0.10862159, -0.67551076,  0.05938726]])
}
{
    'size': 2000,
    'time_dask': 6.170458555221558,
    'time_sklearn': 0.008183717727661133,
    'acc_dask': 0.796,
    'acc_sklearn': 0.796,
    'coef_dask': array([ 0.01435567, -1.61027525, -0.04957294, -0.96870719]),
    'coef_sklearn': array([[-1.61070132, -0.04759224, -0.96795166, -1.03047095]])
}

Looking at these results it seems like the results are seriously skewed, but I wonder ... it also looks like the coefficients appear in a different order. Would you agree?

stsievert · 2021-10-12T01:01:37Z

Hm... I've modified the script a bit below, and get different results when I train Dask-ML's LogisticRegression on NumPy arrays and a Dask array of the same data:

$ python train_glm.py
Training on NumPy arrays, rel_error = 0.3344027325490487
Training on Dask array of same data, rel_error = 1.8500723840924802

Is that expected @TomAugspurger? I'm not seeing any tests to make sure Scikit-learn and Dask-ML linear models produce the same results.

Test script (with hidden warning)

import numpy as np
from sklearn.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
import dask.array as da
import numpy.linalg as LA


if __name__ == "__main__":
    n, d = 1000, 10
    X, y = make_classification(
        n_samples=n,
        n_features=d,
        n_informative=d // 2,
        n_redundant=d // 2,
        random_state=42,
    )

    kwargs = dict(max_iter=10_000, random_state=68, C=1e8, tol=0e-10)

    baseline = SklearnLogReg(solver="lbfgs", **kwargs)
    baseline.fit(X, y)

    est = DaskLogReg(solver="lbfgs", **kwargs)
    est.fit(X, y)
    rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
    print("Training on NumPy arrays, rel_error =", rel_error)

    X_dist = da.from_array(X, chunks=(n, d))
    y_dist = da.from_array(y, chunks=n)
    assert X_dist.npartitions == y_dist.npartitions == 1
    est.fit(X_dist, y_dist)
    rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
    print("Training on Dask arrays, rel_error =", rel_error)

I also get a warning:

/Users/scott/anaconda3/lib/python3.8/site-packages/dask/config.py:588: UserWarning: Configuration key "fuse_ave_width" has been deprecated. Please use "optimization.fuse.ave-width" instead
  warnings.warn(

TomAugspurger · 2021-10-12T11:37:22Z

I'm not too familiar with the glm code. That would be Chris White.

I do recall noticing that the default method can take a really long time to converge.

bingoko · 2021-10-12T12:49:43Z

I am facing a similar problem. Are you trying it on multiple classes?
Seems like LogisticRegression dask-ml doesn't work correctly on multiple classes classification.

koaning · 2021-10-12T13:19:56Z

I have been using the make_classification from dask_glm which only outputs two classes: [True, False].

kchare · 2022-02-10T21:14:19Z

I was also experiencing the same issue. I believe that this is due to the assignment of the intercept and the multiple dispatch of dask_glm.utils.add_intercept(). I have adapted @stsievert's experiment from above with a few small modifications, and it seems that the type of array determines whether the intercept column is concatenated as the first or final column of the design matrix. dask_glm only takes the final entry as the intercept.

Because decision_function() and predict() methods rely on the add_intercept() method as well though, it doesn't seem to affect the decision function, shown for the first simulated data point.

$ python train_glm.py
NumPy array _coef: [ 0.44225  0.65241  0.28006 -0.24262  0.61215]
Dask array _coef: [ 0.61215  0.44225  0.65241  0.28006 -0.24262]

Baseline coef_: [ 0.44225  0.65241  0.28006 -0.24262]
NumPy array coef_: [ 0.44225  0.65241  0.28006 -0.24262]
Dask array coef_: [0.61215 0.44225 0.65241 0.28006]

Baseline intercept_: 0.61215
NumPy array intercept_: 0.61215
Dask array intercept_: -0.24262

NumPy Xbeta: -1.34381
Dask Xbeta: -1.34381

Test script

import numpy as np
from sklearn.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
import dask.array as da


if __name__ == "__main__":
    n, d = 1000, 4
    X, y = make_classification(
        n_samples=n,
        n_features=d,
        n_informative=d,
        n_redundant=0,
        random_state=42,
    )

    kwargs = dict(max_iter=10_000, random_state=68, C=1e8, tol=0e-10)

    baseline = SklearnLogReg(solver="lbfgs", **kwargs)
    baseline.fit(X, y)

    np_est = DaskLogReg(solver="lbfgs", **kwargs)
    np_est.fit(X, y)
    
    X_dist = da.from_array(X, chunks=(n, d))
    y_dist = da.from_array(y, chunks=n)
    assert X_dist.npartitions == y_dist.npartitions == 1
    
    dask_est = DaskLogReg(solver="lbfgs", **kwargs)
    dask_est.fit(X_dist, y_dist)

    np.set_printoptions(precision=5)
    print(f"NumPy array _coef: {np_est._coef}")
    print(f"Dask array _coef: {dask_est._coef}\n")
    
    print(f"Baseline coef_: {baseline.coef_[0]}")
    print(f"NumPy array coef_: {np_est.coef_}")
    print(f"Dask array coef_: {dask_est.coef_}\n")

    print(f"Baseline intercept_: {baseline.intercept_[0]:0.5f}")
    print(f"NumPy array intercept_: {np_est.intercept_:0.5f}")
    print(f"Dask array intercept_: {dask_est.intercept_:0.5f}\n")

    print(f"NumPy Xbeta: {np_est.decision_function(X)[0]:0.5f}")
    print(f"Dask Xbeta: {dask_est.decision_function(X_dist)[0].compute():0.5f}")

stsievert · 2022-02-14T01:43:51Z

Thanks for that output @kchare. I think I've found the issue:

Dask-GLM's utils.add_intercept adds coefficients representing the intercept to be the first or last column if Dask or NumPy (respectively).
But... Dask-GLM always thinks the intercept is the last coefficient:

Details:

dask-ml/dask_ml/linear_model/glm.py

Line 191 in 1e811ce

self.intercept_ = self._coef[-1]

Script showing ordering issue

>>> import numpy as np
>>> import dask.array as da
>>> from dask_ml.linear_model.utils import add_intercept
>>> X = np.array([[2, 2], [3, 3], [4, 4]])
>>> Y = da.from_array(X)
>>> Xt = add_intercept(X)
>>> Yt = add_intercept(Y).compute()
>>> print(Xt)
[[2. 2. 1.]
 [3. 3. 1.]
 [4. 4. 1.]]
>>> print(Yt)
[[1 2 2]
 [1 3 3]
 [1 4 4]]
>>> assert np.allclose(Xt, Yt)

I think this diff in dask_ml/linear_model/utils.py resolves this issue:

 @dispatch(np.ndarray)
 def add_intercept(X):
-    return np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
+    return _add_intercept(X)

 def _add_intercept(x):
     ones = np.ones((x.shape[0], 1), dtype=x.dtype)
-    return np.concatenate([ones, x], axis=1)
+    return np.concatenate([x, ones], axis=1)

Dask-GLM also has some utils, and the column ordering is specified correctly there.

kchare · 2022-02-28T20:22:11Z

Thank you for tracing that @stsievert, I agree that Dask-GLM defaulting to the final coefficient being the intercept. I am happy to create a PR to implement the diff that you have suggested in dask_ml/linear_model/utils.py, if that is an acceptable solution to this issue.

stsievert · 2022-03-01T02:24:29Z

I'd love to see a PR for that! I think it'd be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn's solution (and to be frank I'm surprised that test it missing!).

kchare mentioned this issue Apr 1, 2022

Update coefficient assignment #914

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

koaning commented Oct 11, 2021 •

edited

Loading

stsievert commented Oct 11, 2021

Uh oh!

koaning commented Oct 11, 2021

Uh oh!

stsievert commented Oct 12, 2021

Uh oh!

TomAugspurger commented Oct 12, 2021

Uh oh!

bingoko commented Oct 12, 2021

Uh oh!

koaning commented Oct 12, 2021

Uh oh!

kchare commented Feb 10, 2022

Uh oh!

stsievert commented Feb 14, 2022 •

edited

Loading

Uh oh!

kchare commented Feb 28, 2022

Uh oh!

stsievert commented Mar 1, 2022

Uh oh!

Uh oh!

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

Comments

koaning commented Oct 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

stsievert commented Oct 11, 2021

Uh oh!

koaning commented Oct 11, 2021

Uh oh!

stsievert commented Oct 12, 2021

Uh oh!

TomAugspurger commented Oct 12, 2021

Uh oh!

bingoko commented Oct 12, 2021

Uh oh!

koaning commented Oct 12, 2021

Uh oh!

kchare commented Feb 10, 2022

Uh oh!

stsievert commented Feb 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kchare commented Feb 28, 2022

Uh oh!

stsievert commented Mar 1, 2022

Uh oh!

koaning commented Oct 11, 2021 •

edited

Loading

stsievert commented Feb 14, 2022 •

edited

Loading