Skip to content

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
koaning opened this issue Oct 11, 2021 · 10 comments
Open

DaskML vs. Sklearn LogisticRegression. Coefficients seem different. #860

koaning opened this issue Oct 11, 2021 · 10 comments

Comments

@koaning
Copy link

koaning commented Oct 11, 2021

The coefficients from the daskml.linear_model.LogisticRegression seem different from the sklearn.linear_model.LogisticRegression.

The script below demonstrates a benchmark that shows this.

# %pip install dask_ml memo
import time
import numpy as np
from dask_ml.datasets import make_classification
from dask_glm.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000, chunksize=1_000):
    X, y = make_classification(n_samples=size, n_features=20, n_informative=10, chunksize=chunksize)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_diff': np.abs(mod_sklearn.coef_ - mod_dask.coef_).sum() # This is what I'm interested in understanding.
    }

for size in [1000, 2000]:
    for chunksize in [500, 1000]:
        run_experiment(size=size, chunksize=chunksize)

The output in data_hindsight looks like this:

{'size': 1000, 'chunksize': 500, 'time_dask': 1.2568867206573486, 'time_sklearn': 0.003863811492919922, 'acc_dask': 0.795, 'acc_sklearn': 0.795, 'coef_diff': 7.47177509238394}
{'size': 1000, 'chunksize': 1000, 'time_dask': 0.09700465202331543, 'time_sklearn': 0.005501270294189453, 'acc_dask': 0.784, 'acc_sklearn': 0.784, 'coef_diff': 8.645457699298978}
{'size': 2000, 'chunksize': 500, 'time_dask': 2.3598458766937256, 'time_sklearn': 0.007267951965332031, 'acc_dask': 0.757, 'acc_sklearn': 0.7605, 'coef_diff': 5.962742662561093}
{'size': 2000, 'chunksize': 1000, 'time_dask': 1.5501070022583008, 'time_sklearn': 0.010750293731689453, 'acc_dask': 0.7985, 'acc_sklearn': 0.7995, 'coef_diff': 8.450044499024445}

Although the accuracy on the train set seems to be roughly the same, I cannot help but wonder why the coef_diff is so large. I understand that the internal optimizer is bound to be different in a parallel setting so there's gotta some difference here, but the difference seem too big for my gut feeling.

Environment:

Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.27.0

Compiler    : GCC 9.3.0
OS          : Linux
Release     : 5.11.0-7614-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 12
Architecture: 64bit

dask==2021.9.1
dask-glm==0.2.0
dask-ml==1.9.0
scikit-learn==0.24.2
@stsievert
Copy link
Member

Is an average coefficient difference of 0.008 large? That's what np.abs(mod_sklearn.coef_ - mod_dask.coef_).mean() would produce. Yes, the sum is large (8) but the number of examples is much larger (1000). Typically, when I examine convergence I look at the relative error, norm(coef_est - coef_true) / norm(coef_true).

@koaning
Copy link
Author

koaning commented Oct 11, 2021

I ran the same experiment but now with 4 features.

import time
from rich import print
import numpy as np
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg 
from memo import grid, memlist, memfunc

data_hindsight = []

@memfunc(print)
@memlist(data=data_hindsight)
def run_experiment(size=1_000):
    X, y = make_classification(n_samples=size, n_features=4, n_informative=4, chunksize=1000)

    # Run everything in Dask
    tic = time.time()
    mod_dask = DaskLogReg()
    mod_dask.fit(X, y)
    dask_time = time.time() - tic 

    # Run everything in Scikit-Learn
    X_np, y_np = np.asarray(X), np.asarray(y)
    mod_sklearn = SklearnLogReg()
    tic = time.time()
    mod_sklearn.fit(X_np, y_np)
    sklearn_time = time.time() - tic
    
    # Return stats that are of interest
    return {
        'time_dask': dask_time,
        'time_sklearn': sklearn_time, 
        'acc_dask': (mod_dask.predict(X) == y).mean().compute(),
        'acc_sklearn': (mod_sklearn.predict(X_np) == y_np).mean(),
        'coef_dask': mod_dask.coef_,
        'coef_sklearn': mod_sklearn.coef_ 
    }

for size in [1000, 2000]:
    run_experiment(size=size)

I got these results.

{
    'size': 1000,
    'time_dask': 0.47808194160461426,
    'time_sklearn': 0.004954099655151367,
    'acc_dask': 0.645,
    'acc_sklearn': 0.645,
    'coef_dask': array([-0.04967113, -0.43258482,  0.10778322, -0.67010295]),
    'coef_sklearn': array([[-0.43595994,  0.10862159, -0.67551076,  0.05938726]])
}
{
    'size': 2000,
    'time_dask': 6.170458555221558,
    'time_sklearn': 0.008183717727661133,
    'acc_dask': 0.796,
    'acc_sklearn': 0.796,
    'coef_dask': array([ 0.01435567, -1.61027525, -0.04957294, -0.96870719]),
    'coef_sklearn': array([[-1.61070132, -0.04759224, -0.96795166, -1.03047095]])
}

Looking at these results it seems like the results are seriously skewed, but I wonder ... it also looks like the coefficients appear in a different order. Would you agree?

@stsievert
Copy link
Member

Hm... I've modified the script a bit below, and get different results when I train Dask-ML's LogisticRegression on NumPy arrays and a Dask array of the same data:

$ python train_glm.py
Training on NumPy arrays, rel_error = 0.3344027325490487
Training on Dask array of same data, rel_error = 1.8500723840924802

Is that expected @TomAugspurger? I'm not seeing any tests to make sure Scikit-learn and Dask-ML linear models produce the same results.

Test script (with hidden warning)
import numpy as np
from sklearn.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
import dask.array as da
import numpy.linalg as LA


if __name__ == "__main__":
    n, d = 1000, 10
    X, y = make_classification(
        n_samples=n,
        n_features=d,
        n_informative=d // 2,
        n_redundant=d // 2,
        random_state=42,
    )

    kwargs = dict(max_iter=10_000, random_state=68, C=1e8, tol=0e-10)

    baseline = SklearnLogReg(solver="lbfgs", **kwargs)
    baseline.fit(X, y)

    est = DaskLogReg(solver="lbfgs", **kwargs)
    est.fit(X, y)
    rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
    print("Training on NumPy arrays, rel_error =", rel_error)

    X_dist = da.from_array(X, chunks=(n, d))
    y_dist = da.from_array(y, chunks=n)
    assert X_dist.npartitions == y_dist.npartitions == 1
    est.fit(X_dist, y_dist)
    rel_error = LA.norm(baseline.coef_ - est.coef_) / LA.norm(baseline.coef_)
    print("Training on Dask arrays, rel_error =", rel_error)

I also get a warning:

/Users/scott/anaconda3/lib/python3.8/site-packages/dask/config.py:588: UserWarning: Configuration key "fuse_ave_width" has been deprecated. Please use "optimization.fuse.ave-width" instead
  warnings.warn(

@TomAugspurger
Copy link
Member

I'm not too familiar with the glm code. That would be Chris White.

I do recall noticing that the default method can take a really long time to converge.

@bingoko
Copy link

bingoko commented Oct 12, 2021

I am facing a similar problem. Are you trying it on multiple classes?
Seems like LogisticRegression dask-ml doesn't work correctly on multiple classes classification.

@koaning
Copy link
Author

koaning commented Oct 12, 2021

I have been using the make_classification from dask_glm which only outputs two classes: [True, False].

@kchare
Copy link
Contributor

kchare commented Feb 10, 2022

I was also experiencing the same issue. I believe that this is due to the assignment of the intercept and the multiple dispatch of dask_glm.utils.add_intercept(). I have adapted @stsievert's experiment from above with a few small modifications, and it seems that the type of array determines whether the intercept column is concatenated as the first or final column of the design matrix. dask_glm only takes the final entry as the intercept.

Because decision_function() and predict() methods rely on the add_intercept() method as well though, it doesn't seem to affect the decision function, shown for the first simulated data point.

$ python train_glm.py
NumPy array _coef: [ 0.44225  0.65241  0.28006 -0.24262  0.61215]
Dask array _coef: [ 0.61215  0.44225  0.65241  0.28006 -0.24262]

Baseline coef_: [ 0.44225  0.65241  0.28006 -0.24262]
NumPy array coef_: [ 0.44225  0.65241  0.28006 -0.24262]
Dask array coef_: [0.61215 0.44225 0.65241 0.28006]

Baseline intercept_: 0.61215
NumPy array intercept_: 0.61215
Dask array intercept_: -0.24262

NumPy Xbeta: -1.34381
Dask Xbeta: -1.34381
Test script
import numpy as np
from sklearn.datasets import make_classification
from dask_ml.linear_model import LogisticRegression as DaskLogReg
from sklearn.linear_model import LogisticRegression as SklearnLogReg
import dask.array as da


if __name__ == "__main__":
    n, d = 1000, 4
    X, y = make_classification(
        n_samples=n,
        n_features=d,
        n_informative=d,
        n_redundant=0,
        random_state=42,
    )

    kwargs = dict(max_iter=10_000, random_state=68, C=1e8, tol=0e-10)

    baseline = SklearnLogReg(solver="lbfgs", **kwargs)
    baseline.fit(X, y)

    np_est = DaskLogReg(solver="lbfgs", **kwargs)
    np_est.fit(X, y)
    
    X_dist = da.from_array(X, chunks=(n, d))
    y_dist = da.from_array(y, chunks=n)
    assert X_dist.npartitions == y_dist.npartitions == 1
    
    dask_est = DaskLogReg(solver="lbfgs", **kwargs)
    dask_est.fit(X_dist, y_dist)

    np.set_printoptions(precision=5)
    print(f"NumPy array _coef: {np_est._coef}")
    print(f"Dask array _coef: {dask_est._coef}\n")
    
    print(f"Baseline coef_: {baseline.coef_[0]}")
    print(f"NumPy array coef_: {np_est.coef_}")
    print(f"Dask array coef_: {dask_est.coef_}\n")

    print(f"Baseline intercept_: {baseline.intercept_[0]:0.5f}")
    print(f"NumPy array intercept_: {np_est.intercept_:0.5f}")
    print(f"Dask array intercept_: {dask_est.intercept_:0.5f}\n")

    print(f"NumPy Xbeta: {np_est.decision_function(X)[0]:0.5f}")
    print(f"Dask Xbeta: {dask_est.decision_function(X_dist)[0].compute():0.5f}")

@stsievert
Copy link
Member

stsievert commented Feb 14, 2022

Thanks for that output @kchare. I think I've found the issue:

  1. Dask-GLM's utils.add_intercept adds coefficients representing the intercept to be the first or last column if Dask or NumPy (respectively).
  2. But... Dask-GLM always thinks the intercept is the last coefficient:

Details:

self.intercept_ = self._coef[-1]

Script showing ordering issue
>>> import numpy as np
>>> import dask.array as da
>>> from dask_ml.linear_model.utils import add_intercept
>>> X = np.array([[2, 2], [3, 3], [4, 4]])
>>> Y = da.from_array(X)
>>> Xt = add_intercept(X)
>>> Yt = add_intercept(Y).compute()
>>> print(Xt)
[[2. 2. 1.]
 [3. 3. 1.]
 [4. 4. 1.]]
>>> print(Yt)
[[1 2 2]
 [1 3 3]
 [1 4 4]]
>>> assert np.allclose(Xt, Yt)

I think this diff in dask_ml/linear_model/utils.py resolves this issue:

 @dispatch(np.ndarray)
 def add_intercept(X):
-    return np.concatenate([X, np.ones((X.shape[0], 1))], axis=1)
+    return _add_intercept(X)

 def _add_intercept(x):
     ones = np.ones((x.shape[0], 1), dtype=x.dtype)
-    return np.concatenate([ones, x], axis=1)
+    return np.concatenate([x, ones], axis=1)

Dask-GLM also has some utils, and the column ordering is specified correctly there.

@kchare
Copy link
Contributor

kchare commented Feb 28, 2022

Thank you for tracing that @stsievert, I agree that Dask-GLM defaulting to the final coefficient being the intercept. I am happy to create a PR to implement the diff that you have suggested in dask_ml/linear_model/utils.py, if that is an acceptable solution to this issue.

@stsievert
Copy link
Member

I'd love to see a PR for that! I think it'd be a valuable PR, mostly because I think Dask-ML should have a test showing it converges to Scikit-learn's solution (and to be frank I'm surprised that test it missing!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants