Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API discussion #11

Open
cicdw opened this issue Jan 26, 2017 · 18 comments
Open

API discussion #11

cicdw opened this issue Jan 26, 2017 · 18 comments

Comments

@cicdw
Copy link
Collaborator

cicdw commented Jan 26, 2017

One of the biggest hanging issues is to organize / discuss what the API will look like for dask_glm and what sorts of things we expose. This decision impacts algorithm structure going forward (as we incorporate other GLM families and regularizers) so I will need some help focusing the following discussion on specific action items. My initial thoughts were to:

  • Aim for generality so users can create their own GLM families / regularizers: My initial thought along these lines were to have an Optimizer(**initialization_args, **numerical_algo_args, **regularization_args) class that can be inherited by any class that implements func(), grad(), hessian() methods. This would allow us to focus on algorithm development separately from model specification and add additional model types (Normal, Poisson, etc.) easily. The user then calls Logistic(X,y, **kwargs) to initialize a model, and then .fit(**algo_args) calls on the methods from Optimizer. This can get hairy, and could possibly prevent further optimizations that exploit specific structure inherent in the models.

  • Hard code the models users can choose from: Per my conversation with @jcrist, we might not want to aim for extreme flexibility in the model API, as most people probably don't need / want to define their own GLM families or their own regularizers. Related, we might want to implement the algorithms specifically for each family / regularizer.

  • Regularized vs. unregularized models: How do we manage the 'switch' from regularized to unregularized models? Should it be something as simple as model = LogisticModel(X, y, reg='l2') and the user chooses from a pre-defined list of regularizers, or should we allow for model = LogisticModel(X, y, reg=reg_func) where reg_func() is any regularization function on the parameters, allowing for customizations such as grouped lasso, non-negativity, etc? Again I was hoping to keep things as general as possible so that the user can mix and match, but on the algorithm side this can get difficult especially if we want speed and accuracy.

-Inherit from scikit-learn: Per my conversation with @hussainsultan, we might want to try and stick as closely as possible to scikit-learn's API, as this is common enough to be comfortable for most people. However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications (thus requiring things like inferential stats that scikit-learn currently doesn't include).

Ultimately, I think my thought process boils down to: should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?

cc: @mrocklin @mcg1969

@mrocklin
Copy link
Member

From the perspective of someone who is not intimately familiar with this field it would be useful to see an example algorithm for a particular case (perhaps un-regularized logistic regression using gradient descent) and then highlight which elements of the code we would change if we were to add L1 or L2 regularization, switch out logistic for some other GLM family, or replace gradient descent with something else.

If it is possible to create an example where those options are swapable then that exercise may help to inform a general solution. I'm talking well outside of my expertise though.

Regarding scikit-learn is it possible to modestly extend the fit/predict API with some other verb to provide inferential information?

should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?

I generally try to build concrete things first and then generalize them. However it requires some finesse to build the concrete thing quickly enough to get valuable user feedback while also not painting yourself into a corner to the point where you can't refactor later. Sorry for the platitude; I have nothing concrete to provide here.

@hussainsultan
Copy link
Collaborator

@moody-marlin Thanks for raising this.

However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications.

What type of outputs do you expect that would be different than scikit-learn?

@cicdw
Copy link
Collaborator Author

cicdw commented Jan 27, 2017

@mrocklin Thanks for the feedback, that's a good exercise that I can work on writing up.

@hussainsultan I would expect output more similar to statsmodels, so something like a model object that contains p-values, AIC, BIC, Somer's D, adjusted R^2, outlier / residual tests, etc. Some of these values are still meaningful for regularized problems (e.g., R^2) while others are not (e.g., p-values).

@mcg1969
Copy link

mcg1969 commented Jan 31, 2017

I'd have to cast my vote to something that hews closer to a statsmodels approach than scikit-learn, both in terms of its API and the types of outputs it provides.

@cicdw
Copy link
Collaborator Author

cicdw commented Feb 15, 2017

In terms of what is exposed and where, here is where my thoughts are: https://github.com/moody-marlin/dask-glm/blob/api_build/dask_glm/logistic.py

cc: @mrocklin @jcrist @hussainsultan

@mrocklin
Copy link
Member

Throwing Chris' example into a comment here for discussion

class LogisticModel(object):
    def hessian(self):
    def gradient(self):
    def loglike(self):
    def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
    def _initialize(self):
    def _check_inputs(self):
    def _set_intercept(self):
    def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)

I'm not qualified to judge on most of this but I'm curious what some of these methods do. Do all models implement a hessian? Or is this just used during solving if its implemented?

Is there a way to satisfy both the sklearn and statsmodels APIs? Roughly what is the statsmodels API?

Are the solvers built here tailored to Logistic regression or are they more general?

@cicdw
Copy link
Collaborator Author

cicdw commented Feb 16, 2017

All good questions; the current solvers which are immediately generalizable to other models are: gradient_descent, admm, and newton by replacing function / gradient / hessian evaluations with calls to the above methods, and by replacing the call to shrinkage in admm with a call to proximal_operator which can be used for any regularizer.

In my mind, there are two big differences between sklearn and statsmodels:

  • Upon initialization, sklearn takes in algorithm parameters such as initialization, convergence tolerance, etc. and then data is fed into the .fit() call. Statsmodels, on the other hand, is initialized with data and then the call to .fit() accepts algorithm parameters.
  • The output of sklearn is a model object with a few new attributes coefs, and a few things about convergence. Statsmodels outputs a model object with many new attributes, including various inferential statistics and statistical tests, along with a .summary() method to see all this.

In my mind, the organization of arguments in statsmodels makes more sense, but it could be that sklearn's interface is more amenable to pipelines and deployment?

Re: the methods, the loglike / gradient / hessian methods will be used for both fitting the model parameters and for producing output statistics. _initialize() will be used for creating initializations for the algorithms (there are some smarter ways than just all zeros that could lead to quicker convergence). _check_inputs will check the data is chunked appropriately, and will handle differences between dask dataframes vs. dask arrays. _set_intercept will simply add a column of all 1's to the input data.

@hussainsultan
Copy link
Collaborator

hussainsultan commented Feb 20, 2017

Ok... these are my thoughts on how to make this work with sklearn interface:

from sklearn.base import BaseEstimator, ClassifierMixin



def grad(w, X, y, lambduh):
    ...
    return out, grad

class Logistic(BaseEstimator, ClassifierMixin):

    def __init__(self, tol=1e-8, lamduh=0.1):
        pass

    def fit(self, X, y):
        n_samples, n_features = X.shape

        self.beta = self._solver(X,y)

        return self

    def predict(self, X):
        return np.sign(safe_sparse_dot(X, self.beta))


    def  _solver(self,X,y):
         """ this can be configurable based on methods in logistic.py, similarly we could add other methods like `def _hessian`  """
         return dask_glm.proximal_gradient(X,y)

    def summary(self):
         pass 

the reason i like this is that it will work with sklearn as long as all of our methods work with numpy as well as dask.array.

also, if it were a dask.array we could possibly use dask-learn pipelines.

@jcrist thoughts on above?

@cicdw
Copy link
Collaborator Author

cicdw commented Feb 20, 2017

Could you provide an example of a feature we gain by inheriting directly from sklearn ? I don't clearly see what these sklearn classes provide us.

@jcrist
Copy link
Member

jcrist commented Feb 20, 2017

Scikit-learn doesn't enforce inheriting from a base class, rather it requires a duck-typed interface. However, all of scikit-learn is written expecting immediate array-likes, so I'm not sure if dask-glm classes can really be mixed with scikit-learn stuff directly (at least when passing in dask-arrays). That said, I think matching the interface is a good idea, but not necessarily subclassing (could go either way here).

The required interface is described in detail here: http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator.

also, if it were a dask.array we could possibly use dask-learn pipelines.

Those have been removed, as they were difficult to maintain, and not that useful in practice. That package now only implements model-parallelism, accelerating grid-search/randomized-search across small data. dask-glm wouldn't really fit with the old pipelines anyway, as it evaluates immediately, while those were lazy.

@hussainsultan
Copy link
Collaborator

hussainsultan commented Feb 20, 2017

thanks @jcrist I wasnt aware that duck typing was sufficient for sklearn interface. This is super helpful!

@moody-marlin i am happy with what you proposed with slight modification that we try to adhere to sklearn API. I think this is general enough that it will allow for the thing that you wanted to do. One example that i could think of was using sklearn pipelines that will work for numpy inputs. http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html What do you think?

@mcg1969
Copy link

mcg1969 commented Feb 20, 2017

class LogisticModel(object):
    def hessian(self):
    def gradient(self):
    def loglike(self):
    def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
    def _initialize(self):
    def _check_inputs(self):
    def _set_intercept(self):
    def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True)

This strikes me as an inappropriate mixture of categories. Methods like gradient and Hessian refer, I assume, to the ability to compute individual values and derivatives of a particular function. And of course, not every model we'd like to build fitting algorithms for are differentiable (e.g., regularized models). I know I'm late to the party, but is there a technical reason we need to conflate functions and models?

@TomAugspurger
Copy link
Member

I did a bit of rough tinkering to hookup statsmodels and dask / dask-glm here. The two biggest tasks were getting statsmodels Model class to accept dask arrays (that version is hacky, no error checks, no patsy), and using dask-glm in the .fit step.

The most awkward part was statsmodels' .fit methods (at least the likelihood-based ones) using something like scipy.minimize which takes a function, starting parameters, and maybe a gradient or hessian. Contrast that with gask-glm which takes an X, y, and family.

I haven't really thought through whether dask-glm splitting the optimization part off from the data part is even desirable from dask-glm's point of view. Just wanted to share some feedback.

@mrocklin
Copy link
Member

This notebook also ran fine on a distributed system (I think) at least locally. You might want to add a .persist() at the end of the from_array calls.

In general I like that this seems to be on a path towards figuring out a good hand-off point between the projects.

        # This section is a bit awkward. From statsmodels POV, dask-glm would
        # provide an optimizer f: llf, score -> opt
        optimizer = self._get_optimizer()()

cc @moody-marlin thoughts on this? @TomAugspurger is this the full expected interface below?

class DaskOptimizer(Optimizer):

    def _fit(f, score, start_params, fargs, kwargs, hessian,
             method, disp, maxiter, callback, retall, full_output):
        from dask_glm.algorithms import admm
        return admm(self.exog, self.endog)

Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?

class DaskGLM(GLM):

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        from dask_glm.families import Logistic
        self.family = Logistic

@TomAugspurger
Copy link
Member

is this the full expected interface below?

IIRC, this isn't perfectly unified across all of statsmodels. Certain fit methods / optimizers have their own arguments. I was meaning to look more closely but ran out of time.

Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy?

You provide a family argument when you instantiate the model:

sm.GLM(data.endog, data.exog, family=sm.families.Gamma()

I haven't looked but I assume that statsmodels' version has all the info we need to lookup the corresponding dask-glm version.

@mrocklin
Copy link
Member

@TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community?

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 14, 2017 via email

@TomAugspurger
Copy link
Member

If you're interested, I've taken a stab at this in #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants