-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API discussion #11
Comments
From the perspective of someone who is not intimately familiar with this field it would be useful to see an example algorithm for a particular case (perhaps un-regularized logistic regression using gradient descent) and then highlight which elements of the code we would change if we were to add L1 or L2 regularization, switch out logistic for some other GLM family, or replace gradient descent with something else. If it is possible to create an example where those options are swapable then that exercise may help to inform a general solution. I'm talking well outside of my expertise though. Regarding scikit-learn is it possible to modestly extend the fit/predict API with some other verb to provide inferential information?
I generally try to build concrete things first and then generalize them. However it requires some finesse to build the concrete thing quickly enough to get valuable user feedback while also not painting yourself into a corner to the point where you can't refactor later. Sorry for the platitude; I have nothing concrete to provide here. |
@moody-marlin Thanks for raising this.
What type of outputs do you expect that would be different than scikit-learn? |
@mrocklin Thanks for the feedback, that's a good exercise that I can work on writing up. @hussainsultan I would expect output more similar to |
I'd have to cast my vote to something that hews closer to a |
In terms of what is exposed and where, here is where my thoughts are: https://github.com/moody-marlin/dask-glm/blob/api_build/dask_glm/logistic.py |
Throwing Chris' example into a comment here for discussion class LogisticModel(object):
def hessian(self):
def gradient(self):
def loglike(self):
def fit(self, tol=1e-8, warm_start=None, random_state=None, max_iter=250):
def _initialize(self):
def _check_inputs(self):
def _set_intercept(self):
def __init__(self, X, y, reg=None, lamduh=0, fit_intercept=True) I'm not qualified to judge on most of this but I'm curious what some of these methods do. Do all models implement a hessian? Or is this just used during solving if its implemented? Is there a way to satisfy both the sklearn and statsmodels APIs? Roughly what is the statsmodels API? Are the solvers built here tailored to Logistic regression or are they more general? |
All good questions; the current solvers which are immediately generalizable to other models are: In my mind, there are two big differences between sklearn and statsmodels:
In my mind, the organization of arguments in statsmodels makes more sense, but it could be that sklearn's interface is more amenable to pipelines and deployment? Re: the methods, the loglike / gradient / hessian methods will be used for both fitting the model parameters and for producing output statistics. |
Ok... these are my thoughts on how to make this work with sklearn interface: from sklearn.base import BaseEstimator, ClassifierMixin
def grad(w, X, y, lambduh):
...
return out, grad
class Logistic(BaseEstimator, ClassifierMixin):
def __init__(self, tol=1e-8, lamduh=0.1):
pass
def fit(self, X, y):
n_samples, n_features = X.shape
self.beta = self._solver(X,y)
return self
def predict(self, X):
return np.sign(safe_sparse_dot(X, self.beta))
def _solver(self,X,y):
""" this can be configurable based on methods in logistic.py, similarly we could add other methods like `def _hessian` """
return dask_glm.proximal_gradient(X,y)
def summary(self):
pass the reason i like this is that it will work with sklearn as long as all of our methods work with also, if it were a dask.array we could possibly use @jcrist thoughts on above? |
Could you provide an example of a feature we gain by inheriting directly from |
Scikit-learn doesn't enforce inheriting from a base class, rather it requires a duck-typed interface. However, all of scikit-learn is written expecting immediate array-likes, so I'm not sure if dask-glm classes can really be mixed with scikit-learn stuff directly (at least when passing in dask-arrays). That said, I think matching the interface is a good idea, but not necessarily subclassing (could go either way here). The required interface is described in detail here: http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator.
Those have been removed, as they were difficult to maintain, and not that useful in practice. That package now only implements model-parallelism, accelerating grid-search/randomized-search across small data. |
thanks @jcrist I wasnt aware that duck typing was sufficient for sklearn interface. This is super helpful! @moody-marlin i am happy with what you proposed with slight modification that we try to adhere to sklearn API. I think this is general enough that it will allow for the thing that you wanted to do. One example that i could think of was using sklearn pipelines that will work for |
This strikes me as an inappropriate mixture of categories. Methods like |
I did a bit of rough tinkering to hookup statsmodels and dask / dask-glm here. The two biggest tasks were getting statsmodels Model class to accept dask arrays (that version is hacky, no error checks, no patsy), and using The most awkward part was statsmodels' I haven't really thought through whether dask-glm splitting the optimization part off from the data part is even desirable from dask-glm's point of view. Just wanted to share some feedback. |
This notebook also ran fine on a distributed system (I think) at least locally. You might want to add a In general I like that this seems to be on a path towards figuring out a good hand-off point between the projects. # This section is a bit awkward. From statsmodels POV, dask-glm would
# provide an optimizer f: llf, score -> opt
optimizer = self._get_optimizer()() cc @moody-marlin thoughts on this? @TomAugspurger is this the full expected interface below? class DaskOptimizer(Optimizer):
def _fit(f, score, start_params, fargs, kwargs, hessian,
method, disp, maxiter, callback, retall, full_output):
from dask_glm.algorithms import admm
return admm(self.exog, self.endog) Here you hard code Logistic. Is there a general way to do this within statsmodels? Or is this patsy? class DaskGLM(GLM):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
from dask_glm.families import Logistic
self.family = Logistic |
IIRC, this isn't perfectly unified across all of statsmodels. Certain fit methods / optimizers have their own arguments. I was meaning to look more closely but ran out of time.
You provide a family argument when you instantiate the model: sm.GLM(data.endog, data.exog, family=sm.families.Gamma() I haven't looked but I assume that statsmodels' version has all the info we need to lookup the corresponding dask-glm version. |
@TomAugspurger what do you think is the right way forward here? Is this worth pursuing? What are things that should be worked on on the dask-glm side? Are there open questions for the statsmodels community? |
I'll do some more experimentation, but it'll be a bit.
Are there open questions for the statsmodels community?
I'd like to get a more cleaned up version of dask + statsmodels working, to
essentially replace
https://github.com/statsmodels/statsmodels/blob/master/examples/notebooks/distributed_estimation.ipynb
I'd prefer statsmodels users use dask for larger datasets, rather than
writing their own larger-than-memory handling.
What are things that should be worked on on the dask-glm side?
Not really sure at this point. I still don't have a good feeling for how
much additional work separating the optimizer would be. Should have some
time in a few weeks.
…On Mon, Mar 13, 2017 at 9:28 AM, Matthew Rocklin ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> what do you think is
the right way forward here? Is this worth pursuing? What are things that
should be worked on on the dask-glm side? Are there open questions for the
statsmodels community?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIj-fa-dZEIjdjyjF_9xGjGZ6hs6Nks5rlVKggaJpZM4LvG12>
.
|
If you're interested, I've taken a stab at this in #40 |
One of the biggest hanging issues is to organize / discuss what the API will look like for
dask_glm
and what sorts of things we expose. This decision impacts algorithm structure going forward (as we incorporate other GLM families and regularizers) so I will need some help focusing the following discussion on specific action items. My initial thoughts were to:Aim for generality so users can create their own GLM families / regularizers: My initial thought along these lines were to have an
Optimizer(**initialization_args, **numerical_algo_args, **regularization_args)
class that can be inherited by any class that implementsfunc(), grad(), hessian()
methods. This would allow us to focus on algorithm development separately from model specification and add additional model types (Normal, Poisson, etc.) easily. The user then callsLogistic(X,y, **kwargs)
to initialize a model, and then.fit(**algo_args)
calls on the methods fromOptimizer
. This can get hairy, and could possibly prevent further optimizations that exploit specific structure inherent in the models.Hard code the models users can choose from: Per my conversation with @jcrist, we might not want to aim for extreme flexibility in the model API, as most people probably don't need / want to define their own GLM families or their own regularizers. Related, we might want to implement the algorithms specifically for each family / regularizer.
Regularized vs. unregularized models: How do we manage the 'switch' from regularized to unregularized models? Should it be something as simple as
model = LogisticModel(X, y, reg='l2')
and the user chooses from a pre-defined list of regularizers, or should we allow formodel = LogisticModel(X, y, reg=reg_func)
wherereg_func()
is any regularization function on the parameters, allowing for customizations such as grouped lasso, non-negativity, etc? Again I was hoping to keep things as general as possible so that the user can mix and match, but on the algorithm side this can get difficult especially if we want speed and accuracy.-Inherit from scikit-learn: Per my conversation with @hussainsultan, we might want to try and stick as closely as possible to
scikit-learn
's API, as this is common enough to be comfortable for most people. However, we don't want to adhere to this too strongly (especially in output), because we started with the goal of intelligently toggling between inferential and predictive applications (thus requiring things like inferential stats thatscikit-learn
currently doesn't include).Ultimately, I think my thought process boils down to: should we expose an API to access the underlying algorithms or should we bake them into the various specific GLM families we want to consider?
cc: @mrocklin @mcg1969
The text was updated successfully, but these errors were encountered: