-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Approximate exponents #23
Comments
@jcrist this may interest you |
In my experience we don't need to care about Inf/NaN checking, but we do need to make sure that it returns Inf if the input is too large. |
I personally think that it could be useful to use inexact / approximate computations for a large piece of the algorithm and then use exact computations for the final steps. A high-level overview of what I mean:
The first piece will hopefully be faster, and the second piece ensures that we are still solving the true problem. |
It looks like exp appears in the following two places: def sigmoid(x):
return 1 / (1 + exp(-x))
def logistic_loss(beta, X, y):
'''Logistic Loss, evaluated point-wise.'''
beta, y = beta.ravel(), y.ravel()
Xbeta = X.dot(beta)
eXbeta = np.exp(Xbeta)
return np.sum(np.log1p(eXbeta)) - np.dot(y, Xbeta) In the sigmoid case it seems like we only care about accuracy close to 0 because anything farther away will get squashed. In the second case of |
It's also worth noting that the example above is an extreme case where there are only two columns. As the number of columns increases we expect the matvec to dominate over the exp. |
There are plenty of pieces of advice that Stuart has offered above that would require no numerical compromise and would be worth considering. But for the rest I'm going to maintain by spoil sport position here. I'm opposed to doing anything that returns an approximate result. Statisticians want what they want; heck, most modelers want what they want. And if that's a logistic regression, that means the global minimum of a particular cost function. I also feel that an attempt to design a hybrid approach that attempts to preserve global convergence while exploiting approximations in early stages introduces a lot of distracting research and speculation. As far as I know (and I could be wrong) a dependence on approximate exponentials is unprecedented in this area. That's true of every innovation, certainly. But I think it is worth asking: are any other well-respected or widely used GLM implementations approximating the exponential? And if not, why not? I took a look at H2O and didn't see any evidence they used anything but standard library exp. (That said, If we prove competitive or better with the exact model, and we can demonstrate significant improvements on top of that with an approximation, that would be attractive.) Obviously, I'm all talk and no action right now (though I was actively working on test generation today). I'm not trying to exercise power over the project I don't have. I just don't think the market is asking for approximations for the sake of speed, when there ought to be myriad ways to improve upon the current best workflow practices, of which the algorithm is just a part. |
My understanding is that choices like sigmoid were fairly qualitative to start with, and so outputs should be relatively robust to approximation here. Also, approximations of exp can be very very accurate if constrained to certain ranges. There is evidence that people do approximate However, benchmarking shows that this is only valuable in the small-number-of-columns case. If common use cases are well over 10 columns then the matvecs take over in cost. |
That's absolutely right! And yet I consider it completely irrelevant to this discussion. So it seems our disagreement here may be more philosophical than technical. There is no doubt that a larger modeling effort might be robust to an approximation. The very act of selecting a model to apply to a set of data implies approximation and heuristic. How many people have chosen least square simply because it's something they know how to compute, without any consideration as to whether or not the error distribution is approximately IIID Gaussian? And with a binary model, there are several common alternatives to the logistic model. When to choose one over the other is not always going to be clear, and many will just settle for what they know. But once I've settled on a model, I want what I want. If I've decided that's going to be a logistic model, I'm going to find suspect a package that insists on offering me an approximation. Or perhaps it's not so much that we would insist on it, but offering it prominently might communicate that an approximation is necessary to achieve decent performance---when no other package I'm aware of feels the need to make that claim. |
I hear what you're saying, and I appreciate the aversion to approximations, however I think that this may also be approximate in a similar way to how floating point arithmetic is approximate. I suggest that we just try things out on real data and see if it has value in practice. I'm not suggesting that we go down this path, I'm suggesting that we explore it. I think it's cheap to try. I'm entirely happy to back off from it if it proves unfruitful or significantly affects accuracy. I suspect that with a couple hours of work we could create an exp_approx implementation in numba that was arbitrarily accurate around the range relevant for sigmoid. We would then have the option to play with this function while we profile and benchmark different problems and see if it has positive or negative effects. |
Get me to within |
For ADMM, my understanding is that you don't have to be precise for local updates for the solution to converge, which leads me to believe that some numeric testing will be helpful. @moody-marlin thoughts? |
There's quite a bit of literature about inexact search methods. For instance, a typical gradient/Newton?BFGS descent method, you'll still see progress towards the optimal solution as long as your descent directions exhibit negative inner product with the gradient (and some other technical conditions). With large-scale Newton methods in particular you can sometimes get significant savings by solving the Newton system iteratively and inexactly. But approximate search directions need to be distinguished from approximations of the models themselves. |
This is an interesting discussion, I can't really comment on the domain specific use case of what e.g. modellers want, however, as an engineer I've some thoughts over the specific issue over Given
|
@stuartarchibald which of these techniques would work well in a Python + Numba project that is strictly open source (no proprietary code). |
Sigmoid is very smooth, have you considered a lookup table? |
Sigmoid has a neat taylor series expansion, super fast to evaluate and reasonably accurate even with a pretty rough lookup table. My interpolation is pretty lame, but just to show that one of you actually applied math types could come up with something much more accurate. I'm pretty sure that any reasonable rational function of exponents of polynomials will behave in a similar way. How much precision do you need for function evals in GLM? 3 digits? or more? |
@mrocklin Enumerating the above techniques as 1-5.
for(size_t i=0; i<n; i++)
exp_x[i] = exp(x[i]) goes to void exp_func(double * input, double * output, size_t n)
{
for (size_t i=0; i<n; i++)
{
// some inlined exp alg that's undergone the unswitch
// possibly add in tiling here too, will need vector instruction support
}
}
exp_func(x, exp_x);
|
One MIT licensed implementation for really fast exponential here: https://github.com/jhjourdan/SIMD-math-prims Looks like you get 20x speedup vs expf. |
Fun fact: the current ADMM implementation can spend almost half of it's time computing
np.exp
Script
Profile results
I use a wrapped version of
np.exp
just so that it shows up in profile results.Comments
Here are some comments from @stuartarchibald
How many exp() are you doing at a time?
If you are doing many, are the input values of similar magnitude?
Do you care about IEEE754 correctness over NaN and Inf?
When I've deal with this previously, most of the time spent in a loop over exp() was in a) feraiseexcept or similar dealing with inf/nan b) branch misprediction, the split points are usually f(log_e(2)) IIRC.
a) goes away if you don't care
b) goes away if you can guarantee a range or aren't too bothered about inaccuracy out of range. Remez algs are usually used to compute the coefficient table for a polynomial and a bunch of shifts are done to get values into appropriate range.
c) if you have a load of values to compute, perhaps try Intel VML?
cc @seibert @mcg1969
The text was updated successfully, but these errors were encountered: