Skip to content

Commit

Permalink
typos fixed
Browse files Browse the repository at this point in the history
  • Loading branch information
kaipuolamaki committed Sep 19, 2022
1 parent 03a334c commit 48a6f41
Showing 1 changed file with 31 additions and 31 deletions.
62 changes: 31 additions & 31 deletions examples/how-to-optimize-with-torch.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,22 @@
"id": "4760bcc3",
"metadata": {},
"source": [
"# How to optimize with Torch\n",
"# How to optimise with Torch\n",
"Kai Puolamäki, 28 June 2022\n",
"\n",
"## Introduction\n",
"\n",
"This brief tutorial demonstrates how [PyTorch](https://pytorch.org) can be used to find minimum values of arbitrary functions, as is done in [SLISEMAP](https://github.com/edahelsinki/slisemap). The advantages of PyTorch include the use of autograd and optionally GPU acceleration. These may result in significant speedups when optimizing high-dimensional loss functions, which often happens in deep learning but also elsewhere.\n",
"This brief tutorial demonstrates how [PyTorch](https://pytorch.org) can be used to find minimum values of arbitrary functions, as is done in [SLISEMAP](https://github.com/edahelsinki/slisemap). The advantages of PyTorch include the use of autograd and optionally GPU acceleration. These may result in significant speedups when optimising high-dimensional loss functions, often in deep learning and elsewhere.\n",
"\n",
"The existing documentation of PyTorch is geared towards deep learning. It is currently difficult to find documentation of how to do \"simple\" optimization without any deep learning context, which is why I wrote this tutorial in the hope that it will be useful for someone.\n",
"The existing documentation of PyTorch is geared towards deep learning. It is currently difficult to find documentation of how to do \"simple\" optimisation without any deep learning context, which is why I wrote this tutorial in the hope that it will be helpful for someone.\n",
"\n",
"## Toy example\n",
"\n",
"Here we minimise a simple regularized least squares loss given by\n",
"Here we minimise a simple regularised least squares loss given by\n",
"$$\n",
"L = \\lVert {\\bf y}-{\\bf X}{\\bf b} \\rVert_2^2+\\lVert{\\bf{b}}\\rVert_2^2/10,\n",
"$$\n",
"where ${\\bf X}\\in{\\mathbb{R}}^{3\\times 2}$ and ${\\bf y}\\in{\\mathbb{R}}^3$ are constants and ${\\bf{b}}\\in{\\mathbb{R}}^2$ is a vector whose values are to be found by the optimiser. We could optimize any reasonably behaving function; here we picked the least squares loss for simplicity.\n",
"where ${\\bf X}\\in{\\mathbb{R}}^{3\\times 2}$ and ${\\bf y}\\in{\\mathbb{R}}^3$ are constants and ${\\bf{b}}\\in{\\mathbb{R}}^2$ is a vector whose values are to be found by the optimiser. We could optimise any reasonably behaving function; here, we picked the least squares loss for simplicity.\n",
"\n",
"In this example, we use the following values for the constant matrix and vector:\n",
"$$\n",
Expand All @@ -39,9 +39,9 @@
"\n",
"## Numpy and Scipy\n",
"\n",
"We first solve the problem with the [standard `scipy` optimizer](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), by using an arbitrarily chosen initial starting point.\n",
"We first solve the problem with the [standard `scipy` optimiser](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), by using an arbitrarily chosen initial starting point.\n",
"\n",
"We first define the matrices and vectors as Numpy arrays and then define a loss function `loss_fn0` that that takes the value of ${\\bf{b}}$ as input and outputs the value of the loss $L$."
"We first define the matrices and vectors as Numpy arrays and then define a loss function `loss_fn0` that takes the value of ${\\bf{b}}$ as input and outputs the value of the loss $L$."
]
},
{
Expand Down Expand Up @@ -145,9 +145,9 @@
"id": "c71c5414",
"metadata": {},
"source": [
"For this starting point the loss value is $L=5.027$, which is clearly larger than the optimal value.\n",
"For this starting point, the loss value is $L=5.027$, which is larger than the optimal value.\n",
"\n",
"We can find the value of ${\\bf{b}}$ that minimizes the loss $L$ by using a library optimization algorithm, BFGS in this case. We find the correct value of ${\\bf{b}}$ and the corresponding loss:"
"In this case, we can find the value of ${\\bf{b}}$ that minimizes the loss $L$ by using a library optimization algorithm, BFGS. We see the correct value of ${\\bf{b}}$ and the corresponding loss:"
]
},
{
Expand Down Expand Up @@ -262,11 +262,11 @@
"source": [
"## PyTorch\n",
"\n",
"We'll repeat the same with Pytorch. First we define a helper function `LBFGS` that takes in the loss function and the variables to be optimized as input and that as a side effect updates the variables to their values at the minimum of the loss function.\n",
"We'll repeat the same with Pytorch. First, we define a helper function `LBFGS` that takes in the loss function and the variables to be optimised as input and that, as a side effect, updates the variables to their values at the minimum of the loss function.\n",
"\n",
"The helper function uses the [Torch LBFGS optimizer](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html). The `closure` is a function that essentially evaluates the loss function and updates the gradient values. \n",
"The helper function uses the [Torch LBFGS optimiser](https://pytorch.org/docs/stable/generated/torch.optim.LBFGS.html). The `closure` is a function that essentially evaluates the loss function and updates the gradient values. \n",
"\n",
"You can use this helper function as a generic optimizer, much in the same way as you would use the `scipy.optimize.minimize` above by just cutting-and-pasting the LBGGS helper function into your code. The file [utils.py](https://github.com/edahelsinki/slisemap/blob/main/slisemap/utils.py) in the SLISEMAP source code contains a more advanced version of the helper function."
"You can use this helper function as a generic optimiser, much like you would use the `scipy. optimise. minimise` above by just cutting and pasting the LBGGS helper function into your code. The file [utils.py](https://github.com/edahelsinki/slisemap/blob/main/slisemap/utils.py) in the SLISEMAP source code contains a more advanced version of the helper function."
]
},
{
Expand Down Expand Up @@ -339,7 +339,7 @@
"id": "96f70091",
"metadata": {},
"source": [
"Torch functions typically require that we define the variables torch tensors. The torch tensors correspond to Numpy arrays, but they carry autograd information and they can optionally be used within a GPU. Notice that we need to attach the slot for the gradients to ${\\bf{b}}$ tensor because we want to optimize it!"
"Torch functions typically require that we define the variables as torch tensors. The torch tensors correspond to Numpy arrays, but they carry autograd information and can optionally be used within a GPU. Notice that we need to attach the slot for the gradients to ${\\bf{b}}$ tensor because we want to optimize it!"
]
},
{
Expand Down Expand Up @@ -402,7 +402,7 @@
"id": "da4abb6d",
"metadata": {},
"source": [
"The safe way to make Torch tensors Numpy arrays is to first move them to CPU, then detach any autograd part, and then make them numpy arrays:"
"The safe way to make Torch tensors Numpy arrays is first to move them to the CPU, then detach any autograd part and then make them NumPy arrays:"
]
},
{
Expand Down Expand Up @@ -460,9 +460,9 @@
"id": "bc273b6d",
"metadata": {},
"source": [
"Next, we define the loss function that takes no parameters as an input and which outputs the loss (a tensor with only one real number as a value). If you want to evaluate the value of loss for different values of ${\\bf{b}}$ you must update the values in the corresponding tensor.\n",
"Next, we define the loss function that takes no parameters as an input and outputs the loss (a tensor with only one real number as a value). If you want to evaluate the value of loss for different values of ${\\bf{b}}$ you must update the values in the corresponding tensor.\n",
"\n",
"It is important to use only Torch arithmetic operations that support autograd. Luckily, there are enough operations to cover most needs. Instead of `sum` method in the [Tensor object](https://pytorch.org/docs/stable/tensors.html) as in the first example below we can alternatively use [torch.sum](https://pytorch.org/docs/stable/generated/torch.sum.html) (both of which supports torch tensors and autograd), but we cannot use, e.g., [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) (which does not support torch tensors and autograd)."
"It is essential to use only Torch arithmetic operations that support autograd. Luckily, there are enough operations to cover most needs. Instead of the `sum` method in the [Tensor object](https://pytorch.org/docs/stable/tensors.html) as in the first example below, we can alternatively use [torch.sum](https://pytorch.org/docs/stable/generated/torch.sum.html) (both of which support torch tensors and autograd), but we cannot use, e.g., [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) (which does not support torch tensors and autograd)."
]
},
{
Expand Down Expand Up @@ -576,7 +576,7 @@
"id": "9af4e9da",
"metadata": {},
"source": [
"If we want to have the loss value as a real number then the correct procedure is to first move the tensor to CPU (this matters if we use GPU, otherwise it is a null operation), then detach the autograd component, and then take the only item out as a real number:"
"If we want the loss value as a real number, the correct procedure is first to move the tensor to the CPU; this matters if we use GPU; otherwise, it is a null operation. Afterwards, we can detach the autograd component and take the only item as a real number."
]
},
{
Expand Down Expand Up @@ -632,7 +632,7 @@
"id": "7c8fe981",
"metadata": {},
"source": [
"We use the helper function `LBFGS` defined above to do the optimization. We need to give as parameters the loss function and a list of tensors to be optimized. As a result, the value of the tensor ${\\bf{b}}$ is updated to the value that minimizes the loss!"
"We use the helper function `LBFGS` defined above to do the optimization. We need to give as parameters the loss function and a list of tensors to be optimized. As a result, the tensor ${\\bf{b}}$ is updated to the value that minimizes the loss!"
]
},
{
Expand Down Expand Up @@ -693,7 +693,7 @@
"id": "47801939",
"metadata": {},
"source": [
"The optimum value of the loss function is the same as in the first example with Numpy and standard Scipy optimization function."
"The optimum value of the loss function is the same as in the first example with Numpy and the standard Scipy optimization function."
]
},
{
Expand Down Expand Up @@ -749,7 +749,7 @@
"id": "c2dcb966",
"metadata": {},
"source": [
"Again, it is good to check that the optimization converged successfully:"
"Again, it is good to check that the optimisation converged successfully:"
]
},
{
Expand Down Expand Up @@ -807,35 +807,35 @@
"id": "cf824a75",
"metadata": {},
"source": [
"The optimization took 6 iterations (cutoff being 500). Therefore, the optimization was probably terminated due to convergence to a local minumum and we should be fine. If there is no convergence you may need to increase the cutoff or study the matter further (e.g., the loss to be optimized could be badly behaving)."
"The optimisation took six iterations (cutoff being 500). Therefore, the optimisation was probably terminated due to convergence to a local minimum, and we should be fine. If there is no convergence, you may need to increase the cutoff or study the matter further (e.g., the loss to be optimised could be badly behaving)."
]
},
{
"cell_type": "markdown",
"id": "91158601",
"metadata": {},
"source": [
"## Addendum: differences between \"conventional\" optimization and optimization in deep learning\n",
"## Addendum: differences between \"conventional\" optimisation and optimisation in deep learning\n",
"\n",
"The idea in optimization is to find parameter values that minimize the value of a given target function (loss). When the parameters of the target function are real-valued numbers then typically gradient-based optimizers are used. \n",
"Optimisation aims to find parameter values that minimise the value of a given target function (loss). When the parameters of the target function are real-valued numbers, then typically, gradient-based optimisers are used. \n",
"\n",
"In deep learning applications the loss to be optimized is typically, e.g., classification error of the deep learning networks and the parameters are weights in the network. Below, I list some practical differences between optimization in deep learning and more traditional optimization problems.\n",
"In deep learning applications, the loss to be optimised is typically, e.g., classification error of the deep learning networks and the parameters are weights in the network. Below, I list some practical differences between optimisation in deep learning and more traditional optimisation problems.\n",
"\n",
"\n",
"### Stochastic gradient algorithms are more popular in deep learning\n",
"### Stochastic gradient algorithms are more prevalent in deep learning\n",
"\n",
"Deep learning problems are typically high dimensional (meaning there are lots of parameters). Stochastic gradient -based algorithms scale well for very high-dimensional datasets, while more conventional optimization methods may become too slow. However, for lower-dimensional problems the stochastic gradient -based algorithms may be slower to converge. LBFGS (which is not based on stochastic gradient) used above is very good conventional optimizer and me be a better default choice for conventional problems with a reasonable number of parameters to be optimised.\n",
"Deep learning problems are typically high dimensional (meaning there are many parameters). Stochastic gradient-based algorithms scale well for high-dimensional datasets, while more conventional optimisation methods may become too slow. However, the stochastic gradient-based algorithms may be slower to converge for lower-dimensional problems. LBFGS (which is not based on stochastic gradient) is an excellent conventional optimiser and might be a better default choice for traditional problems with a reasonable number of parameters to be optimised.\n",
"\n",
"\n",
"### In deep learning we do not want to find the minimum\n",
"### In deep learning, we do not want to find the minimum\n",
"\n",
"In deep learning we do not usually want to find the parameter values that minimize the loss, because this may result into overfitting to the training data. Instead, gradient optimization is typically run iteratively step by step. The optimization is stopped when validation loss stops decreasing. In a normal Torch workflow `optimiser.step(closure)` would be run repeatedly until validation loss stops decreasing. \n",
"In deep learning, we do not usually want to find the parameter values that minimise the loss because this may result in overfitting the training data. Instead, gradient optimisation is typically run iteratively step by step. The optimisation is stopped when validation loss stops decreasing. A typical Torch workflow ` optimiser.step(closure)` would be run repeatedly until validation loss stops falling. \n",
"\n",
"A more traditional approach for optimization (used also by the Scipy `minimize` above) is to run the optimizer until the pre-defined stopping criteria are met, which typically means that the solution is no longer improved indicating that the optimizer has found a local minimum of the loss function; this is what our LBFGS helper function does. We need to run LBFGS optimizer only one step, which is in most cases enough to converge, because the default maximum number of iterations within step is in LBFGS function set to 500, Torch default being 20. Often, the optimizer stops before 500 steps are used after convergence, but you should check this.\n",
"A more traditional approach for optimisation (also used by the Scipy `minimise` above) is to run the optimiser until the pre-defined stopping criteria are met. It typically means that the solution is no longer improved, indicating that the optimiser has found a local minimum of the loss function; this is what our LBFGS helper function does. We need to run LBFGS optimiser only one step, which is, in most cases, enough to converge because the default maximum number of iterations within the step is in the LBFGS function set to 500, Torch default being 20. Often, the optimiser stops before 500 steps are used after convergence, but you should check this.\n",
"\n",
"### In deep learning speed is considered more important than stability or robustness\n",
"### In deep learning, speed is considered more important than stability or robustness\n",
"\n",
"In deep learning stability or robustness of the algorithm is often considered less important than scalability. If the optimization does not converge it can be just restarted with different parameters, while in a more conventional setup you would be happy if the optimizer behaves predictably and you do not have to fiddle with parameters. Therefore line search is not by default used in Torch LBFGS optimizer. I have added line search option, which guarantees that the value of the loss function does not increase at any iteration. This results to better numerical stability with the penalty of slightly longer runtime."
"In deep learning, the stability or robustness of the algorithm is often considered less important than scalability. If the optimisation does not converge, it can be just restarted with different parameters, while in a more conventional setup, you would be happy if the optimiser behaves predictably. You do not have to fiddle with the parameters. Therefore line search is not by default used in Torch LBFGS optimiser. I have added a line search option, which guarantees that the value of the loss function does not increase at any iteration, resulting in better numerical stability with a slightly longer runtime penalty."
]
}
],
Expand Down

0 comments on commit 48a6f41

Please sign in to comment.