# Physics 212, 2019: Lecture 16

Back to the main Teaching page.

## Multidimensional optimization

Multidimensional optimization is pretty similar to one-dimensional cases we considered before, but just much harder. All the same problems -- multiple minimal, flat regions of the parameter space -- show up, but now there are a lot more ways for them to show up, because the search space is so large. It is simply impossible to find a minimum by brute force search. Indeed, if one only were to try just about 10 points per variable we are optimizing with respect to, and there are ${\displaystyle d}$ such variables, one would need to explore ${\displaystyle 10^{d}}$ points, and this is infeasible for all but the smallest ${\displaystyle d}$. If the optimization is non-convex, there's simply no sure way of finding a global minimum, and even finding a local one is hard. And yet there are some good methods that, in practice, work quite well for not too hard problems.

As in the case of one-dimensional optimization, these methods are divided into ones that require being able to evaluate the function being minimized only, ${\displaystyle {\mathcal {L}}}$, the function and its derivatives, and the function and its first and second derivatives. They are additionally divided into stochastic and deterministic algorithms (the latter is the focus of our class).

### Simplex optimization

This is a relatively fool-proof method, also known as the Nelder-Mead algorithm, which allows one to find a local minimum, and should be the first, go-to method that you should use when optimizing a function about which you know very little. It is essentially equivalent to rolling downhill, without knowing derivatives of a function. In one dimension, to now which direction is down requires knowing the function at two points, and then comparing the two values. In two dimensions, to do the same would require knowing a function in three points (on a triangle). Indeed, if you only had two values, you'd be able to find the smaller value of the two, but nobody prevents a minimum of the function to be away from the line that connects these two points. To explore all directions, requires knowing the function on triangle. In three dimensions, one would want to know at least four values to know which direction is down -- and this would be knowing the function in vertexes of some tetrahendron. More generally, to know which way is down in ${\displaystyle d}$ dimensions requires knowing the function in ${\displaystyle d+1}$ points that are nongenerate (that is form an object, called a simplex, that has a nonzero ${\displaystyle d}$-dimensional volume). Then to roll downhill requires finding the highest point of the simplex and reflecting them through the remaining points, or, in other words, stepping downhill. If the step size is small, and the optimization proceeds slowly, then one may grow the simplex. Conversely, when one approaches the minimum, or tries to explore the bottom of a very small valley, one may need to shrink the simplex. There are well-established heuristics that suggest when to make such choices, and these can be found in standard textbooks. For example, the description of the algorithm in the Numerical Recipes textbook, Ch. 10.4, is very good.

When derivatives of the objective function can be evaluated (either analytically or numerically), one can calculate the gradient -- the direction of the maximum change of the function, and move in that direction. One should be careful to choose a good step size to move by, so that we do not step over a minimum, and, at the same time, make steps large enough to not waste computing time. There are additional considerations, so that sometimes one should not move down the gradient directly, if that direction has been explored recently. As for the simplex method, this algorithm and its relatives are very well understood, and the description can be found in Numerical Recipes, Ch. 10.5, 10.6.

### Newton method

If one can evaluate not just the first, but also the second derivative of the objective function, then one can estimate where the minimum of the function is using its Taylor expansion. In 1-d, this we studied this as the Newton method, and it translates with few changes to multi-dimensional optimization as well.

## Nonlinear Least Squares

Sometimes we know various properties of the function that we are trying to optimize, and these properties can be used to develop good optimization algorithms. One special example that we see often in scientific application is fitting curves to data, when the objective function is a sum of squares -- a sum of squared errors between the fitted model and the actual data ${\displaystyle {\mathcal {L}}=\sum _{i}(y_{i}-{\hat {y}}(x_{i},\theta ))^{2}}$, where, as always ${\displaystyle y_{i}}$ are the measured values, and ${\displaystyle {\hat {y}}(x_{i},\theta )}$ are the fitted values at points ${\displaystyle x_{i}}$ with the parameter values ${\displaystyle \theta }$. Fitting the curve means minimizing ${\displaystyle {\mathcal {L}}}$, or, in other words, finding the {\em least squares} curve to pass through data. This is similar to the linear reegression, which we solved using linear least squares approach. However, in this case the dependence of ${\displaystyle {\mathcal {L}}}$ on the unknown parameter ${\displaystyle \theta }$ is nonlinear. Nonetheless, can the fact that the objective function is a sum of squares be used to develop a minimization algorithm that will help us find the solution faster? Let's apply the Newton alfgorithm for minimization to the sum of squares and see where it leads (and let's keep things simple and focus on just one dimension).

### Gauss-Newton algorithm

Suppose that we already are near a minimum of the objective function. Then the entire ${\displaystyle {\mathcal {L}}}$ is quadratic in the deviation of ${\displaystyle \theta }$ from the minimum. But each of the individual residual terms is not (the minimum for each residual would be zero). We can thus approximate each residual term as ${\displaystyle r_{i}(\theta )=r_{i}(\theta _{0})+{\frac {dr_{i}}{d\theta }}(\theta -\theta _{0})}$. This turns the nonlinear least squares into the locally linear lease squares, which is an easily solvable problem. One then solves the problem, and repeats recursively the solution till the minimum is reached. See more details here.

### Levenberg-Marquardt algorithm

The Newton-Gauss method works well when our initial guess is close enough to the minimum of the objective function so that the latter can be viewed as a quadratic function of the parameters. This is not always the case. To address the issues, the Levenberg-Marquardt algorithm, see https://en.wikipedia.org/wiki/Levenberg–Marquardt_algorithm , interpolates between the Newton-Gauss method and gradient descent. It is skewed towards the Newton-Gauss method when the latter works well, and skews towards gradient descent otherwise. The function curve_fit from the optimize module, http://docs.scipy.org/doc/scipy/reference/optimize.html uses the Levenberg-Marquardt method by default when performing least squares fitting, and we will be using this method in our work as well.