Physics 212, 2018: Lecture 15

Back to the main Teaching page.

Multidimensional optimization

Multidimensional optimization is pretty similar to one-dimensional cases we considered before, but just much harder. All the same problems -- multiple minimal, flat regions of the parameter space -- show up, but now there are a lot more ways for them to show up, because the search space is so large. It is simply impossible to find a minimum by brute force search. Indeed, if one only were to try just about 10 points per variable we are optimizing with respect to, and there are ${\displaystyle d}$ such variables, one would need to explore ${\displaystyle 10^{d}}$ points, and this is infeasible for all but the smallest ${\displaystyle d}$. If the optimization is non-convex, there's simply no sure way of finding a global minimum, and even finding a local one is hard. And yet there are some good methods that, in practice, work quite well for not too hard problems.

As in the case of one-dimensional optimization, these methods are divided into ones that require being able to evaluate the function being minimized only, ${\displaystyle {\mathcal {L}}}$, the function and its derivatives, and the function and its first and second derivatives.

Simplex optimization

This is the fool-proof method, also known as the Nelder-Mead agorithm, which allows one to find a local minimum, and should be the first, go-to method that you should use when optimizing a function about which you know very little. It is essentially equivalent to rolling downhill, without knowing derivatives of a function. In one dimension, to now which direction is down requires knowing the function at two points, and then comparing the two values. In two dimensions, to do the same would require knowing a function in three points (on a triangle). Indeed, if you only had two values, you'd be able to find the smaller value of the two, but nobody prevents a minimum of the function to be away from the line that connects these two points. To explore all directions, requires knowing the function on triangle. In three dimensions, one would want to know at least four values to know which direction is down -- and this would be knowing the function in vertexes of some tetrahendron. More generally, to know which way is down in ${\displaystyle d}$ dimensions requires knowith the function in ${\displaystyle d+1}$ points that are nongenerate (that is form an object, called a simplex, that has a nonzero ${\displaystyle d}$-dimensional volume). Then to roll downhill requires finding the highest point of the simplex and reflecting them through the remaining points, or, in other words, stepping downhill. If the step size is small, and the optimization proceeds slowly, then one may grow the simplex. Conversely, when one approaches the minimum, or tries to explore the bottom of a very small valley, one may need to shrink the simplex. There are well-established heuristics that suggest when to make such choices, and these can be found in standard textbooks. For example, the description of the algorithm in the Numerical Recipes textbook, Ch. 10.4, is very good.

When derivatives of the objective function can be evaluated, one can calculate the gradient -- the direction of the maximum change of the function, and move in that direction. One should be careful to choose a good step size to move by, so that we do not step over a minimum, and, at the same time, make steps large enough to not waste computing time. There are additional considerations, so that sometimes one should not move down the gradient directly, if that direction has been explored recently. As for the simplex method, this algorithm and its relatives are very well understood, and the description can be found in Numerical Recipes, Ch. 10.5, 10.6.

Newton method

If one can evaluate not just the first, but also the second derivative of the objective function, then one can estimate where the minimum of the function is using its Taylor expansion. In 1-d, this we studied this as the Newton method, and it translates with few changes to multi-dimensional optimization as well.

Nonlinear Least Squares

Sometimes we know various properties of the function that we are trying to optimize, and these properties can be used to develop good optimization algorithms. One special example that we see often in scientific application is fitting curves to data, when the objective function is a sum of squares -- a sum of squared errors between the fitted model and the actual data ${\displaystyle {\mathcal {L}}=\sum _{i}(y_{i}-{\hat {y}}(x_{i},\theta ))^{2}}$, where, as always ${\displaystyle y_{i}}$ are the measured values, and ${\displaystyle {\hat {y}}(x_{i},\theta )}$ are the fitted values at points ${\displaystyle x_{i}}$ with the parameter values ${\displaystyle \theta }$. Fitting the curve means minimizing ${\displaystyle {\mathcal {L}}}$, or, in other words, finding the {\em least squares} curve to pass through data. This is similar to the linear reegression, which we solved using linear least squares approach. However, in this case the dependence of ${\displaystyle {\mathcal {L}}}$ on the unknown parameter ${\displaystyle \theta }$ is nonlinear. Nonetheless, can the fact that the objective function is a sum of squares be used to develop a minimization algorithm that will help us find the solution faster? Let's apply the Newton alfgorithm for minimization to the sum of squares and see where it leads (and let's keep things simple and focus on just one dimension).

Gauss-Newton algorithm

Suppose that we already are near a minimum of the objective function. Then the entire ${\displaystyle {\mathcal {L}}}$ is quadratic in the deviation of ${\displaystyle \theta }$ from the minimum. But each of the individual residual terms is not (the minimum for each residual would be zero). We can thus approximate each residual term as ${\displaystyle r_{i}(\theta )=r_{i}(\theta _{0})+{\frac {dr_{i}}{d\theta }}(\theta -\theta _{0})}$. This turns the nonlinear least squares into the locally linear lease squares, which is an easily solvable problem. One then solves the problem, and repeats recursively the solution till the minimum is reached. See more details here.

Levenberg-Marquardt algorithm

The Newton-Gauss method works well when our initial guess is close enough to the minimum of the objective function so that the latter can be viewed as a quadratic function of the parameters. This is not always the case. To address the issues, the Levenberg-Marquardt algorithm, see https://en.wikipedia.org/wiki/Levenberg–Marquardt_algorithm , interpolates between the Newton-Gauss method and gradient descent. It is skewed towards the Newton-Gauss method when the latter works well, and skews towards gradient descent otherwise. The function curve_fit from the optimize module, http://docs.scipy.org/doc/scipy/reference/optimize.html uses the Levenberg-Marquardt method by default when performing least squares fitting, and we will be using this method in our work as well.

Avoid blind fits

As we have stressed again and again, there are no guarantees that a nonlinear optimization routine is going to be able to find a global minimum, or even a "good" minimum, where, for example, the curve that we are trying to fit passes through the data points. More commonly, you will find a local minimum, where the parameter values produce a slightly better fit than the nearby parameters, but the fit overall will be bad and it won't make much sense. To avoid the problem, one must again put their thinking hat on, and try to figure out what reasonable values of parameters are. Remember that every optimization algorithm requires an initial guess from which to start searching, and if the initial guess is close to a good optimum, then the chances that the optimum will be found are much higher.

How do we find such good initial guesses? The main idea is, again, to explore special cases. The curves that you fit to data are generated by models, and it is frequently the case that different parts of the curve are affected differently by different model parameters. The trick is to find different ranges of the futted curves that look nearly linear, with the parameters of the full model corresponding to parameters of the linear fit. The beauty of linear models is that there is just one global minimum, which is easy to find. So one can do linear least squares fitting, find parameter guesses from pieces of the full curve, and then use these guesses as initial conditions for the full fit.

The following Module 3 Python script shows how we can do this using an example of a Michaelis-Menten enzymatic kinetics, where the change in the number of substrates is given by ${\displaystyle {\frac {dS}{dt}}=-{\frac {VS}{K_{M}+S}}}$. There are two unknown parameters -- the maximum reaction velocity ${\displaystyle V}$ and the Michaelis constant ${\displaystyle K_{M}}$. We notice that if the substrate concentration ${\displaystyle S}$ is large, then ${\displaystyle {\frac {dS}{dt}}\approx -V}$. Thus one can fit the initial part of the ${\displaystyle S}$ vs. ${\displaystyle t}$ data, and the slope of the linear fit will give us a good estimate of ${\displaystyle V}$. If the substrate concentration ${\displaystyle S}$ is small, then ${\displaystyle {\frac {dS}{dt}}=-{\frac {VS}{K_{M}}}}$. This has the usual exponential solution ${\displaystyle S\propto e^{-V/K_{M}t}}$. This exponential curve becomes linear in the semi-logarithmic coordinates: ${\displaystyle \log S=C-V/K_{M}t}$, where ${\displaystyle C}$ is some constant. One can do a linear fit of this semi-logarithmic data and get a good initial guess for ${\displaystyle V/K_{M}}$. But then, knowing ${\displaystyle V}$ from the first part of the curve, one can get an estimate of ${\displaystyle K_{M}}$ too. Finally, with these estimates, one can then do a 2-d global fit to find really good parameter values.

Which data to fit?

The Module 3 Python script also shows that fits often depend on what exactly is being fitted. If data comes as ${\displaystyle (x_{i},y_{i})}$ pairs, one can fit these data, or their transformed versions, such as ${\displaystyle (\log x_{i},y_{i})}$, ${\displaystyle (\log x_{i},\log y_{i})}$, ${\displaystyle (x_{i}^{2},y_{i})}$, ${\displaystyle (x_{i}^{2},y_{i}^{3})}$, or any other transformed combination. Which choice should we make? The sum-of-squares objective function assumes that the noise is of the same scale for every ${\displaystyle x}$. And you should transform your data to satisfy this property. For example, for exponentially decaying data with multiplicative noise, such transofrmation is ${\displaystyle (x_{i},\log y_{i})}$ -- and it produces much better fits, as the script shows.