Physics 212, 2019: Lecture 17
Avoid blind fits
Follow the Optimization notebook for this section.
As we have stressed again and again, there are no guarantees that a nonlinear optimization routine is going to be able to find a global minimum, or even a "good" minimum, where, for example, the curve that we are trying to fit passes through the data points. More commonly, you will find a local minimum, where the parameter values produce a slightly better fit than the nearby parameters, but the fit overall will be bad and it won't make much sense. To avoid the problem, one must again put their thinking hat on, and try to figure out what reasonable values of parameters are. Remember that every optimization algorithm requires an initial guess from which to start searching, and if the initial guess is close to a good optimum, then the chances that the optimum will be found are much higher.
How do we find such good initial guesses? The main idea is, again, to explore special cases. The curves that you fit to data are generated by models, and it is frequently the case that different parts of the curve are affected differently by different model parameters. The trick is to find different ranges of the futted curves that look nearly linear, with the parameters of the full model corresponding to parameters of the linear fit. The beauty of linear models is that there is just one global minimum, which is easy to find. So one can do linear least squares fitting, find parameter guesses from pieces of the full curve, and then use these guesses as initial conditions for the full fit.
The following Module 3 Python script shows how we can do this using an example of a Michaelis-Menten enzymatic kinetics, where the change in the number of substrates is given by Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{dS}{dt}=-\frac{VS}{K_M+S}} . There are two unknown parameters -- the maximum reaction velocity Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle V} and the Michaelis constant Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K_M} . We notice that if the substrate concentration is large, then . Thus one can fit the initial part of the vs. data, and the slope of the linear fit will give us a good estimate of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle V} . If the substrate concentration Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} is small, then Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{dS}{dt}=-\frac{VS}{K_M}} . This has the usual exponential solution Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S\propto e^{-V/K_M t}} . This exponential curve becomes linear in the semi-logarithmic coordinates: Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \log S = C-V/K_M t} , where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C} is some constant. One can do a linear fit of this semi-logarithmic data and get a good initial guess for . But then, knowing Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle V} from the first part of the curve, one can get an estimate of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle K_M} too. Finally, with these estimates, one can then do a 2-d global fit to find really good parameter values.
Which data to fit?
The Optimization notebook also shows that fits often depend on what exactly is being fitted. If data comes as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (x_i,y_i)} pairs, one can fit these data, or their transformed versions, such as Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\log x_i,y_i)} , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\log x_i,\log y_i)} , , Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (x_i^2,y_i^3)} , or any other transformed combination. Which choice should we make? The sum-of-squares objective function assumes that the noise is of the same scale for every Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} . And you should transform your data (typically the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y} coordinate) to satisfy this property. For example, for exponentially decaying data with multiplicative noise, such transformation is Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (x_i,\log y_i)} -- and it produces much better fits, as the script shows. Similarly, sum-of-squares algorithms usually work much better when the distribution of Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} points is not skewed, and there are no outliers -- and one often can achieve this by transforming the variable as well.