Physics 434, 2014: Central limit theorem

From Ilya Nemenman: Theoretical Biophysics @ Emory
Jump to: navigation, search
Emory Logo

Back to the main Teaching page.

Back to Physics 434, 2014: Information Processing in Biology.

The Central Limit Theorem will be a crucial tool for us in later studies of random walk, diffusion, and related phenomena. It also explains why we are so fascinated with Gaussian distributions. Roughly speaking, the theorem states that a sum of many i.i.d. (independent and identically distributed) random variables with finite variances approaches a Gaussian distribution. In my opinion this is one of the most remarkable laws in the probability theory. It is supposed to explains why experimental noises are often Gaussian distributed. It also provides an explanation for why universalities in physical laws exist -- that is, why distinct, seemingly very different phenomena are often explained by the same simplified physical models. Richard Feynman, in his Messenger Lectures, has a very nice discussion of the subject.

Theorem statement

More precisely, suppose are i.i.d. random variables with mean and variance . Then the CLT says that is distributed according to (called the standard normal distribution), provided is sufficiently large. Before proving this, we need to learn, as an aside, how to do Gaussian integrals.

The Gaussian Integral

Suppose one wants to calculate . One writes it as . That is, . I other words, the Gaussian probability distribution , indeed, normalizes to one.

Suppose now we want to calculate the mean or the variance of the Gaussian distribution, or other moments or cumulants. It turns out that it easier to do this by first calculating its MGF (or the CGF), and then taking the derivatives. .

In other words, the Moment Generating Function of a Gaussian distribution also has a polynomial of the order two in the exponent.

The Central Limit Theorem

We prove only a special case of this theorem, assuming that none of the cumulants of the i.i.d. variables is infinite, and hence the moment generating functions exist. This is a stronger assumption then the finiteness of variances, but this will be sufficient for our purposes. Remember that, for independent variables, MGFs multiply (and CGFs add). Thus , which proves the theorem.

Generalizations

The theorem holds also for the following cases:

  • If the variables have different variances and means, but all variances are bounded. Convergence will be slower though.
  • If only the first two moments of the constituent variables are defined, but the others are not.

In the case when the variance of the constituent variables is not defined, the central limit distribution is not a Gaussian. Some hints about what it would be were given in a homework problem. Namely, a sum of two Lorentzian variables is also a Lorentzian, suggesting that the Lorentzian is also a limit distribution of some sort. Indeed, there's a whole class of distributions with power law tails, which are limit distributions for sums of variables with power law tails.

Simulations

The attached code does numerical simulation of the CLT for sums of many exponential or binary variables: CLT.m

Relating this back to our favorite E. coli, we notice thus that the motion of the bacterium consists of many run-tumbles steps, each of which has a finite variance. Thus the probability distribution of end points of E. coli motion over a certain long period of time is a Gaussian. To illustrate this, we will show in a homework that for E. coli. It's a diffusive motion as well, just like diffision of small molecules. We demonstrate this by numerical simulations (homework, and also see this Matlab code.)