# Physics 434, 2015: Introduction to Information theory

Back to the main Teaching page.

References (Nemenman, 2012) and (Levchenko and Nemenman, 2014) will be a useful reading.

• Setting up the problem: How do we measure information transmitted by a biological signaling system?
1. Dose-response curves examples in (Levchenko and Nemenman, 2014) show us that this should be a property of the entire joint distribution.
2. Information is the difference between the uncertainty pre and post the measurement.
3. How do we define the uncertainty?
• Shannon's axioms and the derivation of entropy: if a variable ${\displaystyle x}$ is observed from a distribution ${\displaystyle P(x)}$ then the amount of the information we gain from this observation must obey the following properties.
1. If the cardinality of the distribution grows and the distribution is uniform, then the measure of information grows as well.
2. The measure of information must be a continuous function of the distribution ${\displaystyle P(x)}$
3. The measure of information is additive. That is, for a fine graining of ${\displaystyle x}$ into ${\displaystyle \xi }$, we should have ${\displaystyle S[\xi ]=S[x]+\sum P(x)S[\xi |x]}$.
• Up to a multiplicative constant, the measure of information is then ${\displaystyle S=-\sum P\log P}$, which is also called the Boltzman-Shannon entropy. And we fix the constant by defining the entropy of a uniform binary distribution to be 1. Then ${\displaystyle S=-\sum P\log _{2}P}$. The entropy is then measured in bits.
• Meaning of entropy: Entropy of 1 bit means that we have gained enough information to answer one yes or no (binary) question about the variable ${\displaystyle x}$.
• Properties of entropy (positive, limited, convex):
1. ${\displaystyle 0\leq S[X]\leq \log _{2}k}$, where ${\displaystyle k}$ is the cardinality of the distribution. Moreover, the first inequality becomes an equality iff the variable is deterministic (that is, one event has a probability of 1), and the second inequality is an equality iff the distribution is uniform.
2. Entropy is a convex function of the distribution
3. Entropies of independent variables add.
4. Entropy is an extensive quantity: for a joint distribution ${\displaystyle P(x_{1},x_{2},\dots ,x_{n})}$, we can define an entropy rate ${\displaystyle S_{0}=\lim _{n\to \infty }S[X_{1},\dots ,X_{n}]/n}$.
• Differential entropy: a continuous variable ${\displaystyle x}$ can be discretized with a step ${\displaystyle \Delta x}$, and then the entropy is ${\displaystyle S[X]=-\sum P(x)\Delta x\log _{2}\left(P(x)\Delta x\right)\to \int dxP(x)\log _{2}P(x)+\log _{2}1/\Delta x}$. This formally diverges at fine discretization: we need infinitely many bits to fully specify a continuous variable. The integral in the above expression is called the differential entropy, and whenever we write ${\displaystyle S[X]}$ for continuous variables, we mean the differential entropy.
• Entropy of a normal distribution with variance ${\displaystyle \sigma ^{2}}$ is ${\displaystyle S=1/2\log _{2}\sigma ^{2}+{\rm {const}}}$.
• Multivariate entropy is defined with summation/integration of log-probability over multiple variables, cf. entropy rate above.
• Conditional entropy is defined as averaged log-probability of a conditional distribution
• Mutual information: what if we want to know about a variable ${\displaystyle x}$, but instead are measuring a variable ${\displaystyle y}$. How much are we learning about ${\displaystyle x}$ then? This is given by the difference of entropies of ${\displaystyle x}$ before and after the measurement: ${\displaystyle {\begin{array}{ll}I[X;Y]&=S[X]-\langle S[X|Y]\rangle _{y}\\&=S[X]+S[Y]-S[X,Y]\\&=\langle \log _{2}{\frac {P(x,y)}{P(x)P(y)}}\rangle \end{array}}}$.
• Meaning of mutual information: mutual information of 1 bit between two variables means that by querying one of them as much as possible, we can get one bit of information about the other.
• Properties of mutual information
1. Limits: ${\displaystyle 0\leq I[X;Y]\leq \min(S[X],S[X])}$. Note that the first inequality becomes an equality iff the two variables are completely statistically independent.
2. Mutual information is well-defined for continuous variables.
3. Reparameterization invariance: for any ${\displaystyle \xi =\xi (x),\,\eta =\eta (y)}$, the following is true ${\displaystyle I[X;Y]=I[\Xi ;\mathrm {H} ]}$.
4. Data processing inequality: For ${\displaystyle P(x,y,z)=P(x)P(y|x)P(z|y)}$, ${\displaystyle I[X;Z]\leq \min(I[X;Y],I[Y;Z])}$. That is, information cannot get created in a transformation of a variable, whether deterministic or probabilistic.
5. Information rate: Information is also an extensive quantity, so that it makes sense to define an information rate ${\displaystyle I_{0}=\lim _{n\to \infty }I[X_{1},\dots ,X_{n};Y_{1}\dots Y_{n}]/n}$.
• Mutual information of a bivariate normal with a correlation coefficient ${\displaystyle \rho }$ is ${\displaystyle I=1/2\log _{2}(1-\rho ^{2})}$.
• For Gaussian variables ${\displaystyle y=g(x+\eta )}$, where ${\displaystyle x}$ is the signal, ${\displaystyle y}$ is the response, and ${\displaystyle \eta }$ is the noise related to the input, ${\displaystyle I[X;Y]={\frac {1}{2}}\log _{2}\left(1+{\frac {\sigma _{x}^{2}}{\sigma _{\eta }^{2}}}\right)={\frac {1}{2}}\log _{2}(1+SNR)}$.