Information Bottleneck Method

From Ilya Nemenman: Theoretical Biophysics @ Emory
Jump to: navigation, search

Bottleneck with finite samples

In the bottleneck algorithm, one assumes to know the joint distribution of the predictor and the relevant variable. In reality, you almost never have this knowledge, but only the knowledge of samples from this distribution. How does the uncertainty of learning the pdf from the samples influences the findings of the bottleneck?

  • Still and Bialek, 2002
  • O Shamir, S Sabato, and N Tishby. Learning and Generalization with the Information Bottleneck. Proc Intl Symp AI and Mathematics (ISAIM)-2008. PDF.
    The information bottleneck is an information theoretic framework, extending the classical notion of minimal sufficient statistics, that finds concise representations for an ‘input’ random variable that are as relevant as possible for an ‘output’ variable. This framework has been used successfully in various supervised and unsupervised applications. However, its learning theoretic properties and justification remained unclear as it differs from standard learning models in several crucial aspects, primarily its explicit reliance on the joint input-output distribution. In practice, an empirical plug-in estimate of the underlying distribution has been used, so far without any finite sample performance guarantees. In this paper we present several formal results that address these difficulties. We prove several non-uniform finite sample bounds that show that it can provide concise representations with good generalization based on smaller sample sizes than needed to estimate the underlying distribution. Based on these results, we can analyze the information bottleneck method as a learning algorithm in the familiar performance-complexity tradeoff framework. In addition, we formally describe the connection between the information bottleneck and minimal sufficient statistics.
    The paper proves that the errors in the bottleneck cost function depend on the cardinality of the compressor variable, rather than the original predictor variable (the proof in this version is a bit fishy, i think). In addition, the paper shows that the probability of misclassifying the relevant variable based on the compressed predictor variable is exponentially low in the mutual information between the two; in this sense, a quality of classification is known objectively, rather than subjectively. Finally, it's shown that the quality of the approximation is also governed by the mutual information between the compressed and the original variables, showing that one should compress. It is also shown that as the compression penalty goes to zero, the compressed variable becomes a sufficient statistics.