# Diagnostics using microarray data

With increasing availability of microarray data, many people have tried to use the mRNA expression profiles as a diagnostics tool. Example of questions are: given the expression profile, can the subject be classified as having a certain disease? or can the disease outcome be predicted? The next step, of course, will be to use other high throughput modalities (proteomic or metabolic data) to make similar diagnostics predictions. So, is this approach reasonable? If done correctly, yes. If done like most researchers do it, it's a disaster.

So what's the problem with the method? Consider that the number of possible predictors (different genes, metabolites, or proteins) in these experiments is usually ${\displaystyle M=10^{3}\dots 10^{6}}$. On the other hand, the number of training samples is ${\displaystyle N=10^{1}\dots 10^{3}}$, so that ${\displaystyle N/M\sim 10^{-2}}$ or smaller. So guess what is is the probability that one of the predictors (genes) by chance will shatter the data into clusters correctly? You guessed it, it is huge. Well, of course, people realize this, and they try to fight the problem. But what do they do for this? Well, they do bootstrapping and/or cross-validation (which, for our purposes is the same).

Well, think about it in the context of, say, leave-one-out cross-validation. We throw one of the samples out of the entire data set. We design the classifier. We then test the classifier on the left-out sample. We record the error, and we then repeat the entire procedure for other points being left out, and we average the error. Well, Jorma Rissanen correctly called this incest in the data in his book -- you do not do validation on new data but, at different moments of time, the same data serves as either the training or the test data. We should expect, and we will get problems. In fact, in a discussion, Isaac Kohane (one of whose articles I cite below) has mentioned that, in a problem of a certain cancer diagnostics done by different classification algorithms on the same training data, the overlap of predictors selected by the algorithms was exactly zero, raising the possibility that the choices of predictors are incidental.

## Bibliography

Discussions of why there may be problems of incidental classification/diagnostics:

1. Rissanen's book