Diagnostics using microarray data
With increasing availability of microarray data, many people have tried to use the mRNA expression profiles as a diagnostics tool. Example of questions are: given the expression profile, can the subject be classified as having a certain disease? or can the disease outcome be predicted? The next step, of course, will be to use other high throughput modalities (proteomic or metabolic data) to make similar diagnostics predictions. So, is this approach reasonable? If done correctly, yes. If done like most researchers do it, it's a disaster.
So what's the problem with the method? Consider that the number of possible predictors (different genes, metabolites, or proteins) in these experiments is usually . On the other hand, the number of training samples is , so that or smaller. So guess what is is the probability that one of the predictors (genes) by chance will shatter the data into clusters correctly? You guessed it, it is huge. Well, of course, people realize this, and they try to fight the problem. But what do they do for this? Well, they do bootstrapping and/or cross-validation (which, for our purposes is the same).
Well, think about it in the context of, say, leave-one-out cross-validation. We throw one of the samples out of the entire data set. We design the classifier. We then test the classifier on the left-out sample. We record the error, and we then repeat the entire procedure for other points being left out, and we average the error. Well, Jorma Rissanen correctly called this incest in the data in his book -- you do not do validation on new data but, at different moments of time, the same data serves as either the training or the test data. We should expect, and we will get problems. In fact, in a discussion, Isaac Kohane (one of whose articles I cite below) has mentioned that, in a problem of a certain cancer diagnostics done by different classification algorithms on the same training data, the overlap of predictors selected by the algorithms was exactly zero, raising the possibility that the choices of predictors are incidental.
Discussions of why there may be problems of incidental classification/diagnostics:
- Rissanen's book
- I Kohane, DR Masys, and RB Altman. The Incidentalome: A Threat to Genomic Medicine. JAMA 296, 212-215, 2006. PDF.
- Genomic medicine is poised to offer a broad array of new genome-scale screening tests. However, these tests may lead to a phenomenon in which multiple abnormal genomic findings are discovered, analogous to the "incidentalomas" that are often discovered in radiological studies. If practitioners pursue these unexpected genomic findings without thought, there may be disastrous consequences. First, physicians will be overwhelmed by the complexity of pursuing unexpected genomic measurements. Second, patients will be subjected to unnecessary follow-up tests, causing additional morbidity. Third, the cost of genomic medicine will increase substantially with little benefit to patients or physicians (but with great financial benefits to the genomic testing industry), thus throwing the overall societal benefit of genome-based medicine into question. In this article, we discuss the basis for these concerns and suggest several steps that can be taken to help avoid these substantive risks to the practice of genomically personalized medicine.
- References 3-10 provide examples of microarray-based classification tools. Note that the focus of this paper is slightly different from the current discussion, and "incidentalome" has a slightly different meaning.