[1644] False Discovery and Fairy Tales in Gene Expression Analysis

Colin J Rog, Mary E Edgerton. MD Anderson Cancer Center, Houston, TX

Background: Analysis of gene expression data can be misleading due to overfitting. Genes are highly connected in cellular networks with frequent duplication across pathways. We hypothesize that as more genes are determined to be relevant to a disease outcome, it is more likely any two genes are connected in a known pathway. Cancer is a heavily studied disease area. We also hypothesize that the likelihood of discovering a gene to be a member of a pathway for which at least one member has been studied in association with cancer is high.
Design: A master set of 12000 genes was constructed using an Affymetrix microarray probe set. One hundred sets of five and ten genes each were created by random selection from the master set. Dykstra's algorithm was used to build pathways using MetaCore, a commercial database of gene and protein interactions, using the five and ten gene sets as inputs. We measured the frequency with which pathways could be generated using one, two and three maximum intervening steps between any two genes included as input. When networks were generated, we constructed a Boolean statement using the “or” operator to join the network genes and the “and” operator to join the gene set with the term “cancer”. This Boolean construct was used to search Pubmed for publications relevant to cancer for any member of the network.
Results: Frequency analysis is summarized below.

Table 1 Pathway Generatino Frequency
Maximum intervening steps allowed123
5-gene input0%28%67%
10-gene input6%70%96%


The number of publications retrieved using the Boolean search strings was linearly correlated with the number of genes in the network with a slope greater than 1000.


Conclusions: Given the connectivity of genes and the proliferation of literature on cancer, using pathways with relevant literature on cancer to support mechanisms elucidated from gene expression data analysis can be misleading. In particular, techniques that generate many hundreds of genes are more likely to result in higher false discovery rates. This can create problems when using pathways databases and literature on cancer to develop relevant mechanisms to support gene discovery. Literature support or network connectivity cannot be used in isolation to support hypotheses generated from analysis of gene expression data.
Category: Informatics

Monday, March 19, 2012 8:45 AM

Platform Session: Section H, Monday Morning

 

Close Window