False Discovery and Fairy Tales in Gene Expression Analysis
Colin J Rog, Mary E Edgerton. MD Anderson Cancer Center, Houston, TX
Background: Analysis of gene expression data can be misleading due to overfitting. Genes are highly connected in cellular networks with frequent duplication across pathways. We hypothesize that as more genes are determined to be relevant to a disease outcome, it is more likely any two genes are connected in a known pathway. Cancer is a heavily studied disease area. We also hypothesize that the likelihood of discovering a gene to be a member of a pathway for which at least one member has been studied in association with cancer is high.
Design: A master set of 12000 genes was constructed using an Affymetrix microarray probe set. One hundred sets of five and ten genes each were created by random selection from the master set. Dykstra's algorithm was used to build pathways using MetaCore, a commercial database of gene and protein interactions, using the five and ten gene sets as inputs. We measured the frequency with which pathways could be generated using one, two and three maximum intervening steps between any two genes included as input. When networks were generated, we constructed a Boolean statement using the “or” operator to join the network genes and the “and” operator to join the gene set with the term “cancer”. This Boolean construct was used to search Pubmed for publications relevant to cancer for any member of the network.
Results: Frequency analysis is summarized below.
|Maximum intervening steps allowed||1||2||3|