The Hidden Layer of Noncoding RNA in the Evolution and Genetic Programming of Complex Organisms

John S. Mattick
Institute for Molecular Bioscience, University of Queensland, Brisbane 4072 Australia

Recent evidence suggests that at least half of the genes in the mammalian genome do not encode proteins. Most of the mammalian genome is transcribed, the vast majority (~98%) of which is non-protein-coding RNA, comprising introns of protein-coding transcripts and introns and exons of non-protein-coding transcripts (ncRNAs). These transcripts include complex clusters of overlapping and antisense transcripts, "intergenic" transcripts and pseudogene transcripts that appear to participate in both local and long-distance regulatory networks. Many transcripts (including intronic RNAs) are processed to smaller RNAs, including snoRNAs that edit other RNAs, and microRNAs that control many aspects of development, including embryogenic patterning, adipocyte formation, hematopoietic differentiation, apoptosis and insulin secretion, and are perturbed in a range of cancers. RNA signaling is also involved in chromosome dynamics and chromatin modification, epigenetic processes which, like alternative splicing (also likely to be controlled by trans-acting RNAs), are essential to differentiation and development. Interestingly many ncRNAs with conserved functions, such as XIST and H19, are not highly conserved at the sequence level, suggesting that (like language) their sequences can drift easily and yet retain the same function, which also suggests that there may be many more microRNAs and other regulatory RNAs to be discovered. Many putative ncRNAs identified in the RIKEN mouse cDNA project exhibit tissue-specific expression patterns and are dynamically induced by physiological stimuli. There are also many nucleic-acid binding proteins that appear to interact with complexes containing RNA, but whose exact specificity is unknown. In addition, a significant proportion of the human genome appears to be under evolutionary selection, including thousands of ultra-conserved noncoding sequences and transposon-free regions that have remained unchanged throughout mammalian evolution, suggesting extended regions of complex regulatory information that operate via unknown mechanisms, observations that are hard to reconcile with current models of gene regulation. Many noncoding regions are conserved between species in complex patterns that are not evident from pairwise comparisons alone, suggesting that many sequences are under negative or positive selection in different lineages, presumably related to their common ontogeny and phenotypic differences driven by adaptive radiation, respectively.

These observations, and the increasing number of complex genetic phenomena shown to be directed by regulatory RNAs, suggest that the majority of the human genome, and those of other complex organisms, is in fact functional (not junk), and devoted to an advanced genetic regulatory system that is primarily transacted by RNA. This conclusion is also supported by an information theoretic analysis and empirical data that show that regulatory networks are accelerating networks and that bacteria have been likely limited in their complexity by a regulatory system based simply on analogue controls (i.e. proteins), implying that multicellular organisms must have breached this limit by evolving a new regulatory system, based on sequence-specific RNA signaling. If this is correct, our current conceptions of the information content of the mammalian genome and the genetic programming of mammalian development and variation will have to be completely reassessed, with enormous implications for biology and medicine.