Less than two percent of the human genome is made up of protein-coding genes. Fifty years ago, scientists launched an expedition of the other 98 percent. It has been a slow march for much of that time, but in recent years the pace has picked up, thanks to advances such as new ways to sequence DNA.

Scientists are now generally agreed that some of the non-coding DNA falls into several categories, including

sites where proteins can bind in order to switch nearby genes on and off

genes for RNA molecules. Instead of just serving as a template for turning genes into proteins, RNA actually plays lots of roles in the cell, such as sensing levels of different molecules in the cell and interfering with other RNA molecules to control levels of protein.

old viruses and other genomic parasites. Some viruses can insert their genetic material into our genomes so that it becomes a permanent part of our DNA. These viruses and other parasitic stretches of DNA can, from time to time, make copies of themselves, which then get inserted back into the genome. In a few cases, these genomic parasites may be domesticated, evolving to do valuable things like help build placentas or fight off viruses. But for the most part they’re either useless or downright harmful–just like any other source of mutation.

Hobbled or dead genes. Sometimes mutations strike genes so that they can no longer produce proteins. Sometimes these mutations are fatal. Other times, we’re able to survive without a particular gene. The pseudogene, as it’s known, may linger on in the genome for millions of years. In a few cases, pseudogenes may still be able to produce useful RNA molecules. But for the most part, they’re just baggage.

The first two categories include stretches of DNA that are useful. The second two include stretches that are useless. Now comes the hard part: figuring out just how much of the genome is made up of each. The question goes beyond mere census-taking, because it will help us understand how the genome works, in its entirety. And it will also reveal how much of the genome provides no benefit at all.

I wrote an article about this line of research for the New York Times in November 2008. I described some scientists who were betting that most of the genome wouldn’t be good for much, and others who believed that most of it was serving important functions. The latter group pointed to studies in which scientists tallied up all the RNA transcripts produced by one chunk of the genome. They found that most of the DNA they analyzed produced RNA. John Mattick, a member of the research team who works at the University of Queensland in Australia, claimed that most of that DNA encoded useful molecules. “My bet is the vast majority of it — I don’t know whether that’s 80 or 90 percent,” he said.

But it was just a bet. A lot of work remained to figure out what all that RNA really signified. This week scientists at the University of Toronto published a study that suggests, contrary to Mattick, it’s full of sound and fury, signifying nothing. They used new methods to survey the RNA produced by the genome and compared their results to the ones from older methods. They found that most of their RNA came from regions of the genome that are already known to be protein-coding genes. Very little RNA came from elsewhere in the genome. They argue that the older methods were crude, so studies based on them were loaded with false positives. Protein-coding genes are not the only source of RNA transcripts in the genome, but a lot of the extra ones may just be the result of sloppiness. When proteins slide down DNA, making RNA transcripts, they sometimes grab onto the wrong stretches. The extra RNA gets broken down quickly–as useless and as inevitable as sparks flying off a grinding wheel.

Nature News has a nice write-up, as does PLOS Biology (from which I shamelessly lifted my Macbeth).

[Image: MIT]

Originally published May 19, 2010. Copyright 2010 Carl Zimmer.