You Don’t Miss Those 8,000 Genes, Do You?

Science moves forward by flow. One experiment leads to another. Observations accrue. What seem like side trips or even dead ends may bring a fuzzy picture further into focus. Yet science often seems as if it moves forward one bombshell at a time, marked by scientific papers and press conferences. I can’t think of a bigger contrast between the bombshell illusion and the flowing reality of science than the day in 2000 when President Clinton announced the completion of the first draft of the human genome on the White House lawn. He declared it “an epoch-making triumph of science and reason.”

And yet, as James Shreeve described in his book The Genome War, the announcement did not mark any clear milestone, but represented more of a last-minute compromise between two rival genome-sequencing teams. After the press conference, everyone went back to perfecting the rough drafts, which turned out to be very rough indeed. In fact, they’re still polishing it, years after many people may have gotten the impression they were actually finished.

When Craig Venter and his colleagues published their rough draft of the human genome in 2001 they identified 26,588 human genes. They then broke those genes down by their functions. Some were involved in building DNA, some in relaying signals, and so on. Remarkably, though, they classified 12809 genes–almost half–as “molecular function unknown.” Last week I wanted to know if those numbers still hold. I’ve been working on a book on Escherichia coli, and I wanted to contrast just how well scientists understand that microbe to just how poorly we understand ourselves (biologically, in this case). I wanted some numbers to make my case.

They weren’t so easy to find. In 2003 some reports came out to the effect that the genome had shrunk down to 21,000 genes. But I couldn’t turn up much news in the past four years. I wondered what sort of artificial milestone I would have to wait for in order to get some fresh numbers.

Fortunately there are now some rivals to the milestone model of science. There are web sites where you can observe works in progress, such as the human genome. One of those sites is called PANTHER. I contacted the top scientist behind it, Paul D. Thomas, with my question, and he sent me a link. When I clicked on the link, I got the pie chart I’ve posted here (click on the image to go to the original page if it’s hard to read).

The pie shows that we’re now down to just 18,308 genes. That’s over 8,000 genes fewer than six years ago. Many sequences that once looked like full-fledged genes, capable of generating a protein, now don’t make the grade. Some genes turned out to be pseudogenes–vestiges of genes that once worked but have been since wrecked by mutations. In other cases, DNA segments that appeared to be parts of separate genes have turned out to be part of the same gene.

Today scientists still don’t know the function of 5898 genes in the human genome. In other words, over the past six years about 7,000 genes either have been figured out or have vanished into the land of nevermind. That’s progress, of a sort. But unknown genes still represent a major slice of the human genome, because the total number of genes has fallen as well. The blue slice in the pie above represents 32.2% of all our known genes. For all the work that has poured into the genome, for all the grand announcements, we still don’t know have the faintest idea of what about a third of our genes are for.

Actually, we don’t even know all that much about the “known” genes. A lot of the functions assigned to human genes actually come from research on other species. We share a common evolutionary history with mice and Drosophila flies and other organisms that scientists have studied carefully. We all descend from a common ancestor, but when our lineages diverged, those ancestral genes duplicated and diverged. Some disappeared and others took on new roles. It’s possible now to group the genes from many species into families. Within those families scientists can group genes into sub-families. Genes from the same sub-family tend to do the same thing, even if they are found in different species. So PANTHER assigns human genes functions that have been established for genes from the same sub-family through careful experiments on other organisms. That’s a good strategy, but the fact remains that few human genes have experimental evidence for their function in humans. In one study of 35329 proteins, scientists estimated that only 2784 met this gold standard.

That 35,329 figure may seem confusing, since we only have 18,308 genes. A single human gene can make more than one protein. Human genes come in pieces, separated by non-coding chunks of DNA, and those segments can be spliced together in different combinations. Scientists will discover many more splice variants. Each splice can have a significantly different function than other proteins produced by the same gene. This pie doesn’t capture that dimension of our knowledge (or our ignorance).

And then there’s the whole matter of all the other DNA that doesn’t encode proteins (98.5% of the genome all told). A lot of it is most likely a mishmash of broken genes and viral DNA. It’s possible to cut huge swaths of it out of a mouse’s genome with no apparent ill effect. But there are also a lot of important players hiding in that wilderness–switches that proteins can use to turn genes on and off, sequences that do not give rise to proteins but rather RNA molecules that create their own control system for a cell. In all of these complications, scientists will probably find the answer to the question, “How do roughly the same number of genes encode such different kinds of animals?” Complexity isn’t purely a matter of the number of genes you have. It’s also how you use them.

Getting an update on the human genome was interesting in itself, but the way I got it was interesting as well. I did not have to follow the traditional procedure, waiting for highly guarded paper to finally be published and reported on. The latest statistics on the human genome are out there now for anyone who cares to look at them. But in order to get at this information, you do need a fair amount of acumen. I would not have been able to have created this pie chart without Thomas’s help. Perhaps some science writers will become more like investigative political reporters who know how to sift through Federal election databases for the real news. If that gets us away from the illusion of the bombshell, it will be a good thing.