In 1991, a 21-year-old Finnish computer science student named Linus Torvalds got annoyed. He had bought a personal computer to use at home, but he couldn’t find an operating system for it that was as robust as Unix, the system he used on the computers at the University of Helsinki. So he wrote one. He posted it online, free for anyone to download. But he required that anyone who figured out a way to make it better would have share the improvement with everyone else who used the system. Torvalds would later tell Wired that his motives were not noble. “I didn’t want the headache of trying to deal with parts of the operating system that I saw as the crap work,” he said. “I wanted help.”
In his quest to avoid crap work, Torvalds unleashed a monster. People began to download the system, dubbed Linux, all over the world. Within a few weeks, Torvalds was getting emails from hundreds of users, explaining how to fix bugs and how to add new bells and whistles. People began to write programs that would only work on Linux computer. They founded companies around Linux-based software. Millions of people chose Linux for their computers, and major computer companies like Microsoft and Dell began to support the system. Along the way, Linux evolved. Torvalds’s first version contained 10,000 lines of code. Linux now holds over 12 million lines.
Those 12 million lines may seem like a hopeless thicket of code, but it actually has a hidden structure. It’s divided up into chunks, each of which carries out a particular task. All told, they carry out 12,391 separate functions. The functions are also connected. If Linux carries out one function, the system will direct the computer to carry out other functions. You can think of Linux as a network, with the functions joined together by links of control. Computer programmers can map out that network as a so-called “call graph.”
Linux bears an uncanny resemblance to the genes in a living cell. Many genes make proteins that act as switches for other genes. The proteins clamp onto DNA near a target gene, allowing the cell to read the gene and make a new protein. And that new protein may, in turn, grab onto many other genes. Thanks to this hierarchy of switches, cells can respond to changes in their environment and quickly carry out complex behaviors, such as reorganizing themselves to feed on a new kind of food.
A number of scientists have begun to compare natural and manmade networks. A lot of the same rules appear to be at work in the growth of the Internet, airport connections, brain wiring, ecosystem food webs, and gene networks. But very often, scientists are finding, it’s the differences between natural and manmade networks that are most revealing, offering clues to the different ways in which people and evolution build complex things.
In the Proceedings of the National Academy of Sciences this week, Koon-Kiu Yang of Yale and his colleagues present the first detailed comparison of Linux’s network to a gene network. (The paper will be here.) Thanks to the open-source nature of Linux, the scientists could look at every line of code in every version of the system over the past two decades, from Torvald’s first primitive stab to its current sophisticated form. And for a living cell, Yang and his colleagues turned to the living equivalent of Linux–a biological network they could analyze from top to bottom. They chose E. coli. coli, since it is the best-studied species on Earth. (Why E. coli? There’s a certain book that will explain it to you.)
Over the past fifty years, scientists have mapped 1,378 interactions among E. coli genes. Out of that research, Yang and his colleagues built a microbial call graph. They assigned each gene to one of three categories. If a gene switched on one or more genes, but was not itself switched on by another gene, they called it a “master regulator.” If a gene was switched on by a different gene and then, in turn, switched on other genes, the scientists dubbed it a “middle manager.” And if the gene was switched on but did not then switch on any other genes, they called it a “workhorse.” The scientists drew the network of master regulators, middle managers, and workhorses.
The scientists sorted all the functions in Linux by the same rules. Here is the picture that emerged.
(N.B.: for the sake of clarity, the scientists only used 10% of the nodes in the full Linux call graph. But the complete picture would look the same.)
Both Linux and E. coli are organized into hierarchies. But their hierarchies have different shapes. E. coli‘s genome is dominated by workhorses. Middle-managers and master regulators make up less than 5% of the total number of genes. In Linux, by contrast, over 80% of the functions are in the upper echelons. Each workhorse in Linux is controlled to many middle managers. In E. coli, on the other hand, each workhorse gene is typically controlled either by a few genes or just one. And so in E. coli it’s the higher levels where genes have the most links, not the workhorses.
Once Yang and his colleagues had drawn the two networks, they looked at the paths information takes as it flows from master regulators down to workhorses. E. coli’s genes are organized into relatively distinct modules. When a master regulator swings into action–in response, say, to a spike in temperature–it switches on a set of other genes with relatively little overlap with the genes switched on by other master regulators. Linux, by contrast, has blurry boundaries. Four out of five Linux modules overlap, in contrast to 5% of E. coli‘s.
The networks in E. coli and Linux don’t just look different. They also grew in different ways as well. The oldest genes in E. coli‘s network–the ones shared by many other species of microbes–are its workhorses. The genes higher up in the E. coli hierarchy have emerged more recently. Those higher-ranking genes have also been undergoing a lot of evolutionary change since they first emerged. The old genes, by contrast, have changed little.
The history of Linux has played out differently. A lot of the oldest functions in Linux are middle managers or master regulators, not workhorses as in E. coli. And while old genes in E. coli haven’t evolved much, programmers have heavily rewritten Linux’s old functions.
Both networks developed, step by step, as increasingly sophisticated systems for operating things–computers or cells. But the Linux network was the work of programmers, while E. coli is the product of four billion years of evolution. The differences in the history and shape of the two networks emerge from the ways in which they developed. The programmers who built Linux did not have the time to invent entirely new workhorse functions. It was simpler for them to just use the old workhorse functions in new modules. But this strategy leaves Linux a lot more fragile than a biological network. Its modules overlap, so that in many cases, a workhorse function is essential for many different modules at once. As a result, Linux gets buggy and prone to crashing. And so as programmers improve Linux, they’ve had to fine-tune its all-purpose functions at every step of the way.
E. coli is far more rugged. Mutations crop up all the time as the bacteria multiply, and yet they generally don’t suffer a catastrophic network crash. One reason E. coli is so robust is that its modules have evolved to be distinct. Overlapping modules make cells particularly vulnerable to mutations, because a single mutation can shut down a lot of their essential biology. Natural selection favors organisms with a more rugged network.
Because E. coli is the product of evolution, rather than of programmers, parts of its genome have changed relatively little over billions of years. The oldest parts of the network are the workhorse genes–the ones that encode primitive proteins that do the fundamental work of life, like building new pieces of DNA. They can tolerate very little change. It’s much easier instead for E. coli to evolve new ways of controlling those workhorses.
This kind of comparison is very new, and it’s not clear yet what scientists will find when they compare Linux to other genomes–particular to the genomes of more complex species like ourselves. E. coli has only about 4300 genes. We have 20,000 protein-coding genes. A lot of those genes control other genes. Indeed, a typical human gene has a lot of switches, all of which have to be thrown in order for the gene to make a protein in a certain situation. The human genome is also packed with thousands of genes that don’t encode proteins, but which may encode RNA molecules that also switch genes on and off. Scientists just don’t know enough yet about the human genome to map its network the way they’ve mapped E. coli. But it’s possible that when they finally do, it will be a lot more top-heavy, with a lot more overlapping modules and multi-tasking workhorses.
If that turns out to be the case, biologists will have a new question to keep them busy for a long time to come: how did Linus get to be so much like Linux?
[Update: Fixed Torvalds’s name and other typos. Thanks for the proofing!]
Originally published May 3, 2010. Copyright 2010 Carl Zimmer.