Less is more: correcting the map of microbial evolution

In 5 seconds In a new study, UdeM computer scientist Miklós Csűrös offers a new, statistically sound foundation for understanding how the earliest life forms on Earth evolved.
A vivid depiction of scientific research featuring genetic sequencing and lab equipment.

In this era of Big Data, the prevailing wisdom is that more information leads to better answers. However, a new Canadian shows that in the hunt for life's ancient ancestors, more data can actually lead to less truth.

Published in Proceedings of the National Academy of Sciences, the research by UdeM associate professor of computer science Miklós Csűrös reveals that standard methods for reconstructing the genomes of ancient microbes are being overwhelmed by an explosion of information.

This paradox causes current models to "hallucinate" evolutionary events—specifically, an implausibly high number of horizontal gene transfers—that are actually just statistical ghosts, the study shows.

In it, Csűrös identifies a crisis point in evolutionary biology: as researchers try to reconcile thousands of gene sequences across the entire tree of life, the actual evolutionary signal begins to vanish, replaced by mathematical noise.

'Don't zoom in too close'

"Traditional tools try to track every single mutation and exchange, but at this scale, the signal collapses," said Csűrös. “It’s like trying to read a book where the ink has smeared; if you zoom in too close, you lose the letters entirely.”

To solve this, Csűrös developed the GLD (Gain-Loss-Duplication) framework. Instead of getting lost in the "smeared ink" of individual sequences, GLD focuses on the demographics of gene families—tracking how they are born, how they die and how they move across time.

Using robust likelihood computations and stepping back from the noise of individual variations, the GLD framework provides a clearer, more stable map of the past, Csűrös said. “Overcoming the tricky math of birth-death processes, it becomes a powerful tool for genomic archaeology.”

When applied to a massive dataset of 269 archaeal genomes, GLD corrected the distorted results of previous high-profile studies, Csűrös found. The study revealed that archaeal evolution isn't a chaotic swap-meet of genes, but a finely balanced and dynamic equilibrium.

Three components revealed

That equilibrium has three components:

  • A behind-the-scenes tug of war: Most of a genome's life is spent in a high-frequency cycle of streamlining (a constant "leak" of individual genes) balanced by a pervasive influx of transients. The study found that for every stable gene family, there are six times as many transient genes passing through—a discovery made possible by a new mathematical method that corrects for the bias of only looking at common genes.
  • Adaptive modular losses: Beyond the random background noise, the research identified modular patterns where entire sets of functional genes are shed together. These aren't random; they are strategic adaptations. For example, when an ancient microbe switches its diet, it discards the entire biological machinery it no longer needs in one coordinated evolutionary move.
  • Punctuated massive gains: Interspersed with these losses are rare, massive surges of new genetic material. These events act as "evolutionary founders," providing the raw material that allows new classes of organisms—like the Halobacteria—to thrive in entirely new environments.

By proving that less is more when it comes to phylogenetic signal, Csűrös said his study provides "a new, statistically sound foundation for understanding how the earliest life forms on Earth evolved. 

"It offers a vital quality-control mechanism for the next generation of evolutionary research, ensuring we don't mistake the noise of Big Data for the signal of life's history."

Media requests

Université de Montréal
Phone: 514-343-6111, ext. 75930