In March, when covid cases began spiking around India, Bani Jolly went hunting for answers in the virus’s genetic code.
Researchers in the UK had just set the scientific world ablaze with news that a covid variant called B.1.1.7—soon to be referred to as alpha—was to blame for skyrocketing case counts there. Jolly, a third-year PhD student at the CSIR Institute of Genomics and Integrative Biology in New Delhi, expected to find that it was driving infections in her country too.
Because her institution is at the forefront of covid research in India, she had access to sequences from thousands of covid samples taken around the country. She began running them through software that grouped them according to branches of covid’s family tree.
Instead of dense clumps of B.1.1.7 cases, Jolly found a cluster of sequences that didn’t look quite like any known variant, some of them with two mutations of the spike protein that were already suspected to make the virus more dangerous.
Jolly talked to her advisor, who suggested that she reach out to other sequencing labs around India. Their data, too, showed signs that a local outbreak had given rise to a new family of the virus.
Before long, journalists got wind of the new development, and Jolly began to see articles about “double mutants” and the “Indian variant.”
She knew researchers could do more with a useful label than a “scariant” nickname. So she went to the place where a small group of scientists give new variants their names: a GitHub page staffed by a handful of volunteers around the world, led primarily by a PhD student in Scotland.
Those volunteers oversee a system called Pango, which has quietly become essential to global covid research. Its software tools and naming system have now helped scientists worldwide understand and classify nearly 2.5 million samples of the virus.
In April, Jolly posted her sequences to the GitHub page, along with an explanation of why they represented a significant change to the virus. (She was the second user to flag the new variant; the first flag had been waved a few days before, by a researcher in the UK.) The Pango team quickly came up with a new name, B.1.167. The family includes the infamously transmissible variant now known, in the media, as delta.
“Pango makes it really easy to see if other people are seeing what we’re seeing,” Jolly says. “If they’re not, it is really easy to report what’s being seen in India, so people can track it in other regions.”
Researchers, public health officers, and journalists around the world use Pango to understand covid’s evolution. But few realize that the entire endeavor—like much in the new field of covid genomics—is powered by a tiny team of young researchers who have often put their own work on hold to build it.
Too much data
You might assume that there’s long been an official, time-tested process for naming new branches of a virus’s family tree as it evolves, infecting one person after another. After all, researchers have been using genomic sequencing to study viruses for two decades.
But that work has historically had to cope with orders of magnitude less data, and little of it was shared collaboratively between scientists on different continents, as covid sequences have been. There had never been a pressing need to develop standardized names.
In March 2020, when the WHO declared a pandemic, the public sequence database GISAID held 524 covid sequences. Over the next month scientists uploaded 6,000 more. By the end of May, the total was over 35,000. (In contrast, global scientists added 40,000 flu sequences to GISAID in all of 2019.)
“Without a name, forget about it—we cannot understand what other people are saying,” says Anderson Brito, a postdoc in genomic epidemiology at the Yale School of Public Health, who contributes to the Pango effort.
As the number of covid sequences spiraled, researchers trying to study them were forced to create entirely new infrastructure and standards on the fly. A universal naming system has been one of the most important elements of this effort: without it, scientists would struggle to talk to each other about how the virus’s descendants are traveling and changing—either to flag up a question or, even more critically, to sound the alarm.
Where Pango came from
In April 2020, a handful of prominent virologists in the UK and Australia proposed a system of letters and numbers for naming lineages, or new branches, of the covid family. It had a logic, and a hierarchy, even though the names it generated—like B.1.1.7—were a bit of a mouthful.
One of the authors on the paper was Áine O’Toole, a PhD candidate at the University of Edinburgh. Soon she’d become the primary person actually doing that sorting and classifying, eventually combing through hundreds of thousands of sequences by hand.
She says: “Very early on, it was just who was available to curate the sequences. That ended up being my job for a good bit. I guess I never understood quite the scale we were going to get to.”
She quickly set about building software to assign new genomes to the right lineages. Not long after that, another researcher, postdoc Emily Scher, built a machine-learning algorithm to speed things up even more.
They named the software Pangolin, a tongue-in-cheek reference to a debate about the animal origin of covid. (The whole system is now simply known as Pango.)
The naming system, along with the software to implement it, quickly became a global essential. Although the WHO has recently started using Greek letters for variants that seem especially concerning, like delta, those nicknames are for the public and the media. Delta actually refers to a growing family of variants, which scientists call by their more precise Pango names: B.1.617.2, AY.1, AY.2, and AY.3.
“When alpha emerged in the UK, Pango made it very easy for us to look for those mutations in our genomes to see if we had that lineage in our country too,” says Jolly. “Ever since then, Pango has been used as the baseline for reporting and surveillance of variants in India.”
Because Pango offers a rational, orderly approach to what would otherwise be chaos, it may forever change the way scientists name viral strains—allowing experts from all over the world to work together with a shared vocabulary. Brito says: “Most likely, this will be a format we’ll use for tracking any other new virus.”
Many of the foundational tools for tracking covid genomes have been developed and maintained by early-career scientists like O’Toole and Scher over the last year and a half. As the need for worldwide covid collaboration exploded, scientists rushed to support it with ad hoc infrastructure like Pango. Much of that work fell to tech-savvy young researchers in their 20s and 30s. They used informal networks and tools that were open source—meaning they were free to use, and anyone could volunteer to add tweaks and improvements.
“The people on the cutting edge of new technologies tend to be grad students and postdocs,” says Angie Hinrichs, a bioinformatician at UC Santa Cruz who joined the Pangolin project earlier this year. For example, O’Toole and Scher work in the lab of Andrew Rambaut, a genomic epidemiologist who posted the first public covid sequences online after receiving them from Chinese scientists. “They just happened to be perfectly placed to provide these tools that became absolutely critical,” Hinrichs says.
Building fast
It hasn’t been easy. For most of 2020, O’Toole took on the bulk of the responsibility for identifying and naming new lineages by herself. The university was shuttered, but she and another of Rambaut’s PhD students, Verity Hill, got permission to come into the office. Her commute, walking 40 minutes to school from the apartment where she lived alone, gave her some sense of normalcy.
Every few weeks, O’Toole would download the entire covid repository from the GISAID database, which had grown exponentially each time. Then she would hunt around for groups of genomes with mutations that looked similar, or things that looked odd and might have been mislabeled.
When she got particularly stuck, Hill, Rambaut, and other members of the lab would pitch in to discuss the designations. But the grunt work fell on her.
Deciding when descendants of the virus deserve a new family name can be as much art as science. It was a painstaking process, sifting through an unheard-of number of genomes and asking time and again: Is this a new variant of covid or not?
“It was pretty tedious,” she says. “But it was always really humbling. Imagine going through 20,000 sequences from 100 different places in the world. I saw sequences from places I’d never even heard of.”
As time went on, O’Toole struggled to keep up with the volume of new genomes to sort and name.
In June 2020, there were over 57,000 sequences stored in the GISAID database, and O’Toole had sorted them into 39 variants. By November 2020, a month after she was supposed to turn in her thesis, O’Toole took her last solo run through the data. It took her 10 days to go through all the sequences, which by then numbered 200,000. (Although covid has overshadowed her research on other viruses, she’s putting a chapter on Pango in her thesis.)
Fortunately, the Pango software is built to be collaborative, and others have stepped up. An online community—the one that Jolly turned to when she noticed the variant sweeping across India—sprouted and grew. This year, O’Toole’s work has been much more hands-off. New lineages are now designated mostly when epidemiologists around the world contact O’Toole and the rest of the team through Twitter, email, or GitHub— her preferred method.
“Now it’s more reactionary,” says O’Toole. “If a group of researchers somewhere in the world is working on some data and they believe they’ve identified a new lineage, they can put in a request.”
The deluge of data has continued. This past spring, the team held a “pangothon,” a sort of hackathon in which they sorted 800,000 sequences into around 1,200 lineages.
“We gave ourselves three solid days,” says O’Toole. “It took two weeks.”
Since then, the Pango team has recruited a few more volunteers, like UCSC researcher Hindriks and Yale researcher Brito, who both got involved initially by adding their two cents on Twitter and the GitHub page. A postdoc at the University of Cambridge, Chris Ruis, has turned his attention to helping O’Toole clear out the backlog of GitHub requests.
O’Toole recently asked them to formally join the organization as part of the newly created Pango Network Lineage Designation Committee, which discusses and makes decisions about variant names. Another committee, which includes lab leader Rambaut, makes higher-level decisions.
“We’ve got a website, and an email that’s not just my email,” O’Toole says. “It’s become a lot more formalized, and I think that will really help it scale.”
The future
A few cracks around the edges have started to show as the data has grown. As of today, there are nearly 2.5 million covid sequences in GISAID, which the Pango team has split into 1,300 branches. Each branch corresponds to a variant. Of those, eight are ones to watch, according to the WHO.
With so much to process, the software is starting to buckle. Things are getting mislabeled. Many strains look similar, because the virus evolves the most advantageous mutations over and over again.
As a stopgap measure, the team has built new software that uses a different sorting method and can catch things that Pango may miss.
It’s important to remember, though, that no system has ever dealt with such a deluge of data on how viruses morph. Covid has become the most-watched virus of all time. It’s also the first time scientists have been able to see exactly how the virus changes as it moves between countries.
“All this was possible because people were sharing their data, people were sharing their tools,” says Jolly.
As scientists have found ways to communicate with one another, they’ve also had to learn about public communication. It’s been “a bit surreal,” says O’Toole, watching the media use these highly technical names.
“We’d been using this nomenclature all year long, and it’s really useful for the scientific community, but a name like B.1.1.7 definitely wasn’t designed to be on BBC News,” she says. “It’s been a big learning experience to have this level of public scrutiny.”
Behind the scenes, the Pango team continues to track the evolution of covid so that scientists around the globe can work together on stopping the pandemic.
Says Brito: “The media is talking all the time about the delta variant, the alpha variant. CNN Brazil is talking about the genomes being sequenced and saying, ‘The lineage will be assigned and we’ll get a report in a few days’ … It would have been unimaginable two years ago.”
This story is part of the Pandemic Technology Project, supported by The Rockefeller Foundation.