The hits and misses of India’s ambitious COVID-19 genome sequencing project

Project assistants work on genome sequencing at Indian Institute of Science Education and Research in Pune 16 February 2022. Pratham Gokhale/Hindustan Times
13 May, 2022

The night of 29 January 2020 was one of the most strenuous of Pragya Yadav’s life. Yadav, a scientist at the National institute of Virology, had already spent a few days and nights ensuring that her laboratory was prepared to test samples of the first two Indians thought to have COVID-19. Earlier that day, Yadav had conducted two rounds of tests in two different laboratories to confirm that the samples were indeed positive for the alarming new disease spreading across the world. By the evening, she was back in her laboratory to sequence the viral samples and to understand the genetic constitution of the virus. 

“I remember that night very clearly,” Yadav said. She is a bespectacled, soft-spoken woman who patiently answered my barrage of questions, pausing to explain the jargon of genomic sequencing. “Me and my team had just wrapped up the testing work by 7 pm and went home, but were back on campus by 10:30 pm to discuss our sequencing strategy,” she recalled. Yadav and Priya Abraham, the director of the NIV, decided that the team would work through the night to complete sequencing. This was arduous work, and also harrowing. They had to handle samples of a highly contagious new pathogen that they knew little about apart from the fact that it was making people very ill and rapidly claiming lives. They also had to wear suffocating protective gear that made physical conditions inside the laboratory next to unbearable. Yadav and her team summoned all their strength to work through the night and the next day while ignoring anxious calls from their families. They finished sequencing by the morning of 30 January. “We finally had a whole genome sequence, a blueprint of the virus,” Yadav said. 

What Yadav had decoded was a structure and constitution of the virus from the sample she was sent, one of many similar sequences from similar samples around the world that could help unlock its mechanisms, which in turn could indicate how the disease could be controlled. The SARS-CoV-2 virus that causes COVID-19 is made up of a single RNA strand which contains all the information it needs to evolve and multiply in a host body. “Basically, we look at this strand of RNA and we see the pattern used to produce protein, replicate itself and infect a potential host,” Saket Choudhary, a postdoctoral student at the New York Genome Centre, said. “So we can then use this information to build drugs and develop new vaccines that can break these patterns of the virus. The more you know about how it functions, the more likely it is that you will be able to incapacitate it.”

India’s genomic sequencing efforts have advanced significantly since the time Yadav and her colleagues sequenced that first sample. Yadav’s team had to do this work at the only laboratory in the country equipped for testing and sequencing. By mid-2020, a few Indian scientists associated with the department of biotechnology, which falls under the union ministry of science and technology, had begun collecting a representative pool of blood samples from infected individuals across the country. They found that different mutations of the virus had already started emerging locally, especially in states that had witnessed surges in infections. “This is an RNA virus and we knew it is likely to mutate quickly but not at this scale and this fast,” Dr Saumitra Das, the former director of the National institute of Biomedical Genomics, said. “By the latter half of 2020, it was more evident that there was a correlation between the surge in cases and variations in the virus. So an important question emerged: what is the exact link between variants and these surges?” 

This link was one reason why the government of India decided to form a consortium of laboratories for COVID-19 genome sequencing. The other was to join the global effort to fight the pandemic, for which many other countries had already initiated concerted sequencing efforts. In December 2020, the department of biotechnology officially set up the Indian SARS-CoV2 Consortium on Genomics or INSACOG, which was an unprecedented nationwide effort initiated to sequence COVID-19 samples. “When the UK variant emerged [in December 2020] and cases rose rapidly over there, that is when it was decided to concretise this plan and formally establish INSACOG,” Das, who also worked as a coordinator for the consortium, said.

INSACOG initially comprised ten government laboratories. Over a year, it expanded to 38 centres, including private laboratories. However, the consortium was slow to sequence significant numbers of samples. It has also not been able to effectively deploy sequencing information to prepare for and prevent future waves of infection. This is partly because of INSACOG’s own inability to establish clear systems for constituent laboratories and partly because of political apathy to the information provided by the consortium. The INSACOG project has come to exemplify the government’s approach to scientific counsel in tackling the pandemic: to ignore it. 

INSACOG’s mandate was to sequence five per cent of all samples from COVID-19 patients across the country to detect patterns of infection. Samples were sent to consortium laboratories closest to points of collection. Yadav told me that every new sequence was examined for mutations. “Basically, we look for the mutations or changes in the various gene segments in the entire genome of the virus—how those changes affect the different proteins of the virus such as the spike protein which is responsible for causing infection,” she said. The sequence was then compared to older ones already linked to particular patterns of infection or symptoms. Scientists checked whether the mutations in the sequence being examined were also present in previous variants of the virus. They checked for literature on how particular mutations affected virus functionality. “If it is a one-off variation in the genomic sequence somewhere, we don’t see it as a matter of concern, but if we see a pattern of the same mutations in a cluster of samples, we look further into it,” Dr Krishanpal Karmodiya, assistant professor of biology at the Indian Institute of Science Education and Research in Pune, said. IISER Pune is part of INSACOG and also part of Maharashtra’s state-level sequencing programme that started before INSACOG was set up. “We check whether a particular mutation has already been associated with clinical characteristics such as higher virulence or immune escape.” 

The point of linking genomic sequences to clinical information is to ultimately make more informed public health interventions. A genomic scientist based in Kerala told me how rigorous sequencing in the state led to the government pushing for double masking as a preventive tool. At the end of 2020, there was enough scientific evidence to show that the virus spread through the air. When Alpha emerged as a dominant variant, it was clear that cloth masks did not offer enough protection and that well-fitted N-95 masks offered much better protection against airborne particles. “But we knew we couldn’t advise everyone to wear N-95 masks, that could cause a shortage of masks for our essential workers,” the scientist said. “So we saw double-masking as a solution out of this, where people could wear a combination of a cloth mask and the more easily available surgical mask to cut down the rate of transmission.” A few months later, in April 2021, genomic surveillance provided early insights into the community spread of Alpha when scientists were able to link mutations in the virus with high infectivity. Vaccination rates were still very low in India at the time. “Double-masking as a non-pharmacological intervention was decided upon based the fact that a more transmissible variant (Alpha) was gaining hold in the state,” the scientist said. 

The efficiency Kerala displayed in using information gathered from genomic sequencing to implement preventive measures was not seen across India. In February 2021, just before the devastating second wave of the pandemic claimed millions of lives in India, members of the Maharashtra COVID-19 task force collected samples from patients in the Amravati and Yavatmal districts of the state. Both districts witnessed a sudden surge in cases and, when samples from these areas were tested in INSACOG labs, scientists found that they contained mutations that were linked to immune escape and increased infectivity. They were able to infer this because they had genomic data from other countries such as South Africa and Brazil where variants with the same mutations were driving up case numbers and hospitalisation rates. INSACOG scientists told reporters that they had warned the government by early March  about a concerning new variant circulating in the population, but officials did nothing about it. Though the presence of a new variant is not solely responsible for driving waves of infection, it is one of the primary reasons for driving up infection rates. The second wave was largely fueled by the Delta variant, which was much more transmissible and more severe than earlier variants of the virus and quickly replaced them in the infected population. The government did not take preventive measures to minimise the spread of the Delta variant but instead allowed large election rallies and massive gatherings at the Kumbh mela in April 2021. 

When genomic scientists found novel mutations in SARS-CoV2 variants that had not been reported from other countries, they looked for patterns that could reveal clinical implications. “If we don’t know much about these new mutations, we need to start collecting data on such clinical information, about how fast this mutation is spreading, how sick are people getting , and whether they are hospitalised etc,” Karmodiya said. 

All this finding of similarities in genome sequences with variants reported abroad and tracking mutations to see if they will lead to surges or bad outcomes required a certain number of samples to be collected, sequenced and reported. INSACOG set out to sequence five percent of all positive samples collected from every state. Shahid Jameel, the virologist who chaired the scientific advisory group of INSACOG until he resigned from the position in May 2021, told me the five-percent target was an aspirational one. “At the time, the best sequencing countries were sequencing in that range. This was not a mandate from the government, but was proposed in the funding application by 10 constituent laboratories of INSACOG,” he said. Till December 2021, the laboratories managed only a fifth of this target.  Choudhary has closely analysed coronavirus sequencing data from India that has been uploaded to the Global Initiative on Sharing All Influenza Data or GISAID website. GISAID provides open access genomic data for a range of viruses including SARS-CoV2, H1N1 and H5N1. He found that until late 2021, India was consistently sequencing only around one percent of all positive COVID-19 samples each month. 

According to data uploaded on the INSACOG portal, from the time INSACOG was formed till the end of January 2022, Maharashtra sequenced 24,704 samples, while states like Bihar and Jharkhand sequenced only 594 and 783 respectively. So, more than 25 percent of sequencing data in India was from Maharashtra alone. The resulting national level sequencing data and prevalence of new variants mostly reflects what is going on in the states that are sequencing much more than the rest of the country. 

Scientists and researchers that I spoke to pointed to the lack of a coherent and consistent central strategy for genomic sequencing as the reason for the disparities in rates of sequencing in different states. INSACOG’s initial strategy in December 2020 was to ask every state to send five per cent of its positive samples to its laboratory network. In April 2021, the consortium changed its strategy to that of sentinel site surveillance. It asked each state to designate five laboratories and five tertiary care hospitals as sentinel sites, each of which would send 15 samples. 

In late December 2021, when cases were rising across the country again, a leading genomic scientist involved with INSACOG told me that the consortium was not adhering to either strategy. “Neither of these strategies are being followed, except for in the states of Maharashtra and Kerala, who have their own state level surveillance strategy,” the scientist said. Maharashtra and Kerala both had state-level genomic surveillance programmes that were started before INSACOG was formed and then adapted to work in coordination with INSACOG. These two states sequenced one-third of all samples in India. Karmodiya’s  laboratory at IISER was part of INSACOG but also worked in parallel for the Maharashtra State Genomic Surveillance. “Apart from the mandate from the centre and INSACOG, we hold state-level meetings and co-ordinate with labs within the state to go the extra mile. We work in tandem with INSACOG but also go above and beyond that to sequence more and more samples across districts,” Karmodiya said. Kerala’s sequencing program called GENESCoV2 Kerala was established in November, a month before INSACOG was established. The idea was to sample 100 samples from every district every month. “This constant number of samples provided a unique opportunity to understand diversity and evolution in a much more systematic way,” the Kerala-based genomic scientist told me. Since November 2021 however, Kerala’s sequencing strategy was based on INSACOG’s sentinel site strategy.

Gathering epidemiological information from sequencing is not just a numbers game. Sequencing data is of limited use until clinical information—age, gender and comorbidities of the patient from whom the sample was taken, whether they were hospitalised or needed oxygen support, whether they were vaccinated or boosted—is attached. “You would want more sequencing, you would want it faster, you would want to make sure that you have more detailed clinical information with the sequencing,” Gagandeep Kang, a virologist and professor at the Christian Medical College in Vellore, said. Kang says clinically correlated genomic data should be collected for more infectious diseases and not just for COVID-19. “The data that we have is inconsistent, we can’t track things over time or over geography.” she said. “If there is greater or lesser hospitalisation, for example, you know this is something you need to immediately respond to, rather than just saying a proportion of a particular variant is increasing, without having a link to particular characteristics.” 

The COVID-19 genomic data available through the INSACOG website merely told us how many samples have been sequenced from each state till now, and how many of each variant      are present. “The portal washed out many details and has no granular data,” Choudhury said. “Not even when these samples were collected or sequenced. The interface they provide is not useful to draw out any conclusions, it is limited to just absolute numbers.” INSACOG rarely presented analysis on sequencing data in the public domain. The few times it offered this information, it came months after sample collection. 

In December of 2021, when cases surged due to the Omicron variant, India managed to sequence six percent of all positive samples in that month. Until then, the government was mostly sequencing samples taken from international passengers arriving at airports, the genomic scientist told me, adding that the focus on airport arrivals was misguided. The focus should have been on testing samples from local communities because of the breakneck speed of Omicron’s spread. At that point, on the cusp of the third national COVID-19 wave with almost ten thousand new cases each day, the strategy of sequencing only airport samples left India with no way of knowing how much the variant has already spread locally, especially since a number of cases are likely to be asymptomatic or mildly symptomatic. The genomic scientist was frustrated with the lack of foresight. “We can’t abandon our efforts when the cases are low and ramp it up only when the next wave hits us.” 

The pattern of accelerating sequencing as a consequence of a rise in cases occurred during the second wave in 2021 and repeated during the third wave in early 2022. INSACOG expanded its sequencing strategy in January when the number of daily COVID-19 cases rose from tens of thousands to more than a lakh within a few days.  Anurag Agrawal, the former director of the Institute of Genomics and Integrative Biology, wrote in an email to me at the time that, “the focus of sequencing has shifted from airports and foreign travelers to surveillance of Omicron fraction in India.” Agrawal also said that the big metros already seemed to be “Omicron dominant” and that information from smaller cities would soon become more important. 

Genomic surveillance is much more beneficial at trough stages rather than at peaks during a pandemic—that is, when cases are the lowest and not at the highest. “If you sequence a lot during the trough stage you will catch a larger diversity of variants early,” the genomic scientist said.  “Whereas during the peak, even when you sequence a large number of samples, most of them will be found to be one or two dominant variants of the virus. Because it is the dominant variants that drive waves in the first place.” Though the national sequencing rate reached six per cent in December of 2021, it dropped to only 0.3 per cent in January 2022 and was at 0.9 per cent in February 2022, according to data on GISAID. 

In February 2022, INSACOG announced that IISER Pune would test samples of hospitalised patients to understand whether their disease was caused by a particular sub-lineation of      Omicron or another variant. I spoke to Karmodiya at the time and he confirmed that his institute and others across Pune were collaborating to extend genomic sequencing such that new mutations and variants could be quickly linked to clinical trends in hospitalised COVID-19 patients. INSACOG is now working towards expanding this to a network of hospitals across India, from where laboratories can assess samples of hospitalised COVID-19 patients. Saumitra Das, the director of the National institute of Biomedical Genomics and a joint coordinator of INSACOG, said that this effort is spearheaded by the Translational Health Science and Technology Institute, an autonomous institute under the department of biotechnology. 

The consortium’s inadequate data collection efforts add another layer of complication to its sequencing strategy. Choudhary said that while some delays in uploading data to the INSACOG or GISAID websites were understandable, the backlogs in data from India rendered the information mostly unusable.  Data from samples that were collected in March 2021 for example, was uploaded at the end of December 2021. This could also explain the sudden spike in sequencing in the month. “One of the bigger public-funded labs strangely uploaded 6,000 sequences at once, in one day, on 6 December. This was data from samples that were collected on different days in the past six months. Six months later the data is almost useless, so we can definitely do a better job than this,” Choudhary said while speaking to me in early January.  According to the GISAID website , there is an average lag of 63 days between when the sample is collected in India and when sequencing data from the sample is submitted. In comparison, the neighboring countries of Bangladesh and Pakistan have average lags of 36 and 48 days respectively. 

Karmodiya pointed out some challenges in data collection that can cause these lags. “We try to submit the sequencing data to both GISAID and INSACOG during the next three days of completion of the sequencing. However, many times it is delayed due to the non-availability of the metadata,” he said. “One reason could be that they have a bulk of samples from the airport, with no metadata attached to it. Data such as the location from where the sample was taken or where the passenger had traveled from. These labs have to go back to gather that data because you can’t submit sequencing data to these portals without attaching some basic metadata.”

Further, according to the data Choudhary accessed through GISAID, the majority of the samples India had sequenced in December 2021 and early January 2022 were Omicron samples. “If you just look at the absolute numbers of samples deposited, the numbers submitted from labs are very small. So we don't know if it's reflective of the prevalence of the variant or that labs are rushing to submit these samples because we are all majorly concerned with Omicron now.” Choudhury was concerned that a selective bias in sequencing and submitting data only on Omicron samples would leave India blind to the presence of other variants that might already be circulating in the population. 

Rakesh Mishra, a genomic scientist who retired as the director of Centre for Cellular and Molecular Biology—one of the first 10 INSACOG laboratories—in April of 2021, told me in mid-December that it was essential for the government to present data to the public in real time. “We shouldn't even wait for data to begin implementing measures because the time for action is now. But data is important, India needs to start collecting data on the variant and its clinical characteristics,” he said. “Data is our eye, and without it we are going into this war blind.” 

The first thing scientists around the world noticed about Omicron was that it was even more transmissible than Delta. “Most new waves are driven by highly infectious variants,” Mishra said. Preparedness for an infectious wave can be best achieved if this data is shared or uploaded on publicly accessible portals promptly. The INSACOG data portal has been up since early 2021, soon after the consortium was formed. But data shared through the portal remains limited and is not updated in real time. Since February 2022, this portal is no longer being updated and INSACOG data is hosted on a website run by the Indian Biological Data Center, a website that has restricted access and has very limited data available for the public. The earlier INSACOG portal had weekly bulletins. The last regular bulletin was shared on 10 January 2020, after which there was no update for three months. In late April, three weekly bulletins for the month were uploaded together. 

Before the pandemic, India’s genetic sequencing efforts were limited to particular laboratories and conducted sporadically. “TB is the one pathogen for which there was a nationwide effort to sequence. There were many samples collected at a national level but very few laboratories in the country that were associated with the actual sequencing effort,” Agrawal, the director of IGIB, said. Yadav’s lab in NIV was the only lab in India dedicated to sequence novel and potentially dangerous pathogens in India. Yadav had trained at the Centre for Disease Control in the United States of America for three weeks, where she learnt to use Next Generation Sequencing or NGS technology–advanced sequencing that allows scientists to sequence multiple strands of DNA or RNA parallelly. She came back to India, gathered a team of scientists at NIV and set up a lab in 2017 which uses NGS to conduct research on emerging pathogens. This is the same lab where she sequenced the first two COVID-19 samples in India. Since then, the country has ramped up sequencing capacity, adding almost forty laboratories to INSACOG and sporadically increasing the number of samples sequenced above its five per cent target. 

But regardless of whether India consistently increases its sequencing capacity or meets the aspirational five percent target, INSACOG’s will have to streamline its strategy and implements it uniformly across the country work in order to inform timely public health intervention. Most scientists I spoke with, who are working across different INSACOG laboratories, told me different versions of the strategy they followed while sequencing, and also claimed that the strategy mandated by INSACOG kept changing. Further, this strategy needs to provide good quality data linked to clinical information. This data needs to be uploaded and analysed in a timely manner so that public-health officials and administrators can use the information. All this also depends on whether the government works collaboratively with the scientific community, providing them with the resources to carry out their work and acting promptly when warned about concerning new mutations.

There are two ways of measuring the performance of a scientific endeavour: its progress compared to previous efforts and its progress compared to its goals. Kang said that INSACOG’s pace was satisfactory, even commendable, given that India had never before done genomic sequencing at this scale. “But there is obviously always room for improvement,” she said. According to Kang, these shortcomings should not deter the government from building better systems in the future. She is hopeful for the future of genomic sequencing in India. “If  we start building a system for robust public-health surveillance and data collection, that has value. It will always have value when you have a disease that is not going away. The fact that we haven't had such data available to us in the past one-and-a-half years is not a reason to not have data in the future as well.” 

 This reporting was supported by a grant from the Thakur Family Foundation. Thakur Family Foundation has not exercised any editorial control over the contents of this reportage.