The Collision that Formed India

What genetics reveals about Indian origins

01 October, 2018


In the oldest text of Hinduism, the Rig Veda, the warrior god Indra rides against his “impure enemies,” or dasa, in a horse-drawn chariot, destroys their fortresses, or pur, and secures land and water for his people, the arya, or Arya. Composed between 4,000 and 3,000 years ago in Old Sanskrit, the Rig Veda was passed down orally for some two thousand years before being written down, much like the Iliad and Odyssey in Greece, which were composed several hundred years later in another early Indo-European language. The Rig Veda is an extraordinary window into the past, as it provides a glimpse of what Indo-European culture might have been like in a period far closer in time to when these languages radiated from a common source. But what did the stories of the Rig Veda have to do with real events? Who were the dasa, who were the arya, and where were the fortresses located? Did anything like this really happen? There was tremendous excitement about the possibility of using archaeology to gain insight into these questions in the 1920s and 1930s. In those years, excavations uncovered the remains of an ancient civilisation, walled cities at Harappa, Mohenjo-daro, and elsewhere in the Punjab and Sind that dated from 4,500 to 3,800 years ago. These cities and smaller towns and villages dotted the valley of the river Indus in present-day Pakistan and parts of India, and some of them sheltered tens of thousands of people. Were they perhaps the fortresses, or pur, of the Rig Veda?

Indus Valley Civilisation cities were surrounded by perimeter walls and laid out on grids. They had ample storage for grain supplied by farming of land in the surrounding river plains. The cities sheltered craftspeople skilled in working clay, gold, copper, shell and wood. The people of the Indus Valley Civilization engaged in prolific trade and commerce, as reflected in the stone weights and measures they left behind, and their trading partners, who lived as far away as Afghanistan, Arabia, Mesopotamia and even Africa. They made decorative seals with images of humans or animals. There were often signs or symbols on the seals whose meaning remains largely undeciphered.

Since the original excavations, many things about the Indus Valley Civilisation have remained enigmatic, not only its script. The greatest mystery is its decline. Around 3,800 years ago, the settlements of the Indus dwindled, with population centres shifting east toward the Ganges plain. Around this time, the Rig Veda was composed in Old Sanskrit, a language that is ancestral to the great majority of languages spoken in northern India today and that had diverged in the millennium before the Rig Veda was composed from the languages spoken in Iran. Indo-Iranian languages are in turn cousins of almost all of the languages spoken in Europe and with them make up the great Indo-European language family. The religion of the Rig Veda, with its pantheon of deities governing nature and regulating society, had unmistakable similarities to the mythology of other parts of Indo-European Eurasia, including Iran, Greece and Scandinavia, providing further evidence of cultural links across vast expanses of Eurasia.

Some have speculated that the collapse of the Indus Valley Civilisation was caused by the arrival in the region of migrants from the north and west, who spoke Indo-European languages—the so-called Indo-Aryans. In the Rig Veda, the invaders had horses and chariots. We know from archaeology that the Indus Valley Civilization was a pre-horse society. There is no clear evidence of horses at their sites, nor are there remains of spoke-wheeled vehicles, although there are clay figurines of wheeled carts pulled by cattle. Horses and spoke-wheeled chariots were the weapons of mass destruction of Bronze Age Eurasia. Did the Indo-Aryans use their military technology to put an end to the old Indus Valley Civilisation?

Since the original excavations at Harappa, the “Aryan invasion theory” has been seized on by nationalists in both Europe and India, which makes the idea difficult to discuss in an objective way. European racists, including the Nazis, were drawn to the idea of an invasion of India in which dark-skinned inhabitants were subdued by light-skinned warriors related to northern Europeans, who imposed on them a hierarchical caste system that forbade intermarriage across groups. To the Nazis and others, the distribution of the Indo-European language family, linking Europe to India and having little impact on the Near East with its Jews, spoke of an ancient conquest moving out of an ancestral homeland, displacing and subjugating the peoples of the conquered territories, an event that they wished to emulate. Some placed the ancestral homeland of the Indo-Aryans in northeast Europe, including Germany. They also adopted features of Vedic mythology as their own, calling themselves Aryans after the term in the Rig Veda, and appropriating the swastika, a traditional Hindu symbol of good fortune.

The Nazis’ interest in migrations and the spread of Indo-European languages has made it difficult for serious scholars in Europe to discuss the possibility of migrations spreading Indo-European languages. In India, the possibility that the Indus Valley Civilisation fell at the hands of migrating Indo-European speakers coming from the north is also fraught, as it suggests that important elements of South Asian culture might have been influenced from the outside.

The idea of a mass migration from the north has fallen out of favour among scholars not only because it has become so politicised, but also because archaeologists have realised that major cultural shifts in the archaeological record do not always imply major migrations. And, in fact, there is scant archaeological evidence for such a population movement. There are no obvious layers of ash and destruction around 3,800 years ago suggesting the burning and sacking of the Indus towns. If anything, there is evidence that the Indus Valley Civilisation’s decline played out over a long period, with emigration away from the towns and environmental degradation taking place over decades. But the lack of archaeological evidence does not mean that there were no major incursions from the outside. Between 1,600 and 1,500 years ago, the Western Roman Empire collapsed under the pressure of the German expansions, with great political and economic blows dealt to the empire when the Visigoths and the Vandals each sacked Rome and took political control of Roman provinces. However, there so far seems to be little archaeological evidence for destruction of Roman cities in this time, and if not for the detailed historical accounts, we might not know these pivotal events occurred. It is possible that in the apparent depopulation of the Indus Valley, too, we might be limited by the difficulty archaeologists have in detecting sudden change. The patterns evident from archaeology may be obscuring more sudden triggering events.

What can genetics add? It cannot tell us what happened at the end of the Indus Valley Civilisation, but it can tell us if there was a collision of peoples with very different ancestries. Although mixture is not by itself proof of migration, the genetic evidence of mixture proves that dramatic demographic change and thus opportunity for cultural exchange occurred close to the time of the fall of Harappa.


The great Himalayas were formed around ten million years ago by the collision of the Indian continental plate, moving northward through the Indian Ocean, with Eurasia. India today is also the product of collisions of cultures and people.

Consider farming. The Indian subcontinent is one of the breadbaskets of the world—today it feeds a quarter of the world’s population—and it has been one of the great population centres ever since modern humans expanded across Eurasia after fifty thousand years ago. Yet farming was not invented in India. Indian farming today is born of the collision of the two great agricultural systems of Eurasia. The Near Eastern winter rainfall crops, wheat and barley, reached the Indus Valley sometime after nine thousand years ago according to archaeological evidence—as attested, for example, in ancient Mehrgarh on the western edge of the Indus Valley in present-day Pakistan. Around five thousand years ago, local farmers succeeded in breeding these crops to adapt to monsoon summer rainfall patterns, and the crops spread into peninsular India. The Chinese monsoon summer rainfall crops of rice and millet also reached peninsular India around five thousand years ago. India may have been the first place where the Near Eastern and the Chinese crop systems collided.

Language is another blend. The Indo-European languages of the north of India are related to the languages of Iran and Europe. The Dravidian languages, spoken mostly by southern Indians, are not closely related to languages outside South Asia. There are also Sino-Tibetan languages spoken by groups living in the mountains fringing the north of India, and small pockets of tribal groups in the east and centre that speak Austroasiatic languages related to Cambodian and Vietnamese, and that are thought to descend from the languages spoken by the peoples who first brought rice-farming to South Asia and parts of Southeast Asia. Words borrowed from ancient Dravidian and Austroasiatic languages, which linguists can detect as they are not typical of Indo-European languages, are present in the Rig Veda, implying that these languages have been in contact in India for at least three or four thousand years.

The people of India are also diverse in appearance, providing visual testimony to mixture. A stroll down a street in any Indian city makes it clear how diverse Indians are. Skin shades range from dark to pale. Some people have facial features like Europeans, others closer to Chinese. It is tempting to think that these differences reflect a collision of peoples who mixed at some point in the past, with different proportions of mixture in different groups living today. But it is also possible to over-interpret physical appearances, as it is known that appearances can also reflect environment and diet.

The first genetic work in India gave seemingly contradictory results. Researchers studying mitochondrial DNA, always passed down from mothers, found that the vast majority of mitochondrial DNA in Indians was unique to the subcontinent, and they estimated that the Indian mitochondrial DNA types only shared common ancestry with those predominant outside South Asia many tens of thousands of years ago. This suggested that on the maternal line, Indian ancestors had been largely isolated within the subcontinent for a long time, without mixing with neighbouring populations to the west, east or north. In contrast, a good fraction of Y chromosomes in India, passed from father to son, showed closer relatedness to West Eurasians—Europeans, central Asians and Near Easterners—suggesting mixture.

Some historians of India have thrown up their hands and discounted genetic information due to these apparently conflicting findings. The situation has not been helped by the fact that geneticists do not have formal training in archaeology, anthropology and linguistics—the fields that have dominated the study of human prehistory—and are prone to make elementary mistakes or to be tripped up by known fallacies when summarising findings from those fields. But it is foolhardy to ignore genetics. We geneticists may be the barbarians coming late to the study of the human past, but it is always a bad idea to ignore barbarians. We have access to a type of data that no one has had before, and we are wielding these data to address previously unapproachable questions about who ancient peoples were.


My research into the prehistory of India began in 2007 with a book and a letter.

The book was The History and Geography of Human Genes, Luca Cavalli-Sforza’s magnum opus, in which he mentions the “Negrito” people of the Andaman Islands in the Bay of Bengal, hundreds of kilometres from the mainland. The Andaman Islands have remained isolated by deep-sea barriers for most of the history of modern human dispersal through Eurasia, although the largest, Great Andaman, has been massively disrupted by mainland influence over the last few hundred years (the British used it as a colonial prison). North Sentinel Island is populated by one of the last largely uncontacted Stone Age peoples of the world—a group of several hundred people who are now protected from outside interference by the Indian government, and who are so not-of-our-world that they shot arrows at Indian helicopters sent to offer help after the Indian Ocean tsunami of 2004. The Andamanese speak languages that are so different from any others in Eurasia that they have no traceable connections. They also look very different from other humans living nearby, with slighter frames and tightly coiled hair. In one section of his book, Cavalli-Sforza speculated that the Andamanese might represent isolated descendants of the earliest expansions of modern humans out of Africa, perhaps having moved there before the migration that occurred after around fifty thousand years ago and that gave rise to most of the ancestry of non-Africans today.

On reading this, my colleagues and I wrote a letter to Lalji Singh and Kumarasamy Thangaraj of the Centre for Cellular and Molecular Biology in Hyderabad. A few years earlier, Singh and Thangaraj had published a paper on mitochondrial and Y-chromosome DNA from people of the Andaman Islands. Their study showed that the people of Little Andaman Island had been separated for tens of thousands of years from peoples of the Eurasian mainland. I asked them whether it would be possible to analyse whole genomes of the Andamanese, to gain a fuller picture.

Singh and Thangaraj were excited to collaborate and quickly convinced me that there was a broader picture to paint involving mainland Indians as well. They offered us access to a vast collection of DNA. In the freezers at the CCMB, they had assembled samples that represented the extraordinary human diversity of India—the last time I checked, the collection included more than 300 groups and more than 18,000 individual DNA samples. These had been assembled by students from all over India, who had visited villages and collected blood samples from people whose grandparents were from the same location and group. From the CCMB collection, we selected 25 groups that were as geographically, culturally and linguistically diverse as possible. The groups were of traditionally high as well as low social status in the Indian caste system, and also included a number of tribes entirely outside the caste system.

A few months later, Thangaraj came to our laboratory in Boston, bringing with him this unique and precious set of DNA samples. We analysed them using a single nucleotide polymorphism microarray, a technology that had just recently become available in the United States but was not yet available in India. For this reason, Thangaraj had been granted permission by the Indian government to take the DNA outside India. (There are Indian regulations limiting export of biological material if the research can be achieved within the country.) An SNP microarray contains hundreds of thousands of microscopic pixels, each of which is covered by artificially synthesised stretches of DNA from the places in the genome that scientists have chosen to analyse. When a DNA sample is washed over the microarray, the fragments that overlap the artificial DNA sequences bind tightly, and the fragments that do not are washed away. Based on the relative intensity of binding to these bait sequences, a camera that detects fluorescent light can determine which possible genetic types a person carries in his or her genome. The SNP microarray that we analysed was able to study many hundreds of thousands of positions in the genome that harbour a mutation carried by some people but not others. By studying these positions, it is possible to determine which people are most closely related to which others. The technique is much less expensive than sequencing a whole human genome since it zeroes in on points of interest—those that tend to differ among people and thus provide the greatest density of information about population history.

To obtain an initial picture of how the samples were related to each other, we used the mathematical technique of principal component analysis, which is also described in the previous chapter on West Eurasian population history, and which finds combinations of single-letter changes in DNA that are most informative about the differences among people. Using this method to display Indian genetic data on a two-dimensional graph, we found that the samples spread out along a line. At the far extreme of the line were West Eurasian individuals—Europeans, central Asians and Near Easterners—whom we had included in the analysis for the sake of comparison. We called the non–West Eurasian part of the line the “Indian Cline”: a gradient of variation among Indian groups that pointed on the plot like an arrow directly at West Eurasians.

A gradient in a principal component analysis plot can be caused by several quite different histories, but such a striking pattern led us to guess that many Indian groups today might be mixtures, in different proportions, of a West Eurasian-related ancestral population and another very different population. Seeing that the southernmost groups in India—which also spoke Dravidian languages—tended to be farthest away from West Eurasians in the plot, we explored a model in which Indians today are formed from a mixture of two ancestral populations, and we evaluated the consistency of this model with the data.

To test whether mixture occurred, we had to develop new methods. The methods that we applied in 2010 to show that mixture had occurred between Neanderthals and modern humans were, in fact, primarily developed to study Indian population history.

We first tested the hypothesis that Europeans and Indians descend from a common ancestral population that split at an earlier time from the ancestors of East Asians such as Han Chinese. We identified DNA letters where European and Indian genomes differed, and then measured how often Chinese samples had the genetic types seen in Europeans or Indians. We found that Chinese clearly share more DNA letters with Indians than they do with Europeans. That ruled out the possibility that Europeans and Indians descended from a common homogeneous ancestral population following their separation from the ancestors of Chinese.

We then tested the alternative hypothesis that Chinese and Indians descended from a common ancestral population since their separation from the ancestors of Europeans. However, this scenario did not hold up either: European groups are more closely related to all Indians than to all Chinese.

We found that the frequencies of the genetic mutations seen in all Indians are, on average, intermediate between those in Europeans and East Asians. The only way that this pattern could arise was through a mixture of ancient populations—one related to Europeans, central Asians and Near Easterners, and another related distantly to East Asians.

We initially called the first population “West Eurasians,” as a way of referring to the large set of populations in Europe, the Near East and central Asia, among which there are only modest differences in the frequencies of genetic mutations from one group to another. These differences are typically about ten times smaller than the differences between Europeans and the people of East Asia. It was striking to find that one of the two populations contributing to the ancestry of Indians today grouped with West Eurasians. This looked to us like the easternmost edge of the ancient distribution of West Eurasian ancestry, where it had mixed with other very different people. We could see that the other population was more closely related to present-day East Asians such as the Chinese, but was also clearly tens of thousands of years separated from them. So it represented an early-diverging lineage that contributed to people living today in South Asia but not much to people living anywhere else.

Having identified the mixture, we searched for present-day Indian populations that might have escaped it. All the populations on the mainland had some West Eurasian-related ancestry. However, the people of Little Andaman Island had none. The Andamanese were consistent with being isolated descendants of an ancient East Asian-related population that contributed to South Asians. The indigenous people of Little Andaman Island, despite a census size of fewer than one hundred, turned out to be key to understanding the population history of India.


The tensest 24 hours of my scientific career came in October 2008, when my collaborator Nick Patterson and I travelled to Hyderabad to discuss these initial results with Singh and Thangaraj.

Our meeting on 28 October was challenging. Singh and Thangaraj seemed to be threatening to nix the whole project. Prior to the meeting, we had shown them a summary of our findings, which were that Indians today descend from a mixture of two highly divergent ancestral populations, one being “West Eurasians.” Singh and Thangaraj objected to this formulation because, they argued, it implied that West Eurasian people migrated en masse into India. They correctly pointed out that our data provided no direct evidence for this conclusion. They even reasoned that there could have been a migration in the other direction, of Indians to the Near East and Europe. Based on their own mitochondrial DNA studies, it was clear to them that the great majority of mitochondrial DNA lineages present in India today had resided in the subcontinent for many tens of thousands of years. They did not want to be part of a study that suggested a major West Eurasian incursion into India without being absolutely certain as to how the whole-genome data could be reconciled with their mitochondrial DNA findings. They also implied that the suggestion of a migration from West Eurasia would be politically explosive. They did not explicitly say this, but it had obvious overtones of the idea that migration from outside India had a transformative effect on the subcontinent.

Singh and Thangaraj suggested the term “genetic sharing” to describe the relationship between West Eurasians and Indians, a formulation that could imply common descent from an ancestral population. However, we knew from our genetic studies that a real and profound mixture between two different populations had occurred and made a contribution to the ancestry of almost every Indian living today, while their suggestion left open the possibility that no mixture had happened. We came to a standstill. At the time I felt that we were being prevented by political considerations from revealing what we had found.

That evening, as the fireworks of Diwali, one of the most important holidays of the Hindu year, crackled, and as young boys threw sparklers beneath the wheels of moving trucks outside our compound, Patterson and I holed up in his guest room at Singh and Thangaraj’s scientific institute and tried to understand what was going on. The cultural resonances of our findings gradually became clear to us. So we groped toward a formulation that would be scientifically accurate as well as sensitive to these issues.

The next day, the full group reconvened in Singh’s office. We sat together and came up with new names for ancient Indian groups. We wrote that the people of India today are the outcome of mixtures between two highly differentiated populations, “Ancestral North Indians” (ANI) and “Ancestral South Indians” (ASI), who before their mixture were as different from each other as Europeans and East Asians are today. The ANI are related to Europeans, central Asians, Near Easterners and people of the Caucasus, but we made no claim about the location of their homeland or any migrations. The ASI descend from a population not related to any present-day populations outside India. We showed that the ANI and ASI had mixed dramatically in India. The result is that everyone in mainland India today is a mix, albeit in different proportions, of ancestry related to West Eurasians, and ancestry more closely related to diverse East Asian and South Asian populations. No group in India can claim genetic purity.


Having come to this conclusion, we were able to estimate the fraction of West Eurasian-related ancestry in each Indian group.

To make these estimates, we measured the degree of the match of a West Eurasian genome to an Indian genome on the one hand and to a Little Andaman Islander genome on the other. The Little Andamanese were crucial here because they are related—albeit distantly—to the ASI but do not have the West Eurasian-related ancestry present in all mainland Indians, so we could use them as a reference point for our analysis. We then repeated the analysis, now replacing the Indian genome with the genome of a person from the Caucasus to measure the match rate we should expect if a genome was entirely of West Eurasian-related ancestry. By comparing the two numbers, we could ask: “How far is each Indian population from what we would expect for a population of entirely West Eurasian ancestry?” By answering this question we could estimate the proportion of West Eurasian-related ancestry in each Indian population.

In this initial study and in subsequent studies with larger numbers of Indian groups, we found that West Eurasian-related mixture in India ranges from as low as 20 percent to as high as 80 percent. This continuum of West Eurasian-related ancestry in India is the reason for the Indian Cline—the gradient we had seen on our principal components plots. No group is unaffected by mixing, neither the dominant nor the oppressed caste, including the non-Hindu tribal populations living outside the caste system.

The mixture proportions provided clues about past events. For one thing, the genetic data hinted at the languages spoken by the ancient ANI and ASI. Groups in India that speak Indo-European languages typically have more ANI ancestry than those speaking Dravidian languages, who have more ASI ancestry. This suggested to us that the ANI probably spread Indo-European languages, while the ASI spread Dravidian languages.

The genetic data also hinted at the social status of the ancient ANI (higher social status on average) and ASI (lower social status on average). Groups of traditionally higher social status in the Indian caste system typically have a higher proportion of ANI ancestry than those of traditionally lower social status, even within states where everyone speaks the same language. For example, Brahmins, the priestly caste, tend to have more ANI ancestry than the groups they live among, even those speaking the same language. Although there are groups in India that are exceptions to these patterns, including well-documented cases where whole groups have shifted social status, the findings are statistically clear, and suggest that the ANI-ASI mixture in ancient India occurred in the context of social stratification.

People in the north primarily speak Indo-European languages and have relatively high proportions of West Eurasian-related ancestry. People in the south primarily speak Dravidian languages and have relatively low proportions of West Eurasian ancestry. Isolated tribal groups in the centre and east speak Austroasiatic languages.

The genetic data from Indians today also reveal something about the history of differences in social power between men and women. Around 20 to 40 percent of Indian men and around 30 to 50 percent of Eastern-European men have a Y-chromosome type that, based on the density of mutations separating people who carry it, descends in the last 6,800 to 4,800 years from the same male ancestor. In contrast, the mitochondrial DNA, passed down along the female line, is almost entirely restricted to India, suggesting that it may have nearly all come from the ASI, even in the north. The only possible explanation for this is major migration between West Eurasia and India in the Bronze Age or afterward. Males with this Y-chromosome type were extraordinarily successful at leaving offspring while female immigrants made far less of a genetic contribution.

The discrepancy between the Y chromosome and mitochondrial DNA patterns initially confused historians. But a possible explanation is that most of the ANI genetic input into India came from males. This pattern of sex-asymmetric population mixture is disturbingly familiar. Consider African Americans. The approximately 20 percent of ancestry that comes from Europeans derives in an almost four-to-one ratio from the male side. Consider Latinos from Colombia. The approximately 80 percent of ancestry that comes from Europeans is derived in an even more unbalanced way from males (a 50-to-1 ratio). I explore in part III what this means for the relationships among populations, and between males and females, but the common thread is that males from populations with more power tend to pair with females from populations with less. It is amazing that genetic data can reveal such profound information about the social nature of past events.


To understand what our findings about population mixture meant in the context of Indian history, we needed to know not just that population mixture had occurred, but also when.

One possibility we considered is that the mixtures we had detected were due to great human migrations at the end of the last ice age, after around fourteen thousand years ago, as improving climates changed deserts into habitable land and contributed to other environmental change that drove people across the landscape of Eurasia.

A second possibility is that the mixtures reflected movements of farmers of Near Eastern origin into South Asia, a migration that could be a possible explanation for the spread of Near Eastern farming into the Indus Valley after 9,000 years ago.

A third possibility is that the mixtures occurred in the last 4,000 years and were associated with the dispersal of Indo-European languages that are spoken today in India as well as in Europe. This possibility hints at events described in the Rig Veda. However, even if mixture occurred after 4,000 years ago, it is entirely possible that it took place between already-resident populations, one of which had migrated to the area from West Eurasia some centuries or even millennia earlier but had not yet interbred with the ASI.

All three of the possibilities involve migration at some point from West Eurasia into India. Although Singh and Thangaraj entertained the possibility of a migration out of India and into points as far west as Europe to explain the relatedness between the ANI and West Eurasian populations, I have always thought, based on the absence of any trace of ASI ancestry in the great majority of West Eurasians today and the extreme geographic position of India within the present-day distribution of peoples bearing West Eurasian-related ancestry, that the shared ancestry likely reflected ancient migrations into South Asia from the north or west. By dating the mixture, we could obtain more concrete information.

The challenge of getting a date prompted us to develop a series of new methods. Our approach was to take advantage of the fact that in the first generation, after the ANI and ASI mixed, their offspring would have had chromosomes of entirely ANI or ASI ancestry. In each subsequent generation, as individuals combined their mother’s and father’s chromosomes to produce the chromosome they passed on to their offspring, the stretches of ANI and ASI ancestry would have broken up, with one or two breakpoints per generation per chromosome. By measuring the typical size of stretches of ANI or ASI ancestry in Indians today, and determining how many generations would be needed to chop them down to their current size, Priya Moorjani, a graduate student in my laboratory, succeeded in estimating a date.

Analysis of the primary patterns of genetic variation in South Asia shows that the majority of Indian groups form a gradient of ancestry, with Indo-European speakers from the north clustering at one extreme, and Dravidian speakers from the south at the other.

We found that all Indian groups we analysed had ANI-ASI mixture dates between 4,000 and 2,000 years ago, with Indo-European-speaking groups having more recent mixture dates on average than Dravidian-speaking groups. The older mixture dates in Dravidian speakers surprised us. We had expected that the oldest mixtures would be found in Indo-European-speaking groups of the north, as it is presumably there that the mixture first occurred. We then realised that an older date in Dravidians actually makes sense, as the present-day locations of people do not necessarily reflect their past locations. Suppose that the first round of mixture in India happened in the north close to 4,000 years ago, and was followed by subsequent waves of mixture in northern India as previously established populations and people with much more West Eurasian ancestry came into contact repeatedly along a boundary zone. The people who were the products of the first mixtures in northern India could plausibly, over thousands of years, have mixed with or migrated to southern India, and thus the dates in southern Indians today would be those of the first round of mixture. Later waves of mixture of West Eurasian-related people into northern Indian groups would then cause the average date of mixture estimated in northern Indians today to be more recent than in southern Indians.

A hard look at the genetic data confirms the theory of multiple waves of ANI-related mixture into the north. Interspersed among the short stretches of ANI-derived DNA we find in northern Indians, we also find quite long stretches of ANI-derived DNA, which must reflect recent mixtures with people of little or no ASI ancestry.

Remarkably, the patterns we observed were consistent with the hypothesis that all of the mixture of ANI and ASI ancestry that occurred in the history of some present-day Indian groups happened within the last 4,000 years. This meant that the population structure of India before around 4,000 years ago was profoundly different from what it is today. Before then, there were unmixed populations, but afterward, there was convulsive mixture in India, which affected nearly every group.

So between 4,000 and 3,000 years ago—just as the Indus Civilisation collapsed and the Rig Veda was composed—there was a profound mixture of populations that had previously been segregated. Today in India, people speaking different languages and coming from different social statuses have different proportions of ANI ancestry. Today, ANI ancestry in India derives more from males than from females. This pattern is exactly what one would expect from an Indo-European-speaking people taking the reins of political and social power after 4,000 years ago and mixing with the local peoples in a stratified society, with males from the groups in power having more success in finding mates than those from the disenfranchised groups.


How is it that the genetic marks of these ancient events have not been blurred beyond recognition after thousands of years of history?

One of the most distinctive features of traditional Indian society is caste—the system of social stratification that determines whom one can marry and what privileges and roles one has in society. The repressive nature of caste has spawned in reaction major religions—Jainism, Buddhism, and Sikhism—each of which offered refuge from the caste system. The success of Islam in India was also fuelled by the escape it provided for low-social-status groups that converted en masse to the new religion of the Mughal rulers. Discrimination on the basis of caste was outlawed with the birth of democratic India, but it still shapes whom people choose to socialise with and marry today.

A sociological definition of a caste is a group that interacts economically with people outside it (through specialised economic roles), but segregates itself socially through endogamy (which prevents people from marrying outsiders). Jews in northeastern Europe, from whom I descend, were, prior to the “Jewish emancipation” beginning in the late-eighteenth century, a caste in lands where not all groups were castes. Jews served an economic function as moneylenders, liquor vendors, merchants and craftspeople for the population within which they lived. Religious Jews then, as now, segregated themselves socially through dietary rules (kosher laws), distinctive dress, body modification (circumcision of males), and strictures against marrying outsiders.

Caste in India is organised at two levels, varna and jati. The varna system involves stratification of all of society into at least four ranks: at the top the priestly group (Brahmins) and the warrior group (Kshatriyas); in the middle the merchants, farmers and artisans (Vaishya); and finally the lower castes (Shudras), who are labourers. There are also the Chandalas or Dalits— “Scheduled Castes”— people who are considered so low that they are “untouchable” and excluded from normal society. Finally, there are the “Scheduled Tribes,” the official Indian government name for people outside Hinduism who are neither Muslim nor Christian. The caste system is a deep part of traditional Hindu society and is described in detail in the religious texts—the Vedas—that were composed subsequent to the Rig Veda.

The jati system, which few people outside India understand, is much more complicated, and involves a minimum of 4,600 and by some accounts around 40,000 endogamous groups. Each is assigned a particular rank in the varna system, but strong and complicated endogamy rules prevent people from most different jatis from mixing with each other, even if they are of the same varna level. It is also clear that in the past, whole jati groups have changed their varna ranks. For example, the Gujjar jati (from which the state of Gujarat in northwest India takes its name) have a variety of ranks depending on where in India they live, which is likely to reflect the fact that in some regions, Gujjars have successfully made the case to raise the status of their jati within the varna hierarchy.

How the varna and jati relate to each other is a much-debated mystery. One hypothesis suggested by the anthropologist Irawati Karve is that thousands of years ago, Indian peoples lived in effectively endogamous tribal groups that did not mix, much like tribal groups in other parts of the world today. Political elites then ensconced themselves at the top of the social system as priests, kings and merchants, creating a stratified system in which the tribal groups were incorporated into society in the form of labouring groups that remained at the bottom of society as Shudras and Dalits. The tribal organisation was thus fused with the system of social stratification to form early jatis, and eventually the jati structure percolated up to the higher ranks of society, so that today there are many jatis of higher as well as of lower castes. These ancient tribal groups have preserved their distinctiveness through the caste system and endogamy rules.

An alternative hypothesis is that strong endogamy rules are not very old at all. The theory of the caste system is undeniably old, as it is described in the ancient Manusmriti, a Hindu text composed some hundreds of years after the Rig Veda. The Manusmriti describes in exquisite detail the varna system of ranked social stratification, and within it the innumerable jati groups. It puts the whole system into a religious framework, justifying its existence as part of the natural order of life. However, revisionist historians, led by the anthropologist Nicholas Dirks, have argued that, in fact, strong endogamy was not practiced in ancient India, but instead is largely an innovation of British colonialism. Dirks and his colleagues showed how, as a way of effectively ruling India, British policy beginning in the eighteenth century was to strengthen the caste system, carving out a natural place within Indian society for British colonialists as a new caste group. To achieve this, the British strengthened the institution of caste in parts of India where it was not very important, and worked to harmonise caste rules across different regions. Given these efforts, Dirks suggested that strong endogamy restrictions as manifested in today’s castes might not be as old in practice as they seem.

To understand the extent to which the jatis corresponded to real genetic patterns, we examined the degree of differentiation of each jati from which we had data with all others based on differences in mutation frequencies. We found that the degree of differentiation was at least three times greater than that among European groups separated by similar geographic distances. This could not be explained by differences in ANI ancestry among groups, or differences in the region within India from which the population came, or differences in social status. Even comparing pairs of groups matched according to these criteria, we found that the degree of genetic differentiation among Indian groups was many times larger than that in Europe.

These findings led us to surmise that many Indian groups today might be the products of population bottlenecks. These occur when relatively small numbers of individuals have many offspring and their descendants too have many offspring and remain genetically isolated from the people who surround them due to social or geographic barriers. Famous population bottlenecks in the history of people of European ancestry include the ones that contributed most of the ancestry of the Finnish population (around two thousand years ago), a large fraction of the ancestry of today’s Ashkenazi Jews (around six hundred years ago), and most of the ancestry of religious dissenters such as Hutterites and Amish who eventually migrated to North America (around three hundred years ago). In each case, a high reproductive rate among a small number of individuals caused the rare mutations carried in those individuals to rise in frequency in their descendants.

We looked for the telltale signs of population bottlenecks in India and found them: identical long stretches of sequence between pairs of individuals within the same group. The only possible explanation for such segments is that the two individuals descend from an ancestor in the last few thousand years who carried that DNA segment. What’s more, the average size of the shared DNA segments reveals how long ago in the past that shared ancestor lived, as the shared segments break up at a regular rate in each generation through the process of recombination.

The genetic data told a clear story. Around a third of Indian groups experienced population bottlenecks as strong as or stronger than the ones that occurred among Finns or Ashkenazi Jews. We later confirmed this finding in an even larger dataset that we collected while working with Thangaraj: genetic data from more than 250 jati groups spread throughout India.

Many of the population bottlenecks in India were also exceedingly old. One of the most striking we discovered was in the Vysya of the southern Indian state of Andhra Pradesh, a middle-caste group of approximately five million people whose population bottleneck we could date (from the size of segments shared between individuals of the same population) to between 3,000 and 2,000 years ago.

The observation of such a strong population bottleneck among the ancestors of the Vysya was shocking. It meant that after the population bottleneck, the ancestors of the Vysya had maintained strict endogamy, allowing essentially no genetic mixing into their group for thousands of years. Even an average rate of influx into the Vysya of as little as 1 percent per generation would have erased the genetic signal of a population bottleneck. The ancestors of the Vysya did not live in geographic isolation. Instead, they lived cheek by jowl with other groups in a densely populated part of India. Despite proximity to other groups, the endogamy rules and group identity in the Vysya have been so strong that they maintained strict social isolation from their neighbours, and transmitted that culture of social isolation to each and every subsequent generation.

And the Vysya were not unique. A third of the groups we analysed gave similar signals, implying thousands of groups in India like this. Indeed, it is even possible that we were underestimating the fraction of groups in India affected by strong long-term endogamy. To show a signal, a group needed to have gone through a population bottleneck. Groups that descended from a larger number of founders but nevertheless maintained strict endogamy ever since would go undetected by our statistics. Rather than an invention of colonialism as Dirks suggested, long-term endogamy as embodied in India today in the institution of caste has been overwhelmingly important for millennia.

Learning this feature of Indian history had a strong resonance for me. When I started my work on Indian groups, I came to it as an Ashkenazi Jew, a member of an ancient caste of West Eurasia. I was uncomfortable with my affiliation but did not have a clear sense of what I was uncomfortable about. My work on India crystallised my discomfort. There is no escaping my background as a Jew. I was raised by parents whose highest priority was being open to the secular world, but they themselves had been raised in a deeply religious community and were children of refugees from persecution in Europe that left them with a strong sense of ethnic distinctiveness. When I was growing up, we followed Jewish dietary rules at home—I believe my parents did so in part in the hope that their own families would feel comfortable eating at our house—and I went for nine years to a Jewish school and spent many summers in Jerusalem. From my parents as well as from my grandparents and cousins I imbibed a strong sense of difference—a feeling that our group was special—and a knowledge that I would cause disappointment and embarrassment if I married someone non-Jewish (a conviction that I know also had a powerful effect on my siblings). Of course, my concern about disappointing my family is nothing compared to the shame, isolation and violence that many expect in India for taking a partner outside their group. And yet my perspective as a Jew made me empathise strongly with all the likely Romeos and Juliets over thousands of years of Indian history, whose loves across ethnic lines have been quashed by caste. My Jewish identity also helped me to understand on a visceral level how this institution had successfully perpetuated itself for so long.

The first migration was from the Near East after around nine thousand years ago, which brought farmers who mixed with local hunter-gatherers. The second migration was from the steppe after around five thousand years ago, which brought pastoralists who then mixed with local farmers. Mixtures of these groups then formed two gradients of ancestry: One in Europe and one in India.

What the data were showing us was that the genetic distinctions among jati groups within India were in many cases real, thanks to the longstanding history of endogamy in the subcontinent. People tend to think of India, with its more than 1.3 billion people, as having a tremendously large population, and indeed many Indians as well as foreigners see it this way. But genetically, this is an incorrect way to view the situation. The Han Chinese are truly a large population. They have been mixing freely for thousands of years. In contrast, there are few, if any, Indian groups that are demographically very large, and the degree of genetic differentiation among Indian jati groups living side by side in the same village is typically two to three times higher than the genetic differentiation between northern and southern Europeans. The truth is that India is composed of a large number of small populations


The groups of European ancestry that have experienced strong population bottlenecks—Ashkenazi Jews, Finns, Hutterites, Amish, French Canadians of the Saguenay-Lac-Saint-Jean region and others—have been the subject of endless and productive study by medical researchers. Because of their population bottlenecks, rare disease-causing mutations that happened to have been carried in the founder individuals have dramatically increased in frequency. Rare mutations that are innocuous when a person inherits a copy from only one of their parents—they act recessively, which means that two copies are required to cause disease—can be lethal when a person inherits copies from both parents. However, once these mutations increase in frequency due to a population bottleneck, there is an appreciable chance that individuals in the population will inherit the same mutation from both of their parents. For example, in Ashkenazi Jews there is a high incidence of the devastating disease of Tay-Sachs, which causes brain degeneration and death within the first few years of life. One of my first cousins died within months of birth due to an Ashkenazi founder disease called Zellweger syndrome, and one of my mother’s first cousins died young of Riley-Day syndrome, or familial dysautonomia, another Ashkenazi founder disease. Hundreds of such diseases have been identified, and the responsible genes have been identified in European founder populations, including Ashkenazi Jews. These findings have led to important biological insights, and in a few cases to the development of drugs that counteract the effect of the damaged genes.

India, of course, has far more people who belong to groups that experienced strong bottlenecks, as the country’s population is huge, and as around one-third of Indian jati groups descend from bottlenecks as strong as or stronger than those that occurred in Ashkenazi Jews or Finns. Searches for the genes responsible for disorders in these Indian groups, therefore, have the potential to identify risk factors for thousands of diseases. Despite the fact that no one has systematically looked, a few such cases are already known. For example, the Vysya are known to have a high rate of prolonged muscle paralysis in response to muscle relaxants given prior to surgery. As a result, clinicians in India know not to give these drugs to people of Vysya ancestry. The condition is due to low levels of the protein butylcholinesterase in some Vysya. Genetic work has shown that this condition is due to a recessively acting mutation that occurs at about 20-percent frequency in the Vysya, a far higher rate than in other Indian groups, presumably because the mutation was carried in one of the Vysya’s founders. This frequency is sufficiently high that the mutation occurs in two copies in about 4 percent of the Vysya, causing disastrous reactions for people who carry the mutation and go under anaesthesia.

As the Vysya example demonstrates, the history of India presents an important opportunity for biological discovery, as finding genes for rare recessive diseases is cheap with modern genetic technology. All it takes is access to a small number of people in a jati group with the disease, whose genomes can then be sequenced. Genetic methods can identify which of the thousands of groups in India have experienced strong population bottlenecks. Local doctors and midwives can identify syndromes that occur at high rates in specific groups. It is surely the case that local doctors, having delivered thousands of babies, will know that certain diseases and malformations occur more frequently in some groups than in others. This is all the information one needs to collect a handful of blood samples for genetic analysis. Once these samples are in hand, the genetic work to find the responsible genes is straightforward.

The opportunities for making a medical difference in India through surveys of rare recessive disease are particularly great because arranged marriage is very common. Much as I find restrictions on marriage discomfiting, arranged marriages are a fact in numerous communities in India—as they are in the ultra-Orthodox Jewish community. A number of my own first cousins in the Ashkenazi Jewish Orthodox community have found their spouses that way. In this religious community, a genetic-testing organisation founded by Rabbi Josef Ekstein in 1983, after he lost four of his own children to Tay-Sachs, has driven many recessive diseases almost to extinction. In many Orthodox religious high schools in the United States and Israel, nearly all teenagers are tested for whether they are carriers of the handful of rare recessive disease-causing mutations that are common in the Ashkenazi Jewish community. If they are carriers, they are never introduced by matchmakers to other teenagers carrying the same mutation. There is every opportunity to do the same in India, but instead of affecting a few hundred thousand people, in India the approach could have an impact on hundreds of millions.


Up until 2016, the genetic studies of Indian groups focussed on the ANI and the ASI: the two populations that mixed in different proportions to produce the great diversity of endogamous groups still living in India today.

But this changed in 2016, when several laboratories, including mine, published the first genome-wide ancient DNA from some of the world’s earliest farmers, people who lived between 11,000 and 8,000 years ago in present-day Israel, Jordan, Anatolia and Iran. When we studied how these early farmers of the Near East were related to people living today, we found that present-day Europeans have strong genetic affinity to early farmers from Anatolia, consistent with a migration of Anatolian farmers into Europe after 9,000 years ago. Present-day people from India have a strong affinity to ancient Iranian farmers, suggesting that the expansion of Near Eastern farming eastward to the Indus Valley after 9,000 years ago had as important an impact on the population of India. But our studies also revealed that present-day people in India have strong genetic affinities to ancient steppe pastoralists. How could the genetic evidence of an impact of an Iranian farming expansion on the population of India be reconciled with the evidence of steppe expansions? The situation was reminiscent of what we had found a couple of years before in Europe, where today’s populations are a mixture, not just of indigenous hunter-gatherers and migrant farmers but also of a third major group with an origin in the steppe.

To gain some insight, Iosif Lazaridis in my laboratory wrote down mathematical models for present-day Indian groups as mixtures of populations related to Little Andaman Islanders, ancient Iranian farmers and ancient steppe peoples. What he found is that almost every group in India has ancestry from all three populations. Nick Patterson then combined the data from almost 150 present-day Indian groups to come up with a unified model that allowed him to obtain precise estimates of the contribution of these three ancestral populations to present-day Indians.

When Patterson inferred what would have been expected for a population of entirely ANI ancestry—one with no Andamanese-related ancestry—he determined that they would be a mixed population of Iranian farmer-related ancestry and steppe pastoralist-related ancestry. But when he inferred what would have been expected for a population of entirely ASI ancestry—one with no Yamnaya-related ancestry—he found that they too must have had substantial Iranian farmer-related ancestry (the rest being Little Andamanese-related).

This was a great surprise. Our finding that both the ANI and ASI had large amounts of Iranian-related ancestry meant that we had been wrong in our original presumption that one of the two major ancestral populations of the Indian Cline had no West Eurasian ancestry. Instead, people descended from Iranian farmers made a major impact on India twice, admixing both into the ANI and the ASI.

Patterson proposed a major revision to our working model for deep Indian history. The ANI were a mixture of about 50 percent steppe ancestry related distantly to the Yamnaya, and 50 percent Iranian farmer-related ancestry from the groups the steppe people encountered as they expanded south. The ASI were also mixed, a fusion of a population descended from earlier farmers expanding out of Iran (around 25 percent of their ancestry), and previously established local hunter-gatherers of South Asia (around 75 percent of their ancestry).

So the ASI were not likely to have been the previously established hunter-gatherer population of India, and instead may have been the people responsible for spreading Near Eastern agriculture across South Asia. Based on the high correlation of ASI ancestry to Dravidian languages, it seems likely that the formation of the ASI was the process that spread Dravidian languages as well.

These results reveal a remarkably parallel tale of the prehistories interbred with the previously established hunter-gatherer populations to form new mixed groups between 9,000 and 4,000 years ago. Both subcontinents were then also affected by a second, later major migration with an origin in the steppe, in which Yamnaya pastoralists speaking an Indo-European language mixed with the previously established farming population they encountered along the way, in Europe forming the peoples associated with the Corded Ware culture, and in India eventually forming the ANI. These populations of mixed steppe and farmer ancestry then mixed with the previously established farmers of their respective regions, forming the gradients of mixture we see in both subcontinents today.

The Yamnaya—who the genetic data show were closely related to the source of the steppe ancestry in both India and Europe—are obvious candidates for spreading Indo-European languages to both these subcontinents of Eurasia. Remarkably, Patterson’s analysis of population history in India provided an additional line of evidence for this. His model of the Indian Cline was based on the idea of a simple mixture of two ancestral populations, the ANI and ASI. But when he looked harder and tested each of the Indian Cline groups in turn for whether it fit this model, he found that there were six groups that did not fit in the sense of having a higher ratio of steppe-related to Iranian farmer-related ancestry than was expected from this model. All six of these groups are in the Brahmin varna—with a traditional role in society as priests and custodians of the ancient texts written in the Indo-European Sanskrit language—despite the fact that Brahmins made up only about 10 percent of the groups Patterson tested. A natural explanation for this was that the ANI were not a homogeneous population when they mixed with the ASI, but instead contained socially distinct subgroups with characteristic ratios of steppe to Iranian-related ancestry. The people who were custodians of Indo-European language and culture were the ones with relatively more steppe ancestry, and because of the extraordinary strength of the caste system in preserving ancestry and social roles over generations, the ancient substructure in the ANI is evident in some of today’s Brahmins even after thousands of years. This finding provides yet another line of evidence for the steppe hypothesis, showing that not just Indo-European languages, but also Indo-European culture as reflected in the religion preserved over thousands of years by Brahmin priests, was likely spread by peoples whose ancestors originated in the steppe.

The picture of population movements in India is still far less crisp than our picture of Europe because of the lack of ancient DNA from South Asia. An outstanding mystery is the ancestry of the peoples of the Indus Valley Civilization, who were spread across the Indus Valley and parts of northern India between 4,500 and 3,800 years ago, and were at the crossroads of all these great ancient movements of people. We have yet to obtain ancient DNA from the people of the Indus Valley Civilization, but multiple research groups, including mine, are pursuing this as a goal. At a lab meeting in 2015, the analysts in our group went around the table placing bets on the likely genetic ancestry of the Indus Valley Civilization people, and the bets were wildly different. At the moment, three very different possibilities are still on the table. One is that Indus Valley Civilization people were largely unmixed descendants of the first Iranian-related farmers of the region, and spoke an early Dravidian language. A second possibility is that they were the ASI—already a mix of people related to Iranian farmers and South Asian hunter-gatherers—and if so, they would also probably have spoken a Dravidian language. A third possibility is that they were the ANI, already mixed between steppe and Iranian farmer-related ancestry, and thus would instead likely have spoken an Indo-European language. These scenarios have very different implications, but with ancient DNA, this and other great mysteries of the Indian past will soon be resolved.

This essay is excerpted from David Reich’s Who We Are and How We Got Here: Ancient DNA and the New Science of the Human Past, recently available in a South Asia edition from Oxford University Press.