Sunday, April 27, 2008

Correlation between major Indian Y-chromosome markers

Using 31 different population clusters from the Indian subcontinent I wanted to see if there are significant correlations (positive or negative) between the frequencies of any two haplogroups. Two positively correlated haplogroups may indicate a shared history and expansion, whereas a negative correlation may indicate opposite histories of those haplogroups. I considered the 5 major Indian haplogroups for the analysis, which are, R1A1, H, R2, J2 and L.


RESULTS


Dravidian castes:

  • -R1A1 increases then L increases

  • -R2 increases then H decreases

  • -R2 increase then L decreases


Indo-European castes (IE):

  • -R2 increases then R1A1 decreases


Dravidian + IE castes

  • R1A1 increases then H decreases
  • R2 increases then R1A1 decreases


Interpretation:

Some members of H were replaced by a Holocene population consisting of R1A1 and L. Some time later (possibility during the Neolithic), members for R2 expanded over an R1A1/L/H background population.

The full analysis can be downloaded here:

http://www.zshare.net/download/11148490cde21020/

23 comments:

VM Weber said...

I think it's better if we upload those files to some groups.

Ravindra Mundkur said...

Do you mean that H is pre-Holocene , R1A1 and L Holocene (ca.10K yrs BP)and R2 Neolithic(ca.2000-1900 BC)?
Don't we have any markers for post- Vedic period, say 800-500 BC or younger?
How old could be the pre-Holocene H?

Ibra said...

“Do you mean that H is pre-Holocene , R1A1 and L Holocene (ca.10K yrs BP)and R2 Neolithic(ca.2000-1900 BC)?”

I combined the sample of H* and H1 (most of which is H1). H is estimated to be 30 kyr and H1 at 10.5 kyr likely arose in South West India based on the variance and frequency. I interpreted the data as the result of geneflow from an R1A1/L Holocene population to another Holocene population mainly of H1. R2 seems to have overridden R1a1/L/H mutually in the North and the South, mirroring the South Asian Neolithic. However, the analysis doesn’t imply anything about the timing of the R2 movement only that R2 is very intrusive to other haplogroups; therefore other interpretations about the movement of R2 are possible.


“Don't we have any markers for post- Vedic period, say 800-500 BC or younger?”

I haven’t seen any yet, but It will happen once typical Indian haplogroups are resolved into younger subclade. Another thing is that a population carrying a haplogroup may expand well in excess of haplogroup coalescence time.

Maju said...

That looks like R1a and L are in positive correlation, the only ones. All other clades (and these two clades when taken together too) seem to have a negative correlation between them in all groups.

So it would translate as three layers: H, R2 and R1a+L. But I don't see any reason to suppose R2 arrived after R1a+L. In fact, given its geographic distribution (somehow between R1a and H areas) and that it is peripheric in relation to haplogroup R as whole (and P too), I have always assumed that R2 is older than R1a in India.

Ibra said...

"That looks like R1a and L are in positive correlation, the only ones. All other clades (and these two clades when taken together too) seem to have a negative correlation between them in all groups."

So for example H and L are negatively correlated?

Maju said...

So for example H and L are negatively correlated?

Yah, right. I didn't realize that Manjunat doesn't say that. :(

Still, following Sahoo et al., L has a stronger NW-West distribution (coincident with R1a and contrasting with H and R2), what suggests that it would be the case anyhow.

It actually contrasts less with H (both are relatively strong in the Western coast, though it doesn't look their core area), and it's maybe more parallel in its South Asian distribution with J2. So I would think (on first sight at least) that L and J2 spread primarily in Neolithic and Chalcolithic times correlating with IVC most closely. But while J2 looks West Asian in ultimate origin, L could well be of South Asian origin, being strongest in Pakistan.

In any case, I make little sense of R2 being a late arrival. It's negatively correlated with everything else and it's too deep into India (looking from the NW main migratory gate) to be that. It would have needed to cross all the subcontinent leaving only minor traces of its pass behind: a too focused frog-leap migration to make much sense.

VM Weber said...

Yah, right. I didn't realize that Manjunat doesn't say that. :(

What!?

I have always assumed that R2 is older than R1a in India.

A big chunk of R1a1 looks much younger (in fact post Vedic). There are too many 12/12 matches with Indian and European R1a1. If you consider this calculation then by most conservative estimation (89 generations and 25 years per generation length) those may be around 2500 years old.

Maju said...

What!?

Sorry, should read "Ibra". I have yet to get used to this duality of bloggers. :)

A big chunk of R1a1 looks much younger (in fact post Vedic). There are too many 12/12 matches with Indian and European R1a1. If you consider this calculation then by most conservative estimation (89 generations and 25 years per generation length) those may be around 2500 years old.

We are in agreement then, right?

Not sure about TMRCA estimates anyhow (as always). Anyhow, if some R1a is postvedic, what migrations could it involve? Afghans? Saka? Scythians (Saka) could make for a very good Europe-India link, specially for those so recent near-exact matches.

It's a migration that is often overlooked (in comparison with Indo-Aryans) but it may have made a significative (cummulative) impact too.

VM Weber said...

I'll leave it to Ibra to comment on the validity of TMRCA calculation.

Maju said...

In any case, thanks for the link to that TRMCA site. It's a good read and a new toy to play with. :)

Ibra said...

Good points Maju, an expansion of an R1a1+L population into an H + R2 population is also well within the results of most the data. However, how would you account for the negative correlation between H and R2 in the Dravidian castes? R2 is either really dominant or really passive to other markers. Since R2 contrast with H, and R2 is implicated in 4/6 correlations I tend to think of R2 as intrusive in nature rather than passive.

“All other clades (and these two clades when taken together too) seem to have a negative correlation between them in all groups.”

What is your interpretation of negative correlation? Out of 10 potential pairings of each marker form each group only 2 on average are statistically correlated, 3/10 in the Dravidian castes, 1/10 in the IE castes and 2/10 in the combined castes.

Maju said...

an expansion of an R1a1+L population into an H + R2 population is also well within the results of most the data

Add J2 to the pack. In fact J2 and L might seem to correlate somewhat better. In any case the three seem to have pushed through Pakistan into (loosely speaking) NW India.

However, how would you account for the negative correlation between H and R2 in the Dravidian castes? R2 is either really dominant or really passive to other markers. Since R2 contrast with H, and R2 is implicated in 4/6 correlations I tend to think of R2 as intrusive in nature rather than passive.

Possibly. I really do not know enough about R2 or have meditated much about it. Most of my thoughts about it are because of its relation within R and P. As the P urheimat should be by logic not far from Central Asia, and apparently there are some P* in that area, as well as further south in Pakistan and India (where exactly?), I imagine that R2, if immigrant, didn't came from too far away, maybe from South Asia itself - or Central Asia the farther.

The R2-H dichotomy may represent an old (possibly pre-Neolithic) structure in the subcontinent. As my best hunch (and just that) R2 may have entered South Asia (if it was not formed inside it) in the mid (or late) Paleolithic. H instead was possibly "always" there, since F split apart. R2 and its predecessors anyhow never went too far either: not farther than Central Asia.

What is your interpretation of negative correlation?

That they represent different prehistorical (or protohistorical) populations. Peoples, cultures... have tended to be dominated by males, at least recently. So I guess R2 and H represent two more or less competing cultures, whose distinction is not anymore visible.

Instead L and J2 may have "hitchicked" the success of R1a somewhat, or were previously very succesful on their own, managing to survive inside the new R1a-dominated Indo-Aryan society. That may be also the case of other clades in specific localities: we try to discern patterns but the success of each lineage was at some time surely tied to the fortunes of individuals or local groups. It may therefore change from place to place somewhat, and without a detailed history, we may never be able to know how such "quantum" changes happened at all.

Maju said...

Addendum:

Based on some notes I had forgotten about, I can, very speculatively, imagine a late Upper Paleolithic India in this way:

Y-DNA:
- H clan: dominant in West and South but also in the Middle Ganges
- R2 clan: dominant in Mid-East and SE India specially. Also present in South (along H).
- L clan: dominant in Rajputana and Pakistan and maybe coastal West India, along H, (though this last is probably a late Neolithic arrival).

All this assuming neither of these three groups expanded later, and that may be the case of L, at least for mid-West Indian coasts.

H might have arrived to the Ganges early on following the Narmada route but it's unclear how R2 could have arrived to SE India leaving such low traces in the Ganges area. This may suggest (tentatively) an H expansion over R2 in the Ganges area, after the spread of R2 from its Pakistani/Kashmiri/Central Asian source.

I have also notes on mtDNA clans:
- M6 is strong in Kashmir and mid-East India. It correlates well with R2.
- M2 is strong in SE and Bangla Desh, correlating also well with R2, at least in the first case.
- U subclades (U2a, U2l and U7) could have some correlation with H. That's also the case of R5.
- M3 (mostly SE of the upper Ganges) does not correlate well with any Y-DNA clan.

Something that surprised me a little is how H (an F subclade) correlates better with R and U clades in India than R2 (much lower in the branching hierarchy) that goes along best with M clades. Intuitively I would have expected exactly the opposite but it's true that M, R and U mtDNA lineages probably date from the first colonization indistinctly. R and U have a somewhat more western distribution than M and that is probably all in it.

But the presence of R2 so deep into India, behind the "lines" of H at the middle Ganges is something that requires an explanation. Explanation that may be found if Indian archaeology is eventually able to produce some cultural, temporal and geographical structure for the late Paleolithic. Otherwise we will probably remain blind.

Ibra said...

“Add J2 to the pack. In fact J2 and L might seem to correlate somewhat better. In any case the three seem to have pushed through Pakistan into (loosely speaking) NW India.”

Check the correlation coefficient between J2 and L they are the most uncorrelated in the first and last group, and the second most uncorrelated in the middle group. That says that there is not much of a relationship between J2 and L.


Anyway Maju, these are all interesting hypothesis that you forth; but I think that the situation will become clear once we extracts ancient DNA of Holocene and Neolithic people. Another thing is that cluster sampling is a better technique to study large population; this Russian study for example:

http://download.ajhg.org/AJHG/pdf/PIIS0002929707000250.pdf?intermediate=true

Maju said...

Hi again, Ibra.

I could not effectively download your rar archive. It was apparently downloaded but then I could not find it; tried again and the same. So I'm all the time working with only Sahoo's maps, that do suggest a quite similar distribution for L and J2 in South Asia (L is trivial outside it), but are maybe less comprehensive data than the one you are using. Sorry about that shortcoming.

Thanks for the link to the Russian study. It's interesting certainly.

I have some doubts about aDNA: I often wonder if the strange results obtained so far might represent not contamination but degradation. For example, I wonder if mtDNA N and N1a, found so frequently in European aDNA is not but a degraded something else downstream of N, that looks like odd upstream haplos precisely because of molecular degradation. But hopefully you are right and in due time we may get better archaeogenetical data that will enlighten us.

Ibra said...

http://www.sendspace.com/file/ji9hex

Also acessable from sendspace.

Maju said...

I actually found yesterday they were there... but in the wrong folder. That's why I could not find them. :)

I have just taken a quick look to the Word file by the moment (very interesting and neatly presented) and my impression is that only the scatterplots that involve R1a actually approach the trend lines, always showing an inverse correlation, except with L. The rest are rather unrelated to those lines, that look somewhat artificial in the middle of very different cases. This includes the R1a vs L graph.

Let's see:
- R1a vs L has atually three groups: (1) and (2) show direct correlation of L and R1a (in both L is larger and thery are just differernt in the apportion of L: rather low and high), but (3) shows exclusive R1a without any L. So these are two different cases: (1,2) L accompanied of some R1a and
(3) R1a on its own.

- H vs R2. The negative correlation trend cammouflages three groups different groups: (1) H +/- dominant, low/no R, (2) R +/- dominant, low/no H and (3) hybrid: roughly 50% H + 30% R2. So it does look like the two groups have diferent origins, still very separeated geographically, but they did not mind to mix in almost equivalent ammounts either (in the contact zone, I assume). H+R2 (high for both) populations does exist. Talking about a negative correlation in this case can hide the equivalent admixture area and the almost exclussiveness of the two main groups (what is more than just negative correlation in a sense).

- L vs. R2: negative correlation certainly. But two cases: (1) negative correlated and (2) no L at all (varied R2).

Overall they don't seem very different of what you can infer from Sahoo's maps anyhow.

On the J2 vs L correlation issue, I am not sure how to read the figures but actually they seem to be rather positive (except among Indo-Aryans):

J2 vs L (Drav.): 0.120 (positive)
J2 vs L (IA): - 0.031 (slightly negative)
J2 vs L (all): 0.047 (slightly positive)

Maybe in some poulations the correlation is negative but it's not a clear issue overall. In the maps you can see they both have a west to east clinal decrease, though the details are somewhat different: for instance, Gujarat is rather low in L but rather high in J2. Their apparent "flow" and origins also seem different: J2 seems to come from West Asia via Pakistan into West and NW India specially, while L seems to originate near the Hindu Kush (Quetta has the highest peak) and be most important in two different areas that not fuly overlap with the J2 main area: central-north Pakistan and the western coast of India (south of Gujarat and north of Kerala). But both patterns are clearly "western" in distribution (and probably origin) anyhow.

J2 also correlates somewhat positively with R1a (another probably "western" lineage), except among Dravidians.

Overall, it's important to underline that the somewhat positive correlation of the three "western" haplos is not generalized and that it probably is more because of the common NW origin than because they might have arrived/expanded together. Though maybe they did in some cases.

VM Weber said...

I believe J2b probably was predominant among IVC farmers and traders. I would like to see its distribution among merchant communities in north and south India. I don't see how J2b could positively be correlated with R1a1 among Indo-Aryans. J2a may be but not J2b. No mathematics involved here.

Maju said...

@Manjunat: Ibra's data and Sahoo maps take J2 as a whole, so impossible to tell for me.

I'd be interested in an expansion of that idea of different roles for J2a and J2b in India and IVC. So far I can just take your word, what I do gladly but acritically.

@All:

I've just been checking the excel table and:

1. I see more than just three communities where H and R are both strong: three Dravidian but also other three Indo-Aryan ones: Baniya (Bih), Chitapavan Brahmin (Mah) and Dhangar (Mah). So it's not like R2 and H exclude each other.

2. I've focused my attention a bit on the six Brahmin communities and one interesting fact is that J2 seems relatively high in all them (c. 15-20%), except the eastern Oriya Brahmin, where is also present anyhow. R1a is also high in all them except one (Chitapavan Brahmin, dominated by H-R2) and L is variedly found: from absent 82 cases) to co-dominance with R1a (2 cases too). Strangely enough this co-dominance happens in eastern states (Orissa and Andrah), where L is overall rather low.

No strong conclussions from this anyhow, just food for thought. But certainly it sugests to me that J2 (whichever subclade) has been involved in Brahmin castes along with R1a since Vedic times (maybe because they were already important among IVC elites?).

Ibra said...

“On the J2 vs L correlation issue, I am not sure how to read the figures but actually they seem to be rather positive (except among Indo-Aryans):

J2 vs L (Drav.): 0.120 (positive)
J2 vs L (IA): - 0.031 (slightly negative)
J2 vs L (all): 0.047 (slightly positive)”

J2 and L do not correlate. In order for 2 things to correlate the p value has to be <0.1, otherwise the relation is not statistically greater than chance. The smaller the p value the stronger the result. The p value is listed under the correlation coefficient.

Maju said...

Ok, you are the expert statiscian.

According to that, the only correlations that are statistically valid are the negative correlations of:

a. R2 with H and L among Dravidians
b. R1a with R2 among IEs
c. R1a with H and R2 overall

There's no positive statstical correlation at all. But that's at subcontinental or macroethnic level only. It can mask regional, microethnic or caste correlations, like that of J2 and R1a among Brahmins.

Still when you look at the maps you see a different picture because J2, L and R1a all seem to stem from the NW and, even if their patterns of distribution are somewhat different, they all three tend to be strong in the NW of the subcontinent and weaker in the South.

VM Weber said...

By Dienekes, J2a negatively correlates with both R1a1 and J2b and R1a1 positively correlates with J2b in Greece.

Maju said...

J2a1 and the other is J2(xJ2a1). Probably unimportant anyhow.

But he doesn't give P values for the correlations and my impression from the data is the R1a is mainly northerner and J2a1 specially Cretan. There are a couple of exceptions to this rule though.