Human genetic diversity: Lewontin’s fallacy
In popular articles that play down the genetical differences among human populations, it is often stated that about 85% of the total genetical variation is due to individual differences within populations and only 15% to differences between populations or ethnic groups. It has therefore been proposed that the division of Homo sapiens into these groups is not justified by the genetic data. This conclusion, due to R.C. Lewontin in 1972, is unwarranted because the argument ignores the fact that most of the information that distinguishes populations is hidden in the correlation structure of the data and not simply in the variation of the individual factors. The underlying logic, which was discussed in the early years of the last century, is here discussed using a simple genetical example.
“When a large number of individuals [of any kind of organism] are measured in respect of physical dimensions, weight, colour, density, etc., it is possible to describe with some accuracy the population of which our experience may be regarded as a sample. By this means it may be possible to distinguish it from other populations differing in their genetic origin, or in environmental circumstances. Thus local races may be very different as populations, although individuals may overlap in all characters; …” R.A. Fisher (1925).
“It is clear that our perception of relatively large differences between human races and subgroups, as compared to the variation within these groups, is indeed a biased perception and that, based on randomly chosen genetic differences, human races and populations are remarkably similar to each other, with the largest part by far of human variation being accounted for by the differences between individuals. Human racial classification is of no social value and is positively destructive of social and human relations. Since such racial classification is now seen to be of virtually no genetic or taxonomic significance either, no justification can be offered for its continuance”. R.C. Lewontin (1972).
“The study of genetic variations in Homo sapiens shows that there is more genetic variation within populations than between populations. This means that two random individuals from any one group are almost as different as any two random individuals from the entire world. Although it may be easy to observe distinct external differences between groups of people, it is more difficult to distinguish such groups genetically, since most genetic variation is found within all groups.” Nature (2001).
In popular articles that play down the genetical differences among human populations it is often stated, usually without any reference, that about 85% of the total genetical variation is due to individual differences within populations and only 15% to differences between populations or ethnic groups. It has therefore been suggested that the division of Homo sapiens into these groups is not justified by the genetic data. People the world over are much more similar genetically than appearances might suggest.
Thus an article in New Scientist reported that in 1972 Richard Lewontin of Harvard University “found that nearly 85 per cent of humanity’s genetic diversity occurs among individuals within a single population.”“In other words, two individuals are different because they are individuals, not because they belong to different races.” In 2001, the Human Genome edition of Nature came with a compact disc containing a similar statement, quoted above.
Such statements seem all to trace back to a 1972 paper by Lewontin in the annual review Evolutionary Biology. Lewontin analysed data from 17 polymorphic loci, including the major blood-groups, and 7 ‘races’ (Caucasian, African, Mongoloid, S. Asian Aborigines, Amerinds, Oceanians, Australian Aborigines). The gene frequencies were given for the 7 races but not for the individual populations comprising them, although the final analysis did quote the within-population variability.
“The results are quite remarkable. The mean proportion of the total species diversity that is contained within populations is 85.4%…. Less than 15% of all human genetic diversity is accounted for by differences between human groups! Moreover, the difference between populations within a race accounts for an additional 8.3%, so that only 6.3% is accounted for by racial classification.”
Lewontin concluded “Since … racial classification is now seen to be of virtually no genetic or taxonomic significance …, no justification can be offered for its continuance” (full quotation given above).
Lewontin included similar remarks in his 1974 book The Genetic Basis of Evolutionary Change “The taxonomic division of the human species into races places a completely disproportionate emphasis on a very small fraction of the total of human diversity. That scientists as well as nonscientists nevertheless continue to emphasize these genetically minor differences and find new ‘scientific’ justifications for doing so is an indication of the power of socioeconomically based ideology over the supposed objectivity of knowledge.”
These conclusions are based on the old statistical fallacy of analysing data on the assumption that it contains no information beyond that revealed on a locus-by-locus analysis, and then drawing conclusions solely on the results of such an analysis. The ‘taxonomic significance’ of genetic data in fact often arises from correlations amongst the different loci, for it is these that may contain the information which enables a stable classification to be uncovered.
Cavalli-Sforza and Piazza coined the word ‘treeness’ to describe the extent to which a tree-like structure was hidden amongst the correlations in gene-frequency data. Lewontin’s superficial analysis ignores this aspect of the structure of the data and leads inevitably to the conclusion that the data do not possess such structure. The argument is circular. A contrasting analysis to Lewontin’s, using very similar data, was presented by Cavalli-Sforza and Edwards at the 1963 International Congress of Genetics. Making no prior assumptions about the form of the tree, they derived a convincing evolutionary tree for the 15 populations that they studied. Lewontin, though he participated in the Congress, did not refer to this analysis.
The statistical problem has been understood at least since the discussions surrounding Pearson’s ‘coefficient of racial likeness’ in the 1920s. It is mentioned in all editions of Fisher’s Statistical Methods for Research Workers from 1925 (quoted above). A useful review is that by Gower in a 1972 conference volume The Assessment of Population Affinities in Man. As he pointed out, “…the human mind distinguishes between different groups because there are correlated characters within the postulated groups.”
The original discussions involved anthropometric data, but the fallacy may equally be exposed using modern genetic terminology. Consider two haploid populations each of size n. In population 1 the frequency of a gene, say ‘+’ as opposed to ‘-’, at a single diallelic locus is p and in population 2 it is q, where p + q = 1. (The symmetry is deliberate.) Each population manifests simple binomial variability, and the overall variability is augmented by the difference in the means.
原初的讨论涉及一些人体测量学数据，但是我们用现代遗传学术语也同样可以揭示这个谬误。考虑两个个体数量各为n的单倍体种群。在种群1中某基因在一个单独位点为“+”而不是“-”的频率为p，在种群2中该频率为q，且p + q = 1。（这种对称性是有意设定的。）各种群的多样性为简单二项式分布，且总体多样性由于两个种群间平均值的差异而得到加强。
The natural way to analyse this variability is the analysis of variance, from which it will be found that the ratio of the within-population sum of squares to the total sum of squares is simply 4pq. Taking p = 0.3 and q = 0.7, this ratio is 0.84; 84% of the variability is within-population, corresponding closely to Lewontin’s figure. The probability of misclassifying an individual based on his gene is p, in this case 0.3. The genes at a single locus are hardly informative about the population to which their bearer belongs.
很自然的，我们用方差分析来评估多样性，从中可以得出种群内平方和与总体平方和之比为4pq【译注：对于任一种群，种群方差为npq，种群平方和为n2pq；总和平方和为1/4•n2(p+q)2 = 1/4•n2；(n2pq)/( 1/4•n2)=4pq】。如 p = 0.3 而 q = 0.7，该比率为0.84，即84%的多样性来自于种群内，正好对应列万廷的结果。基于该基因对个体的分类误差率为p，即0.3。单个位点的基因几乎不包含关于该基因携带者属于哪个种群的任何信息。
Now suppose there are k similar loci, all with gene frequency p in population 1 and q in population 2. The ratio of the within-to-total variability is still 84% at each locus. The total number of ‘+’ genes in an individual will be binomial with mean kp in population 1 and kq in population 2, with variance kpq in both cases. Continuing with the former gene frequencies and taking k = 100 loci (say), the mean numbers are 30 and 70 respectively, with variances 21 and thus standard deviations of 4.58. With a difference between the means of 40 and a common standard deviation of less than 4.6, there is virtually no overlap between the distributions, and the probability of misclassification is infinitesimal, simply on the basis of counting the number of ‘+’ genes. Fig. 1 shows how the probability falls off for up to 20 loci.
现在假设共有k个相似位点，都在种群1中和种群2中分别具有p和q的基因频率。在每个单个位点上，种群内多样性与总体多样性之比仍是84%。在每个个体上为“+”的基因数将呈二项式分布，其均值在种群1中为kp，在种群2中为kq，方差在两个种群中同为kpq。继续之前关于基因频率的假设【译注：即p = 0.3，q = 0.7】，设k = 100 个位点，则在两个种群中均值各为30和70，方差为21，因此标准差为4.58。在均值相差40的情况下，共同的标准差还不到4.6，因此这两个分布几乎没有任何重叠部分，所以基于“+”基因出现个数所作分类的误差可能性是无限小。图1显示了该分类误差率随位点数增加而下降的曲线，至20个位点。
Figure 1. Graph showing how the probability of misclassification falls off as the number of gene loci increases, for the first example given in the text. The proportion of the variability within groups remains at 84% as in Lewontin’s data, but the probability of misclassification rapidly becomes negligible.
One way of looking at this result is to appreciate that the total number of ‘+’ genes is like the first principal component in a principal component analysis (Box 1). For this component the between-population sum of squares is very much greater than the within-population sum of squares. For the other components the reverse will hold, so that overall the between-population sum of squares is only a small proportion (in this example 16%) of the total. But this must not beguile one into thinking that the two populations are not separable, which they clearly are.
一种领会该结果的方式是将“+”基因的总数看成主成分分析法中的第一主成分（见框文1【编注：是对主成分分析（Principal components analysis，PCA）方法的介绍，译略，有兴趣可查看原文，或参见维基词条“主成分分析”】）。对于该成分，种群间平方和远大于种群内平方和。对于其他成分则反之，以至于对所有成分来说种群间平方和仅占总体平方和的一小部分（在这个例子里面为16%）。但这个结果不能诱使我们认为两个种群是不可分的，而实际上他们是清晰可分的。
Each additional locus contributes equally to the within-population and between-population sums of squares, whose proportions therefore remain unchanged but, at the same time, it contributes information about classification which is cumulative over loci because their gene frequencies are correlated.
It might be supposed, though it would be wrong, that this example is prejudiced by the assumptions that membership of the two populations is known in advance and that, at each locus, it is the same population that has the higher frequency of the ‘+’ gene. In fact the only advantage of the latter simplifying assumption was that it made it obvious that the total number of ‘+’ genes is the best discriminant between the two populations.
To dispel these concerns, consider the same example but with ‘+’ and ‘-’ interchanged at each locus with probability 1⁄2, and suppose that there is no prior information as to which population each individual belongs. Clearly, the total number of ‘+’ genes an individual contains is no longer a discriminant, for the expected number is now the same in each group. A cluster analysis will be necessary in order to uncover the groups, and a convenient criterion is again based on the analysis of variance as in the method introduced by Edwards and Cavalli-Sforza. Here the preferred division into two clusters maximises the between-clusters sum of squares or, what is the same thing, minimises the sum of the within-clusters sums of squares.
As pointed out by these authors, it is extremely easy to compute these sums for binary data, for all the information is contained in the half-matrix of pairwise distances between the individuals, and at each locus this distance is simply 0 for a match and 1 for a mismatch of the genes. Since interchanging ‘+’ and ‘-’ makes no difference to the numbers of matches and mismatches, it is clear that the random changes introduced above are irrelevant.
Continuing the symmetrical example, the probability of a match is p2 + q2 if the two individuals are from the same population and 2pq if they are from different populations. With k loci, therefore, the distance between two individuals from the same population will be binomial with mean k(p2 + q2) and variance k(p2 + q2)(1 – p2 – q2) and if from different populations binomial with mean 2kpq and variance 2kpq(1 – 2pq). These variances are, of course, the same.
继续这个对称性例子，对于来自同一种群的两个个体来说，单一位点配对的机率为p2 + q2；若来自不同种群，则为2pq。因此，对于k个位点，同一种群两个个体间距离呈二项式分布，均值为k(p2 + q2)，方差为k(p2 + q2)(1 – p2 – q2)；若来自不同种群，则均值为2kpq，方差为2kpq(1 – 2pq)。这两个方差显然是一样的【译注：p + q = 1 à p2 + q2 = 1 – 2pq】。
Taking p = 0.3, q = 0.7 and k = 100 as before, the means are 58 and 42 respectively, a difference of 16, the variances are 24.36 and the standard deviations both 4.936. The means are thus more than 3 standard deviations apart (3.2415). The entries of the half-matrix of pairwise distances will therefore divide into two groups with very little overlap, and it will be possible to identify the two clusters with a risk of misclassification which tends to zero as the number of loci increases.
像之前一样，取p = 0.3, q = 0.7 和 k = 100，则均值分别为58和42，相差16。方差为24.36，即两组的标准差都为4.936。这样一来两组均值之间则有超出3个标准差的距离。因此，这个成对个体距离半矩阵中的数值就可以被分成几乎没有重叠的两组，这样就有可能以较小的分类误差来识别两个聚类，且该分类误差率随位点数目增加逐渐趋向于0。
By analogy with the above example, it is likely that a count of the four DNA base frequencies in homologous tracts of a genome would prove quite a powerful statistical discriminant for classifying people into population groups.
There is nothing wrong with Lewontin’s statistical analysis of variation, only with the belief that it is relevant to classification. It is not true that “racial classification is … of virtually no genetic or taxonomic significance”. It is not true, as Nature claimed, that “two random individuals from any one group are almost as different as any two random individuals from the entire world”, and it is not true, as the New Scientist claimed, that “two individuals are different because they are individuals, not because they belong to different races” and that “you can’t predict someone’s race by their genes”. Such statements might only be true if all the characters studied were independent, which they are not.
Lewontin used his analysis of variation to mount an unjustified assault on classification, which he deplored for social reasons. It was he who wrote “Indeed the whole history of the problem of genetic variation is a vivid illustration of the role that deeply embedded ideological assumptions play in determining scientific ‘truth’ and the direction of scientific inquiry”.
In a 1970 article Race and intelligence he had earlier written “I shall try, in this article, to display Professor Jensen’s argument, to show how the structure of his argument is designed to make his point and to reveal what appear to be deeply embedded assumptions derived from a particular world view, leading him to erroneous conclusions.”
A proper analysis of human data reveals a substantial amount of information about genetic differences. What use, if any, one makes of it is quite another matter. But it is a dangerous mistake to premise the moral equality of human beings on biological similarity because dissimilarity, once revealed, then becomes an argument for moral inequality. One is reminded of Fisher’s remark in Statistical Methods and Scientific Inference “that the best causes tend to attract to their support the worst arguments, which seems to be equally true in the intellectual and in the moral sense.”
This article could, and perhaps should, have been written soon after 1974. Since then many advances have been made in both gene technology and statistical computing that have facilitated the study of population differences from genetic data. The magisterial book of Cavalli-Sforza, Menozzi and Piazza took the human story up to 1994, and since then many studies have amply confirmed the validity of the approach.
Very recent studies have treated individuals in the same way that Cavalli-Sforza and Edwards treated populations in 1963, namely by subjecting their genetic information to a cluster analysis thus revealing genetic affinities that have unsurprising geographic, linguistic and cultural parallels. As the authors of the most extensive of these comment, “it was only in the accumulation of small allele-frequency differences across many loci that population structure was identified.”