Inferring Human Phylogenies Using Three CODIS STR Markers ( CSF 1 PO , TPOX and TH 01 )

Over the past several decades polymorphic genetic loci have been discussed for their utility in human phylogenetic inferences. Short Tandem Repeat (STR) loci have shown promising results for this purpose. Unfortunately, allele frequency data of polymorphic loci are largely confined to few populations. Therefore, the number of shared loci declines as the number of population increases. We hypothesize that even a smaller number of STR loci can be used efficiently for phylogenetic purposes if an appropriate theoretical and statistical strategy is employed. This strategy provides a feasible and cost effective method to choose appropriate STR loci for phylogenetic studies. For this purpose, an empirical study was conducted using allele frequency data of three STR loci CSF1PO, TPOX, and TH01 across 98 human populations from the literature (references are available at http://dnaa.bravehost.com/ index.html and http://www.cstl.nist.gov/strbase/population/Omnipop). The choice of markers was based on locus polymorphism, high heterozygosity, low mutation rate, less artifacts and independence between the loci. Three methods were used to measure genetic distances between the populations; Cavalli Sforza’s chord distance (DC), Nei’s genetic (DA) and Nei’s standard genetic distances (DST). Coefficient of variation (CV) was calculated across hundred (100) datasets obtained by re-sampling of the original dataset for each of the genetic distance methods. CV was in order of DST >DA >DC. Therefore, a consensus tree based on DC was constructed using Neighbour Joining (NJ), Unweighted Pair Group Method with Arithmatic mean (UPGMA) and Maximum Likelihood (ML) methods. NJ and UPGMA methods got more statistical support that is higher bootstrap values than ML (NJ> UPGMA> ML). Validation study was performed using (A) Principal Component Analysis (B) Comparison with trees reported for other molecular markers (C) STR genotyping of five Pakistani subpopulations. Results strongly supported our hypothesis that the three STR markers CSF1PO, TPOX, and TH01 are successful in delineating ethnic, geographic and linguistic differentiation between the populations.


Introduction
Phylogenetic inferences are premised on the inheritance of ancestral characteristics and on the existence of an evolutionary history defined by changes in these characteristics (Li, Pearl, & Doss, 2000).Indeed, many human populations carry distinct genetic markers, and by tracing these markers through the generations their origin can be traced out (Adams, 2008).Since many decades allele frequency data have been used to reconstruct evolutionary histories of human populations (Ayub et al., 2003;Agrawal & Khan, 2005).A number of statistical problems related to the number of loci, sample size of the populations, degree of locus/loci polymorphism, distance methods, methods of reconstructing phylogenetic trees and the methods to ensure the reliability of the trees…etc have been addressed in the literature (Zhivotovsky & Feldman, 1995;Takezaki & Nei, 1996;Nei & Takezaki, 1996;Felsenstein, 2003;Holder & Lewis, 2003;Takezaki & Nei, 2008).However, polymorphic loci for which the allele frequency data are available are largely confined to European, North American and East Asian populations (Nei & Roychoudhury, 1993).For this reason the number of shared loci declines as the number of population increases.Therefore if one wants to use a large number of loci, the number of populations that can be used becomes very small.Moreover, the missing elements in locus × population matrix often introduce unreasonable branching patterns in phylogenetic trees.In the present study we hypothesize that the minimum number of markers can perform efficiently for phylogenetic inferences if the theoretical and statistical strategy applied is correct.For this purpose three microsatellite loci, also called Short Tandem Repeats (STRs), were chosen from Combined DNA Index System (CODIS) of American Federal Bureau of Investigation (Budowle, Moretti, Niezgoda, & Brown, 1998;Budowle, Moretti, Baumstark, Defenbaugh, & Keys, 1999;Butler, 2006).STRs are regions of tandemly repeated DNA segments found throughout the human genome that show length polymorphism with a core repeated DNA sequence (Butler & Hill, 2012).These markers are used for human identification purposes in forensic caseworks.
In the present work, an empirical study was conducted using STR allele frequency data of 98 human populations (references available at http://dnaa.bravehost.com/index.html and http://www.cstl.nist.gov/strbase/population/Omnipop).For the validation of the phylogenetic inferences five Pakistani subpopulations were genotyped for the three STR loci and the allele frequency data was incorporated into world population data.Statistical analyses performed on the datasets of empirical and validation studies are explained in section 'Theory and Calculations'.

Choosing the STR Loci
Three STR loci (CSF1PO, TPOX, and TH01) were chosen from Combined DNA Index System (CODIS) of American Federal Bureau of Investigation (FBI).Different chromosomal locations of these loci minimize the chances of linkage disequilibrium between them (Table 1).Moreover the variance of the number of heterozygous loci was found to be within 95% confidence interval which means there is no association between the loci under study.Two of the loci namely TPOX and TH01 showed the lowest mutation rate among all the CODIS STR loci, hence making them suitable for phylogenetic purposes (Keim et al., 2004).Frequency of biological "artifacts" associated with theses loci such as null alleles, stutter products, non-template nucleotide addition is low.

World Population Data
Allele frequency data of 62 world populations from the literature (http://dnaa.bravehost.com/index.html) and 36 world populations from the Omnipop excel file (http://www.cstl.nist.gov/strbase/population/Omnipop)were available to reconstruct the phylogenetic trees.These populations encompass major geographical areas of the world and show different ethnic and linguistic affiliations (Table 2 and Table 3).All three loci were reported to be in Hardy Weinberg equilibrium across all the populations under study.

Genotyping of the Three STR Loci (CSF1PO, TPOX and TH01) across Five Pakistani Subpopulations
One hundred and seventy five unrelated individuals (2n) were chosen from five Pakistani subpopulations residing in Karachi.Individuals within each subpopulation were selected through randomization.These subpopulations were Baloch (n = 64), Muhajir (Urdu speaking Indian immigrants) (n = 94), Pathan (n = 60), Punjabi (n = 74) and Sindhi (n = 58).Each individual was genotyped for the three STR loci after taking informed consent.All the three loci were co amplified in a single PCR reaction using CTT (CSF1PO, TPOX and TH01) primers in 2400 thermal cycler.The protocols provided by the Promega Geneprint STR System Technical Manual (tm#004) were followed.Allele frequencies were estimated using maximum likelihood method (Li, 1976;Hedrick, 2011).

Theory and Calculations
2.4.1 STR Polymorphism STR polymorphism was estimated by (i) Heterozygosity (h).Heterozygosity of each of the three loci was estimated by the Nei's unbiased formula (1) where pi is the frequency of ith allele in the sample, and n is the number of diploid individuals examined at the locus.The average Heterozygosity (H) of each of the three loci was estimated by (2) where s is the total number of populations under study.(ii) Average number of alleles per locus (n a ).n a was computed by the formula (3) where N a is the number of alleles of a locus in a population and s is the total number of populations under study.(iii) Polymorphism Information Content (PIC) was calculated using the formula PIC = 1-∑pi 2 -2 ∑pi 2 pj 2 , (4) where pi and pj stands for the frequencies of ith and jth alleles of a locus (Shete, Tiwari, & Elston, 2000;Kobilinsky, Liotti, & Oeser Sweat, 2005).(iv) Probability of Identity (PI) and Power of Discrimination (PD): PI is derived by the formula (5) where xi stands for the frequency of homozygotes and is equal to pi 2 .While xij stands for the frequency of heterozygotes and is equal to 2 pi pj, where pi and pj stands for the frequencies of ith and jth alleles of a locus.PD is defined as, PD = 1-∑ (xi) 2 + ∑ (xij ) 2 or 1-PI (6) Different measures/statistics of locus polymorphism across the world populations are shown in Table 4 and Table  5.  (Nei, Tajima, & Tateno, 1983).D A is calculated as where X ij and Y ij are the frequencies of the ith allele at the jth locus in populations X and Y, respectively, and mj is the number of alleles at the jth locus.D A was computed through a statistical program Poptree.(2) D ST is the Nei' s standard genetic distance (Nei, 1972) given by (9) D C was computed through Phylip version 3.68.

Coefficient of Variation (CV) of D A , D ST and D C
Nei's distance and Cavalli Sforza's distance measures are different estimators of the same quantity under the same model.Therefore a measurement of relative variance (CV) was used for each distance measure.CV was calculated across 100 replicates of the original dataset obtained through re-sampling of the original dataset for each of the three distance measures.A random sample of fourteen populations (14 Χ 14 populations distance matrix) was used for this purpose.

Construction of Phylogenetic Trees
Three methods were used to reconstruct phylogenetic trees.(1) Neighbor Joining (NJ) Method.NJ method constructs a tree by successive clustering of lineages, setting branch lengths as the lineages join (Saitou, & Nei, 1987).( 2) Unweighted Pair Group Method using Arithmetic Mean (UPGMA).UPGMA merge closest pair of taxa (by distance) and then recomputes distances to merged nodes via arithmetic mean of pairwise distances to leaves of the tree.(3) Continuous Character Maximum Likelihood Method (CONTML).This is a program in PHYLIP which estimates phylogenies by the restricted maximum likelihood method based on the Brownian motion model.It assumes that each locus evolves independently by pure genetic drift.

Consensus Tree
Consensus trees were generated by bootstrapping (100 to 1000 replications) of the original data taken from http://dnaa.bravehost.com/index.html and http://www.cstl.nist.gov/strbase/population/Omnipop(Figure 1 and Figure 2).Consensus trees were also constructed between the populations who either have a strong ethnic (Figure 3) or linguistic affiliation (Figure 4).Allele frequency data of Pakistani subpopulations were incorporated into the 62 world populations' data (http://dnaa.bravehost.com/index.html) as well as 36 world populations' data from Omnipop file (http://www.cstl.nist.gov/strbase/population/Omnipop).Two measures of genetic distance were used i.e.Nei's genetic distance (Figure 5 and Figure 6) and Cavalli Sforza chord distance (Figure 7 and Figure 8).Trees were constructed using NJ method.

Comparison with other Phylogenetic Trees
Consensus tree was then compared with the trees obtained from other molecular markers such as Alu and RFLP (Nei & Roychoudhury, 1993;Nei & Takezaki, 1996).

Principal Component Analysis (PCA)
PCA was performed for the allele frequency data of 62 world populations (http://dnaa.bravehost.com/index.html)(Figure 9).Pakistani subpopulation allele frequency data was incorporated into 62 world population data and PCA was performed again (Figure 10).

Results
• Coefficient of variation (CV) was in order of D ST >D A >D C .Therefore, a consensus phylogenetic tree based on DC was constructed using NJ, UPGMA and Maximum Likelihood (ML) methods.NJ method showed higher bootstrap values.Comparison of the resultant trees showed that the tree topology was consistent with the trees reported for other molecular markers (Nei & Roychoudhury, 1993;Nei & Takezaki, 1996).Geographic, ethnic and linguistic demarcation between the populations was appreciable (Figure 1 through Figure 4).
• Tree topology was consistent with 'out of Africa' theory of human origin.African populations formed a distinct cluster with high bootstrap value (>950) and the remaining populations branched off from the African cluster.
• Geographical and ethnic demarcations between the populations were more obvious than linguistic demarcation i.e. populations who are in close geographical proximity to each other or who have a common ethnic origin showed tendency to form a separate cluster.For example, Chinese, Tibet, Bhutan, Thai, Malays and Japanese formed a separate cluster though they have diverse linguistic affiliations.China, Thailand and Malaysia belong to Austronesian class of languages while Bhutan, Tibet and Nepal belong to Sino Tibetan class.All these populations belong to the Mongoloid ethnic group.It showed that the STRs were more successful in delineating ethnic rather than linguistic partitioning.
• Phylogenetic efficiency of the three STRs for the populations and subpopulations of the Indian subcontinent was remarkable.Central Indian, south Indian and eastern Indian populations were well differentiated according to their ethnic and linguistic backgrounds.All the Dravidian speaking Australoid Golla subpopulations of Andhra Pradesh (seven in number) consistently formed a separate cluster in phylogenetic trees.Similarly, Eastern India castes Brahmin and Kayasth consistently showed a single cluster while another Eastern India caste Garo was closer to European Caucasians rather than their neighboring populations.Likewise Tamil Bohra muslims did not cluster with their neighboring Tamil sunni muslims, instead they were closer to the Mongoloid populations.
• PCA showed a distinct position of all the African populations in the of score plot of PC1 and PC2 (Figure 9).Indian populations and subpopulations were lying in the left upper quadrant while Mongoloids were in the right lower quadrant.Caucasoids were dispersed in the right half around the median axis.
• Phylogenetic tree (Figure 5 through Figure 8) showed a distinct cluster of the five Pakistani subpopulations with high bootstrap values (≥ 840).It also showed the close affinity of Pakistani subpopulations to the Caucasoid and Mongoloid populations.
• PCA showed all the Pakistani subpopulations in the left lower quadrant except Muhajir that was in the left upper quadrant (Figure 10).

Discussion
Evolutionary histories and phylogenetic relationship of many extant human populations have been explored using microsatellite loci (Bowcock et al., 1994;Deka et al., 1995;Gonser, Donnelly, Nicholson, & Rienzo, 2000;Rowold & Herrera, 2003).Rowold and Herrera (2003) and Agrawal and Faisal (2005) used five STR loci including CSF1PO, TPOX and TH01 for phylogenetic analyses and concluded that these STR loci are successful in reconstructing recent human evolutionary histories.However, they used only ten (10) and twenty one (21) population groups respectively in comparison to ninety eight (98) world populations used in the present study.Moreover the strategy employed in the present study was more comprehensive and each decision making step was explained logically.Findings were also supported by the validation studies.
Understanding the pattern and rate of mutations is very relevant to the applications of these hypervariable genetic markers in evolutionary studies as well as in gene mapping studies (Goldstein, Linares, Cavalli-Sforza, & Feldman, 1995;Shriver et al., 1995).The utility of a genetic marker for determining phylogenetic relationships within a given population is a function of the mutation rate of the marker and the overall genetic diversity of the examined population (Keim et al., 2004).When population genetic diversity is high, only markers with low mutation rates will yield accurate phylogenetic patterns.TPOX and TH01 showed the lowest mutation rates among all thirteen CODIS STR loci; hence likely to be suitable for phylogenetic purposes.
Topology of the phylogenetic trees (Figure 1 through Figure 4) was consistent with those obtained from other molecular markers.For example a phylogenetic tree for 26 human populations based on D A using 29 polymorphic loci (Nei & Roychoudhury, 1993) showed the same partitioning of human populations as shown by the trees reconstructed in the present study.The trees were also compared with the trees based on RFLP data and Alu insertion polymorphism data using D A distance measure (Nei & Takezaki, 1996).Tree topology and partitioning of the populations into ethnic groups were consistent with those of RFLP and Alu insertions.It should be mentioned that the performance of D A distance measure in obtaining the correct tree topology is considered to be the same as that of D C (Takezaki & Nei, 1996).Major ethnic groups identified in the present study were more or less similar to those recognized by classical anthropologist (Nei & Rouchaudhry, 1993).They were four in number namely, Negroid (Africans), Caucasoids (European and their related populations), Mongoloids (East Asians) and Australoid (Andhra Pradesh Golla castes).Tree topology was also supportive of 'out of Africa theory' which has gained popularity among geneticist and anthropologist during the last two decades (For example Nei, 1995;Templeton, 2002;Adams, 2008;Hanihara, 2008;Sun, Mullikin, Patterson, & Reich, 2009).
Stability of tree topology and the adequacy of the data to validate the topology are assessed by bootstrap values (Berry & Gascuel, 1996).Tree topology showed higher bootstrap values when applied to the dataset of populations who have lesser degree of admixture and a strong affiliation with a single ethnic or linguistic group (Figure 3 and Figure 4).It indicates that apart from the number of markers used, there are certain other factors which affect the phylogenetic efficiency of a marker.These factors include STR locus polymorphism, distance measures and methods to reconstruct phylogenetic trees (Nei & Roychoudhury, 1974;Nei, Kumar, & Takahashi, 1998;Goldstein & Pollock, 1994;Tajima & Takezaki, 1994;Takezaki & Nei, 1996;Takezaki & Nei, 2008).Ethnic demarcation showed more statistical support than linguistic demarcation.Another study using 182 autosomal microsatellites could not reveal any phylogenetic relationship between the two language isolate populations namely Hunza Burusho and Basques (Ayub et al., 2003).It was argued then that the microsatellites are best suited for the study of more recent population separations.
Phylogenetic efficiency of the three STR markers was worth noticing for the populations and subpopulations of the Indian subcontinent.Eastern India castes Brahmin and Kayasth consistently formed a single cluster.These two population groups are considered upper classes of Hindu caste system where intermarriages are not prohibited, while Garo which was closer to European Caucasians, is the middle class of Hindu caste system.Likewise Tamil Bohra Muslims were closer to the Mongoloid populations rather than their neighboring Tamil sunni Muslims.Bohra and Sunni are the two religious sects of Muslims between which marriages are generally prohibited.Dravidian speaking Golla castes of Andhra Pradesh remained separated from their neighboring subpopulations and branched off as a single cluster.Even within the Golla castes, western Golla castes (APGolla1 and APGolla5) were closer to each other than other Golla castes.The results established the efficiency of the three STRs (CSF1PO, TPOX, TH01) in delineating genetic relationships of the subpopulations of Indian subcontinent.The finding was validated for the subpopulations of Pakistan which is geographically a part of Indian subcontinent.All five Pakistani subpopulations namely Baloch, Muhajir, Pathan, Punjabi and Sindhi were united in a single cluster with a high bootstrap value that possibly suggests their common origin.Most of the Indian subcontinent populations are thought to be Caucasoid in origin (Cavalli Sforza, Menozzi, & Piazza, 1994).Pakistani subpopulations also showed their affiliation with other Caucasoid and Mongoloid populations.According to a hypothesis of populations' evolution and migration Indian subcontinent has been invaded by both the Caucasoid as well as Mongoloid populations (Nei & Roychoudhury, 1993).Mongoloids are also believed to originate from later splitting in Caucasoid race.Theories of gene flow and varying degrees of admixture between south Asian Indian populations and Mongoloid populations have also been proposed (Bamshad et al., 2003;Watkins, 2003;Shriver et al., 2005).Close affiliation of Pakistani subpopulations may be the results of gene admixture between the two populations.Another study based on HLA-A, -B, -C and -DRB, -DQB1 loci have also shown admixture of Pakistani ethnic groups with Caucasoids and Oriental populations (Mohyuddin, 2000).
In April 2011 American FBI recommended to remove few STR loci from the CODIS list due to their lower polymorphism observed across the world populations (Butler & Hill, 2012;Hares, 2012).TPOX was the least polymorphic of all the CODIS STR loci.In the present study TPOX showed heterozygosity values higher than those for CSF1PO (Table 5).Tri allelic pattern frequently observed for TPOX (Butler, 2005, Lane, 2008;Diaz, Rivas, & Carracedo, 2009) was not observed across the five subpopulations of Pakistan.Results emphasized that the STR loci should be investigated extensively for their efficiency as human identification markers across the populations and subpopulations of the Indian subcontinent.Heterogeneity of extant populations of the Indian subcontinent may deserve a separate standard set of STR loci.It is worth mentioning that European standard set of STR loci are/is different from American core set of STR loci (Butler & Hill, 2012).
It can be concluded that the three STRs successfully exhibited ethnic and linguistic as well as the(omit it) intra-ethnic differentiation across the populations and the subpopulations.Results also suggest that minimum number of markers can be used for reconstructing phylogenetic trees with high bootstrap values provided the markers are efficient for this purpose and a correct statistical strategy is employed.This study may help to identify the STR loci that can be used for forensic as well as phylogenetic purposes.
r are the average heterozygosities over the loci for populations X and Y, respectively, andJ XY = ∑ j r ∑ i mj x ij y ij /r.D ST was computed through Phylip version 3.68.(3) D C is the chord distance proposed by Cavalli Sforza and Edward (1967).D C is defined by

Table 1 .
Chromosomal locations and other information of the three STR loci (CSF1PO, TPOX and TH01)
a a It refers to the number of reference provided at the website http://dnaa.bravehost.com/index.html

Table 4 .
Measures of locus polymorphism for the three STR loci (CSF1PO, TPOX and TH01) averaged over 62 world populations (http://dnaa.bravehost.com/index.html)Heterozygosities of the three loci across each subpopulation of Pakistan are shown in Table6.

Table 6 .
Observed heterozygosities of the three STR loci (CSF1PO, TPOX and TH01) across the five Pakistani subpopulations.These subpopulations were Baloch, Muhajir, Pathan, Punjabi and Sindhi Three distance measure were used to infer genetic distances between the populations under study.(1) D A (Nei's genetic distance) is formulated for Infinite Allele Model (IAM) in which there is a rate of neutral mutation and each mutation give rise to a distinguishable allele