EgyptSearch Forums: Post A Reply

...

»	EgyptSearch Forums » Egyptology » The brave new era of human genetic testing (2008) » Post A Reply

Post A Reply
Login Name:
Password:
Message Icon:
Message: HTML is not enabled. UBB Code™ is enabled.	[QUOTE]Originally posted by xyyman: [QB] Interesting read but I never saw that post but I sort of suspected that already that is why I paid no mind these statistical manipulation games...especially when it is based upon autosomal SNP FREQUENCY and not autosomal STR. I am surprised he included TreeMiIx because I thought it was based upon a totally different processing technique(or algorithm). I assumed TreeMix created some sort of "autosomal PhyloTree" like uniparental markers thus tracing migration. All the others uses str8up frequency which we know is nonsense. Many times you don't need these statistical gamesmanship to see the genetic history displayed in the Admixture Charts- Cluster Charts. As we go from K=2 to K=20 and beyond the separation and genetic history becomes clear. Thanks for the link but I was never convinced of the results from these statistical manipulation. Now it seems I am right. I always said. Genes mimic geography!!!!!! Sforza was correct! That will never change.....until the last 400years ...as we see with peoples of the new world. Oh! And yes, I picked up on Khoi-San being slaves. But the author needed to throw in the word "slave" in there at all cost even when it has no documented "historical" relevance. --------------------------- Quote "Do we always need to remap published aDNA data? Leave a comment The second step when another ancient human DNA paper is published ([b]the first one is obviously to read it[/b]) is usually to download the genomic data in order to use it for our projects. Most papers (especially those concerned with Western Eurasian prehistory) try to include all previously published individuals, which is good for various reasons. [b]The data comes in various shapes[/b]: unprocessed raw reads, trimmed/merged reads or [b]bam files of already mapped reads[/b] or even genotype calls. The question is what kind of data to use. A recent discussion on whether to remap everything using our pipeline inspired me to this post. Various forms of f statistics are widely used in the population genetic analysis of aDNA data. [b]Essentially, they measure correlations between allele frequencies in different populations. The usual interpretation of correlated allele frequencies would be some shared ancestry. xyyman comment: WHICH IS WRONG!!!!! [/b]There is a certain [b]risk that such correlations might arise from technical artefacts[/b] in the handling of different data sets and individuals. The results of at least two papers during the last year may have been [b]substantially affected by some kind of technical artefacts[/b] (Gallego Llorente et al. and Mondal et al. see also here and here). I downloaded the raw reads for 163 ancient individuals from Mathieson et al. and mapped the data to the human reference using bwa aln with two different settings for the -n parameter which controls the mismatch rate to the reference. I used both the default -n 0.04 and -n 0.01 which allows for more mismatches and helps to use more of the (potentially degraded) sequences in the library.[b] Then I merged BOTH versions of each into one[/b] tped file by randomly choosing one read per SNP and bam file (~1.2 million captured SNP sites in total). Then, I used popstats to calculate the statistic D(Chimp, hg19_reference; IndividualX_n0.01, IndividualX_n0.04), so I tested whether the hg19 reference genome shares more alleles with either of two versions of the same individual. The answer should obviously be “no”. dtest-hg19-out-hist (left) Histogram of the Z scores (zero as blue vertical line, the empirical mean is shown in red) and (right) QQ-plot of expected vs observed Z scores Across all 163 individuals, the results show an enrichment of positive Z scores (expectation would be 0). In fact, the entire distribution seems to be shifted towards more positive values. The median is 0.59, the mean is 0.62 which is significantly different from zero (t-test, p=3e-11). Seventeen Z scores are larger than 2, three are even larger than 3, so there is quite some room to wrongly interpret the results of D tests when the samples have been processed differently. This setting is quite artificial as nobody will ask which individual is more or less “reference”, but the human reference sequence itself is closer to some populations than to others. So spurious effects like this could also make a test individual seem more similar to some populations. I know this effect is relatively small, but it is quite a difference whether we claim that population X traces 2% of its ancestry to population Y or if it’s actually 0%. The results make sense: allowing for less mismatches introduces some inherent reference bias as the processing with more mismatches will accept the non-reference allele more often. The random sampling of one read per SNP position is based on the assumption that reference and alternative read would occur 50/50 at a heterozygous position, which seems to be violated due to the reference bias. The reference bias is likely going to affect all sorts of f statistics and related methods (f3, f4, D, ABBA-BABA, qpAdm, [b]TreeMix[/b] etc.) and it might affect others as well. So I’m definitely going [b]make sure that all data is processed as equally as possible [/b]which means to process all samples through the same in silico pipeline even if that can be quite time-consuming for some big aDNA papers. I’m also assuming that other steps in the pipeline (e.g. different diploid genotype callers, base quality rescaling) could have similar effects of [b]causing spurious correlations.[/b]" [/QB][/QUOTE]