It is becoming routine to obtain datasets on DNA sequence variation across
several thousands of chromosomes, providing unprecedented opportunity to infer
the underlying biological and demographic forces. Such data make it vital to
study summary statistics which offer enough compression to be tractable, while
preserving a great deal of information. One well-studied summary is the site
frequency spectrum---the empirical distribution, across segregating sites, of
the sample frequency of the derived allele. However, most previous theoretical
work has assumed that each site has experienced at most one mutation event in
its genealogical history, which becomes less tenable for very large sample
sizes. In this work we obtain, in closed-form, the predicted frequency spectrum
of a site that has experienced at most two mutation events, under very general
assumptions about the distribution of branch lengths in the underlying
coalescent tree. Among other applications, we obtain the frequency spectrum of
a triallelic site in a model of historically varying population size. We
demonstrate the utility of our formulas in two settings: First, we show that
triallelic sites are more sensitive to the parameters of a population that has
experienced historical growth, suggesting that they will have use if they can
be incorporated into demographic inference. Second, we investigate a recently
proposed alternative mechanism of mutation in which the two derived alleles of
a triallelic site are created simultaneously within a single individual, and we
develop a test to determine whether it is responsible for the excess of
triallelic sites in the human genome.
%0 Generic
%1 jenkins2013general
%A Jenkins, Paul A.
%A Mueller, Jonas W.
%A Song, Yun S.
%D 2013
%K coalescent_theory demographic_inference site_frequency_spectrum triallelic
%T General triallelic frequency spectrum under demographic models with
variable population size
%U http://arxiv.org/abs/1310.3444
%X It is becoming routine to obtain datasets on DNA sequence variation across
several thousands of chromosomes, providing unprecedented opportunity to infer
the underlying biological and demographic forces. Such data make it vital to
study summary statistics which offer enough compression to be tractable, while
preserving a great deal of information. One well-studied summary is the site
frequency spectrum---the empirical distribution, across segregating sites, of
the sample frequency of the derived allele. However, most previous theoretical
work has assumed that each site has experienced at most one mutation event in
its genealogical history, which becomes less tenable for very large sample
sizes. In this work we obtain, in closed-form, the predicted frequency spectrum
of a site that has experienced at most two mutation events, under very general
assumptions about the distribution of branch lengths in the underlying
coalescent tree. Among other applications, we obtain the frequency spectrum of
a triallelic site in a model of historically varying population size. We
demonstrate the utility of our formulas in two settings: First, we show that
triallelic sites are more sensitive to the parameters of a population that has
experienced historical growth, suggesting that they will have use if they can
be incorporated into demographic inference. Second, we investigate a recently
proposed alternative mechanism of mutation in which the two derived alleles of
a triallelic site are created simultaneously within a single individual, and we
develop a test to determine whether it is responsible for the excess of
triallelic sites in the human genome.
@misc{jenkins2013general,
abstract = {It is becoming routine to obtain datasets on DNA sequence variation across
several thousands of chromosomes, providing unprecedented opportunity to infer
the underlying biological and demographic forces. Such data make it vital to
study summary statistics which offer enough compression to be tractable, while
preserving a great deal of information. One well-studied summary is the site
frequency spectrum---the empirical distribution, across segregating sites, of
the sample frequency of the derived allele. However, most previous theoretical
work has assumed that each site has experienced at most one mutation event in
its genealogical history, which becomes less tenable for very large sample
sizes. In this work we obtain, in closed-form, the predicted frequency spectrum
of a site that has experienced at most two mutation events, under very general
assumptions about the distribution of branch lengths in the underlying
coalescent tree. Among other applications, we obtain the frequency spectrum of
a triallelic site in a model of historically varying population size. We
demonstrate the utility of our formulas in two settings: First, we show that
triallelic sites are more sensitive to the parameters of a population that has
experienced historical growth, suggesting that they will have use if they can
be incorporated into demographic inference. Second, we investigate a recently
proposed alternative mechanism of mutation in which the two derived alleles of
a triallelic site are created simultaneously within a single individual, and we
develop a test to determine whether it is responsible for the excess of
triallelic sites in the human genome.},
added-at = {2013-10-21T20:12:58.000+0200},
author = {Jenkins, Paul A. and Mueller, Jonas W. and Song, Yun S.},
biburl = {https://www.bibsonomy.org/bibtex/299caa567d3befc576e05166d13f415a2/peter.ralph},
interhash = {c4634d8586ea2d972c4788e44ac66288},
intrahash = {99caa567d3befc576e05166d13f415a2},
keywords = {coalescent_theory demographic_inference site_frequency_spectrum triallelic},
timestamp = {2013-10-21T20:12:58.000+0200},
title = {General triallelic frequency spectrum under demographic models with
variable population size},
url = {http://arxiv.org/abs/1310.3444},
year = 2013
}