Abstract
VDJ rearrangement and somatic hypermutation work together to produce
antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of
antigens. It is now possible to sequence these BCRs in high throughput, and
analysis of these sequences is bringing new insight into how antibodies
develop, in particular for broadly-neutralizing antibodies against HIV and
influenza. A fundamental step in such sequence analysis is to annotate each
base as coming from a specific one of the V, D, or J genes, or from a
non-templated insertion. Previous work has used simple parametric distributions
to model transitions from state to state in a hidden Markov model (HMM) of VDJ
recombination, and assumed that mutations occur via the same process across
sites. However, codon frame and other effects have been observed to violate
these parametric assumptions for such coding sequences, suggesting that a
non-parametric approach to modeling the recombination process could be useful.
In our paper, we find that indeed large modern data sets suggest a model using
parameter-rich per-allele categorical distributions for HMM transition
probabilities and per-allele-per-position mutation probabilities, and that
using such a model for inference leads to significantly improved results. We
present an accurate and efficient BCR sequence annotation software package
using a novel HMM "factorization" strategy. This package, called partis
(https://github.com/psathyrella/partis/), is built on a new general-purpose HMM
compiler that can perform efficient inference given a simple text description
of an HMM.
Users
Please
log in to take part in the discussion (add own reviews or comments).