Joint modeling of rare variant genetic effects using deep learning and data-driven burden scores
Brian Clarke,Eva Holtkamp,Hakime Öztürk,Felix Brechtmann,Florian Hölzlwimmer,Julien Gagneur,Oliver Stegle
German Cancer Research Center (DKFZ)
Abstract
Emerging population-scale genomic resources provide novel opportunities to survey the effect of rare variants on phenotypes. Two major challenges in rare variant association studies (RVASs) are (i) the multiple testing problem caused by the large numbers of individual rare variants, and (ii) the sparsity of very rare variants. Common methods for rare variant association studies (RVASs) rely on strong assumptions about which variants exhibit phenotypic effects, limiting their efficacy. For example, burden tests require ad-hoc variant filtering schemes to estimate the total load of rare variants within a genomic region, while variance-component tests (e.g., Lee et al., 2012; Monti et al., 2022) require hand-designed weighting schemes for variants to estimate a local genetic similarity. Here, we propose DeepRVAT (Deep Rare Variant Association Testing), a data-driven framework that uses deep neural networks to learn a nonlinear rare variant aggregation function. Specifically, we build on set neural networks to flexibly model variant effects and interactions based on functional variant annotations. Our method considers each gene as a set of its variants, where each variant is represented as a vector of functional annotation scores. First, each variant annotation vector is projected into a latent space by a fully connected neural network $\varphi$, which learns nonlinear latent features that flexibly capture variant characteristics. Next, the latent variant representations are aggregated using a permutation-invariant function to yield a functional gene embedding. Finally, a second fully connected network $\rho$ transforms the functional gene embedding to yield a scalar gene impairment score, which is subsequently used in rare variant association testing. Compared to existing methods (e.g., Sun et al., 2013; He et al., 2017; Susak et al., 2021), DeepRVAT offers the following advantages: • Learning of variant effects without strong filtering or specifying a kernel • Modeling of nonlinear and epistatic effects • Ability to efficiently incorporate dozens of multi-model variant annotations • Learning of reusable trait-specific burden scores • Ability to utilize GPUs and minibatching for fast computation at scale First, we benchmark DeepRVAT in comparison to alternative state-of-the-art methods using simulated data. We find substantial power benefits, in particular in regions where the proportion of causal variants is small (data not shown). Next, we apply DeepRVAT to multiple phenotypes on 167,000 whole-exome- sequenced samples from UK Biobank. DeepRVAT yields a substantially larger number of discoveries (e.g., 29 vs. 15 genes associated to human height; FDR < 0.05), while maintaining statistical calibration. Finally, we validate our results by multiple methods: • Using a subset of samples from the UK Biobank study • Through pathway enrichment analysis • By comparison to other RVASs on sample sizes up to three times larger Collectively, our results demonstrate that DeepRVAT represents a robust approach to extract the most informative rare variant burden effects from whole-genome or whole-exome sequencing data. Our framework also lends itself to future generalizations, including other methods of computing rare variant burden statistics and incorporation into nonlinear phenotype prediction models. References Backman, J. D. et al. (2021). “Exome sequencing and analysis of 454,787 UK Biobank participants”. In: Nature 599.7886, pp. 628–634. He, Z., B. Xu, S. Lee, and I. Ionita-Laza (2017). “Unified sequence-based association tests allowing for multiple functional annotations and meta-analysis of noncoding variation in metabochip data”. In: The American Journal of Human Genetics 101.3, pp. 340–352. Jurgens, S. J. et al. (2020). “Rare genetic variation underlying human diseases and traits: results from 200,000 individuals in the UK Biobank”. In: bioRxiv. Kuleshov, M. V. et al. (2016). “Enrichr: a comprehensive gene set enrichment analysis web server 2016 update”. In: Nucleic acids research 44.W1, W90–W97. Lee, S. et al. (2012). “Optimal unified approach for rare-variant association testing with application to small- sample case-control whole-exome sequencing studies”. In: The American Journal of Human Genetics 91.2, pp. 224–237. Monti, R. et al. (2022). “Identifying interpretable gene-biomarker associations with functionally informed kernel- based tests in 190,000 exomes”. In: bioRxiv. Sun, J., Y. Zheng, and L. Hsu (2013). “A unified mixed-effects model for rare-variant association in sequencing studies”. In: Genetic epidemiology 37.4, pp. 334–344. Susak, H. et al. (2021). “Efficient and flexible Integration of variant characteristics in rare variant association studies using integrated nested Laplace approximation”. In: PLoS computational biology 17.2, e1007784.