Statistical Applications in Genetics and Molecular Biology

Margaret R Donald, Susan R Wilson
Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr
To assess the effect of chemotherapy on mitochondrial genome mutations in cancer survivors and their offspring, a study sequenced the full mitochondrial genome and determined the mitochondrial DNA heteroplasmic (mtDNA) mutation rate. To build a model for counts of heteroplasmic mutations in mothers and their offspring, bivariate Poisson regression was used to examine the relationship between mutation count and clinical information while accounting for the paired correlation. However, if the sequencing depth is not adequate, a limited fraction of the mtDNA will be available for variant calling...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Alexandre Bureau, Jordie Croteau
Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Mateen R Shaikh, Joseph Beyene
Microbiomes, populations of microscopic organisms, have been found to be related to human health and it is expected further investigations will lead to novel perspectives of disease. The data used to analyze microbiomes is one of the newest types (the result of high-throughput technology) and the means to analyze these data is still rapidly evolving. One of the distributions that have been introduced into the microbiome literature, the Dirichlet-Multinomial, has received considerable attention. We extend this distribution's use uncover compositional relationships between organisms at a taxonomic level...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Xu Liu, Bin Gao, Yuehua Cui
Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Venkateshan Kannan, Jesper Tegner
We propose a novel systematic procedure of non-linear data transformation for an adaptive algorithm in the context of network reverse-engineering using information theoretic methods. Our methodology is rooted in elucidating and correcting for the specific biases in the estimation techniques for mutual information (MI) given a finite sample of data. These are, in turn, tied to lack of well-defined bounds for numerical estimation of MI for continuous probability distributions from finite data. The nature and properties of the inevitable bias is described, complemented by several examples illustrating their form and variation...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Audrey Qiuyan Fu, Lior Pachter
Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identically regulated gene pairs in single cells. We examine established formulas [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): "Stochastic gene expression in a single cell," Science, 297, 1183-1186.] for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Katherine L Thompson, Catherine R Linnen, Laura Kubatko
A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Chung-I Li, Yu Shyr
As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study's optimal sample size is now a vital step in experimental design. Current methods for calculating a study's required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Alexia Kakourou, Werner Vach, Simone Nicolardi, Yuri van der Burgt, Bart Mertens
Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data...
October 1, 2016: Statistical Applications in Genetics and Molecular Biology
Colin S Gillespie, Andrew Golightly
Solving the chemical master equation exactly is typically not possible, so instead we must rely on simulation based methods. Unfortunately, drawing exact realisations, results in simulating every reaction that occurs. This will preclude the use of exact simulators for models of any realistic size and so approximate algorithms become important. In this paper we describe a general framework for assessing the accuracy of the linear noise and two moment approximations. By constructing an efficient space filling design over the parameter region of interest, we present a number of useful diagnostic tools that aids modellers in assessing whether the approximation is suitable...
October 1, 2016: Statistical Applications in Genetics and Molecular Biology
Jochen Kruppa, Frank Kramer, Tim Beißbarth, Klaus Jung
As part of the data processing of high-throughput-sequencing experiments count data are produced representing the amount of reads that map to specific genomic regions. Count data also arise in mass spectrometric experiments for the detection of protein-protein interactions. For evaluating new computational methods for the analysis of sequencing count data or spectral count data from proteomics experiments artificial count data is thus required. Although, some methods for the generation of artificial sequencing count data have been proposed, all of them simulate single sequencing runs, omitting thus the correlation structure between the individual genomic features, or they are limited to specific structures...
October 1, 2016: Statistical Applications in Genetics and Molecular Biology
Valentina Pugacheva, Alexander Korotkov, Eugene Korotkov
The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments. The developed algorithm was applied to analyze amino acid sequences of a small number of proteins...
October 1, 2016: Statistical Applications in Genetics and Molecular Biology
Yulan Liang, Arpad Kelemen
Construction of gene-gene interaction networks and potential pathways is a challenging and important problem in genomic research for complex diseases while estimating the dynamic changes of the temporal correlations and non-stationarity are the keys in this process. In this paper, we develop dynamic state space models with hierarchical Bayesian settings to tackle this challenge for inferring the dynamic profiles and genetic networks associated with disease treatments. We treat both the stochastic transition matrix and the observation matrix time-variant and include temporal correlation structures in the covariance matrix estimations in the multivariate Bayesian state space models...
August 1, 2016: Statistical Applications in Genetics and Molecular Biology
Carsten Wiuf, Jonatan Schaumburg-Müller Pallesen, Leslie Foldager, Jakob Grove
In many areas of science it is custom to perform many, potentially millions, of tests simultaneously. To gain statistical power it is common to group tests based on a priori criteria such as predefined regions or by sliding windows. However, it is not straightforward to choose grouping criteria and the results might depend on the chosen criteria. Methods that summarize, or aggregate, test statistics or p-values, without relying on a priori criteria, are therefore desirable. We present a simple method to aggregate a sequence of stochastic variables, such as test statistics or p-values, into fewer variables without assuming a priori defined groups...
August 1, 2016: Statistical Applications in Genetics and Molecular Biology
Nolen Perualila-Tan, Adetayo Kasim, Willem Talloen, Bie Verbist, Hinrich W H Göhlmann, Ziv Shkedy
The modern drug discovery process involves multiple sources of high-dimensional data. This imposes the challenge of data integration. A typical example is the integration of chemical structure (fingerprint features), phenotypic bioactivity (bioassay read-outs) data for targets of interest, and transcriptomic (gene expression) data in early drug discovery to better understand the chemical and biological mechanisms of candidate drugs, and to facilitate early detection of safety issues prior to later and expensive phases of drug development cycles...
August 1, 2016: Statistical Applications in Genetics and Molecular Biology
Charles Laurin, Dorret Boomsma, Gitta Lubke
The Lasso is a shrinkage regression method that is widely used for variable selection in statistical genetics. Commonly, K-fold cross-validation is used to fit a Lasso model. This is sometimes followed by using bootstrap confidence intervals to improve precision in the resulting variable selections. Nesting cross-validation within bootstrapping could provide further improvements in precision, but this has not been investigated systematically. We performed simulation studies of Lasso variable selection precision (VSP) with and without nesting cross-validation within bootstrapping...
August 1, 2016: Statistical Applications in Genetics and Molecular Biology
Chamont Wang, Jana L Gevertz
Modern biological experiments often involve high-dimensional data with thousands or more variables. A challenging problem is to identify the key variables that are related to a specific disease. Confounding this task is the vast number of statistical methods available for variable selection. For this reason, we set out to develop a framework to investigate the variable selection capability of statistical methods that are commonly applied to analyze high-dimensional biological datasets. Specifically, we designed six simulated cancers (based on benchmark colon and prostate cancer data) where we know precisely which genes cause a dataset to be classified as cancerous or normal - we call these causative genes...
August 1, 2016: Statistical Applications in Genetics and Molecular Biology
Christopher D Steele, Matthew Greenhalgh, David J Balding
In recent years statistical models for the analysis of complex (low-template and/or mixed) DNA profiles have moved from using only presence/absence information about allelic peaks in an electropherogram, to quantitative use of peak heights. This is challenging because peak heights are very variable and affected by a number of factors. We present a new peak-height model with important novel features, including over- and double-stutter, and a new approach to dropin. Our model is incorporated in open-source R code likeLTD...
July 14, 2016: Statistical Applications in Genetics and Molecular Biology
Yuna Blum, Magalie Houée-Bigot, David Causeur
Inference on gene regulatory networks from high-throughput expression data turns out to be one of the main current challenges in systems biology. Such networks can be very insightful for the deep understanding of interactions between genes. Because genes-gene interactions is often viewed as joint contributions to known biological mechanisms, inference on the dependence among gene expressions is expected to be consistent to some extent with the functional characterization of genes which can be derived from ontologies (GO, KEGG, …)...
June 1, 2016: Statistical Applications in Genetics and Molecular Biology
