Read by QxMD icon Read

Statistical Applications in Genetics and Molecular Biology

Haixiang Zhang, Yinan Zheng, Grace Yoon, Zhou Zhang, Tao Gao, Brian Joyce, Wei Zhang, Joel Schwartz, Pantel Vokonas, Elena Colicino, Andrea Baccarelli, Lifang Hou, Lei Liu
In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A precision matrix obtained via the constrained ℓ1 minimization method is used to account for the within-subject correlation among multivariate outcomes...
July 26, 2017: Statistical Applications in Genetics and Molecular Biology
Shofiqul Islam, Sonia Anand, Jemila Hamid, Lehana Thabane, Joseph Beyene
Linear principal component analysis (PCA) is a widely used approach to reduce the dimension of gene or miRNA expression data sets. This method relies on the linearity assumption, which often fails to capture the patterns and relationships inherent in the data. Thus, a nonlinear approach such as kernel PCA might be optimal. We develop a copula-based simulation algorithm that takes into account the degree of dependence and nonlinearity observed in these data sets. Using this algorithm, we conduct an extensive simulation to compare the performance of linear and kernel principal component analysis methods towards data integration and death classification...
July 26, 2017: Statistical Applications in Genetics and Molecular Biology
Fadhaa Ali, Jian Zhang
Multilocus haplotype analysis of candidate variants with genome wide association studies (GWAS) data may provide evidence of association with disease, even when the individual loci themselves do not. Unfortunately, when a large number of candidate variants are investigated, identifying risk haplotypes can be very difficult. To meet the challenge, a number of approaches have been put forward in recent years. However, most of them are not directly linked to the disease-penetrances of haplotypes and thus may not be efficient...
July 26, 2017: Statistical Applications in Genetics and Molecular Biology
Zhongxue Chen, Shizhong Han, Kai Wang
Many gene- and pathway-based association tests have been proposed in the literature. Among them, the SKAT is widely used, especially for rare variants association studies. In this paper, we investigate the connection between SKAT and a principal component analysis. This investigation leads to a procedure that encompasses SKAT as a special case. Through simulation studies and real data applications, we compare the proposed method with some existing tests.
July 26, 2017: Statistical Applications in Genetics and Molecular Biology
Aaron T L Lun, Gordon K Smyth
RNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons...
April 25, 2017: Statistical Applications in Genetics and Molecular Biology
Jakub Pecanka, Jelle Goeman
A classical approach to experimental design in many scientific fields is to first gather all of the data and then analyze it in a single analysis. It has been recognized that in many areas such practice leaves substantial room for improvement in terms of the researcher's ability to identify relevant effects, in terms of cost efficiency, or both. Considerable attention has been paid in recent years to multi-stage designs, in which the user alternates between data collection and analysis and thereby sequentially reduces the size of the problem...
April 25, 2017: Statistical Applications in Genetics and Molecular Biology
Ana Arribas-Gil, Catherine Matias
We propose an approach for multiple sequence alignment (MSA) derived from the dynamic time warping viewpoint and recent techniques of curve synchronization developed in the context of functional data analysis. Starting from pairwise alignments of all the sequences (viewed as paths in a certain space), we construct a median path that represents the MSA we are looking for. We establish a proof of concept that our method could be an interesting ingredient to include into refined MSA techniques. We present a simple synthetic experiment as well as the study of a benchmark dataset, together with comparisons with 2 widely used MSA softwares...
April 25, 2017: Statistical Applications in Genetics and Molecular Biology
Shahla Faisal, Gerhard Tutz
High dimensional data like gene expression and RNA-sequences often contain missing values. The subsequent analysis and results based on these incomplete data can suffer strongly from the presence of these missing values. Several approaches to imputation of missing values in gene expression data have been developed but the task is difficult due to the high dimensionality (number of genes) of the data. Here an imputation procedure is proposed that uses weighted nearest neighbors. Instead of using nearest neighbors defined by a distance that includes all genes the distance is computed for genes that are apt to contribute to the accuracy of imputed values...
April 25, 2017: Statistical Applications in Genetics and Molecular Biology
Jiehuan Sun, Joshua L Warren, Hongyu Zhao
Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles...
April 25, 2017: Statistical Applications in Genetics and Molecular Biology
Margaret R Donald, Susan R Wilson
Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Pei-Fang Su, Yu-Lin Mau, Yan Guo, Chung-I Li, Qi Liu, John D Boice, Yu Shyr
To assess the effect of chemotherapy on mitochondrial genome mutations in cancer survivors and their offspring, a study sequenced the full mitochondrial genome and determined the mitochondrial DNA heteroplasmic (mtDNA) mutation rate. To build a model for counts of heteroplasmic mutations in mothers and their offspring, bivariate Poisson regression was used to examine the relationship between mutation count and clinical information while accounting for the paired correlation. However, if the sequencing depth is not adequate, a limited fraction of the mtDNA will be available for variant calling...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Alexandre Bureau, Jordie Croteau
Polytomous phenotypes arise when a disease has multiple subtypes or when two dichotomous phenotypes are analyzed simultaneously. Few software programs offer the option to analyze such phenotypes in family studies, and none implements conditional polytomous logistic regression for within-family analysis robust to population stratification. We introduce Polyunphased, an extension to polytomous phenotypes of the Unphased package, a flexible software tool for genetic association analysis in nuclear families. Like Unphased, Polyunphased is written in C++ and runs from the command line or from a Java graphical user interface...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Mateen R Shaikh, Joseph Beyene
Microbiomes, populations of microscopic organisms, have been found to be related to human health and it is expected further investigations will lead to novel perspectives of disease. The data used to analyze microbiomes is one of the newest types (the result of high-throughput technology) and the means to analyze these data is still rapidly evolving. One of the distributions that have been introduced into the microbiome literature, the Dirichlet-Multinomial, has received considerable attention. We extend this distribution's use uncover compositional relationships between organisms at a taxonomic level...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Xu Liu, Bin Gao, Yuehua Cui
Epidemiological studies have suggested the joint effect of simultaneous exposures to multiple environments on disease risk. However, how environmental mixtures as a whole jointly modify genetic effect on disease risk is still largely unknown. Given the importance of gene-environment (G×E) interactions on many complex diseases, rigorously assessing the interaction effect between genes and environmental mixtures as a whole could shed novel insights into the etiology of complex diseases. For this purpose, we propose a generalized partial linear varying multi-index coefficient model (GPLVMICM) to capture the genetic effect on disease risk modulated by multiple environments as a whole...
March 1, 2017: Statistical Applications in Genetics and Molecular Biology
Ao Kong, Robert Azencott
For mass spectra acquired from cancer patients by MALDI or SELDI techniques, automated discrimination between cancer types or stages has often been implemented by machine learning algorithms. Nevertheless, these techniques typically lack interpretability in terms of biomarkers. In this paper, we propose a new mass spectra discrimination algorithm by parameterized Markov Random Fields to automatically generate interpretable classifiers with small groups of scored biomarkers. A dataset of 238 MALDI colorectal mass spectra and two datasets of 216 and 253 SELDI ovarian mass spectra respectively were used to test our approach...
February 11, 2017: Statistical Applications in Genetics and Molecular Biology
Venkateshan Kannan, Jesper Tegner
We propose a novel systematic procedure of non-linear data transformation for an adaptive algorithm in the context of network reverse-engineering using information theoretic methods. Our methodology is rooted in elucidating and correcting for the specific biases in the estimation techniques for mutual information (MI) given a finite sample of data. These are, in turn, tied to lack of well-defined bounds for numerical estimation of MI for continuous probability distributions from finite data. The nature and properties of the inevitable bias is described, complemented by several examples illustrating their form and variation...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Audrey Qiuyan Fu, Lior Pachter
Gene expression is stochastic and displays variation ("noise") both within and between cells. Intracellular (intrinsic) variance can be distinguished from extracellular (extrinsic) variance by applying the law of total variance to data from two-reporter assays that probe expression of identically regulated gene pairs in single cells. We examine established formulas [Elowitz, M. B., A. J. Levine, E. D. Siggia and P. S. Swain (2002): "Stochastic gene expression in a single cell," Science, 297, 1183-1186.] for the estimation of intrinsic and extrinsic noise and provide interpretations of them in terms of a hierarchical model...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Katherine L Thompson, Catherine R Linnen, Laura Kubatko
A central goal in biological and biomedical sciences is to identify the molecular basis of variation in morphological and behavioral traits. Over the last decade, improvements in sequencing technologies coupled with the active development of association mapping methods have made it possible to link single nucleotide polymorphisms (SNPs) and quantitative traits. However, a major limitation of existing methods is that they are often unable to consider complex, but biologically-realistic, scenarios. Previous work showed that association mapping method performance can be improved by using the evolutionary history within each SNP to estimate the covariance structure among randomly-sampled individuals...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Chung-I Li, Yu Shyr
As RNA-seq rapidly develops and costs continually decrease, the quantity and frequency of samples being sequenced will grow exponentially. With proteomic investigations becoming more multivariate and quantitative, determining a study's optimal sample size is now a vital step in experimental design. Current methods for calculating a study's required sample size are mostly based on the hypothesis testing framework, which assumes each gene count can be modeled through Poisson or negative binomial distributions; however, these methods are limited when it comes to accommodating covariates...
December 1, 2016: Statistical Applications in Genetics and Molecular Biology
Alexia Kakourou, Werner Vach, Simone Nicolardi, Yuri van der Burgt, Bart Mertens
Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data...
October 1, 2016: Statistical Applications in Genetics and Molecular Biology
Fetch more papers »
Fetching more papers... Fetching...
Read by QxMD. Sign in or create an account to discover new knowledge that matter to you.
Remove bar
Read by QxMD icon Read

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"