Helene Ruffieux, Anthony C Davison, Jorg Hager, Irina Irincheeva
Combined inference for heterogeneous high-dimensional data is critical in modern biology, where clinical and various kinds of molecular data may be available from a single study. Classical genetic association studies regress a single clinical outcome on many genetic variants one by one, but there is an increasing demand for joint analysis of many molecular outcomes and genetic variants in order to unravel functional interactions. Unfortunately, most existing approaches to joint modeling are either too simplistic to be powerful or are impracticable for computational reasons...
March 16, 2017: Biostatistics
Denis Agniel, Boris P Hejblum
As gene expression measurement technology is shifting from microarrays to sequencing, the statistical tools available for their analysis must be adapted since RNA-seq data are measured as counts. It has been proposed to model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity. In this vein, we propose tcgsaseq, a principled, model-free, and efficient method for detecting longitudinal changes in RNA-seq gene sets defined a priori. The method identifies those gene sets whose expression varies over time, based on an original variance component score test accounting for both covariates and heteroscedasticity without assuming any specific parametric distribution for the (transformed) counts...
March 10, 2017: Biostatistics
Panagiota Filippou, Giampiero Marra, Rosalba Radice
This article proposes a penalized likelihood method to estimate a trivariate probit model, which accounts for several types of covariate effects (such as linear, nonlinear, random, and spatial effects), as well as error correlations. The proposed approach also addresses the difficulty in estimating accurately the correlation coefficients, which characterize the dependence of binary responses conditional on covariates. The parameters of the model are estimated within a penalized likelihood framework based on a carefully structured trust region algorithm with integrated automatic multiple smoothing parameter selection...
March 4, 2017: Biostatistics
Joseph Antonelli, Corwin Zigler, Francesca Dominici
In comparative effectiveness research, we are often interested in the estimation of an average causal effect from large observational data (the main study). Often this data does not measure all the necessary confounders. In many occasions, an extensive set of additional covariates is measured for a smaller and non-representative population (the validation study). In this setting, standard approaches for missing data imputation might not be adequate due to the large number of missing covariates in the main data relative to the smaller sample size of the validation data...
March 3, 2017: Biostatistics
Ander Wilson, Yueh-Hsiu Mathilda Chiu, Hsiao-Hsien Leon Hsu, Robert O Wright, Rosalind J Wright, Brent A Coull
Epidemiological research supports an association between maternal exposure to air pollution during pregnancy and adverse children's health outcomes. Advances in exposure assessment and statistics allow for estimation of both critical windows of vulnerability and exposure effect heterogeneity. Simultaneous estimation of windows of vulnerability and effect heterogeneity can be accomplished by fitting a distributed lag model (DLM) stratified by subgroup. However, this can provide an incomplete picture of how effects vary across subgroups because it does not allow for subgroups to have the same window but different within-window effects or to have different windows but the same within-window effect...
February 27, 2017: Biostatistics
Amanda F Mejia, Mary Beth Nebel, Ani Eloyan, Brian Caffo, Martin A Lindquist
Outlier detection for high-dimensional (HD) data is a popular topic in modern statistical research. However, one source of HD data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing-primarily through large, publicly available grassroots datasets-automated quality control and outlier detection methods are greatly needed...
February 27, 2017: Biostatistics
Coraline Danieli, Nadine Bossard, Laurent Roche, Aurelien Belot, Zoe Uhry, Hadrien Charvat, Laurent Remontet
Net survival, the one that would be observed if the disease under study was the only cause of death, is an important, useful, and increasingly used indicator in public health, especially in population-based studies. Estimates of net survival and effects of prognostic factor can be obtained by excess hazard regression modeling. Whereas various diagnostic tools were developed for overall survival analysis, few methods are available to check the assumptions of excess hazard models. We propose here two formal tests to check the proportional hazard assumption and the validity of the functional form of the covariate effects in the context of flexible parametric excess hazard modeling...
February 26, 2017: Biostatistics
Jing Ning, Yong Chen, Jin Piao
Publication bias occurs when the published research results are systematically unrepresentative of the population of studies that have been conducted, and is a potential threat to meaningful meta-analysis. The Copas selection model provides a flexible framework for correcting estimates and offers considerable insight into the publication bias. However, maximizing the observed likelihood under the Copas selection model is challenging because the observed data contain very little information on the latent variable...
February 25, 2017: Biostatistics
Vegard Nygaard, Einar Andreas Rødland, Eivind Hovig
No abstract text is available yet for this article.
February 20, 2017: Biostatistics
Jakub Pecanka, Marianne A Jonker, Zoltan Bochdanovits, Aad W Van Der Vaart
For over a decade functional gene-to-gene interaction (epistasis) has been suspected to be a determinant in the "missing heritability" of complex traits. However, searching for epistasis on the genome-wide scale has been challenging due to the prohibitively large number of tests which result in a serious loss of statistical power as well as computational challenges. In this article, we propose a two-stage method applicable to existing case-control data sets, which aims to lessen both of these problems by pre-assessing whether a candidate pair of genetic loci is involved in epistasis before it is actually tested for interaction with respect to a complex phenotype...
February 6, 2017: Biostatistics
Aaron T L Lun, John C Marioni
An increasing number of studies are using single-cell RNA-sequencing (scRNA-seq) to characterize the gene expression profiles of individual cells. One common analysis applied to scRNA-seq data involves detecting differentially expressed (DE) genes between cells in different biological groups. However, many experiments are designed such that the cells to be compared are processed in separate plates or chips, meaning that the groupings are confounded with systematic plate effects. This confounding aspect is frequently ignored in DE analyses of scRNA-seq data...
February 6, 2017: Biostatistics
Roland A Matsouaka, Eric J Tchetgen Tchetgen
We consider estimating causal odds ratios using an instrumental variable under a logistic structural nested mean model (LSNMM). Current methods for LSNMMs either rely heavily on possible "uncongenial" modeling assumptions or involve intricate numerical challenges, which have impeded their use. In this article, we present an alternative method that ensures a congenial parametrization, circumvents computational complexity of existing methods, and is easy to implement. We illustrate the proposed method to (1) estimate the causal effect of years of education on earnings using data from the NLSYM and (2) assess the impact of moving families from high to low-poverty neighborhoods had on lifetime major depressive disorder among adolescents in the "Moving to Opportunity (MTO) for Fair Housing Demonstration Project" from the Department of Housing and Urban Development...
February 6, 2017: Biostatistics
Tianmeng Lyu, Eric F Lock, Lynn E Eberly
High-dimensional linear classifiers, such as distance weighted discrimination (DWD) and versions of the support vector machine (SVM), are commonly used in biomedical research to distinguish groups of subjects based on a large number of features. However, their use is limited to applications where a single vector of features is measured for each subject. In practice, data are often multi-way, or measured over multiple dimensions. For example, metabolite abundance may be measured over multiple regions or tissues, or gene expression may be measured over multiple time points, for the same subjects...
January 23, 2017: Biostatistics
Abhishek Kaul, Ori Davidov, Shyamal D Peddada
SUMMARYThis paper is motivated by the recent interest in the analysis of high-dimensional microbiome data. A key feature of these data is the presence of "structural zeros" which are microbes missing from an observation vector due to an underlying biological process and not due to error in measurement. Typical notions of missingness are unable to model these structural zeros. We define a general framework which allows for structural zeros in the model and propose methods of estimating sparse high-dimensional covariance and precision matrices under this setup...
January 8, 2017: Biostatistics
Emily J Huang, Ethan X Fang, Daniel F Hanley, Michael Rosenblum
SUMMARYIn many randomized controlled trials, the primary analysis focuses on the average treatment effect and does not address whether treatment benefits are widespread or limited to a select few. This problem affects many disease areas, since it stems from how randomized trials, often the gold standard for evaluating treatments, are designed and analyzed. Our goal is to learn about the fraction who benefit from a new treatment using randomized trial data. We consider the case where the outcome is ordinal, with binary outcomes as a special case...
December 26, 2016: Biostatistics
Sebastian Meyer, Leonhard Held
SummaryRoutine public health surveillance of notifiable infectious diseases gives rise to weekly counts of reported cases-possibly stratified by region and/or age group. We investigate how an age-structured social contact matrix can be incorporated into a spatio-temporal endemic-epidemic model for infectious disease counts. To illustrate the approach, we analyze the spread of norovirus gastroenteritis over six age groups within the 12 districts of Berlin, 2011-2015, using contact data from the POLYMOD study...
December 26, 2016: Biostatistics
Duncan Lee, Sabyasachi Mukhopadhyay, Alastair Rushworth, Sujit K Sahu
SummaryIn the United Kingdom, air pollution is linked to around 40000 premature deaths each year, but estimating its health effects is challenging in a spatio-temporal study. The challenges include spatial misalignment between the pollution and disease data; uncertainty in the estimated pollution surface; and complex residual spatio-temporal autocorrelation in the disease data. This article develops a two-stage model that addresses these issues. The first stage is a spatio-temporal fusion model linking modeled and measured pollution data, while the second stage links these predictions to the disease data...
December 26, 2016: Biostatistics
Wei Fu, Jeffrey S Simonoff
SUMMARYTree methods (recursive partitioning) are a popular class of nonparametric methods for analyzing data. One extension of the basic tree methodology is the survival tree, which applies recursive partitioning to censored survival data. There are several existing survival tree methods in the literature, which are mainly designed for right-censored data. We propose two new survival trees for left-truncated and right-censored (LTRC) data, which can be seen as a generalization of the traditional survival tree for right-censored data...
December 26, 2016: Biostatistics
No abstract text is available yet for this article.
January 2017: Biostatistics
Sunghwan Kim, Steffi Oesterreich, Seyoung Kim, Yongseok Park, George C Tseng
SummaryWith the rapid advances in technologies of microarray and massively parallel sequencing, data of multiple omics sources from a large patient cohort are now frequently seen in many consortium studies. Effective multi-level omics data integration has brought new statistical challenges. One important biological objective of such integrative analysis is to cluster patients in order to identify clinically relevant disease subtypes, which will form basis for tailored treatment and personalized medicine. Several methods have been proposed in the literature for this purpose, including the popular iCluster method used in many cancer applications...
January 2017: Biostatistics
(heart or cardiac or cardio*) AND arrest -"American Heart Association"