Statistics and its Interface

Baolin Wu, James S Pankow
More and more large cohort studies have conducted or are conducting genome-wide association studies (GWAS) to reveal the genetic components of many complex human diseases. These large cohort studies often collected a broad array of correlated phenotypes that reflect common physiological processes. By jointly analyzing these correlated traits, we can gain more power by aggregating multiple weak effects and shed light on the mechanisms underlying complex human diseases. The majority of existing multi-trait association test methods are based on jointly modeling the multivariate traits conditional on the genotype as covariate, and can readily accommodate the imputed SNPs by using their imputed dosage as a covariate...
2017: Statistics and its Interface
Thaddeus Tarpey, Eva Petkova, Liangyu Zhu
Understanding heterogeneity in phenotypical characteristics, symptoms manifestations and response to treatment of subjects with psychiatric illnesses is a continuing challenge in mental health research. A long-standing goal of medical studies is to identify groups of subjects characterized with a particular trait or quality and to distinguish them from other subjects in a clinically relevant way. This paper develops and illustrates a novel approach to this problem based on a method of optimal-partitioning (clustering) of functional data...
July 1, 2016: Statistics and its Interface
Yang Li, Yanqing Sun
Longitudinal data frequently arise in many fields such as medical follow-up studies focusing on specific longitudinal responses. In such situations, the responses are recorded only at discrete observation times. Most existing approaches for longitudinal data analysis assume that the observation or follow-up times are independent of the underlying response process, either completely or given some known covariates. We present a joint analysis approach in which possible correlations among the responses, observation and follow-up times can be characterized by time-dependent random effects...
2016: Statistics and its Interface
Kun Chen
Reduced-rank methods are very popular in high-dimensional multivariate analysis for conducting simultaneous dimension reduction and model estimation. However, the commonly-used reduced-rank methods are not robust, as the underlying reduced-rank structure can be easily distorted by only a few data outliers. Anomalies are bound to exist in big data problems, and in some applications they themselves could be of the primary interest. While naive residual analysis is often inadequate for outlier detection due to potential masking and swamping, robust reduced-rank estimation approaches could be computationally demanding...
2016: Statistics and its Interface
Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, Jun Yan
Big data are data on a massive scale in terms of volume, intensity, and complexity that exceed the capacity of standard analytic tools. They present opportunities as well as challenges to statisticians. The role of computational statisticians in scientific discovery from big data analyses has been under-recognized even by peer statisticians. This article summarizes recent methodological and software developments in statistics that address the big data challenges. Methodologies are grouped into three classes: subsampling-based, divide and conquer, and online updating for stream data...
2016: Statistics and its Interface
Wan-Min Tsai, Heping Zhang, Eugenia Buta, Stephanie O'Malley, Ralitza Gueorguieva
The tree-based methodology has been widely applied to identify predictors of health outcomes in medical studies. However, the classical tree-based approaches do not pay particular attention to treatment assignment and thus do not consider prediction in the context of treatment received. In recent years, attention has been shifting from average treatment effects to identifying moderators of treatment response, and tree-based approaches to identify subgroups of subjects with enhanced treatment responses are emerging...
2016: Statistics and its Interface
Taoyun Cao, Xueqin Wang, Heping Zhang
This paper introduces Energy Bagging Tree (EBT) for multivariate nonparametric regression problems. The EBT makes use of a measure of dispersion constructed from a generalized Gini's mean difference as node impurity, and the tree split function therefore corresponds to the product of energy distance and descendants' proportions. As a non-parametric extension of the between-sample variation in the analysis of variance, this measure of dispersion serves well for EBT in understanding certain complex data. Extensive simulation studies indicate that EBT is highly competitive with existing regression tree methods...
2016: Statistics and its Interface
Jiwei Zhao, Heping Zhang
The need for analysis of multiple responses arises from many applications. In behavioral science, for example, comorbidity is a common phenomenon where multiple disorders occur in the same person. The advantage of jointly analyzing multiple correlated responses has been examined and documented. Due to the difficulties of modeling multiple responses, nonparametric tests such as generalized Kendall's Tau have been developed to assess the association between multiple responses and risk factors. These procedures have been applied to genomewide association studies of multiple complex traits...
2016: Statistics and its Interface
Qingrun Zhang, Chris Tyler-Smith, Quan Long
To identify evolutionary events from the footprints left in the patterns of genetic variation in a population, people use many statistical frameworks, including neutrality tests. In datasets from current high throughput sequencing and genotyping platforms, it is common to have missing data and low-confidence SNP calls at many segregating sites. However, the traditional statistical framework for neutrality tests does not allow for these possibilities; therefore the usual way of treating missing data is to ignore segregating sites with missing/low confidence calls, regardless of the good SNP calls at these sites in other individuals...
October 1, 2015: Statistics and its Interface
Le Bao, Adrian E Raftery, Amala Reddy
In most countries in the world outside of sub-Saharan Africa, HIV is largely concentrated in sub-populations whose behavior puts them at higher risk of contracting and transmitting HIV, such as people who inject drugs, sex workers and men who have sex with men. Estimating the size of these sub-populations is important for assessing overall HIV prevalence and designing effective interventions. We present a Bayesian hierarchical model for estimating the sizes of local and national HIV key affected populations...
April 1, 2015: Statistics and its Interface
Yanming Di
We consider negative binomial (NB) regression models for RNA-Seq read counts and investigate an approach where such NB regression models are fitted to individual genes separately and, in particular, the NB dispersion parameter is estimated from each gene separately without assuming commonalities between genes. This single-gene approach contrasts with the more widely-used dispersion-modeling approach where the NB dispersion is modeled as a simple function of the mean or other measures of read abundance, and then estimated from a large number of genes combined...
2015: Statistics and its Interface
Hui Jiang, Julia Salzman
Ultra high-throughput sequencing of transcriptomes (RNA-Seq) has enabled the accurate estimation of gene expression at individual isoform level. However, systematic biases introduced during the sequencing and mapping processes as well as incompleteness of the transcript annotation databases may cause the estimates of isoform abundances to be unreliable, and in some cases, highly inaccurate. This paper introduces a penalized likelihood approach to detect and correct for such biases in a robust manner. Our model extends those previously proposed by introducing bias parameters for reads...
2015: Statistics and its Interface
Xianbin Zeng, Shuangge Ma, Yichen Qin, Yang Li
In this paper, we consider the variable selection problem in semiparametric additive partially linear models for longitudinal data. Our goal is to identify relevant main effects and corresponding interactions associated with the response variable. Meanwhile, we enforce the strong hierarchical restriction on the model, that is, an interaction can be included in the model only if both the associated main effects are included. Based on B-splines basis approximation for the nonparametric components, we propose an iterative estimation procedure for the model by penalizing the likelihood with a partial group minimax concave penalty (MCP), and use BIC to select the tuning parameter...
2015: Statistics and its Interface
Victor H Lachos, Ming-Hui Chen, Carlos A Abanto-Valle, Caio L N Azevedo
HIV RNA viral load measures are often subjected to some upper and lower detection limits depending on the quantification assays. Hence, the responses are either left or right censored. Linear/nonlinear mixed-effects models, with slight modifications to accommodate censoring, are routinely used to analyze this type of data. Usually, the inference procedures are based on normality (or elliptical distribution) assumptions for the random terms. However, those analyses might not provide robust inference when the distribution assumptions are questionable...
2015: Statistics and its Interface
Eugene Urrutia, Seunggeun Lee, Arnab Maity, Ni Zhao, Judong Shen, Yun Li, Michael C Wu
Analysis of rare genetic variants has focused on region-based analysis wherein a subset of the variants within a genomic region is tested for association with a complex trait. Two important practical challenges have emerged. First, it is difficult to choose which test to use. Second, it is unclear which group of variants within a region should be tested. Both depend on the unknown true state of nature. Therefore, we develop the Multi-Kernel SKAT (MK-SKAT) which tests across a range of rare variant tests and groupings...
2015: Statistics and its Interface
Dennis Kostka, Tara Friedrich, Alisha K Holloway, Katherine S Pollard
Next-generation sequencing technology enables the identification of thousands of gene regulatory sequences in many cell types and organisms. We consider the problem of testing if two such sequences differ in their number of binding site motifs for a given transcription factor (TF) protein. Binding site motifs impart regulatory function by providing TFs the opportunity to bind to genomic elements and thereby affect the expression of nearby genes. Evolutionary changes to such functional DNA are hypothesized to be major contributors to phenotypic diversity within and between species; but despite the importance of TF motifs for gene expression, no method exists to test for motif loss or gain...
2015: Statistics and its Interface
Hongtu Zhu, Joseph G Ibrahim, Qingxia Chen
We establish a connection between Bayesian case influence measures for assessing the influence of individual observations and Bayesian predictive methods for evaluating the predictive performance of a model and comparing different models fitted to the same dataset. Based on such a connection, we formally propose a new set of Bayesian case-deletion model complexity (BCMC) measures for quantifying the effective number of parameters in a given statistical model. Its properties in linear models are explored. Adding some functions of BCMC to a conditional deviance function leads to a Bayesian case-deletion information criterion (BCIC) for comparing models...
October 1, 2014: Statistics and its Interface
Qingxia Chen, Joseph G Ibrahim
Multiple Imputation, Maximum Likelihood and Fully Bayesian methods are the three most commonly used model-based approaches in missing data problems. Although it is easy to show that when the responses are missing at random (MAR), the complete case analysis is unbiased and efficient, the aforementioned methods are still commonly used in practice for this setting. To examine the performance of and relationships between these three methods in this setting, we derive and investigate small sample and asymptotic expressions of the estimates and standard errors, and fully examine how these estimates are related for the three approaches in the linear regression model when the responses are MAR...
July 1, 2014: Statistics and its Interface
Binbing Yu, A James O'Malley, Pulak Ghosh
Multivariate outcomes with heavy skewness and thick tails often arise from clustered experiments or longitudinal studies. Linear mixed models with multivariate skew-t (MST) distributions for the random effects and the error terms is a popular tool of robust modeling for such outcomes. However the usual MST distribution only allows a common degree of freedom for all marginal distributions, which is only appropriate when each marginal has the same amount of tail heaviness. In this paper, we introduce a new class of extended MST distributions, which allow different degrees of freedom and thereby can accommodate heterogeneity in tail-heaviness across outcomes...
2014: Statistics and its Interface
Himel Mallick, Nengjun Yi
Park and Casella (2008) provided the Bayesian lasso for linear models by assigning scale mixture of normal (SMN) priors on the parameters and independent exponential priors on their variances. In this paper, we propose an alternative Bayesian analysis of the lasso problem. A different hierarchical formulation of Bayesian lasso is introduced by utilizing the scale mixture of uniform (SMU) representation of the Laplace density. We consider a fully Bayesian treatment that leads to a new Gibbs sampler with tractable full conditional posterior distributions...
2014: Statistics and its Interface
