Journal of the American Statistical Association

Lan Wang, Yu Zhou, Rui Song, Ben Sherwood
Finding the optimal treatment regime (or a series of sequential treatment regimes) based on individual characteristics has important applications in areas such as precision medicine, government policies and active labor market interventions. In the current literature, the optimal treatment regime is usually defined as the one that maximizes the average benefit in the potential population. This paper studies a general framework for estimating the quantile-optimal treatment regime, which is of importance in many real-world applications...
2018: Journal of the American Statistical Association
Daniel Backenroth, Jeff Goldsmith, Michelle D Harran, Juan C Cortes, John W Krakauer, Tomoko Kitago
We propose a novel method for estimating population-level and subject-specific effects of covariates on the variability of functional data. We extend the functional principal components analysis framework by modeling the variance of principal component scores as a function of covariates and subject-specific random effects. In a setting where principal components are largely invariant across subjects and covariate values, modeling the variance of these scores provides a flexible and interpretable way to explore factors that affect the variability of functional data...
2018: Journal of the American Statistical Association
Quan Zhou, Yongtao Guan
We show that under the null, the 2 log(Bayes factor) is asymptotically distributed as a weighted sum of chi-squared random variables with a shifted mean. This claim holds for Bayesian multi-linear regression with a family of conjugate priors, namely, the normal-inverse-gamma prior, the g-prior, and the normal prior. Our results have three immediate impacts. First, we can compute analytically a p-value associated with a Bayes factor without the need of permutation. We provide a software package that can evaluate the p-value associated with Bayes factor efficiently and accurately...
2018: Journal of the American Statistical Association
Ross P Hilton, Yuchen Zheng, Nicoleta Serban
We introduce a modeling approach for characterizing heterogeneity in healthcare utilization using massive medical claims data. We first translate the medical claims observed for a large study population and across five years into individual-level discrete events of care called utilization sequences . We model the utilization sequences using an exponential proportional hazards mixture model to capture heterogeneous behaviors in patients' healthcare utilization. The objective is to cluster patients according to their longitudinal utilization behaviors and to determine the main drivers of variation in healthcare utilization while controlling for the demographic, geographic, and health characteristics of the patients...
2018: Journal of the American Statistical Association
Dungang Liu, Heping Zhang
Ordinal outcomes are common in scientific research and everyday practice, and we often rely on regression models to make inference. A long-standing problem with such regression analyses is the lack of effective diagnostic tools for validating model assumptions. The difficulty arises from the fact that an ordinal variable has discrete values that are labeled with, but not, numerical values. The values merely represent ordered categories. In this paper, we propose a surrogate approach to defining residuals for an ordinal outcome Y ...
2018: Journal of the American Statistical Association
Kun Chen, Neha Mishra, Joan Smyth, Haim Bar, Elizabeth Schifano, Lynn Kuo, Ming-Hui Chen
Necrotic enteritis (NE) is a serious disease of poultry caused by the bacterium C. perfringens . To identify proteins of C. perfringens that confer virulence with respect to NE, the protein secretions of four NE disease-producing strains and one baseline non-disease-producing strain of C. perfringens were examined. The problem then becomes a clustering task, for the identification of two extreme groups of proteins that were produced at either concordantly higher or concordantly lower levels across all four disease-producing strains compared to the baseline, when most of the proteins do not exhibit significant change across all strains...
2018: Journal of the American Statistical Association
Mark S Handcock
2018: Journal of the American Statistical Association
Danielle Braun, Malka Gorfine, Hormuzd A Katki, Argyrios Ziogas, Giovanni Parmigiani
Mismeasured time to event data used as a predictor in risk prediction models will lead to inaccurate predictions. This arises in the context of self-reported family history, a time to event predictor often measured with error, used in Mendelian risk prediction models. Using validation data, we propose a method to adjust for this type of error. We estimate the measurement error process using a nonparametric smoothed Kaplan-Meier estimator, and use Monte Carlo integration to implement the adjustment. We apply our method to simulated data in the context of both Mendelian and multivariate survival prediction models...
2018: Journal of the American Statistical Association
Kin Yau Wong, Donglin Zeng, D Y Lin
Structural equation modeling is commonly used to capture complex structures of relationships among multiple variables, both latent and observed. We propose a general class of structural equation models with a semiparametric component for potentially censored survival times. We consider nonparametric maximum likelihood estimation and devise a combined Expectation-Maximization and Newton-Raphson algorithm for its implementation. We establish conditions for model identifiability and prove the consistency, asymptotic normality, and semiparametric efficiency of the estimators...
2018: Journal of the American Statistical Association
HaiYing Wang, Rong Zhu, Ping Ma
For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this paper, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and asymptotic normality of the estimator from a general subsampling algorithm, and then derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator...
2018: Journal of the American Statistical Association
Alexander R Luedtke, Mark J van der Laan
Suppose one has a collection of parameters indexed by a (possibly infinite dimensional) set. Given data generated from some distribution, the objective is to estimate the maximal parameter in this collection evaluated at the distribution that generated the data. This estimation problem is typically non-regular when the maximizing parameter is non-unique, and as a result standard asymptotic techniques generally fail in this case. We present a technique for developing parametric-rate confidence intervals for the quantity of interest in these non-regular settings...
2018: Journal of the American Statistical Association
Abhra Sarkar, Debdeep Pati, Antik Chakraborty, Bani K Mallick, Raymond J Carroll
We consider the problem of multivariate density deconvolution when interest lies in estimating the distribution of a vector valued random variable X but precise measurements on X are not available, observations being contaminated by measurement errors U . The existing sparse literature on the problem assumes the density of the measurement errors to be completely known. We propose robust Bayesian semiparametric multivariate deconvolution approaches when the measurement error density of U is not known but replicated proxies are available for at least some individuals...
2018: Journal of the American Statistical Association
BaoLuo Sun, Eric J Tchetgen Tchetgen
The development of coherent missing data models to account for nonmonotone missing at random (MAR) data by inverse probability weighting (IPW) remains to date largely unresolved. As a consequence, IPW has essentially been restricted for use only in monotone missing data settings. We propose a class of models for nonmonotone missing data mechanisms that spans the MAR model, while allowing the underlying full data law to remain unrestricted. For parametric specifications within the proposed class, we introduce an unconstrained maximum likelihood estimator for estimating the missing data probabilities which can be easily implemented using existing software...
2018: Journal of the American Statistical Association
Zhao Chen, Jianqing Fan, Runze Li
Error variance estimation plays an important role in statistical inference for high dimensional regression models. This paper concerns with error variance estimation in high dimensional sparse additive model. We study the asymptotic behavior of the traditional mean squared errors, the naive estimate of error variance, and show that it may significantly underestimate the error variance due to spurious correlations which are even higher in nonparametric models than linear models. We further propose an accurate estimate for error variance in ultrahigh dimensional sparse additive model by effectively integrating sure independence screening and refitted cross-validation techniques (Fan, Guo and Hao, 2012)...
2018: Journal of the American Statistical Association
Yuanjia Wang, Haoda Fu, Donglin Zeng
Individualized medical decision making is often complex due to patient treatment response heterogeneity. Pharmacotherapy may exhibit distinct efficacy and safety profiles for different patient populations. An "optimal" treatment that maximizes clinical benefit for a patient may also lead to concern of safety due to a high risk of adverse events. Thus, to guide individualized clinical decision making and deliver optimal tailored treatments, maximizing clinical benefit should be considered in the context of controlling for potential risk...
2018: Journal of the American Statistical Association
Jeffrey W Miller, Matthew T Harrison
A natural Bayesian approach for mixture models with an unknown number of components is to take the usual finite mixture model with symmetric Dirichlet weights, and put a prior on the number of components-that is, to use a mixture of finite mixtures (MFM). The most commonly-used method of inference for MFMs is reversible jump Markov chain Monte Carlo, but it can be nontrivial to design good reversible jump moves, especially in high-dimensional spaces. Meanwhile, there are samplers for Dirichlet process mixture (DPM) models that are relatively simple and are easily adapted to new applications...
2018: Journal of the American Statistical Association
Quefeng Li, Guang Cheng, Jianqing Fan, Yuyan Wang
Factor modeling is an essential tool for exploring intrinsic dependence structures among high-dimensional random variables. Much progress has been made for estimating the covariance matrix from a high-dimensional factor model. However, the blessing of dimensionality has not yet been fully embraced in the literature: much of the available data are often ignored in constructing covariance matrix estimates. If our goal is to accurately estimate a covariance matrix of a set of targeted variables, shall we employ additional data, which are beyond the variables of interest, in the estimation? In this article, we provide sufficient conditions for an affirmative answer, and further quantify its gain in terms of Fisher information and convergence rate...
2018: Journal of the American Statistical Association
Yin Xia, Tianxi Cai, T Tony Cai
Making accurate inference for gene regulatory networks, including inferring about pathway by pathway interactions, is an important and difficult task. Motivated by such genomic applications, we consider multiple testing for conditional dependence between subgroups of variables. Under a Gaussian graphical model framework, the problem is translated into simultaneous testing for a collection of submatrices of a high-dimensional precision matrix with each submatrix summarizing the dependence structure between two subgroups of variables...
2018: Journal of the American Statistical Association
Eric B Laber, Ana-Maria Staicu
Evidence-based personalized medicine formalizes treatment selection as an individualized treatment regime that maps up-to-date patient information into the space of possible treatments. Available patient information may include static features such race, gender, family history, genetic and genomic information, as well as longitudinal information including the emergence of comorbidities, waxing and waning of symptoms, side-effect burden, and adherence. Dynamic information measured at multiple time points before treatment assignment should be included as input to the treatment regime...
2017: Journal of the American Statistical Association
Yuan Huang, Qingzhao Zhang, Sanguo Zhang, Jian Huang, Shuangge Ma
For data with high-dimensional covariates but small sample sizes, the analysis of single datasets often generates unsatisfactory results. The integrative analysis of multiple independent datasets provides an effective way of pooling information and outperforms single-dataset and several alternative multi-datasets methods. Under many scenarios, multiple datasets are expected to share common important covariates, that is, the corresponding models have similarity in their sparsity structures. However, the existing methods do not have a mechanism to promote the similarity in sparsity structures in integrative analysis...
2017: Journal of the American Statistical Association
