P Ding, T J VanderWeele, J M Robins
Drawing causal inference with observational studies is the central pillar of many disciplines. One sufficient condition for identifying the causal effect is that the treatment-outcome relationship is unconfounded conditional on the observed covariates. It is often believed that the more covariates we condition on, the more plausible this unconfoundedness assumption is. This belief has had a huge impact on practical causal inference, suggesting that we should adjust for all pretreatment covariates. However, when there is unmeasured confounding between the treatment and outcome, estimators adjusting for some pretreatment covariate might have greater bias than estimators without adjusting for this covariate...
June 1, 2017: Biometrika
Seunggeun Lee, Wei Sun, Fred A Wright, Fei Zou
Unobserved environmental, demographic, and technical factors can negatively affect the estimation and testing of the effects of primary variables. Surrogate variable analysis, proposed to tackle this problem, has been widely used in genomic studies. To estimate hidden factors that are correlated with the primary variables, surrogate variable analysis performs principal component analysis either on a subset of features or on all features, but weighting each differently. However, existing approaches may fail to identify hidden factors that are strongly correlated with the primary variables, and the extra step of feature selection and weight calculation makes the theoretical investigation of surrogate variable analysis challenging...
June 1, 2017: Biometrika
Xin Gao, Raymond J Carroll
We consider situations where the data consist of a number of responses for each individual, which may include a mix of discrete and continuous variables. The data also include a class of predictors, where the same predictor may have different physical measurements across different experiments depending on how the predictor is measured. The goal is to select which predictors affect any of the responses, where the number of such informative predictors tends to infinity as the sample size increases. There are marginal likelihoods for each experiment; we specify a pseudolikelihood combining the marginal likelihoods, and propose a pseudolikelihood information criterion...
June 2017: Biometrika
D M Farewell, C Huang, V Didelez
Likelihood factors that can be disregarded for inference are termed ignorable. We demonstrate that close ties exist between ignorability and identification of causal effects by covariate adjustment. A graphical condition, stability, plays a role analogous to that of missingness at random, but is applicable to general longitudinal data. Our formulation of ignorability does not depend on any notion of missing data, so is appealing in situations where missing data may not actually exist. Several examples illustrate how stability may be assessed...
June 2017: Biometrika
Q Zhou, H Zhou, J Cai
The case-cohort design has been widely used as a means of cost reduction in assembling or measuring expensive covariates in large cohort studies. The existing literature on the case-cohort design is mainly focused on right-censored data. In practice, however, the failure time is often subject to interval-censoring; it is known only to fall within some random time interval. In this paper, we consider the case-cohort study design for interval-censored failure time and develop a sieve semiparametric likelihood approach for analyzing data from this design under the proportional hazards model...
March 2017: Biometrika
Kean Ming Tan, Yang Ning, Daniela M Witten, Han Liu
In classical statistics, much thought has been put into experimental design and data collection. In the high-dimensional setting, however, experimental design has been less of a focus. In this paper, we stress the importance of collecting multiple replicates for each subject in this setting. We consider learning the structure of a graphical model with latent variables, under the assumption that these variables take a constant value across replicates within each subject. By collecting multiple replicates for each subject, we are able to estimate the conditional dependence relationships among the observed variables given the latent variables...
December 2016: Biometrika
Anirban Bhattacharya, Antik Chakraborty, Bani K Mallick
We propose an efficient way to sample from a class of structured multivariate Gaussian distributions. The proposed algorithm only requires matrix multiplications and linear system solutions. Its computational complexity grows linearly with the dimension, unlike existing algorithms that rely on Cholesky factorizations with cubic complexity. The algorithm is broadly applicable in settings where Gaussian scale mixture priors are used on high-dimensional parameters. Its effectiveness is illustrated through a high-dimensional regression problem with a horseshoe prior on the regression coefficients...
December 2016: Biometrika
Gongjun Xu, Lifeng Lin, Peng Wei, Wei Pan
Several two-sample tests for high-dimensional data have been proposed recently, but they are powerful only against certain limited alternative hypotheses. In practice, since the true alternative hypothesis is unknown, it is unclear how to choose a powerful test. We propose an adaptive test that maintains high power across a wide range of situations, and study its asymptotic properties. Its finite sample performance is compared with existing tests. We apply it and other tests to detect possible associations between bipolar disease and a large number of single nucleotide polymorphisms on each chromosome based on a genome-wide association study dataset...
September 2016: Biometrika
A I Ni, Jianwen Cai, Donglin Zeng
Case-cohort designs are widely used in large cohort studies to reduce the cost associated with covariate measurement. In many such studies the number of covariates is very large, so an efficient variable selection method is necessary. In this paper, we study the properties of a variable selection procedure using the smoothly clipped absolute deviation penalty in a case-cohort design with a diverging number of parameters. We establish the consistency and asymptotic normality of the maximum penalized pseudo-partial-likelihood estimator, and show that the proposed variable selection method is consistent and has an asymptotic oracle property...
September 2016: Biometrika
S Yang, J J Lok
Coarse structural nested mean models are tools to estimate treatment effects from longitudinal observational data with time-dependent confounding. There is, however, no guidance on how to specify the treatment effect model, and model misspecification can lead to bias. We derive a goodness-of-fit test based on modified overidentification restrictions tests for evaluating a treatment effect model, and show that our test statistic is doubly-robust in the sense that, with a correct treatment effect model, the test has the correct type-I error if either the treatment initiation model or a nuisance regression outcome model is correctly specified...
September 2016: Biometrika
Peng Ding, Tyler J Vanderweele
It is often of interest to decompose the total effect of an exposure into a component that acts on the outcome through some mediator and a component that acts independently through other pathways. Said another way, we are interested in the direct and indirect effects of the exposure on the outcome. Even if the exposure is randomly assigned, it is often infeasible to randomize the mediator, leaving the mediator-outcome confounding not fully controlled. We develop a sensitivity analysis technique that can bound the direct and indirect effects without parametric assumptions about the unmeasured mediator-outcome confounding...
June 2016: Biometrika
Wang Miao, Eric J Tchetgen Tchetgen
Suppose we are interested in the mean of an outcome variable missing not at random. Suppose however that one has available a fully observed shadow variable, which is associated with the outcome but independent of the missingness process conditional on covariates and the possibly unobserved outcome. Such a variable may be a proxy or a mismeasured version of the outcome and is available for all individuals. We have previously established necessary and sufficient conditions for identification of the full data law in such a setting, and have described semiparametric estimators including a doubly robust estimator of the outcome mean...
June 2016: Biometrika
Jae Kwang Kim, Yongchan Kwon, Myunghee Cho Paik
Weighting adjustment is commonly used in survey sampling to correct for unit nonresponse. In cluster sampling, the missingness indicators are often correlated within clusters and the response mechanism is subject to cluster-specific nonignorable missingness. Based on a parametric working model for the response mechanism that incorporates cluster-specific nonignorable missingness, we propose a method of weighting adjustment. We provide a consistent estimator of the mean or totals in cases where the study variable follows a generalized linear mixed-effects model...
June 2016: Biometrika
M Oguz-Alper, Y G Berger
Survey data are often collected with unequal probabilities from a stratified population. In many modelling situations, the parameter of interest is a subset of a set of parameters, with the others treated as nuisance parameters. We show that in this situation the empirical likelihood ratio statistic follows a chi-squared distribution asymptotically, under stratified single and multi-stage unequal probability sampling, with negligible sampling fractions. Simulation studies show that the empirical likelihood confidence interval may achieve better coverages and has more balanced tail error rates than standard approaches involving variance estimation, linearization or resampling...
June 2016: Biometrika
C Hennig, C Viroli
Classification with small samples of high-dimensional data is important in many application areas. Quantile classifiers are distance-based classifiers that require a single parameter, regardless of the dimension, and classify observations according to a sum of weighted componentwise distances of the components of an observation to the within-class quantiles. An optimal percentage for the quantiles can be chosen by minimizing the misclassification error in the training sample. It is shown that this choice is consistent for the classification rule with the asymptotically optimal quantile and that under some assumptions, as the number of variables goes to infinity, the probability of correct classification converges to unity...
June 2016: Biometrika
G Alexandrovich, H Holzmann, A Leister
Nonparametric identification and maximum likelihood estimation for finite-state hidden Markov models are investigated. We obtain identification of the parameters as well as the order of the Markov chain if the transition probability matrices have full-rank and are ergodic, and if the state-dependent distributions are all distinct, but not necessarily linearly independent. Based on this identification result, we develop a nonparametric maximum likelihood estimation theory. First, we show that the asymptotic contrast, the Kullback-Leibler divergence of the hidden Markov model, also identifies the true parameter vector nonparametrically...
June 2016: Biometrika
Ting Fung Ma, Chun Yip Yau
This paper develops a composite likelihood-based approach for multiple changepoint estimation in multivariate time series. We derive a criterion based on pairwise likelihood and minimum description length for estimating the number and locations of changepoints and for performing model selection in each segment. The number and locations of the changepoints can be consistently estimated under mild conditions and the computation can be conducted efficiently with a pruned dynamic programming algorithm. Simulation studies and real data examples demonstrate the statistical and computational efficiency of the proposed method...
June 2016: Biometrika
E Olusegun George, Kyeongmi Cheon, Yilian Yuan, Aniko Szabo
We derive an expression for the joint distribution of exchangeable multinomial random variables, which generalizes the multinomial distribution based on independent trials while retaining some of its important properties. Unlike de Finneti's representation theorem for a binary sequence, the exchangeable multinomial distribution derived here does not require that the finite set of random variables under consideration be a subset of an infinite sequence. Using expressions for higher moments and correlations, we show that the covariance matrix for exchangeable multinomial data has a different form from that usually assumed in the literature, and we analyse data from developmental toxicology studies...
June 2016: Biometrika
Jeng-Min Chiou, Hans-Georg Müller
Functional data vectors consisting of samples of multivariate data where each component is a random function are encountered increasingly often but have not yet been comprehensively investigated. We introduce a simple pairwise interaction model that leads to an interpretable and straightforward decomposition of multivariate functional data and of their variation into component-specific processes and pairwise interaction processes. The latter quantify the degree of pairwise interactions between the components of the functional data vectors, while the component-specific processes reflect the functional variation of a particular functional vector component that cannot be explained by the other components...
June 2016: Biometrika
Shu-Ching Chang, Dale L Zimmerman
Antedependence models, also known as transition models, have proven to be useful for longitudinal data exhibiting serial correlation, especially when the variances and/or same-lag correlations are time-varying. Statistical inference procedures associated with normal antedependence models are well-developed and have many nice properties, but they are not appropriate for longitudinal data that exhibit considerable skewness. We propose two direct extensions of normal antedependence models to skew-normal antedependence models...
June 2016: Biometrika
