Causal analyses based on non-representative, non-randomized samples with a large number of measured confounders are commonly conducted in health sciences research. Selection bias, baseline imbalance or confounding, and high-dimensionality of confounding covariates are important problems to be addressed in these analyses. Currently available methods used to address these problems are problematic because they explicitly make adjustments relative to the biased sample population or require the biased sample to be augmented with a representative sample of the target population.
We developed a statistical method to concurrently adjust for selection bias, case-mix imbalance, and high-dimensional baseline measurements in generalized linear models for ordinal response measures when the sample inclusion probability is a monotone (directed) function of the ordinal response (e.g., the more "severe" a case is, the less likely it will be in the sample). This laid the foundation for developing methodology applicable to the family of generalized linear models with ordinal or continuous response. Development and empirical assessment of flexible, robust analytical methods for simultaneously addressing these issues will directly benefit VA HSR&D and indirectly benefit the populations served by the VA.
The proposed research methods comprise two phases:
(1) Theoretical development: We used weighted distributions to model the likelihood of selection-biased samples as a function of the distribution in the target population and a weight function that is monotone in the outcome. We investigated nonparametric and potential parametric methods to model the selection mechanism. We parametrized the outcome distribution in the target population using an ordinal logit model and then identified efficient algorithms to find various maximum likelihood estimates of the parameters. To facilitate comparisons of outcomes among levels of an intervention, we developed a selection-adjusted propensity to address confounding and high-dimensional covariates. Using the adjusted propensity, a stratified selection-adjusted logistic model was developed to combine these different pieces of information to provide an overall measure of intervention effect. This covariate reduction and stratification should greatly simplify maximum likelihood estimation. (2) Empirical performance: We used both simulated and real data sets to demonstrate how the developed methodologies perform with respect to the proposed theory. We then demonstrated the applicability of our methodology in ancillary analyses of some "real world" problems.
The developed methods were applied to a set of data on self reported male sexual assaults. Such self-reports are usually directionally biased. Subjects who experience assault tend to underreport it and this underreporting can vary with other predictors such as age. Using likelihood and Bayeseian methods we estimated the prevalence of the cases using GLM modeling. Also, we provided a framework for the causal inference by treating the unknown balanced covariate distribution as a nuisance parameter and showed how some of the elimination methods (Bayesian marginalization, sufficiency, and ancillarity) can be applied to achieve this goal. These results were demonstrated for the members of the exponential family that are commonly used in generalized linear modeling.
The covariate imbalance in causal analysis was reformulated as an elimination of the nuisance variables problem. We showed, within a counterfactual balanced setting, how averaging, conditioning, and marginalization techniques can be used to reduce bias due to a possibly large number of imbalanced baseline confounders. Examples for exponential families and elliptically symmetric families of distributions were provided.
Also since selection bias is a sampling issue, by using prior distribution on the selection order in ordered designs, we showed how such ordered designs can be used to define prior distributions over the population. For such priors the Bayesian analysis uses information that in standard frequentist methods is incorporated in the sampling design. The resulting methods will often have good frequentist properties. We applied the methodology to two data sets and performed the needed simulations to study the performance of the proposed methods.
Case-mix imbalance, selection bias, and large numbers of explanatory measures are common problems which threaten the validity of findings from health services research studies. Statistical methodology development addressing these challenges will benefit the field of health services research broadly, and the health care of veterans, specifically, by helping researchers avoid using improper methods that may give misleading decisions/estimates of the effect of interventions and healthcare programs. This research responds to the methodological needs that VA statisticians face in their collaborative activities in HSR&D funded studies. The results as seen in examples presented in a number of papers demonstrate the need to adjust our inferences and to participate in developing innovative methodologies needed to answer our research questions.
- Noorbaloochi S. Hypotheses Testing as a Fuzzy Set Estimation Problem. Communications in statistics: theory and methods. 2013 Apr 11; 42(10):1806-1820.
- Meeden G, Noorbaloochi S. Ordered Design and Bayesian Inference in Survey Sampling. Sankhya. 2010 Feb 1; 72-A(Part 1):119-135.
- Noorbaloochi S, Nelson D, Asgharian M. Balancing and elimination of nuisance variables. The international journal of biostatistics. 2010 Jan 1; 6(2):Article 6.
- Noorbaloochi S, Nelson DB. Bias Reduction for High Dimensional Multivariate Data. Paper presented at: Joint Statistical Annual Meeting; 2010 Aug 5; Vancouver, Canada.
- Noorbaloochi S, Nelson DB, Asgharian M. Sufficiency, ancillarity and bias reduction for high dimensional predictor spaces. Presented at: Causal Inference in Statistics and the Quantitative Sciences Workshop; 2009 May 4; Banff, Canada.