Statistical inference in research studies implemented within the VA system and used to inform VA policy decisions, healthcare initiatives, and delivery of patient care frequently requires use of complex generalized linear models (GLMs). The validity of the inferences drawn from these complex models is dependent upon the fit and adequacy or appropriateness of the models. Effective, powerful model diagnostics are critical for establishing the validity and trustworthiness of the inferences drawn from fitting these statistical models to research data.
For multiple linear regression models, there exists a sizable statistical theory and corresponding methodologies for assessing whether a regression model fits the data well or whether there are problems with the specification of the model. One of the primary, most often applied, such methods for linear regression model diagnostics is to plot residuals against fitted values and individual predictors. These residual plots provide a simple, readily interpretable, and powerful method for assessing model fit.
There is a growing theory and set of methodologies for assessing the fit of GLMs. Much of the diagnostic theory and methods for GLMs, such as logistic regression and Poisson regression, stem from direct modifications of the residual based theory for linear models. However, these adaptations of residual based diagnostic methods tend not to perform as well for GLMs, and in particular the residual plot methods are not as easy to interpret and useful for assessing the fit of GLMs, as for linear regression models.
The objective of this research project was to develop an easy to use, readily interpretable graphical statistical methodology for assessing the fit of a GLM that addresses many of the shortcomings of current residual based graphical methods. An additional objective of this research was to investigate developing a formal testing procedure for assessing the fit of a GLM based on assessing the conditional independence of the outcome and predictors given the regression function.
For a correctly specified GLM, we can demonstrate the predictors in the model and the outcome are independent conditional on the value of the regression function. This simple result then opens possibilities for developing diagnostic techniques. We investigated and developed mathematical foundations for diagnostic methods based on this result and empirically examined the performance of these methods.
The conditional independence of the outcome and predictors tells us that, for an estimated regression function close to the true regression function, simple scatterplots of the outcome against the predictors conditional on the value of the estimated regression function, say falling within a small neighborhood of values, should exhibit independence. For misspecified models, several of these plots could be expected to exhibit systematic patterns of association that highlight the nature or form of the model misspecification. In practice though this approach
was difficult to use due the large of number plots to examine and variation in patterns observed across plots for misspecified models due to conditioning on different ranges of the predictors in the different plots.
However, examining patterns observed in these conditional plots of the outcome against a covariate for which the model fit to the data was missing a functional component led to further investigation. Mathematical results demonstrate graphing smoothed differences between the estimated regression function and nonparametric estimates of the regression function formed within small neighborhoods around values of the estimated regression function against the respective covariates provides an easy to implement, readily interpretable diagnostic method. The procedure first computes the difference between the estimated regression function at a given sample point and a nonparametric local regression estimate of the same value formed within a small neighborhood of sample points with estimated regression function of comparable value. The procedure then aggregates these differences, plotting these differences against corresponding values of the predictors, and plots smooths of these values with respect to individual predictors. For well specified models, the points in these plots tend to be centered arround the origin. For models omitting a functional component of the predictors these plots tend to exhibit curvature consistent with the missing functional component. This general behavior then is similar to the behavior of residual plots for linear regression models. Simulation studies and application of this method to existing data sets provide empirical evidence that this methodology provides a useful approach to assessing the fit of a GLM.
In addition, using the conditional independence of the outcome and predictors given the regression function we can show the covariance between the expected value of the outcome given the regression function and the expected values of the predictors given the regression function is identical to the covariance between the outcome and the covariates. This simple result can be used to develop a formal test for lack of fit for a GLM. Specifically, we can use bootstrap resampling methods to compare the observed difference in covariance vectors to the difference expected under a well fitting model. This testing procedure performs better if we consider the covariance between the outcome and the absolute values of the centered predictors.
These new diagnostic methods will lead to improved reliability of the models used in VA research and increased reliability of the inferences drawn from these research projects. The developed methodology will lead to more soundly established medical interventions and health programs.
- Nelson DB, Noorbaloochi S. Splice plots for generalized linear models. Communications in statistics: theory and methods. 2016 Jun 17; 45(12):3524-3540.
- Nelson DB, Noorbaloochi S. Splice Plots for Generalized Linear Models. Communications in statistics: theory and methods. 2015 Sep 3; doi: 10.1080/03610926.2014.895842.
- Nelson DB, Noorbaloochi S. Information preserving sufficient summaries for dimension reduction. Journal of multivariate analysis. 2013 Mar 1; 115(March):347-358.
- Nelson DB, Noorbaloochi S. Splice plots for proportional hazards models. [Abstract]. Proceedings / American Statistical Association. 2018 Jan 5; 1.
- Nelson DB, Noorbaloochi S. Graphical Diagnostic Methods for Generalized Linear Models and Proportional Hazards Models. Presented at: VA HSR&D / QUERI National Meeting; 2015 Jul 8; Philadelphia, PA.
- Nelson DB. Simple sensitivity methods to validly and appropriately address missing cessation outcomes. Paper presented at: American Public Health Association Annual Meeting and Exposition; 2010 Nov 8; Denver, CO.