147-31: An Evaluation of Splines in Linear Regression

نویسندگان

  • Deborah Hurley
  • James Hussey
  • Robert McKeown
  • Cheryl Addy
چکیده

Linear regression is an analytic approach commonly used in public health to examine the relationship between continuous dependent (e.g., blood pressure) and independent (e.g., body mass index (BMI)) variables. For any regression procedure it is desirable to use models that closely fit the data. Transformations of the response variable can improve the fit and may correct violations of model assumptions such as constant error variance. Predictor variables may be separated into logical categories (e.g., age categories), or we may add additional terms that are functions of the existing predictors such as quadratic or cubic terms. Other approaches, such as spline modeling, may provide a better fit, taking into consideration the variation in the relationship between the predictor variable and the response variable, both within and between levels of the predictor variable. There is no one best approach, however, as some modeling methods may produce better results for predicted values (e.g., smaller confidence intervals) than other methods, depending on the data. Analyses using splines is often cumbersome and interpretations are often complex. Given these challenges, this project undertakes a simulation study to examine and compare several “traditional” models with spline models, under varying conditions (e.g., different sample sizes and magnitude of variation), for different data structures (e.g., true quadratic, cubic, or other data patterns), in an effort to determine if spline regression models provide a significantly better fit under these conditions than a regression model employing a simple linear relationship, or one with power term(s). Each scenario was assessed for model “fit” verses model simplicity (i.e., to see whether or not the more complex spline regression models provide any real advantage over what can be obtained with SLR or power models). In general, the best model has the same structure as the data. For these data, the choice of the best modeling method should take into account preliminary plots and the estimated standard deviation. Results show that splines are most appropriate when the plots of the data clearly indicate that they are needed (i.e., when the standard deviation is small enough that we can detect knots and changes in structure). When the plots do not show much detail (i.e., when the standard deviation is large), a simpler model (e.g., polynomial) is recommended. Results also reinforce the need to look at a plot of the predicted values for the model, as some of the usual selection criteria (MSE, PRESS, R) can give similar results for various models, but the coverage for these models may be diverse. INTRODUCTION/BACKGROUND Linear regression is an analytic approach commonly used in public health when we would like to examine the relationship between numeric dependent (e.g., blood pressure) and independent (e.g., body mass index) variables. For any regression procedure, it is desirable to use models that closely fit the data. Transformations of the response variable can improve the fit and may correct violations of model assumptions such as constant error variance. We may also consider separating a predictor variable into logical categories (e.g., age categories), or adding additional terms that are functions of the existing predictors such as quadratic or cubic terms. Still other methods, such as spline modeling, may provide a better fit, taking into consideration the variation in the relationship between the predictor variable and the response variable, both within and between levels of the predictor variable. There is no one best approach, however, as some modeling methods may produce better results for predicted values (e.g., narrower confidence intervals) than other methods, depending on the data. Greenland (1995) suggests using spline regression (and fractional polynomial regression) as an alternative to categorical analysis for dose response and trend analysis, stating that categorical analysis does not make use of within category information and is based on an unrealistic model for dose-response and trends. Spline regression, he contends, is based on more realistic category-specific models that are especially worthwhile when nonlinearities are expected. Splines are lines or curves, which are usually required to be continuous and smooth. Univariate polynomial splines are piecewise polynomials in one variable of some degree d with function values and first d-1 derivatives that agree at the points where they join. The join points (or abscissa values) that mark one transition to the next are referred to as break points, interior knots, or simply knots (Poirier, 1976, Eubank, 1999). Knots give the curve freedom to bend and more closely follow the data. Splines with few knots are generally smoother than splines with many knots; however, increasing the number of knots usually increases the fit of the spline function to the data. (Hansen and Kooperberg, 2002). Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now. 2 Spline functions can be applied to medical and epidemiological investigations. These studies frequently involve survival analysis, linear dose-response problems, latency patterns, and data smoothing (to detect trends) as well as other studies. For example, to assess mortality in colon cancer using survival analysis methods, Bolard et al (2002) used restricted cubic splines to model time-by-covariate interactions. In another study to look at linear dose-response, Thurston et al (2002) applied penalized spline methodology to a cohort study of autoworkers exposed to metalworking fluids to examine the linearity assumption for prostate and brain cancer mortality. Hauptmann et al (2001) used spline function models to investigate latency patterns for radon progeny exposure and lung cancer in a cohort of uranium miners in Colorado. In a study estimating longitudinal immunological and virological markers in HIV patients with individual antiretroviral treatment strategies, Brown et al (2001) proposed univariate and bivariate cubic smoothing splines to fit CD4+ count and plasma viral load. There are many types of splines and estimation procedures (Gu, 2002, Eubank, 1999). The analyses presented in this paper focus on univariate splines in ordinary least squares regression. Knot selection (number and location of knots) can be accomplished by various methods. One can use predetermined knots, natural division points, or visually inspect the data. There are also other (more complex) methods, such as nonlinear least squares methods, for knot selection (Eubank, 1999). Predetermined knots are used in this paper. Because analyses using splines is often cumbersome and interpretations complex, it is necessary to compare the tradeoff between model complexity and model fit in order to assess whether a much more complex model provides a significantly better fit. This project uses five criteria for the comparison of models. The first two are based on confidence intervals (CI) for the mean. These were examined to determine the proportion of times the true mean of the distribution was contained in the CI (i.e., coverage). The widths of these CI were also compared between models. Coverage proportions were cross-referenced with confidence interval widths to assess any relationship (e.g., wider confidence intervals associated with greater coverage proportions, or deviation from what would be expected). The other three criteria are mean square error (MSE), the PRESS statistic (prediction sum of squares), and the Rsquare statistic (R). Under the assumption that the correct model should give an unbiased estimate of the variance, values for MSE were compared to the variance used to generate the data to see if the variance was being over or under-estimated. The PRESS statistic can be a good indication of the predictive power of a model. The PRESS residual is the difference between the observed value and the predicted value, when the model was fit without that point. The PRESS statistic is the sum of the squared PRESS residuals. Since this is a sum of squared errors, a good model has a small PRESS statistic. Finally, recall that R is the proportion of variation in the dependent variable explained by the model fit using the independent variable(s); we can examine this proportion to get an idea of how well the predicted equation fits the data. Together, these measures were used to assess whether or not the more complex spline regression models provide any real advantage over what can be obtained with SLR or power models. METHODS To address the tradeoff between model complexity and model fit, we conducted a simulation study to compare “traditional” regression models with spline models under varying conditions (e.g., different sample sizes and magnitude of variation), for different data structures (e.g., true quadratic, cubic, or other data patterns). The goal was to determine if the added complexity of the spline regression models is justified by a significantly better fit (under certain conditions) than a regression model employing a simple linear relationship, or one with power term(s). Data were simulated for five different structures (patterns), using one dependent variable and one independent variable. Each of these five structures was generated with three different sample sizes (n), and two different standard deviations, for a total of 6 scenarios for each structure. For each scenario, 2000 simulated data sets were generated. Six different regression models were evaluated for each of the 30 scenarios. These models were simple linear regression (SLR), polynomial regression (quadratic and cubic), and spline regression (linear, quadratic and cubic). All data generation and analyses were completed using SAS version 8.2e. DATA STRUCTURES The data from these data sets were designed to follow somewhat realistic patterns and to encompass the various scenarios and restrictions. Data for the first three structures were generated to obtain a general quadratic-type pattern, while data for the last two structures were generated to obtain a general cubic-type pattern. Because the support for independent variable (x) is uniformly distributed from 0 to 100 (0 ≤ x ≤ 100), the different sample sizes (n = 51, 201, 1001) also correspond to the density of the points in terms of the independent variable. For each of the five data structures, two knots were chosen to allow the function to vary on up to three segments per data structure. The placement of the knots was at x = 32 and x = 68, which creates nearly equal intervals between 0 and 100. This was done in order to keep the number of observations within each segment approximately the same. The values for determining mean y were calculated using the quadratic, cubic, or piecewise equations necessary to produce the shape of the segment for the data structure of interest. Normally distributed random error was added to Evaluation notes were added to the output document. To get rid of these notes, please order your copy of ePrint IV now. 3 each mean value to produce the data point for the dependent variable, using: yi = f(xi) + εi, where yi is the dependent variable, f(xi) is the function used to generate the mean value of y, xi is a value between 0 and 100 (generated in increments of 100/(n-1), depending on the sample size, n), and εi is the random error produced by the standard deviation (SD) multiplied by a random standard normal value. Different values of the SD correspond to the variation of the points (y) from the true, underlying relationship. Visual inspection of randomly selected plots of the data was used to determine the desired amount of variation. The larger of the two values chosen for the SD was 0.1. This value produced enough variability that the original pattern was discernable only at close inspection. The smaller value chosen for the SD was 0.03. This value produced data with some variability but where the original pattern was clearly evident when plotted. Structures one through three are similar in that all take on a quadratic or quadratic-like form (see Figure 1). Structure one consists of three linear segments. Structure two consists of a middle linear segment sandwiched between two quadratic segments. Both structures are decreasing, constant, and increasing on the same intervals, with changes in structure occurring at the predetermined knots. Structure three is a purely quadratic structure over the same sample space. All structures pass through y= 0.9 at x=0, have a minimum y=0.1, and pass though y=0.5 at x=100. Structures four and five have a general cubic form (see Figure 1). Structure four consists of three linear segments (increasing, constant, then increasing), with changes in structure occurring at the predetermined knots. Structure five is purely cubic (increasing, decreasing, then increasing) and has a local maximum at x = 32, and a local minimum at x = 68 (the knots). Both structures pass through y= 0.1 at x=0, and y=0.9 at x=100. REGRESSION MODELS The six regression models were analyzed for each simulated data set, using the PROC REG procedure in SAS. The SLR and power models were of the usual form (e.g., cubic model: y = β0 + β1x + β2x + β3x + ε), including the highest order term and all lower order terms. The three spline models were a linear spline, a quadratic spline, and a cubic spline. For example, the cubic spline model is y = β0 + β1x + β2x + β3x + β4x4 + β5x5 + ε, where x4 and x5 are used to model a piecewise cubic independent variable. Here, x4 represents the change in x 3 when x is greater than 32, and is 0 otherwise, and x5 represents the change in x 3 when x is greater than 68, and is 0 otherwise. Thus, ANALYSES Predicted means and their corresponding 95 percent CIs were calculated at seven “test” points of interest (x = 0, 16, 32, 50, 68, 84, 100). These points include the knots (x = 32 and 68), the two endpoints (x = 0 and 100), the midpoint (x = 50), and two points midway between each endpoint and the knot closest to it (x = 16 and 84). The observed proportion, p̂ , of the 2000 replications in which the true mean values were contained in the corresponding CI was recorded, and a Wald confidence interval was then calculated for this proportion, using the formula: p̂ ± (1.96) ˆ ˆ p(1-p)/2000 An interval that does not contain .95 indicates that the actual coverage is not the nominal 95%. Values of p̂ in the range (0.9396, 0.9587) produce CIs that include 0.95. To get an idea of overall coverage, and for comparative purposes, this interval was also calculated for the set of all points in each data set, not just the set of individual “test” points. (Note that the derivation for the CI is for 2000 repetitions on a single point and is not relevant for “all x”.) Additionally, means and SDs of the widths of the CIs were calculated. Finally, univariate analyses were done for the mean square error (MSE), the PRESS statistic and R statistic. Results for MSE were compared by model within each of the 6 scenarios to see if the empirical values were similar to the true variance. The results for the PRESS statistic were compared by model within each scenario (smaller is better), as were the R statistics (larger is better). Results for all three statistics were also graphed as histograms. The histograms were visually inspected for anything that might be interesting or unusual.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Application of ‎F‎uzzy Bicubic Splines Interpolation for Solving ‎T‎wo-Dimensional Linear Fuzzy Fredholm Integral ‎Equations‎‎

‎In this paper‎, ‎firstly‎, ‎we review approximation of fuzzy functions‎ ‎by fuzzy bicubic splines interpolation and present a new approach‎ ‎based on the two-dimensional fuzzy splines interpolation and‎ ‎iterative method to approximate the solution of two-dimensional‎ ‎linear fuzzy Fredholm integral equation (2DLFFIE)‎. ‎Also‎, ‎we prove‎ ‎convergence analysis and numerical stability analysis ...

متن کامل

A Comparison of Thin Plate and Spherical Splines with Multiple Regression

Thin plate and spherical splines are nonparametric methods suitable for spatial data analysis. Thin plate splines acquire efficient practical and high precision solutions in spatial interpolations. Two components in the model fitting is considered: spatial deviations of data and the model roughness. On the other hand, in parametric regression, the relationship between explanatory and response v...

متن کامل

Restricted Cubic Spline Regression: A Brief Introduction

Sometimes, the relationship between an outcome (dependent) variable and the explanatory (independent) variable(s) is not linear. Restricted cubic splines are a way of testing the hypothesis that the relationship is not linear or summarizing a relationship that is too non-linear to be usefully summarized by a linear relationship. Restricted cubic splines are just a transformation of an independe...

متن کامل

Investigation of electron and hydrogenic-donor states confined in a permeable spherical box using B-splines

  Effects of quantum size and potential shape on the spectra of an electron and a hydrogenic-donor at the center of a permeable spherical cavity have been calculated, using linear variational method. B-splines have been used as basis functions. By extensive convergence tests and comparing with other results given in the literature, the validity and efficiency of the method were confirmed.

متن کامل

A FILTERED B-SPLINE MODEL OF SCANNED DIGITAL IMAGES

We present an approach for modeling and filtering digitally scanned images. The digital contour of an image is segmented to identify the linear segments, the nonlinear segments and critical corners. The nonlinear segments are modeled by B-splines. To remove the contour noise, we propose a weighted least q m s model to account for both the fitness of the splines as well as their approximate cur...

متن کامل

ESTIMATING DRYING SHRINKAGE OF CONCRETE USING A MULTIVARIATE ADAPTIVE REGRESSION SPLINES APPROACH

In the present study, the multivariate adaptive regression splines (MARS) technique is employed to estimate the drying shrinkage of concrete. To this purpose, a very big database (RILEM Data Bank) from different experimental studies is used. Several effective parameters such as the age of onset of shrinkage measurement, age at start of drying, the ratio of the volume of the sample on its drying...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006