Estimating Explained Variation of a Latent Scale Dependent Variable Underlying a Binary Indicator of Event Occurrence

The coefficient of determinant, also known as the R2 statistic, is widely used as a measure of the proportion of explained variation in the context of a linear regression model. In many real life events, interests may lie on measuring the proportion of explained variation, ρ2, of a latent scale dependent variable U which follows a multiple regression model. But in practice, U may not be observable and is represented by its binary proxy. In such situations, use of logistic regression analysis is a popular choice. Many analogues to R2 type statistics have been proposed to measure explained variation in the context of logistic regression. McFadden’s R2 measure stands out from others because of its intuitive interpretation and its independence on the proportion of success in the sample. It, however, severely underestimates the proportion of explained variation of the underlying linear model. In this research we present a method for estimating the explained variation for the underlying linear model using the McFadden’s R2 statistics. When used in a real life dataset, our method estimated ρ2 of the underlying model within an acceptable margin of error.


Introduction
Logistic regression modeling is a popular and powerful tool to describe the relationship between a binary outcome variable to several independent variables. Motivation to use the logistic formulation also follows if we consider the dependent variable Y to be a binary proxy for a latent continuous variable U, that follows the multiple linear regression model. This formulation of logistic model is explained below.
Many diseases, including several mental and health disorders, are progressive in nature. Health practitioners use some predefined criteria to determine whether a person has a particular disease or some mental/health condition. In many instances, researchers may have information on whether a subject has a particular health condition or not but they may not have access to the actual measurements on the degree of progression of the condition. Under such circumstances it is reasonable to assume the existence of a latent scale dependent variable, which is not observable but is represented by its binary proxy. This situation can be modeled as follows.
Let U be a continuous random variable, such that Let X ′ = (X 1 , X 2 , · · · , X p ) be a vector of p predictors. We may assume that U is related to X through an ordinary linear model where ε is a random error term such that ε ∼ N(0, σ ε ). The usual coefficient of determinant ρ 2 = 1 − E [Var(U|x)] /Var(U) can be used to measure the extent to which the covariates of interest explain the underlying outcome variable U. As mentioned above, measurements on U are not available, and consequently, the answer to the question "How well the predictors (X 1 , X 2 , · · · , X p ) explain U?" has to be based on the proportion of explained variation obtained from a logistic regression analysis of Y on X. In such situation it is desirable to compute an R 2 analog from the logistic model and use it as an estimate of ρ 2 . There are, however, two main issues that need to be addressed first.
First, unlike the ordinary least square (OLS ) regression analysis, where the R 2 statistic is almost unanimously used as measure of explained variation, there are many R 2 analogs suggested for logistic regression models (Mittlböck & Schemper, 1996;Menard, 2000;DeMaris, 2002;Liao & McGee, 2003;Sharma, 2006). Mittlböck & Schemper (1996) reviewed 12 R 2 analogs for logistic regression, Menard (2000) six, DeMaris (2002) seven, and Sharma (2006) 14, with some overlap. Other authors have proposed adjusted R 2 analogs (see Mittlböck & Schemper, 2002;Liao & McGee, 2003, for example). But there is no clear consensus on the "best" R 2 measure for use with logistic models. Second, almost all of the measures of explained variation for the logistic regression analysis severely underestimate the explained variation in the underlying latent scale variable (Hosmer & Lemeshow, 2000), if one exists.
A "good" R 2 measure should i) have intuitively reasonable interpretation (interpretability); ii) be numerically consistent with the R 2 of an underlying model; and iii) be least dependent of the proportion of successes in the sample (base rate sensitivity) (Sharma, 2006;Menard, 2000). The McFadden's R 2 (McFadden, 1974) has clear advantages over others, because of its intuitively reasonable interpretation as a proportional reduction in error measure, parallel to the R 2 in linear regression analysis (Menard, 2000) and lowest base rate sensitivity (Menard, 2000;Sharma et al., 2011). The McFadden's R 2 measure is defined as where, L 0 and L M are the likelihood of the null and full logistic models, respectively. In spite of many of its advantages over other R 2 measures, R 2 L can not be directly used as an estimator of ρ 2 , as it severely underestimates the the parameter of interest (Hosmer & Lemeshow, 2000).
In this paper we propose a computational method for estimating the proportion of explained variation ρ 2 for the underlying linear model using R 2 L obtained from the logistic regression analysis. In section 2 we explain the twolevel nonlinear model used for estimating ρ 2 . The simulation study, results of model fit and model validation are discussed in Section 2. An application to a real data is presented in Section 4 and some concluding remarks are given in Section 5.

Method
Consider n observations on a binary response variable Y as defined in Eq. (1) and a covariate vector X ′ = (X 1 , . . . , X p ). The relationship between Y and X is modeled by the logistic model with the unconditional mean where β ′ is a vector of p regression parameters. For a logistic model with binary y, it can be shown that the mean of conditional probability of success over all possible combinations of the covariate values (ȳ) equals the probability of success in the populationπ.

Two Level Nonlinear Model
We propose a two-level nonlinear model to estimate the explained variation ρ 2 of the underlying linear model using R 2 L obtained from logistic regression analysis. Results of a preliminary simulation study suggests a nonlinear relationship between ρ 2 and R 2 L . In addition, the dependent variable is a measure of explained variation and needs to be constrained in [0,1]. Therefore, we proposed the following Chapman-Richards function for level-I model. Level-I Model: In the above model, θ 0 is the maximum attainable value of ρ 2 and hence is set to 1. θ 1 is related to the initial value of the response variable. θ 2 is the parameter governing the rate at which the response variable approaches its potential maximum, and θ 3 affects near which asymptote maximum growth occurs and determines curve shape and the location of the inflection point.
The level-II models assumes θ i , i = 1, 2 and 3 to be some linear functions of the probability of successπ and the sample size n.
Level-II Model: where β i j are regression coefficients and ε i is the random error term of i t h level-II model.

Parameter Estimation
Parameter estimation for the proposed model involves the following steps: Step 1 -Simulating Datasets: A Monte Carlo study was designed to simulate datasets of various sample sizes from populations with different levels ofπ. For a Binary dependent variable Y (Eq. 1), representing an unobservable latent scale continuous random variable U (Eq. 2), the probability of successπ is given bȳ where µ and σ respectively are the mean and standard deviation of U and Z ∼ N(0, 1). Therefore,π can be expressed as a function of three key parameters: the cutoff value c, and the mean and the standard deviation of U as below.π = 1 − Pr where Φ is the standard normal cumulative distribution function (CDF). It is, therefor, possible to simulate two populations with different proportion of successes by varying any combination of these three parameters. However, in many practical situations the cutoff value is usually held fixed. It is also reasonable to assume that the underlying latent scale variable U has the same mean but different spread in two subgroups of a population. For example, in a study of determinants of diabetes in male and female populations, the same cutoff value of fasting plasma glucose (FPG) level is used for classifying diabetes status for both populations. However, studies have shown that while the mean FPG level among men is usually higher than female, the standard deviation mostly remains the same (for example see Faerch et al., 2010). Therefore, in order to generate datasets with different proportion of successes, we first simulated Us with different means but same standard deviations and then generated binary Ys using Eq. 1 with fixed c.
We manipulated three variable in our simulation: ρ 2 of the underlying linear model (Eq. 2), proportion of success (π), and sample size (n). We used 19 configurations of ρ 2 varying by 0.05 from 0.05 to 0.95, 10 configurations ofπ varying by 0.05 from 0.05 to 0.5 and five sample sizes: 50, 100, 250, 500, 1000. Simulation variables were completely crossed creating a total of 950 simulation conditions. Each simulation condition was replicated 10,000 times, resulting a total of 9,500,000 logistic models.
Step 2 -Estimating Level-I Model Parameters We used PROC NLIN of SAS to fit level-I model to the simulated data and estimate the parameters. The Marquardt (1963) iterative method was used as it represents a compromise between the linearization (Gauss-Newton) method and the steepest descent method and appears to combine the best features of both while avoiding their most serious limitations. The Marquardt iterative method, however, requires that an initial value for each model parameter be specified first. There are four parameters to be estimated in the level-I model. The methods used to determine the starting values of these parameters are described below. θ 0 is the maximum possible value of the dependent variable, which in our case is ρ 2 , and therefore was set to 1. θ 2 parameter is the rate constant at which the response variable approaches its maximum possible value of 1. On the www.ccsenet.org/ijsp basis of this definition we used the expression (u 2 − u 1 )/(v 2 − v 1 ) to estimate the starting value of θ 2 . Here u 1 and u 2 are values of ρ 2 corresponding to some large R 2 L values in the range (v 1 , v 2 ). For the classical Chapman-Richards model θ 3 is between zero and one (0 < θ 3 < 1). θ 1 depends on the initial value of the response variable, ρ 2 , and can be thought as the "intercept" on Y-axis for R 2 L =0. Its starting value can be specified by evaluating ρ 2 when R 2 L = 0. From equation (6) we get ρ 2 (0) = (1 − θ 1 ) (1/1−θ 3 ) , where ρ 2 (0) is ideally zero, but one should choose a relatively small positive number close to zero. Step

-Estimating Level-II Model Parameters
Estimates of θ i 's obtained in step 2) are regressed on corresponding sample sizes (n) and probability of successes (π) to obtain estimates of β i j , i = 1, 2, 3, j = 0, 1, 2 for the level-II models in Eq. 7.

Results and Model Validation
Scatter plots for level-II model suggest a nonlinear effect ofπ on θ i . An inverse square-root transformation ofπ appeared to address the problem of nonlinearity. Accordingly, we fit the following system of linear equations to obtain the least square estimates of the level-II model parameters.
Model fit summary statistics and estimated level-II model parameters along with the respective p-values are presented in Table 1. The relationship between level-I model parameters and the proportion of success,π, is statistically significant (p < 0.000 for all β i , i = 1, 2, 3). Though the coefficients for n are very small, they are statistically significant for estimating θ 1 and θ 3 . The results presented in Table 1 clearly indicate that the proportion of success in a data set and the sample size are good predictors of the level-I model parameters which are used to estimate ρ 2 of the underlying linear model.
In order to validate our model, we simulated a validation dataset using the same 19 levels of ρ 2 ranging from .05 to .95 and three sample sizes, n: 50, 100 and 500 and four levels ofπ: 0.05, 0.2, 0.35 and 0.5. Use of a sample size smaller than 50 (e.g. n=30) caused numerical problems including no variation in the dependent variable, complete severation and quasi complete severation. These numerical problems were more frequent for low values ofπ, especially whenπ = 0.05. Simulation is implemented using the statistical software S AS c ⃝ 8.1. Proc logit is used to fit the logistic models. The simulation algorithm is outlined below: For each level combination of ρ 2 , and n 1. Simulate the underlying linear model U = β 0 + β ′ X + ε by generating X ∼ N(µ x , σ 2 x ) and ε ∼ N(0, σ 2 ε ).
2. Generate U such that the coefficient of determination for the linear model is ρ 2 .
3. Generate the binary dependent variable Y such the proportion of success in the dataset isπ (π = 0.05, 0.2, 0.35 and 0.5).
4. For each dataset thus generated fit a logistic model and then compute R 2 L 5. For each combination ofπ and n, estimates of θ i 's using equation (10) and estimates of β i j 's using the regression coefficients from Table 1. 6. Obtainρ 2 using Eq. 6.
The graphs in Fig. 1 compareρ 2 and R 2 L as estimators of ρ 2 , proportion explained variation of the underlying linear model, for selected simulation conditions. Graphs for n = 50 so the similar patterns and are not presented here. The 45 • angle solid line was obtained by plotting ρ 2 against itself. The distance of a point from this ideal 45 • angle line indicates how well or how poorly the prediction performed. As can be seen in Fig. 1, our model clearly out performs R 2 L in estimating ρ 2 for all eight simulation conditions. In order to evaluate the quality of our estimate we computed relative root mean square error (RRMSE) ofρ 2 . RRMSE is a relative measure of prediction accuracy and is calculate as  where θ andθ are respectively the desired and the estimated value of the parameter of interest and R represents the number of simulation. The RRMSE has a minimum value of 0.0 for a perfect prediction. Values closer to 0.0 indicate better prediction. RRMSEs of the estimates for all twelve simulation conditions are presented in Table 2.
Except for very small value of ρ 2 , the RRMSEs are acceptably small (less than 5%). When ρ 2 = .05 the RRMSEs range from 8% to 13.5%. However, it should be noted that a model with ρ 2 = .05 is not very useful and thus may not be used in practice.

Application to a Real Life Dataset
The dataset used in our example comes from Exam 3 of Framingham Offspring Study (Feinleib et al., 1975). We used Exam 3 data mainly because information about fasting blood glucose (FBG), which was used as the unobserved latent scale variable in our model, was collected starting at this point of the offspring study. The dataset consists of 3371 men and women who were not taking any diabetes medicine at the time of the exam and were not previously identified as diabetic. For the purpose of this example we selected five potential predictors of FBG. They were gender (SEX), age at the time of Exam 3 (AGE3), hypertension (HYP: 1 if hypertensive, 0 otherwise), smoking (SMOKE: 1 if currently smoking, 0 otherwise) and body mass index (BMI). In the standard model formulae syntax, our model is A multiple linear regression analysis of the above model resulted an R 2 = 0.1861 with all of the predictors being statistically significant. A 95% bootstrap confidence interval, based on 1000 bootstrap samples, for ρ 2 was (0.1629, 0.2092).
According to the American Diabetes Association criteria, a person is classified as having impaired fasting glucose (IFG), a type of prediabetes, if the FBG level is between 100 mg/dL and 125 mg/dL, inclusive. We used this criteria to create a binary variable IFG, a proxy of the continuous dependent variable FBG to be explained by the above mentioned predictors, such that IFG=1 if 100 ≤ FBG ≤ 125 and IFG=0 if FBG < 100. In our dataset 17.83% of the subjects were identified with IFG (i.e.π = 0.1783). A logistic regression analysis between IFG and the five predictors resulted a model with R 2 L = 0.1080, which, as expected, is considerably smaller than R 2 , the proportion of variation in FBG explained by the underlying linear model (12). The predicted value of ρ 2 , using our proposed method is 0.16403, which is well within the 95% bootstrap confidence interval of ρ 2 .

Conclusion
Researchers often are interested in estimating how well a set of predictors explains the outcome of a dependent continuous variable. If the relationship is modeled using a linear regression model then the coefficient of determinant can be used to estimate the proportion of variation in the dependent variable explained by the predictors. But in practice, the dependent variable of interest may not be observable and is represented by its binary proxy. In such situation, interests may lies on estimating the proportion of explained variation by use of a logistic regression analysis. In this paper, we have proposed a computational method for this purpose. We used McFadden's R 2 measure mainly because of its intuitive interpretation and base rate invariant property. In addition, it is easy to compute using standard logistic regression output of most of the statistical analysis softwares. When applied to a real life dataset, our method estimated the proportion of explained variation of the underlying model within an acceptable margin of error.