Self-Selecting Robust Logistic Regression Model

Logistic regression model is the most common model used for the analysis of binary data. However, the problem of atypical observations in the data has an unduly effect on the parameter estimates. Many researchers have developed robust statistical model to solve this problem of outliers. Gelman (2004) proposed GRLR, a robust model by trimming the probability of success in LR. The trimming values in this model were fixed and the user is required to specify this value well in advance. In particular this study developed SsRLR model by allowing the data itself to select the alpha value. We proposed a Restricted LR model to substitute the LR in presence of outliers. We proved that the SsRLR model is the more robust to the presence of leverage points in the data. Parameter estimations is done using a full Bayesian approach implemented in WinBUGS 14 software.


Introduction
Many dependent variables of interest in the social sciences are usually not a continuous variables.In most cases, the outcomes are categorical with two levels, namely, yes/no, success/failure, 0/1.Such variables are called binary responses or dichotomous.Binary logistic regression is a helpful way of explaining the relationship between one or more independent variables and a binary response.By assuming a binary variable y with π = Pr(y = 1) the probability of success, the classical LR model is defined as: LR : where X is a vector of p independent variables and β is a p dimensional vector of regression coefficients for the predictor variables.
Robustness is a subject highly developed in the fields of estimation of the position and scale of simple and multiple regression.Attention has been paid to the robust logistic regression, which is an area where outliers may also appear.Pregibon (1981) started by developing an analytical measure to assist in the detection of outliers and leverage points and quantify their effect on diverse aspects of the maximum likelihood fit.Thereafter, a good number of robust estimation procedures in the context of logistic regression have been examined.Gelman (2004) proposed a GRLR using a trimming approach.The approach used a trimming value α, 0.01 chance of random error in both direction of the interval [0,1].Gelman's Robust Logistic Regression model is defined as: where 0.01 and 0.98 are fixed.That model requires the statistician to specify these values beforehand.SsRLR solves this problem in the GRLR model by relaxing this restriction and letting these probabilities to be self selected by the data at hand so that only a prior distribution, say Uniform [a, b] with a and b belonging to [0,1] is given.

Robust Logistic Regression
Robust signifies remaining resistant against some irregular deviations.In statistics, models are a simple estimation of reality.The models that underlie numerous statistical process are very optimistic and in real data, big errors happen with unpredictable large frequency.An observation that lies an abnormal distant from the mass of data is set as an outlier.Robustness means insensitivity against some divergence from the right model.Robust process was initiated in the works of Tukey (1960) and further, formal models of robustness have been expanded in 1970's.
In regression models, the purpose of robust methods is to detect the outliers and extremely influential data points, leverage points, and to end by describing the goodness of fit for the data.One of the first result works related to Least Square, as a robust estimator, was carried out by Edgeworth (1887), who enhanced the proposal of Boscovich (1757) (Koenker & Bassett. 1985).This estimator is the least absolute deviation (LAD).
Logistic regression is concerned with explaining the probability of a specific response in terms of a number of regressors using a sample of relevant data.Pregibon (1981) affirmed that the estimated LR correlation may be extremely influenced by outliers.Dealing with outliers necessitates the use of robust logistic regression models to overcome their influence on the LR model.Researches in this direction have been conducted by Hubert (1973), Pregibon (1981), Rousseeuw et Al. (1987), Yohai (1987), Copas (1988) and Rousseeuw (2003).
Trimming is an extensive approach to robustifying of statistical process.It permits one to detect outliers and eliminate them from the data exploited in the estimation procedure.Trimming has been expanded highly by different authors in least squares regression, multivariate analysis and other areas (Rousseeuw (1984), Rousseeuw & Van Driessen (1999), where additional mentions can be obtained).It appears attractive to apply trimming also in logistic regression to find outliers and to control their influences.
On the other hand, the outlier can disturb statistical models and results in an expected model differ significantly from the exact one.Outliers in LR may occur in the Y-space called misclassification-type error (Copas, 1988), the X-space considered as leverage points or in both spaces.Outlying cases in this study are only based on the covariate corruptions.
The robust LR model in this study is based on that approach of trimming probability whose estimation procedure is related to Bayesian inference using Gibbs sampler and Metropolis-Hastings Algorithm.

Proposed SsRLR Model
In this work, we improve the model of Andrew Gelman (2004) by developing a self-selecting robust logistic regression.Suppose y = (y 1 , y 2 , . . ., y n ) are n independent observations where y i are binary responses data defined as: Binary regression models assume that y i ∼ Ber (π i ) with π i = Pr(y i = 1) the probability of success for each observation.
From that, the robust model we are developing is as follows: where X is a vector of p independent variables and β is a p dimensional vector of regression coefficients for the predictor variables.
As opposed to other studies where the value of α is set beforehand by the statistician, we allow this to be determined from the data itself.In particular since we are working in the Bayesian paradigm, we give this value α a uniform prior distribution.

Estimation using Bayesian Approach
Bayesian approach in estimation is used to minimize risk estimation and to obtain the optimal estimates.To proceed, we follow the usual pattern for all Bayesian analyses by writing down the likelihood function of the data, forming a prior distribution over all unknown parameters and using Bayes theorem to find the posterior distribution over all parameters.

Likelihood function
In particular, once the probability of success depending on the covariates is obtained, the likelihood function is: where π i represents the probability of success and y i the binary responses data.In our model we have: Hence the likelihood function of the binary responses data of n independent observations is: where:

Prior Distributions
In this analysis, a non informative normal prior was assigned to the regression coefficients β, The parameter under study α is given a uniform prior distribution U[a,b].

Posterior Distributions
To derive the posterior distribution, we multiply the prior distribution over all parameters by the likelihood function.Thus we have: where are β and α prior distributions respectively.
Most often, Bayes estimators of θ cannot be computed explicitly and we have to look for Monte Carlo simulation method, using Gibbs sampler algorithm where the computing of Bayes estimators does not pose great difficulty.For each model, we ran 10,000 Markov chain Monte Carlo (MCMC) iterations, with the initial 1,000 discarded to cater for the burn-in period and thereafter keeping every tenth sample value.MCMC convergence of all models parameters were accessed by checking trace plots and auto correlation plots of the MCMC output.

Gibbs Sampler Algorithm
Consider the basic case: f(x,y).Assume f((x|y)) and f(y|x) available.We can then generate what one will call Gibbs sequence as follows: starting from a value x 0 , y 0 is generated with π(.|x 0 ), then x 1 with π(.|y 0 ), and y 1 with π(.|x 1 ) and so on.
After M iterations of this scheme, it comes a sequence (x 0 , y 0 , x 1 , y 1 , . . ., x M , y M ).For M large enough, x M is a realization of X.
In the Bayesian framework, the Gibbs algorithm (Geman. 1984) will allow to obtain a realization of the parameter θ = (θ 1 , . . ., θ m ) following the posterior distribution π(θ|x) as soon as one is capable of expressing the conditional distributions: π(θ i |θ j ; x), j i.Thus, Gibbs sampling consists of: Starting from an initial vector θ (0) = (θ (0) 1 , . . ., θ (0) m ).At the (p + 1) th step, with the vector Successive iterations of this algorithm successively generate the states of a Markov chain {θ p , p > 0} for values ℵ ⊗m .The transition probability from θ ′ to θ is expressed as: where: This shows that the chain admits an invariant measure which is the posterior.For a sufficiently large number of iterations, the vector θ thus obtained may be considered as a realization of the posterior.

Model Diagnostics
The models goodness of fit were compared using the Deviance Information Criterion (DIC) as suggested by Spiegelhalter (2002).The best fitting model is one with the smallest DIC.The DIC value is given by DIC = D(θ) + pD, where D is the posterior mean of the deviance that measures the goodness of fit, and pD gives the effective number of parameters in the model which penalizes for complexity of the model.However, several authors have stated that a difference in DIC of 3 between two models can not be distinguished while a difference between 3 and 7 can be weakly differentiated.
For further model assessment, we associated the Bayesian Information Criterion (BIC).In statistic, the Bayesian information criterion or Schwarz criterion is a criterion for model selection among a finite set of models and the model with the lowest BIC is preferred (Schwarz. 1978).It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).BIC value is given by BIC = Dhat + 2p * log(n) where Dhat = −2logL(θ * |y) with L(θ * |y), the likelihood of each model, p the number of parameters and n the sample size.

Simulation Study
We carried out a simulation study to investigate the robustness of the three models namely: the Self-Selecting Robust Logistic Regression (SsRLR) model, Gelman's Robust Logistic Regression (GRLR) model and the ordinary Logistic Regression (LR) model.Following the simulation study carried out by Croux & Haesbroeck. (2003), LR model is generated with two independent normally distributed covariates.The additive noise ε i is selected from a logistic distribution defined as: The true parameter values are β = (0, 2, 2) with sample size n = 200.The study was based under a variety of situation.First, data without contamination was taken with two independent normally distributed covariates with zero mean and unit variance.Second, to examine the robust properties of all models, we introduced outliers by contaminating the data similarly to the idea proposed by Victori (2002).We generated the outliers in R software by corrupting the covariates.This consists of randomly choosing a certain t proportion (3%, 5%, 7%) from both covariates and replace them with a sample X i chosen from N(t, 10, 2).The response variable for each proportion was then generated from the new corrupted covariates.Finally the generated binary response data was contaminated under different percentages of leverage points.Thereafter, the three logistic models were applied to these data generated.In order to better handle those outliers, our robust model proposed to the contaminated binary data response itself to select the value of the probability α.After getting that significant alpha value for the robust model, we compared the goodness of fit of the three logistic regression models.

Results
For each simulated data set, we estimated and recorded the parameters β and α.In particular we focus on investigating how much each model performs in presence of outliers in the binary response data.In assessing that performance, we compute and compare their DIC and BIC.

Model Assessment and Comparison
The first finding involved the LR model.In fact, the generated outliers values between 5 and 10 caused the LR not to run, giving "Trap Message" and no output while the SsRLR model model takes care of the leverage points without any problem.We got output and summary statistics by using a Restricted Logistic Regression (RLR) model where y i ∼Bern(π 1 ) is defined as: 14) It can be deduced that fitting ordinary logistic regression with outliers can get "Trap Message" and no output without using Minmax in WinBUGS.Table 1-2 show the simulated results of all the fitted models for data with various percentages of leverage points.In absence of outliers (0% of lev pt), it can be observed that, there is no significant difference between the restricted logistic and the robust models based on the DIC value.But the SsRLR model seems to give better estimated values of the parameters.x 1 ∼ N(n, 0, 1) x 2 x 2 ∼ N(n, 0, 1) The Restricted LR model was immediately affected by 3% of leverage points giving the highest DIC value.Gelman's model was influenced as well showing parameter estimates which were not stronger than the expected one, while the SsRLR model let the data itself to select 2.664E-5 alpha value that improved the parameter estimated values.
It is interesting to observe that the 5% of leverage points do not have effect on the SsRLR model.This latter confirms its robustness giving more better simulated result with the smallest DIC value.
The α values 5.014E-5 and 2.059E-3 respectively self selected in the presence of outliers (5% and 7% of lev pt) allowed the data to minimize the influence of those latter in the parameter estimation.
Based on the criterion that a difference in DIC values from 3, 4 between two models provides a better fit, it can be clearly concluded that the best fitting model is the Self Selecting Robust Logistic Regression (SsRLR) model with small DIC value when there is presence of outliers in the binary response data.
Furthermore, based on the BIC, the SsRLR model with the lowest BIC value is the preferred best fitting model (Schwarz. 1978).

Discussion
This study uses Bayesian techniques to develop robust logistic regression model when outliers are present in binary response data.The study develops robust logistic model to help improve parameter estimation fitting.In this study, the approach used in the robust model is based on a trimming value alpha, α chance of random error in both direction of the interval [0, 1].
From the existing contribution of Gelman (2004) that fixed α and (1 − 2α) in his model, we extended by self selecting theses probability values depending on the data at hand and gave them a Uniform [a, b] prior distribution.
In this study, we clearly confirmed that these probability values could also be determined by the data itself.In other words, depending on the binary data at hand, this latter could itself select αand (1 − 2α).
We found that the smaller the assigned values of a and b, the smaller the self selected α and the more efficient the estimates obtained from simulation results will be, compared to the ones otained from both the GRLR and the LR models when the data is either clean or contaminated.
Another finding is that the self selecting robust logistic regression model is better fitting model compared to the Restricted LR model based on DIC value using Bayesian approach implemented in WinBUGS.
The SsRLR model provides a reliable fitting model based on the lowest BIC value compared to the RLR and GRLR models.
We also found that the Restricted LR model has minimized the effect of the outliers present in the data and allowed achievement of better results.Despite this, the Self Selecting Robust Logistic Regression model presented more reliable results in comparison to the Restricted LR contrary to Gelman's robust logistic regression model.

Conclusions
This work aims to extend the performance of logistic regression for binary data.Ordinary LR with arbitrary outliers was shown to fail.We proposed a robust SsRLR model that dealt with such contamination.It was also observed that by fixing the value of alpha, GRLR model was not that robust to the influential observations.
We proposed in that study a novel robust (LR) model to solve this issue.To proceed, we developed a self selecting robust logistic model, then investigated the robustness of this latter.We proposed a clear way of specifying the trimming values as required by the user, as opposed to fixing it.
One finding indicated across the simulation results that SsRLR model performs well in its specificity of letting the binary data itself to select the alpha value necessary to better improve the quality of the parameter estimates.Based on the smallest DIC and BIC value respectively, our SsRLR model was found the best fitting model under contaminated binary data sets.
We found that as long as α value is smallest self selected by the data at hand, the robustness of the SsRLR model is more improved.That is our contribution to Gelman's robust logistic regression model.
Figures show a visual representation of the distribution of the data set.And it is clearly confirmed in the second histogram of the figure 1. the presence of outliers localized between 5 and 10 as earlier said.

Figure 1 .
Figure 1.Histogram of X in both clean and contamination cases

Table 1 .
Description of variables X and assumed values of parameters manipulated in simulation

Table 2 .
Simulated results of all models for Data with Leverage Points (0% and 3%)