A Sub-Model Theorem for Ordinary Least Squares

Variable selection or subset selection is an important step in the process of model fitting. There are many ways to select the best subset of variables including Forward selection, Backward elimination, etcetera. Ordinary least squares (OLS) is one of the most commonly used methods of fitting the final model. Final sub-model can perform poorly if the variable selection process failed to choose the right number of variables. This paper gives a new theorem and a mathematical proof to illustrate the reason for the poor performances, when using the least squares method after variable selection.


Introduction
The use of OLS for multiple linear regression models after variable selection can results in poor models.First, we describe the Multiple Linear Regression (MLR) model in section 1.In section 2, we discuss the variable selection and in section 3 we introduce a new theorem and its proof to illustrate the reason for the poor performances of some OLS sub-models.This paper closely follows the author's related work Pelawa Watagoda (2017), Pelawa Watagoda and Olive (2018), Pelawa Watagoda and Olive (2018a).

Multiple Linear Regression Model
Suppose that the response variable Y i and at least one predictor variable x i, j are quantitative with x i,1 ≡ 1.Let x T i = (x i,1 , ..., x i,p ) = (1 u T i ) and β = (β 1 , ..., β p ) T where β 1 corresponds to the intercept.Then the multiple linear regression (MLR) model is for i = 1, ..., n.This model is also called the full model.Here n is the sample size, and assume that the random variables e i are independent and identically distributed (iid) with variance V(e i ) = σ 2 .
In matrix notation, these n equations become where Y is an n × 1 vector of response variables, X is an n × p matrix of predictors, β is a p × 1 vector of unknown coefficients, and e is an n × 1 vector of unknown errors.
The ith fitted value Ŷi = x T i β and the ith residual r i = Y i − Ŷi where β is an estimator of β.Ordinary least squares (OLS) is often used for inference if n/p is large.
It is often convenient to use the centered response Z = Y − Y where Y = Y1, and the n × (p − 1) matrix of standardized nontrivial predictors W = (W i j ).For j = 1, ..., p − 1, let W i j denote the ( j + 1)th variable standardized so that ∑ n i=1 W i j = 0 and ∑ n i=1 W 2 i j = n.Note that the sample correlation matrix of the nontrivial predictors u i is R u = W T W/n.Then regression through the origin is used for the model where the vector of fitted values Ŷ = Y + Ẑ.
There are many methods for estimating β, including forward selection with OLS, principal components regression (PCR), partial least squares (PLS) due to Wold (1975), lasso due to Tibshirani (1996), and ridge regression (RR): see Hoerl and Kennard (1970).Also, there are methods like variant of relaxed lasso that applies OLS to a constant and the predictors that had nonzero lasso coefficients, which is the LARS-OLS hybrid estimator of Efron et al. (2004), also called the relaxed lasso (ϕ = 0) estimator by Meinshausen (2007).
These six methods produce M models and use a criterion to select the final model (e.g., C p or 10-fold cross validation (CV)).The number of models M depends on the method.Lasso and ridge regression have a parameter λ.When λ = 0, the OLS full model is used.These methods also use a maximum value λ M of λ and a grid of where often λ 1 = 0.For lasso, λ M is the smallest value of λ such that ηλ M = 0. Hence ηλ i 0 for i < M.

Variable Selection
Variable selection is the search for a subset of predictor variables that can be deleted with little loss of information if n/p is large, and so that the model with the remaining predictors is useful for prediction.Following Olive and Hawkins (2005), a model for variable selection can be described by where x = (x T S , x T E ) T , x S is an a S × 1 vector, and x E is a (p − a S ) × 1 vector.Given that x S is in the model, β E = 0 and E denotes the subset of terms that can be eliminated given that the subset S is in the model.Let x I be the vector of a terms from a candidate subset indexed by I, and let x O be the vector of the remaining predictors (out of the candidate submodel).Suppose that S is a subset of I and that model (5) holds.Then where x I/S denotes the predictors in I that are not in S .Since this is true regardless of the values of the predictors, Forward selection forms a sequence of submodels I 1 , ..., I M where I j uses j predictors including the constant.Let I 1 use x * 1 = x 1 ≡ 1: the model has a constant but no nontrivial predictors.To form I 2 , consider all models I with two predictors including Let I 2 minimize Q 2 (I) for the p − 1 models I that contain x * 1 and one other predictor.Denote the predictors in I 2 by x * 1 , x * 2 .In general, to form I j consider all models I with j predictors including variables x * 1 , ..., for the p− j+1 models I that contain x * 1 , ..., x * j−1 and one other predictor not already selected.Denote the predictors in I j by x * 1 , ..., x * j .Continue in this manner for j = 2, ..., M. Often M = min(⌈n/J⌉, p) for some integer J such as J = 5, 10, or 20.Here ⌈x⌉ is the smallest integer ≥ x, e.g., ⌈7.7⌉ = 8.
Consider the six methods forward selection with OLS, PCR, PLS, lasso, relaxed lasso, and ridge regression.When there is a sequence of M submodels, the final submodel I d needs to be selected.Let the candidate model I contain a terms, including a constant.For example, let x I and βI be a × 1 vectors for the methods excluding PCR and PLS.Then there are many criteria used to select the final submodel I d .

OLS Sub Model Theorem and Proof
This section will prove Theorem 1 bellow and discuss its implications.
Where X I is the vector of a terms from a candidate subset indexed by I, X o is vector of the predictors that are out of the candidate submodel and if S I then, Proof.Assume this is an arbitrary submodel, and I does not contain S.Then, According to Theorem 1, when S I, i.e. when the final submodel does not contain enough predictors, the E( βI ) β I , and will produce a poor final submodel.On the other hand, following equations 6 and 7, when the submodel contains the set of predictors S , β 0 = 0. Then E( βI

Conclusions
This worked mathematically showed the reason for ordinary least squares to perform poorly when the submodel does not contain enough predictors.