Unbiased Estimation for Linear Regression When n < v

In this paper a new method is proposed for solving the linear regression problem when the number of observations n is smaller than the number of predictors v. This method uses the idea of graphical models and provides unbiased parameter estimates under certain conditions, while existing methods such as ridge regression, least absolute shrinkage and selection operator (LASSO) and least angle regression (LARS) give biased estimates. Also the new method can provide a detailed graphical correlation structure for the predictors, therefore the real causal relationship between predictors and response could be identified. In contrast, existing methods often cannot identify the real important predictors which have possible causal effects on the response variable. Unlike the existing methods based on graphical models, the proposed method can identify the potential networks while doing regression even if the data do not follow a multivariate distribution. The new method is compared with some existing methods such as ridge regression, LASSO and LARS by using simulated and real data sets. Our experiments reveal that the new method outperforms all the other methods when n < v.


Introduction
Consider a linear regression model with a univariate response, v covariates and n independent and identically distributed (i.i.d.) observations.Let y = (y 1 , . . ., y n ) ′ and x j = (x 1 j , ..., x n j ) ′ ( j = 1, • • • , v) be the observations for the response and the jth covariate, respectively.Denote the matrix of the covariate observations by x = (x 1 , ..., x v ).Assume the following linear regression model representing the relationship of the response and the covariates, where the elements of ϵ = (ϵ 1 , . . ., ϵ n ) ′ are i.i.d.random variables with mean 0 and variance σ 2 .
When n < v, many methods have been proposed for the above models, such as Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani, 1996), Least Angle Regression (LARS) (Efron et al., 2004) and ridge regression (Hoerl & Kennard, 1970).Greatest attention has been paid by these methods to the case with the number of non-zero coefficients to be much less than n or v, which is usually called the 'sparse' case.These methods, however, provide biased estimates.In addition, the selected model based on LASSO and LARS can take at most n covariates (Zou & Hastie, 2005;McCann & Welsch, 2007).This will be problematic in some areas where more or even all covariates have to be included in the model.Another problem with LASSO is overshrinking the final coefficients which might produce inaccurate estimates for the coefficients (James & Radchenko, 2009).
Ridge regression can include all covariates in the model, but the biased estimate makes it difficult to justify the significance levels for each covariate.This can also lead to a non-sparse model which is difficult to interpret when the number of features is large (Yuan et al., 2007).Other related approaches include the recent research of Candes & Tao (2007), Meinshausen & Yu (2009), Bickel et al. (2009), Zhang (2010) and Lin et al. (2014).However, their estimates are still biased which might not be recommended in general (Washington et al., 2010;Zhang, 2010).
For regression analysis with n > v, standard Least Squares Estimate (LSE) actually uses the saturated model assumption, i.e. any pair of covariates are correlated conditional on the other covariates.When n < v, the full saturated least squares cannot be used.A good alternative is the unsaturated model, by adding constraints to the conditional dependency structure for the covariates.However, all existing methods did not consider the conditional correlation and thus they cannot identify the true model when n < v.This might be very important in genetic studies where the identification of gene networks is needed.Traditional graphical methods only focus on networks based on multivariate distributions.However, if the distribution is not multivariate and we need to do regression, then traditional methods will not work.
This paper develops an unbiased estimation method via graphical models (GLSE), which can provide a much better solution than all other existing methods.The proposed theory has a strong connection with graphical models and takes into account the conditional correlation between covariates.Therefore it could possibly include the causal interpretation between response and covariates and identify the true regression model.Under certain trivial assumptions, the estimator produced by the new method is unbiased in addition to identifying the required networks.
Another advantage of GLSE is that it considers much more candidate models than existing methods.For the regression model with covariates (X 1 , • • • , X v ), the estimate produced under the situation where all X j are independent should be different from the regression estimate produced under the situation when some X j are correlated.By taking into account the detailed conditional correlation structure, this new method actually searches from a much larger space of potential regression models than all existing methods.With v covariates, existing methods carry out model selection from 2 v different sets of models (at most).In contrast, if all covariates should be included in the model, the GLSE could search from 2 v(v−1)/2 different models, since there will be so many different models according to the different conditional correlation structures among covariates X j .
This paper is structured as follows.We provide necessary notations and definitions of graphical models in Section 2.1.Then in Section 2.2 we present the main methodology of GLSE and its properties.The model selection for the underlying graph structure is provided in Section 2.3.Simulation studies and data analysis are given in Sections 3.
The paper ends with a conclusion in Section 4.

Graphs
We follow the notations in Lauritzen (1996).One may also refer to Whittaker (2009) or Dawid and Lauritzen (1993) for more details on graphical models.
An undirected graph is denoted by G = (V, E), where V is the set of vertices, say V = {1, 2, • • • , v}, and E is the set of edges, a subset of V × V. We usually use notation {i, j} to denote the edge between vertex i and j.The elements in V correspond to the index set for all covariates in the regression model.Therefore each covariate corresponds to a vertex in the graph G.
A graph is complete if all vertices are joined by an edge and a subset is complete if it induces a complete subgraph from G. A complete subset that is maximal (with respect to ⊆) is called a clique.
A triple (A, B, C) of disjoint subsets of the vertex set V of an undirected graph G is said to form a decomposition of G if V = A ∪ B ∪ C and the following conditions hold (Lauritzen, 1996): An undirected graph G is decomposable if it holds one of the following: • There is a proper decomposition (A, B, C) into decomposable subgraphs G A∪B and G B∪C .
Consider a sequence of sets C 1 , ..., C q and define that Then if the following conditions are satisfied, the given sequence C 1 , ..., C q is said to be a perfect sequence (Lauritzen, 1996): • There exist an i < j, for any j > 1, such that S j ⊆ C i ; • The sets S j are complete for all values of j; A decomposable graph G allows a perfect ordering of cliques (Golumbic, 2004).

Matrices and Vectors
Throughout this paper, we reserve the capital letters A, B, C and S to denote a subset of V and | • | to denote the number of elements for an edge or vertex set.The lower letters i, j, k are reserved for the notation of indices.
The following notations mainly follow the style in Lauritzen (1996).A vector Z ∈ R v can also be written as (Z j ) j∈V .The jth element Z j can also be written as (Z) j .Denote Z A as an a-dimensional subvector (a = |A|) of Z, with Z A = (Z j ) j∈A .Denote [Z A ] Γ as a v-dimensional vector obtained by filling up 0s, with We define similar notations for matrices.A v × v matrix z can also be written as (z k j ) k, j∈V .We may also write With the above notation, we can use x A to denote the covariate matrix only having covariates with indices in set A.

The Idea of Unbiased Estimation via Graphical Model
In this section, we present the basic idea of the GLSE for the regression model (1).We assume that all response values and covariate values have been centered and the covariates variables are quantitative.First we introduce a condition.

Condition 1 Suppose that the set V can be partitioned into disjoint sets A, B and C and the sample size n
Denote the sample covariance matrix as ssd = x ′ x.Given that Condition 1 holds true, an estimator for β can be defined as βs Clearly under Condition 1, the matrix inversions in the above formula are available.
To discuss the unbiasedness property of the above estimator, we first introduce some definitions.For an estimator β , we introduce three different types of unbiasedness regarding the regression model given in (1).
2. Strong-conditional unbiased estimate (SUE): If E y ( β|x) = β then the estimator β is said to be strongly conditionally unbiased.
3. Weak-conditional unbiased estimate (WUE): If E x V\A ,y ( β|x A ) = β, then the estimator β is said to be weakly conditionally unbiased, where A is denoting a proper nonempty subset of V.
It is easy to show that SUE ⇒ WUE ⇒ UE, where ⇒ mean "implies".
Write the covariate matrix as x = (x A , x B , x C ), then under the following condition, we can show that ( 5) is an unbiased estimate.

Condition 2
The sets A, B and C form a decomposition, with B as the separator.The sets x A and x C are conditionally independent on x B , such that (a) where r BC and r AB are constant matrices; (b) Now it is ready to introduce the following theorem.
Remark 1 The standard least squares estimate (available when When n < v, all existing methods such as ridge regression, LASSO, or LARS are biased, i.e.E x,y ( βother ) β.
Remark 2 Note that Condition 2 will be satisfied in many cases.For example, if ) follows a multivariate normal distribution N(0, Σ) and the distribution of X i factorizes according to a decomposable graph g = (V, E g ), with cliques A ∪ B and B ∪ C and separator B. Condition 2 implies the conditional independence of x A and x C given x B and a linear relationship between x A (and x C ) and x B .Given a different underlying graph structure for the covariates, the proposed estimator will be different.In practice, we can search the best underlying graph structure and this will be provided in Section 2.3.

Remark 3
The linear assumption in Condition 2 will not limit the applicability of the method since one can always use variable transformation to achieve a linear relationship, if the covariates are quantitative variables.In addition, as any non-linear relation can be approximated via a polynomial, the nonlinear dependence on X can be viewed as linear dependence on X, X 2 , • • • .

The General Unbiased Estimates
In general, the associated graph g may be decomposed into many cliques or separators.Suppose that the graph g is decomposable and let C denote the set of cliques and S denote the set of separators.The general formula of the GLSE is Obviously, under the following condition, the GLSE exists.
Condition 3 The sample size n > max C∈C {|C|}.
Under the following condition, we can also show that βg is unbiased.
Condition 4 Write the cliques and separators of g in the perfect ordering, as where Proposition 1 Under Condition 3 and Condition 4, the estimator in ( 11) is unbiased, The proof of Proposition 1 can be done recursively similar to Theorem 1.

Covariance Matrix Estimation for the GLSE
It is not easy to simplify the above formula and derive a simple estimator for the above variance of βg .The bootstrap estimate is a good alternative.We can also use the following estimator for the variance, since any random variable is always an unbiased point estimate to its own mean.We will show that (14) does not work well for data with small sample size, but when the sample size is large, the estimator works well.
When we use bootstrap to estimate var( βg ), care should be taken in that a bootstrapped sample can be used only if the number of distinct observations in the resampled data is greater than the number of variables within the maximum clique.

Discussion on the Unbiasedness and Condition 4
Condition 4 will be reasonable in many real applications.When there are many predictors, some of them are more likely to be conditionally independent given a subset of variables.For example, in finance studies the return of a portfolio may depend on a large number of assets.If we simply look at the correlation of these assets they may all be correlated marginally.However, some of them may be independent given a certain subset (for other example see Carvalho & West (2007)).Also in bioinformatics studies, the covariates may be a pool of genes where some genes may be related due to an intermediate gene.There has been a vast literature in the research of graphical models, which can learn the conditional dependency structure for many random variables even if the sample size is very small.For further study see Dobra et al. (2004); Markowetz & Spang (2007); Kramer et al. (2009); Talluri & Shete (2014) and the work cited there.
Condition 4 allows the possibility to study the detailed conditional independence/correlation structure under a regression framework and this will help us to solve the challenges when n < v. Traditional analyses did not consider the conditional dependency between the covariates.It is fine to ignore this and apply the saturated model (any pair of covariates is conditionally correlated given the remaining) when the sample size is large enough.But when n < v, certain conditions have to be added to achieve a reasonable result.Traditional methods uses other constraints or penalty terms and trade off the bias and variance.This will distort the real conditional correlation between covariates.That is why existing methods for n < v may provide spurious correlation between covariates and predictors and cannot identify the real important causal predictors.For instance, Sun & Zhang (2010) showed that LASSO selects at least one variable at a 50% chance to predict y which might be totally non-significant variable(s) thus leading to a spurious correlation.Researchers in these field often select certain variables manually without any statistical justification.

Graphical Model Selection
In practice, the underlying graph structure g is unknown.We need to develop an algorithm to find which graph g is the best for the data.This is another advantage of GLSE since the underlying graph structure could lead to certain causal interpretations, which cannot be done via all existing methods.
There have been some covariance model selection methods developed under a Bayesian framework, such as Jones et al. (2005) and Dai (2008).We here proposed a new one for graphical model selection under a regression framework.It is natural to consider the following criteria for model selection: the best graph is given by minimizing a target function T(β, g, λ).
( β, ĝ, λ) = arg min β,g∈G,λ T(β, g, λ) (15) where G is the set of all possible graphs and |E g | is the number of edges in graph g.This criteria aims to search for the estimate and graph structure, corresponding to the minimum of the sum of square errors plus a penalty term.
The penalty term tells us how we should shrink the graph.In practice, we may fix the value of λ and only search for β and g.If one prefers a very sparse graph, then a large value of λ should be chosen; if one prefers a more saturated graph, a small value of λ should be chosen.
We recommend to consider more saturated graphs in practice.This is because more saturated graphs will allow more information on the correlation structure.To show this, a graphical comparison is done by generating a data set similar to the one simulated in Section 3.1.For convenience, the graph generated is allowed to add only up to 30 edges shown on the y-axis in Figure 1.The figure shows that a smaller value of λ will give smaller sum of square errors, i.e. more accurate prediction.Therefore, we may simply set λ = 0 and ignore the penalty term, except that in some situations a sparse graph is preferred.In the analysis of real data, λ = 0 is taken.
When searching in the whole graph space G, we can use a stepwise selection procedure.The method considers adding/deleting edges one by one to/from the current graph.When an edge under consideration is not in the current graph, it will be added if the addition makes an improvement in terms of the predetermined criteria, otherwise it will not be added.When an edge under consideration is in the current graph, it will be deleted if the deletion makes an improvement in terms of the predetermined criteria, otherwise it will not be deleted.Similar graphical model selection algorithms for multivariate Gaussian models have been proposed by Jones et al. (2005)..
As an example, we used this algorithm to find the underlying graph structure for the data set generated in Section 3.1.The true graph structure is given in Figure 3.The two selected graphs for different sample sizes are given in Figures 2. Algorithm 1 Pseudocode of the GLSE graph selection 1: Start graph g = (V, E), which can be an empty (or a given decomposable) graph such that n > max C∈C |C| 2: Generate all possible graphs, g i , such that there is only one edge difference between g i and the current graph g.All such g i are such that they have no cordless K-cycles (K ≥ 4, to be decomposable) and such that n > max C∈C |C| 3: Find the graph g * i such that g * i minimise the target function T(.) (given in ( 15)) 4: Go to step 2 with the selected graph g * i and iterate until the best one is found.
In the original graph there are 13 edges having conditional correlations greater than absolute value 0.1.The graph chosen by the algorithm for n = 15, Figure 2 (a), has detected 9 of the these edges in the original graph.Also, the chosen graph detected 4 of the edges with conditional correlation less than or equal to absolute value 0.1, in the original graph.The graph for n = 100 detected 11 edges having conditional correlations greater than absolute value 0.1 and 7 edges with conditional correlation less than or equal to absolute value 0.1 of the original graph.It follows that by increasing the sample size, most of the original edges can be detected by using the algorithm.
The model selection step is the most challenging part for the GLSE method, since once a good graph structure is identified GLSE for regression parameters can be achieved straight forward.The computational cost is quite heavy to search the whole graph space, but step 2 of Algorithm 1 can be improved significantly via parallel computation.

Results
This section provides simulation studies and a real data analysis to assess the proposed method empirically.

Simulation
In this section, we provide detailed simulation study to show that our GLSE has much smaller bias than other existing methods such as LASSO, ridge regression and LARS.We compare the results of our proposal from the simulation model with those of LASSO, LARS and ridge regression where the penalty parameters are obtained by cross validation.Note that the GLSE βg depends on the graph g therefore we provide simulation results when the true graph is known and when it is unknown.
We simulate our data from model (1) with v = 20 and (X 1 , • • • , X v ) following a multivariate normal distribution with mean 0 and variance covariance matrix Σ.The concentration matrix Σ −1 is associated with a graph, shown in Figure 3, where the conditional correlation is indicated by numbers above each line (for the case when true graph is known).Then the response y is generated from model (1), where the random errors ϵ are normally distributed with mean 0 and standard error σ = 0.5.We choose n = 15 < v = 20.The true parameter values of β are given in Table 1.
Table 1.Results from simulation (with known and unknown graphs), n = 15, v = 20; SD: the Monte Carlo standard error for the 500 replicates; SE is the mean of the 500 standard error estimates.1 gives coefficient estimates and Monte Carlo estimates of standard errors for the ridge, LASSO, LARS and the proposed GLSE method.The table also shows the standard error estimates for the GLSE (with known and unknown graph structures) method.When the true graph is known, GLSE outperforms all other three methods in the coefficients estimates.Although the Monte Carlo estimates of standard errors for ridge, LASSO and LARS are smaller than that of the GLSE, the new method being an unbiased estimation should be preferred.We can also see from the results that the mean of the 500 bootstrap standard error estimate is very close to the Monte Carlo standard deviation for the 500 replicated estimates.This implies that bootstrap estimation for the standard error performs well.However, we also noted that the standard error estimate based on ( 14) does not work well for small sample sizes.This problem can be fixed by increasing n.
In the case when the graph is estimated, the results given in the last 4 columns of Table 1 leads to the same conclusion.Here also, the GLSE outperforms the others in coefficient estimation.
The computational cost for the proposed algorithm is not heavy with modern parallel computing technology.Both serial and parallel computing time is considered when the true graph is unknown.It is noted that on a machine with 8GB of memory and 2.7GHz processor, the time taken is approximately 10 hours.When the parallel processing was used, with 60 cores, the computational time reduced to approximately 9 minutes.For the case when the true graph structure is known, parallel computing is not needed in that the time taken serially (using one processor) is just a few seconds.

Real Data Analysis
We consider the data used in Scheetz et al. (2006) which consist of gene expressions from eye tissues of 12-weekold rats.The data set has 120 observations on more than 31,000 genes.The data set is available with in the R library "flare" (Li et al., 2014) where the dimension of the data set is reduced to 200 genes, which could have high possibilities to relate to TRIM32.The gene TRIM32 carries a significant biological information, for example, this gene might lead to Bardet-Biedl syndrome, which is a heterogeneous genetic disease in several organs including eye retina (Huang et al., 2008).
In summary, the data analysis is a regression problem with n = 120 and v = 200 where identifying the correlation structure in the data set is desirable in that most of the variable in the data are correlated.To show this, 10 variable are selected randomly from the data and their scatter plot matrix depicting the marginal correlations is given in Figure 4. We applied ridge, LASSO, LARS and our proposed GLSE method on this data set.For ridge, LASSO and LARS, cross validation is used for obtaining the penalty parameter.The LASSO method found 24 non-zero coefficients while LARS found 25 coefficients to be non-zero.Ridge regression selected all the variables and results are not provided here as it cannot give any information about significance levels.The proposed method kept all 200 variables in the model, among which 7 coefficients are significantly non-zero under a 10% significance level and Our proposed method is based on a greedy search algorithm which searches a huge space of the potential graphs.This algorithm can be significantly improved if parallel computing is taken (step 2 of Algorithm 1).Similar discussion can be found in Jones et al. (2005).
To improve the efficiency of Algorithm 1, we may also focus on the most saturated graphs.This is because from our simulation studies, we found that when increasing the number of edges, the sum of square errors will decrease (Figure 1).Although this is quite reasonable, it is non-trivial to prove that when the number of edges in the graph is increased, the GLSE will provide a more accurate predicted value for the response.We also leave this to future work.

Figure 1 .
Figure 1.The effect of choosing different values of λ for the penalty term.Solid line: sum of squared errors; dotted line: penalty function; dashed line: the number of edges corresponding to the graph giving the minimum target function value.

Figure 4 .
Figure 4. Scatter plot (marginal correlations) for randomly chosen 10 variables of the data.Most of the variables are correlated with each other.

Figure 5 .
Figure 5.Estimated gene network (conditional correlation structure) for the covariates in the real data.