The Hannan-Quinn Proposition for Linear Regression

We consider the variable selection problem in linear regression. Suppose that we have a set of random variables $X_1,...,X_m,Y,\epsilon$ such that $Y=\sum_{k\in \pi}\alpha_kX_k+\epsilon$ with $\pi\subseteq \{1,...,m\}$ and $\alpha_k\in {\mathbb R}$ unknown, and $\epsilon$ is independent of any linear combination of $X_1,...,X_m$. Given actually emitted $n$ examples $\{(x_{i,1}...,x_{i,m},y_i)\}_{i=1}^n$ emitted from $(X_1,...,X_m, Y)$, we wish to estimate the true $\pi$ using information criteria in the form of $H+(k/2)d_n$, where $H$ is the likelihood with respect to $\pi$ multiplied by -1, and $\{d_n\}$ is a positive real sequence. If $d_n$ is too small, we cannot obtain consistency because of overestimation. For autoregression, Hannan-Quinn proved that, in their setting of $H$ and $k$, the rate $d_n=2\log\log n$ is the minimum satisfying strong consistency. This paper solves the statement affirmative for linear regression as well which has a completely different setting.


Introduction
Information criteria such as AIC, MDL/BIC are used for problems in model selection, and each problem is associated with estimating how many independent parameters exist from given finite examples: on how many variables another variable depends in linear regression (LR); on how many previous variables the subsequent variable depends on in auto regression (AR), etc.
For each model g, we evaluate two factors: 1. How well the examples explain the model g; and 2. How simple the model g is. and balance them numerically. Let {d n } ∞ n=1 be nonnegative reals such that d n /n → 0, H(g) the empirical entropy which is the maximum likelihood multiplied by (−1), and k(g) the number of parameters in model g. By information criteria, we mean the quantity and we estimate the model g by finding one with the minimum value. For example, d n = 2 for AIC, and d n = log n for MDL/BIC. Hence, information criteria exist as many as sequences {d n } ∞ n=1 , so it is impossible to list all of information criteria in the form of (1).
In model selection, in particular for theoretical analyses, we often discuss if consistency holds for each {d n }, namely, if a sequence of selected models converges to the correct one as n → ∞ in the following senses: 1. the probability of the selected model for each n being correct converges to one (weakly consistent), and 2. the set (event) of infinite sequences in which at most a finite number of errors occur has probability one (strongly consistent).
Although both properties are satisfied in MDL/BIC (d n = log n), however, none of the two are satisfied in AIC (d n = 2). In general, if d n is too small, strong consistency is not obtained because of overestimation. This paper addresses the minimum order of {d n } satisfying strong consistency although seeking such a condition is of theoretical interest in model selection (in fact, many information criteria are to be satisfactory even if consistency is not achieved).
The definitions of empirical entropy and the number of parameters are different in each problem to be considered. In 1979, Hannan-Quinn proved that for AR d n = 2 log log n is the minimum order satisfying strong consistency (Hannan-Quinn proposition). However, the same d n = 2 log log n has been applied to other problems as well as AR. In fact, the proof of the Hannan-Quinn proposition essentially depends on the properties of the AR problem, which is clear from the original paper by Hannan-Quinn, and the Hannan-Quinn proposition was not proved for any other problem including the LR problem. On the contrary, without noticing such a matter, the information criterion HQ was applied to those problems.
Recently, the Hannan-Quinn proposition has been proved for estimating classification rules which has many applications such as Markov order estimation, data mining, pattern recognition (Suzuki, 2006). This paper shows that the Hannan-Quinn proposition is true for estimating dependencies in LR, which seems to be of great significance. Otherwise, there would be no reason to use HQ in LR. Several authors suggested that d n = c log log n with some positive constant c would be enough (Rao-Wu, 1989). So, there has been evidence that the proposition is true although no formal proof appeared. This paper proves that such a c is any constant strictly greater than two.
In Section 2, we briefly overview how the Hannan-Quinn proposition was proved in AR. In Section 3, we derive the asymptotic error probability of model selection in LR when information criteria are applied, which will be an important step to prove the main result. In Section 4, we give a proof of the Hannan-Quinn proposition for LR. Section 5 summarizes the results in this paper and gives a future problem.
Throughout the paper, we denote by X(Ω) the image {X(ω)|ω ∈ Ω} of a random variable X : Ω → R, where Ω is the underlying sample space.

Auto Regression
i=−∞ be a sequence of independent and identically distributed random variables with expectation zero and variance one, and let {X i } ∞ i=−∞ be defined by where we assume the expectation of each X i to be zero. Since {X i } is stationary, we obtain for m ≥ 0, the following equation (Yule-Waker) where γ m := EX i X i+m does not depend on i. Using Cramer's formula, and from the values of {γ m } k m=0 , we obtain the values of λ 0 := σ 2 k and {λ m } k m=1 as a solution of the (k + 1) × (k + 1) linear equations: Since the values of {γ m } k m=0 are generally unknown, we need to estimatē from the examples Then, we obtain the Yule-Walker equation as follows: In particular, if the order k is unknown, we solve the above linear equation for each k to calculate the value of We estimate the true k = k * by the one k =k that minimizes (3). This process is called estimating the AR order. Then, we also obtain the solutionsλ 0,k :=σ 2 k and {λ m,k }k m=1 of (2) with k =k.
In general,σ almost surely converges to a value less than one. Thus, from (4), we have with probability one On the other hand, for k ≥ k * + 1,σ almost surely converges to one. Hannan-Quinn(1979) proved from the law of iterated logarithms that with probability one, and that for d n = 2c log log n (c > 1), with probability one.

Linear Regression
Let X 1 , · · · , X m be random variables such that there are no linear relations: any linear combination of X 1 , · · · , X m cannot be zero with probability one. Let ǫ ∼ N (0, σ 2 ) be a normal random variable with expectation zero and variance σ 2 > 0, and where α := [α 1 , · · · , α p ] T ∈ R p (0 ≤ p ≤ m). We assume that ǫ is independent of any linear combination of X 1 , · · · , X m . Suppose we do not know the values of order p and coefficients α, and that we are given independently emitted n examples j=1 are to be linearly independent. If we define we can write y = X p α + ǫ. Suppose that we estimate p by q (0 ≤ q ≤ m). If we wish to minimize the quantity n i=1 (y i − q j=1α jq x ij ) 2 given the n examples, then α q = [α 1,q , · · · ,α q,q ] T := (X T q X q ) −1 X T q y is the exact solution (minimum square error estimation), where x n,1 . . . x n,q  

Idempotent Matrices
Suppose p ≤ q. If we define P q := X q (X T q X q ) −1 X T q , we have P 2 q = P q and (I − P q ) 2 = I − P q , so that the square error is expressed by Similarly, if q = p, for P p := X p (X T p X p ) −1 X T p andα p = [α 1,p , · · · ,α p,p ] T := (X T p X p ) −1 X T p y, the square error is expressed by Thus, the difference between the square errors is On the other hand, we have and P T p = P p . From P q X p = X p , P p X p = X p , we obtain P p P q = P T p P T q = (P q P p ) T = P T p = P p . Thus, not just for P p , I − P p but also for P q − P p , the property (P q − P p ) 2 = P 2 q − P q P p − P p P q + P 2 p = P q − P p holds. Such square matrices satisfying the property are called idempotent matrices (Chatterjee-Hadi, 1987).
Since the eigenvalues are one and zero, the multiplicity of eigenvalue one is the same as the trace. Notice that for (X T q X q ) = [y jk ] and (X T q X q ) −1 = [z jk ], and trace(P p ) = p, so that we have the following table.

Error probability in model selection
Proposition 1 If p < q, S p − S q S p /n asymptotically obeys the χ 2 distribution with freedom q − p.
Proof: Given X p , we choose an orthogonal matrix U = [u 1 , · · · , u n ] of I − P p so that U 1 =< u 1 , · · · , u n−p > and U 0 =< u n−p+1 , · · · , u n > are the eigenspaces of eigenvalues one and zero, respectively. Notice that For j = 1, · · · , n − p, multiplying u T j in both hands from left, we get a normal random variable z j := u T j y = u T j ǫ . Since the expectation and variance of ǫ i are zero and σ 2 (independent), and Thus, from the strong law of large numbers, with probability one as n → ∞, On the other hand, given X q , we choose an orthogonal matrix V = [v 1 , · · · , v n ] of P q − P p so that V 1 =< v 1 , · · · , v q−p > and V 0 =< v q−p+1 , · · · , v n > are the eigenspaces of eigenvalues one and zero, respectively. Notice that from (6), we have (P q − P p )y = P q (I − P p )y = P q (I − P p )ǫ = (P q − P p )ǫ .
For j = 1, · · · , q − p, multiplying v j in both hands from left, we get a normal random variable r j := v T j y = v T j ǫ . Since the expectation and variance of ǫ i are zero and σ 2 (independent), and we have E[r j ] = 0 and Hence, as n → ∞, where the fact that the square sum of q − p independent random variables with the standard normal distribution obeys the χ 2 distribution of freedom q − p has been applied. Equations (7)(8) imply Proposition 1.
Hereafter, we do not assume that ǫ i ∼ N (0, σ 2 ) but that ǫ i is an independently identically distributed random variable with expectation zero and variance σ 2 .

Proof of the Hannan-Quinn Proposition
Proposition 2 If q > p, with probability one, S p − S q S p ≤ (q − p) log log n (14) Proof: The notation is similar to Proposition 2, and let p with expectation zero and variance σ 2 , i ] = 1. From the law of iterated logarithms (Stout 1974), we have n i=1 Z i √ n log log n = √ nv T j ǫ/σ √ n log log n ≤ 1 , namely, r j σ ≤ log log n with probability one, which means S p − S q S p /n ≤ (q − p) log log n with probability one.

Conclusion
We proved that the Hannan-Quinn proposition is true for linear regression as well as for auto regression (Hannan-Quinn, 1979) and for classification (Suzuki, 2006): the minimum rate of d n satisfying strong consistency is (2 + ǫ) log log n for arbitrary ǫ > 0. The future problems contain finding strong consistency conditions that are good for all the cases including linear regression, auto regression, and classification. Making clear why the same d n = 2 log log n is the crucial rate for those problems would be the first step to solve the problem.