Visualizing and Testing the Multivariate Linear Regression Model

Recent results make the multivariate linear regression model much easier to use. This model has m ≥ 2 response variables. Results by Kakizawa (2009) and Su and Cook (2012) can be used to explain the large sample theory of the least squares estimator and of the widely used Wilks’ Λ, Pillai’s trace, and Hotelling Lawley trace test statistics. Kakizawa (2009) shows that these statistics have the same limiting distribution. This paper reviews these results and gives two theorems to show that the Hotelling Lawley test generalizes the usual partial F test for m = 1 response variable to m ≥ 1 response variables. Plots for visualizing the model are also given, and can be used to check goodness and lack of fit, to check for outliers and influential cases, and to check whether the error distribution is multivariate normal or from some other elliptically contoured distribution.


Introduction
The multivariate linear regression model is y i = B T x i + i for i = 1, ..., n.This paper suggests some plots and reviews some recent results by Kakizawa (2009) and Su and Cook (2012) that make this model easier to use.
The model has m ≥ 2 response variables Y 1 , ..., Y m and p predictor variables x 1 , x 2 , ..., x p .The ith case is (x T i , y T i ) = (x i1 , x i2 , ..., x ip , Y i1 , ..., Y im ), where the constant x i1 = 1.The model is written in matrix form as Z = XB + E where the matrices are defined below.The model has E( k ) = 0 and Cov( k ) = Σ = (σ i j ) for k = 1, ..., n.Then the p × m coefficient matrix B = β 1 β 2 . . .β m and the m × m covariance matrix Σ are to be estimated, and E(Z) = XB while E(Y i j ) = x T i β j .Multiple linear regression corresponds to m = 1 response variable, and is written in matrix form as Y = Xβ + e. Subscripts are needed for the m multiple linear regression models Y j = Xβ j + e j for j = 1, ..., m where E(e j ) = 0.For the multivariate linear regression model, Cov(e i , e j ) = σ i j I n for i, j = 1, ..., m where I n is the n × n identity matrix.
The n × m matrix of response variables and n × m matrix of errors are . . .
, while the n × p design matrix of predictor variables is X.
Least squares is the classical method for fitting the multivariate linear model.The least squares estimators are B = (X T X) −1 X T Z = β1 β2 . . .βm .The matrix of predicted values or fitted values Ẑ = X B = Ŷ1 Ŷ2 . . .Ŷm .The matrix of residuals Ê = Z − Ẑ = Z − X B = r 1 r 2 . . .r m .These quantities can be found from the m multiple linear regressions of Y j on the predictors: βj = (X T X) −1 X T Y j , Ŷ j = X βj and r j = Y j − Ŷ j for j = 1, ..., m.Hence ˆ i, j = Y i, j − Ŷi,j where Ŷ j = ( Ŷ1,j , ..., Ŷn,j ) T .Finally, The i are assumed to be iid.Some important joint distributions for are completely specified by an m × 1 population location vector µ and an m×m symmetric positive definite population dispersion matrix Σ.An important model is the elliptically contoured EC m (µ, Σ, g) distribution with probability density function where k m > 0 is some constant and g is some known function.The multivariate normal (MVN) N m (µ, Σ) distribution is a special case.
Plots for checking the model are given in Section 2.1.Kakizawa (2009) examines testing for the multivariate linear regression model, showing that the Wilks, Pillai, and Hotelling Lawley test statistics perform well asymptotically for a large class of zero mean error distributions.Section 2.2 reviews these results and shows that the Hotelling Lawley test statistic is closely related to the partial F statistic for multiple linear regression.Section 3 gives an example and some simulations.

Plots for the multivariate linear regression model
This subsection suggests using residual plots, response plots, and the DD plot to examine the multivariate linear regression model.These plots will be described below since the response plot and DD plots are not as well known as the residual plot.The residual plots are often used to check for lack of fit of the model.The response plots are used to check linearity and to detect influential cases and outliers.The response and residual plots are used exactly as in the m = 1 case corresponding to multiple linear regression.See Olive and Hawkins (2005) and Cook and Weisberg (1999a, p. 432;1999b).Some notation is needed to describe the DD plot.Assume that x 1 , ..., x n are iid from a multivariate distribution.The classical estimator (x, S) of multivariate location and dispersion is the sample mean and sample covariance matrix where (1) Let the p × 1 column vector T be a multivariate location estimator, and let the p × p symmetric positive definite matrix C be a dispersion estimator.Then the ith squared sample Mahalanobis distance is the scalar for each observation x i .Notice that the Euclidean distance of x i from the estimate of center T is D i (T, I p ).The notation MD will be used to denote the classical Mahalanobis distances MD i = D i (x, S), and RD will denote distances RD i = D i (T, C) using the robust RMVN estimator (T, C) described in Zhang, Olive, and Ye (2012).The classical estimator (x, S) is a consistent estimator of the population mean and covariance matrix (µ x , Σ x ), and the RMVN estimator (T, C) is a consistent estimator of (µ x , cΣ x ) for a large class of elliptically contoured distributions where c > 0 depends on the distribution and c = 1 for the multivariate normal distribution.
The Rousseeuw and Van Driessen (1999) DD plot is a plot of classical Mahalanobis distances MD versus robust Mahalanobis distances RD, and is used to check the error distribution and to detect outliers.The DD plot suggests that the error distribution is elliptically contoured if the plotted points cluster tightly about a line through the origin as n → ∞.The plot suggests that the error distribution is multivariate normal if the line is the identity line RD=MD with unit slope and zero intercept.If n is large and the plotted points do not cluster tightly about a line through the origin, then the error distribution may not be elliptically contoured.Make a DD plot of the continuous predictor variables to check for x-outliers.These applications of the DD plot for iid multivariate data are discussed in Olive (2002Olive ( , 2013)).
Make the m response and residual plots for the multivariate linear model.A response plot for the jth response variable is a plot of the fitted values Y i j versus the response Y i j where i = 1, ..., n.The identity line is added to the plot as a visual aid.A residual plot corresponding to the jth response variable is a plot of Ŷij versus r i j .In a response plot, the vertical deviations from the identity line are the residuals r i j = Y i j − Ŷij .Suppose the model is good, the error distribution is not highly skewed, and n ≥ 10p.Then the plotted points should cluster about the identity line in each of the m response plots and about the r = 0 line in the m residual plots.If outliers are present or if the plots are not linear, then the current model or data needs to be transformed or corrected.See Example 1.

Testing hypotheses
This subsection reviews useful results from Kakizawa (2009) and Su and Cook (2012).These results will show that the Hotelling Lawley test statistic is an extension of the partial F test statistic.
Consider testing a linear hypothesis H 0 : LB = 0 versus H 1 : LB 0 where L is a full rank r × p matrix.Let Let the error or residual sum of squares and cross products matrix be Then there are four commonly used test statistics.The Roy's maximum root statistic is The Hotelling-Lawley trace statistic is Before proving Theorem 1 and showing that (n − p)U(L) D → χ 2 rm under mild conditions if H 0 is true, we first introduce some necessary notations.Following Henderson and Searle (1979) Then the vec operator stacks the columns of A on top of one another, and A ⊗ B is the Kronecker product of A and B. An important fact is that if A and B are nonsingular square matrices, then Theorem 1 The Hotelling-Lawley trace statistic Proof.Using the Searle (1982, p. 333) identity tr( , and C = I.Hence (3) holds.Kakizawa (2009) gives a result that can be shown to be equivalent to (3) using a commutation matrix K mn where The above proof avoids commutation matrix algebra, and equation (3) will be used to show that the Hotelling Lawley test generalizes the usual partial F test for m = 1 response variable to m ≥ 1 response variables.
The following assumption is important.Assumption D1: Let h i be the ith diagonal element of X(X T X) −1 X T .Assume max 1≤i≤n h i → 0 as n → ∞, assume that the zero mean iid error vectors have finite fourth moments, and assume that 1 Su and Cook (2012) give a central limit type theorem for the multivariate linear regression model: if assumption Their theorem also shows that for multiple linear regression (m = 1), σ2 = MS E is a √ n consistent estimator of σ 2 .Note that it is not assumed that the error vectors have an elliptically contoured distribution.
Theorem 2 If assumption D1 holds and if H 0 is true, then Proof.By Su and Cook (2012), This result also holds if W and Σ are replaced by Ŵ = n(X T X) −1 and Σ .Hence under H 0 and using the proof of Theorem 1, Kakizawa (2009) shows, under stronger assumptions than Theorem 2 (such as eighth moments instead of fourth moments) that for a large class of iid error distributions, the following test statistics have the same χ 2 rm limiting distribution when H 0 is true, and the same noncentral χ 2 rm (ω 2 ) limiting distribution with noncentrality parameter ω 2 when H 0 is false under a local alternative.Hence the three tests are robust to the assumption of normality.The limiting null distribution is well known when the zero mean errors are iid from a multivariate normal distribution.
Theorems 1 and 2 are useful for relating multivariate tests with the partial F test for multiple linear regression that tests whether a reduced model that omits some of the predictors can be used instead of the full model that uses all p predictors.The partial F test statistic is where the residual sums of squares S S E(F) and S S E(R) and degrees of freedom d f F and d f R are for the full and reduced model while the mean square error MS E(F) is for the full model.Let the null hypothesis for the partial F test be H 0 : Lβ = 0 where L sets the coefficients of the predictors in the full model but not in the reduced model to 0. Seber and Lee (2003, p. 100) shows that By Berndt and Savin (1977) and Anderson (1984, pp. 333, 371), Hence the Hotelling Lawley test will have the most power and Pillai's test will have the least power.
Assume H 0 is true.Thus U P → 0, V P → 0, and For large n and t > 0, − log(Λ) then it is possible that the approximate χ 2 rm distribution may be the limiting distribution for only a small class of iid error distributions.When the i are iid N m (0, Σ ), there are some exact results.For r = 1, Let s = min(r, m), m 1 = (|r−m|−1)/2 and m 2 = (n− p−m−1)/2.Note that s(|r−m|+s) = min(r, m) max(r, m) = rm.
This approximation is asymptotically correct by Slutsky's theorem since This approximation is asymptotically correct for a wide range of iid error distributions.
Multivariate analogs of tests for multiple linear regression can be derived with appropriate choice of L. Assume a constant x 1 = 1 is in the model.The analog of the ANOVA F test for multiple linear regression is the MANOVA F test that uses L = [0 I p−1 ] to test whether the nontrivial predictors are needed in the model.
The F j test of hypotheses uses L j = [0, ..., 0, 1, 0, ..., 0], where the 1 is in the jth position, to test whether the jth predictor is needed in the model given that the other p − 1 predictors are in the model.This test is an analog of the t test for multiple linear regression.The statistic Σ−1 B j where BT j is the jth row of B and d j = (X T X) −1 j j , the jth diagonal entry of (X T X) −1 .
The MANOVA partial F test is used to test whether a reduced model is good where the reduced model deletes r of the variables from the full model.For this test, the ith row of L has a 1 in the position corresponding to the ith variable to be deleted.Omitting the jth variable corresponds to the F j test while omitting variables x 2 , ..., x p corresponds to the MANOVA F test.Using L = [0 I k ] tests whether the last k predictors are needed in the multivariate linear regression model given that the remaining predictors are in the model.

Results
Example 1 Cook and Weisberg (1999a, p. 351, 433, 447) gives a data set on 82 mussels sampled off the coast of New Zealand.Let Y 1 = log(S ) and Y 2 = log(M) where S is the shell mass and M is the muscle mass.The predictors are X 2 = L, X 3 = log(W) and X 4 = H: the shell length, log(width) and height.Figures 1 and 2 give the response and residual plots for Y 1 and Y 2 .The response plots show strong linear relationships.For Y 1 , case 79 sticks out while for Y 2 , cases 8, 25 and 48 are not fit well.Highlighted cases had Cook's distance > min(0.5, 2p/n).See Cook (1977).Figure 3 shows the DD plot of the residual vectors.The plotted points are highly correlated but do not cover the identity line, suggesting an elliptically contoured error distribution that is not multivariate normal.The lines MD = 2.60, RD = 2.80 and RD = 2.448 in the DD plot correspond to the 95th percentiles of the MD i , RD i , and χ 2 2,0.95 .Cases 8, 48 and 79 have especially large distances.The response, residual, and DD plots are effective for finding influential cases, for checking linearity, and for checking whether the error distribution is multivariate normal or some other elliptically contoured distribution.Suppose the same model is used except Y 2 = M. Then the response and residual plots for Y 1 remain the same, but the plots shown in Figure 4 show curvature about the identity and r = 0 lines.Hence the linearity condition is violated.Figure 5 shows that the plotted points in the DD plot have correlation well less than one, suggesting that the error distribution is no longer elliptically contoured.Note that the plots can be used to quickly assess whether power transformations have resulted in a linear model.
The R function mltreg produces output for testing and makes the response and residual plots, and the function ddplot4 makes the DD plot.The R commands for making the plots and output are shown below, assuming the data is stored in mussels.The output is very similar to the output for multiple linear regression.Bhat shows B, Ftable shows the F j statistics and pvalues, while MANOVA shows the MANOVA F statistic and pvalue.The four Hotelling Lawley F j statistics were greater than 5.77 with pvalues less than 0.005, and the MANOVA F statistic was 337.8 with pvalue ≈ 0. A small simulation was used to study the Wilks' Λ test, the Pillai's trace test, the Hotelling Lawley trace test, and the Roy's largest root test for the F j tests and the MANOVA F test for multivariate linear regression.The first row of B was always 1 T and the last row of B was always 0 T .When the null hypothesis for the MANOVA F test is true, all but the first row corresponding to the constant are equal to 0 T .When p ≥ 3 and the null hypothesis for the MANOVA F test is false, then the second to last row of B is (1, 0, ..., 0), the third to last row is (1, 1, 0, ..., 0) et cetera as long as the first row is not changed from 1 T .First m × 1 error vectors w i were generated such that the m errors are iid with variance σ 2 .Let the m × m matrix A = (a i j ) with a ii = 1 and a i j = ψ where 0 ≤ ψ < 1 for i j.Then i = Aw i so that Σ = σ 2 A A T = (σ i j ) where the diagonal entries and the off diagonal entries σ i j = σ 2 [2ψ + (m − 2)ψ 2 ] where ψ = 0.10.Hence the correlations are (2ψ + (m − 2)ψ 2 )/(1 + (m − 1)ψ 2 ).As ψ gets close to 1, the error vectors cluster about the line in the direction of (1, ..., 1) T .Used w i ∼ N m (0, I), w i ∼ (1 − τ)N m (0, I) + τN m (0, 25I) with 0 < τ < 1 and τ = 0.25 in the simulation, w i ∼ multivariate t d with d = 7 degrees of freedom, or w i ∼ lognormal -E(lognormal): where the m components of w i were iid with distribution e z − E(e z ) where z ∼ N(0, 1).Only the lognormal distribution is not elliptically contoured.

Discussion
Multivariate linear regression is nearly as easy to use as multiple linear regression if m is small.The plots speed up the model building process for multivariate linear models since the success of power transformations achieving linearity can be quickly assessed and influential cases can often be quickly detected.The plots can also be used for competing methods such as the envelopes estimators of Su and Cook (2012).Variable selection for multivariate linear regression is discussed in Fujikoshi, Ulyanov, and Shimizu (2010).Often observations (x 2 , ..., x p , Y 1 , ..., Y m ) are collected on the same person or thing and hence are correlated.If transformations can be found such that the m response plots and residual plots look good, and n is large enough, then multivariate linear regression can be used to efficiently analyze the data.Examining m multiple linear regressions is an incorrect method for analyzing the data.From simulations, response and residual plots start to be informative for n ≥ 10p.Cramér (1946, pp. 414-415) shows that when the e i are iid N(0, σ 2 ) and none of the p − 1 nontrivial predictors are needed in the multiple linear regression model, then E(R 2 ) = (p − 1)/(n − 1) where R 2 is the coefficient of multiple determination.
For testing the multivariate linear regression model, we recommend n ≥ max((m + p) 2 , mp + 30) provided that the m response and residual plots look good.When m = 1 the model degrees of freedom = n − p.It is not clear what the model degrees of freedom is for m > 1.We used n − mp which is likely too small (conservative), but using k(n − p) for small integer k > 1 is likely too large.Based on a much larger simulation study Pelawa Watagoda (2013, pp. 111-112), using the four types of error distributions and m = p, the tests had approximately correct level if n > 0.83(m + p) 2 for the Hotelling Lawley test, if n > 2.80(m + p) 2 for the Wilks' test (agreeing with Kshirsagar (1972): n ≥ 3(m + p) 2 for multivariate normal data), and if n > 4.2(m + p) 2 for Pillai's test.
The tests are also large sample tests for a robust estimator that is asymptotically equivalent to least squares on a large class of elliptically contoured distributions.See Rupasinghe Arachchige Don (2013).For the robust estimator the elliptically contoured assumption is important since the robust estimator and least squares give different estimators of the constant when the assumption is violated.
The R software was used in the simulation.See R Development Core Team (2011) true and the errors are iid N(0, σ 2 ).Note that for multiple linear regression with m = 1, F R = (n − p)U(L)/r since Σ−1 = 1/ σ2 .Hence the scaled Hotelling Lawley test statistic is the partial F test statistic extended to m > 1 predictor variables by Theorem 1.By Theorem 2, for example, rF R D → χ 2 r for a large class of nonnormal error distribution.If Z n ∼ F k,dn , then Z n D → χ 2 k /k as d n → ∞.Hence using the F r,n−p approximation gives a large sample test with correct asymptotic level, and the partial F test is robust to nonnormality.Similarly, using an F rm,n−pm approximation for the following test statistics gives large sample tests with correct asymptotic level byKakizawa (2009) and similar power for large n.The large sample test will have correct asymptotic level as long as the denominator degrees of freedom d n → ∞ as n → ∞, and d n = n − pm reduces to the partial F test if m = 1 and U(L) is used.Then the three test statistics are −[n − p − 0.5(m − r + 3)] rm log(Λ(L)),

Figure 3 :
Figure 3: DD Plot of the Residual Vectors for the Mussels Data.

Table 1 :
Test Coverages: MANOVA F H 0 is True.

Table 2 :
Test Coverages: MANOVA F H 0 is False.=h= max(r, m) and d 2 = n − p − h + r for the test statistic n − p − h + r h λ max (L).Denote these statistics by W, P, HL and R. Let the coverage be the proportion of times that H 0 is rejected.Want coverage near 0.05 when H 0 is true and coverage close to 1 for good power when H 0 is false.With 5000 runs, coverage outside of (0.04,0.06) suggests that the true coverage is not 0.05.Coverages are tabled for the F 1 , F 2 , F p−1 , and F p test and for the MANOVA F test denoted by F M .The null hypothesis H 0 was always true for the F p test and always false for the F 1 test.When the MANOVA F test was true, H 0 was true for the F j tests with j 1.When the MANOVA F test was false, H 0 was false for the F j tests with j p, but the F p−1 test should be hardest to reject for j p by construction of B and the error vectors.When the null hypothesis H 0 was true, simulated values started to get close to nominal levels for n ≥ 0.8(m + p) 2 , and were fairly good for n ≥ 1.5(m + p) 2 .The exception was Roy's test which rejects H 0 far too often if r > 1. See Table 1 where want values for the F 1 test to be close to 1 since H 0 is false for the F 1 test and want values close to 0.05, otherwise.Roy's test was very good for the F j tests but very poor for the MANOVA F test. Results are shown for m = p = 10.As expected from Berndt and Savin (1977), Pillai's test rejected H 0 less often than Wilks' test which rejected H 0 less often than the Hotelling Lawley test.In Table 2, H 0 is only true for the F p test where p = m, and want values in the F p column near 0.05.Want values near 1 for high power otherwise.If H 0 is false, often H 0 will be rejected for small n.For example, if n ≥ 10p, then the m residual plots should start to look good, and the MANOVA F test should be rejected.For the simulated data, had fair power for n not much larger than mp.Results are shown for the lognormal distribution.
. Programs are in the collection of R functions lregpack.txtavailable from (http://lagrange.math.siu.edu/Olive/lregpack.txt).The mussels data set can be obtained from (http://lagrange.math.siu.edu/Olive/lregdata.txt).The function mregsim was used to simulate the tests of hypotheses.The function mltreg makes the residual and response plots, and computes the F j , MANOVA F, and MANOVA partial F test pvalues.The function mregddsim simulates DD plots of residuals for the multivariate linear regression model.The function MLRsim simulates response and residual plots for various error distributions.The function ddplot4 makes the DD plot of the residuals.See Example 1.