Principal Component Analysis and Its Derivation From Singular Value Decomposition

Generally, today data analysts and researchers are often faced with a daunting task of reducing high dimensional datasets as large volume of data can be easily generated given the explosive activities of the internet. The most widely used tools for data reduction is the principal component analysis. Merely in some cases, the singular value decomposition method is applied. The study examined the application and theoretical framework of these methods in terms of its linear algebra foundation. The study discovered that the SVD method is a more robust and general method for a change of basis and low rank approximations. But.in terms of application, the PCA method is easy to interpret as illustrated in the work.


Introduction
Over time, new mathematical methods tend to spring up owing to complex empirical problems.Just like the saying 'necessity is the mother of all interventions', when confronted with a mathematical impasse, the human mind has the capability to rediscover new and better ways of solving problems.A brainchild of such mind-probing process is the Singular Value Decomposition (SVD) method often abbreviated as SVD.The SVD method in its rights has expanded the frontiers and uses of linear algebra as it underpins several methodologies like Principal Component Analysis, Orthogonal Function Analysis, Eigen Decomposition, Matrix Decomposition, Cholesky Decomposition, Hessenberg Decomposition etc.In this study, the focus of interest of such methods is the Principal Component Analysis (PCA) and its derivation from SVD.
According to Jollife (2002), the SVD methodology is older likened to PCA as it culminated through the works of five mathematicians: Eugenio Beltrami, Camille Jordan, James Joseph Sylvester, Erhard Schmidt and Herman Weyl.These mathematicians did not just discover the SVD methodology; they also laid the theoretical foundation.Beltrami and Jordan are the progenitors of the decomposition methods.Beltrami gave a proof of the outcome for real, invertible matrices with distinct singular values in 1873.Subsequently, Jordan refined the concept and eliminated the unnecessary restrictions imposed by Beltrami.Sylvester, apparently unfamiliar with Beltrami and Jordan's work, rediscovered the result in 1889 and suggested its importance.
PCA is a dimension reducing tool employed to reduce a large data set of variables to a few set of variables without much loss of information.As a mathematical procedure, it converts many correlated variables into a number of uncorrelated variables termed principal components.
PCA is deeply connected to SVD in terms of dimensionality reduction purposes and change of basis.However, SVD is a highly robust method due to its ability to decompose any matrix A with rank r into   , with orthogonal matrices  and  and diagonal  .The non-zero value along the diagonal of , termed singular values( 1,… ,   ) are positive, and   ≥  +1 .In this same vein, PCA applies the Eigen Decomposition method on square symmetric matrices by searching for new abstract components (eigen-vectors) which clarify most of the information variation in a new organized system.Although compared with EVD used in PCA, SVD is a more precise, robust and reliable method with no need to compute the input correlation/covariance matrix according to Will (1999).Classical PCA is not useful when the number of variables is bigger than the number of observation.In microarray data analysis, Lim, (2013) encountered this situation commonly when genes are accepted as the variables.In such instances, it is pertinent to use PCA using SVD (Deshunk and Purohit , 2007).For applying PCA using SVD, the DNA microarray data was in use for the small round blue cell tumors (SRBCT) of childhood by Khan et al (2001).
Nonetheless, in terms of linear algebra, the common underlying idea that coordinates these two methods is the change of basis.Succinctly, the idea behind changing the basis of a vector connotes a new presentation of the vector in a different coordinate system that best depicts the underlying structure of the vector components.
In PCA, the original data set is transformed so that the eigen vectors are the basis vectors, then the new coordinates of the data points are formed with respect to this new basis.This is the change of basis transformation.It is possible because the covariance matrix of the data set is real symmetric.Hence, it's eigen vector matrix is an orthogonal matrix due to an eigen decomposition.
However, there is a more generic transformation that occurs in SVD, which is even possible for non-square matrices.This method is quite powerful for several reasons and one of those is its ability to produce orthonormal bases for the four fundamental bases.Due to this robust nature, SVD is considered computationally efficient and numerically stable as such that it has been used to solve least square problems and other related problems such as pseudo-inverse, etc. (Lyche, 2018).
There are four basic methods to solving the least square problems.These are: Normal equations, QR decomposition, and SVD.
The first method provides very quick and easy least square solution though these solutions are not that accurate.The second method is more accurate than the first method but requires double as much time.Amongst all, SVD is most widely used for its capability to deliver quick, accurate and stable results even for unconventional cases like the overdetermined system (Least squares problems, 2013).
Between the SVD and PCA methods, there is a large misconception being thrown around with regards to how these methods are being applied to practical problems and its theoretical interconnectedness.This study aims at providing a theoretical framework on the relationship that exist between these methods.More so, the study demonstrates the interpretative power of PCA over SVD by carrying out an exploratory data analysis on a financial data.
Understanding the underlying theoretical foundation which these methods utilize in resolving complex structures has become somewhat invaluable, if not critical.By unraveling the mathematical mysteries behind these methods, it becomes quite easy to navigate the intricacies of dealing with multidimensional data, and also increasing precision of data analysis.

Methodology and Data Collection
The research adopted a financial time series data consisting of the Domestic Debt of State Governments for the thirty-six (36) states of Nigeria and its capital (Abuja).As a secondary data, it was sourced from Debt Management office (DMO) via Nigeria Statistical Bulletin.A 6-year time period (2011)(2012)(2013)(2014)(2015)(2016) was considered as the variables (features) to be studied against the states (observations).This obtains a 36×6 matrix denoted as X, where there is 36 number of observations and 6 features to be understudied.

Principal Component Analysis
Mathematically, the PCA method basically obtains the most important basis used to re-express an ambiguous data into its principal components.That is, it involves a linear transformation of a high dimensional dataset into simpler groups that uncover hidden structures of a garbled data.
At this point, the questions that would readily come to mind are; how would these new bases be obtained and why are they important?In a bid to answer these questions, an arbitrary dataset is assumed and an attempt would be made to implement the underlying rigors of the PCA methodology.
Given an  ×  matrix  where the samples are the n columns (e.g.observations) and the m rows are variables, the objective is to transform the matrix  linearly to another matrix  of similar dimension  × .Hence, for some  ×  matrix ,

𝒀 = 𝑷𝑿
(1) The matrix  is a transformation matrix which is expected to change the basis from  to .Geometrically,  is a rotation and a stretch which moves  to .
The rows of  are thought to be the vectors row  ,   … ,   and columns of  to be the vectors of the column  ,   … ,   , then (2.0) can be construed this way, Such that  . ∈ ℝ .
In essence,  . is just the Euclidean inner (dot) standard product.This means that the main data is being projected onto columns of .Hence, the new foundation for column representation of  is given as the rows of , ( ,   … ,   ).
By assuming linearity, the problem reduces to finding the appropriate change of basis.The vector row ( ,   … ,   ) in the transformation becomes the  principal component.
More underlying questions at this stance would be;  What is the most profound way to re-express .That is, how can independence between the new basis in the principal components be defined?
 What is a good basis choice for ?
Principal component analysis defines independence by considering the variance of the data in the original basis.Its purpose is to search for new directions that maximize the variance which defines the new bases.Recall that variance of  with mean  is given as, For instance, if given a vector of p discrete measurements, that is, If the mean is subtracted from individual measurements, then we get a translated measurement set.Thus, the relation for the measurement variance measurements is given thus, The normalization constant (5) Additionally, if given another vector  =  1 ,  , . .,  ) again with absent mean, the idea can be generalized to derive the covariance of and  -covariance is considered a degree of how much two variables alternate simultaneously while variance is essentially a special covariance case where we find two identical variables.The covariance of and  is given as, For an  ×  matrix  where the data matrix is considered with regards to m row vectors, individual length n, that is ∈ ℝ × Since we own vector for individual variables, each variable contains one specific variable samples.Then the ensuing matrix product can be thought as A closer view at the matrix elements, it is obvious that the possible covariance pairs amongst m variables have been computed.This matrix is hence termed the covariance matrix or the dispersion matrix, denoted by   .

𝒀 = 𝑷𝑿
Where  is the foremost data matrix,  is the transformation matrix &  is the new matrix A decision has to be agreed on the best possible method to re-express   to   and what characteristics to look out for in   .
The covariance measures how well the correlation between two variables.A key assumptions made about PCA method was independence.This means, the desired   matrix has to be un-correlated as possible and that is to say that these variables covariances in   should be near zero.Large variances are rather an interesting feature because they tell you so much about the structure.Consequent upon that, it becomes imperative to maximize inherent information (measured by the variance) by maximizing the variance and reducing the covariance between variables.
Reducing the   matrix to a minimum covariance (zero) and maximizing the variances to as much information each component can hold for, this ensues in a matrix which is diagonal.In essence, the matrix of transformation P will be chosen to result in the diagonalization of   .
The foregoing's implication is that the new basis  ,   … ,   should be orthogonal.From linear algebra, the row vectors of an orthogonal matrix are orthonormal vectors.That is each row has a unit length and the vectors are equally perpendicular.
By this assumption, PCA seeks for a normalized direction in m-directional space along which the variance is maximized.Also, by the orthogonality condition, it circumscribes this search to directions that are perpendicular to pre-determined directions; this makes intuitive sense since geometrically the axes of Cartesian systems are usually perpendicular.This continues until m directions are selected.These P's are the principal components.
The orthonormality condition lessens the challenge to a stage of being solvable by linear algebra decomposition tools.
As we know, direct and effective solutions can be obtained easily by certain decompositions in linear algebra.
For the dispersion matrix   and by re-expressing Y with regards to P and X, we obtain, Obviously, S is going to be symmetric since ( ) =  ) ) =  At this point, a common theorem in algebra will be introduced.

Theorem 1.0
If a matrix is orthogonally diagonalizable, then it is symmetric.

Proof
We have =

Multiplying on the left and on the right gives
So A is symmetric.Given that the converse is also true, it can be adopted to support equation (2.8).
Therefore, since the symmetric matrix =  is diagonalizable by a matrix which is orthonormal of its eigen vectors, such that, = Where D is a matrix that is diagonal contains the eigen-values of S, E is a matrix that is orthonormal whose columns are orthogonal eigen-vectors of S.
The rank r of S is the number of orthonormal eigen-vectors that it contains.If r is less than m then the constraint of orthogonality would have to be maintained by generating (m-r) orthonormal vectors to fill in the remaining spaces.
The foregoing galvanizes our choice of the conversion matrix P holding that the P rows become the eigen-vectors of S, that is = .If substituted into equation (9),   becomes Since a matrix that is orthonormal = − , hence, In conclusion, the objective of diagonalising the covariance matrix   has been achieved.The principal components (the rows of P) are the eigen-vectors of the covariance matrix,  and the rows are in order of importance.

Singular Value Decomposition
The SVD is intimately related to the familiar theory of diagonalising a symmetric matrix.Recall from theorem 1.0, if A is a symmetric real  ×  matrix, there is an orthogonal matrix V and a diagonal D such that A = VDV T .Here the columns of V are eigenvectors for A and form an orthonormal basis for   ; the diagonal entries of D are the eigenvalues of A. To emphasize the connection with the SVD, we will refer to VDV T as the eigenvalue decomposition, or EVD, for A.
For the SVD we begin with an arbitrary real  ×  matrix A. As we shall see, there are orthogonal matrices U and V and a diagonal matrix, this time denoted Σ, such that A = UΣV T .In this case, U is  ×  and V is  × , so that Σ is rectangular with the same dimensions as A. The diagonal entries of Σ, that is the Σ ii = σ i , can be arranged to be nonnegative and in order of decreasing magnitude.The positive ones are called the singular values of A. The columns of U and V are called left and right singular vectors, for A.
The analogy between the EVD for a symmetric matrix and SVD for an arbitrary matrix can be extended a little by thinking of matrices as linear transformations.For a symmetric matrix A, the transformation takes   to itself and the columns of V define an especially nice basis.When vectors are expressed relative to this basis, we see that the transformation simply dilates some components and contracts others, according to the magnitudes of the eigenvalues (with a reflection through the origin tossed in for negative eigenvalues).Moreover, the basis is orthonormal, which is the best kind of basis to have.Now let's look at the SVD for an  ×  matrix A.Here the transformation takes   to a different space,   , so it is reasonable to ask for a natural basis for each of domain and range.The columns of V and U provide these bases.When they are used to represent vectors in the domain and range of the transformation, the nature of the transformation again becomes transparent: it simply dilates some components and contracts others, according to the magnitudes of the singular values, and possibly discards components or appends zeros as needed to account for a change in dimension.
From this perspective, the SVD tells us how to choose orthonormal bases so that the transformation is represented by a matrix with the simplest possible form, that is, diagonal.

The Derivation of PCA From SVD
Consider the  ×  matrix, A, where we have a singular decomposing value, A = UΣV T .There exists a theorem from linear algebra which stipulates that the non-zero singular values of A are the nonzero eigenvalues square roots of AA T or A T A. The first assertion for the A T A case is proven in the succeeding way: We note that A T A is same as Σ T Σ, and hence it has similar eigenvalues.Since Σ T Σ is a ( × ) square, diagonal matrix, the eigenvalues are hence the entries done diagonally, which are the singular values squares.Note that the non-zero eigenvalues of individual covariance matrices, AA T and A T A are very identical.
It is imperative to note that an eigenvalue matrix decomposition, A T A has been carried out.Certainly, since A T A is symmetric which is an orthogonal diagonalisation and hence the eigenvectors of A T A are the V columns.This is pertinent in making the practical linking between the PCA and SVD of the X matrix, which comes up next.
Recalling the initial  ×  data matrix, X, let us describe a new  ×  matrix, Z: Recalling that since the rows of X on m contained the samples of n data, we removed the average of the row from individual entries to ensure no mean existed across the rows.Hence, the new Z matrix, has columns without mean.Consider establishing the  ×  matrix, Z T Z: We observe that describing Z in this pattern ensures that Z T Z is similar to the X,   .covariance matrix.From the conversation in the last section, the main components of X (subject to be identified) are the   eigenvectors.Thus, if we carry out an individual value matrix decomposition of Z T Z, the main components will be the orthogonal matrix, V columns.
The final step is to link the SVD of Z T Z back to the change of basis signified by equation (1): Y = PX We wish to project the original data onto the directions described by the principal components.Since we have the relation V = P T , this is simply: Having the relative V = P T , this is basically: If recovering the original data is what we wish to do, we basically compute (employing orthogonality of V): X = VY

Analysis, Results and Discussion
Principal Component Analysis The proportion of each eigenvalue is given in the second row and it shows how much information is being captured by the eigenvalues.For example, the first eigenvalue has a relatively high proportion of 0.860 which is about 86% of variability explained while the following eigenvalue contributes 8.4% and etc.These proportions are obtained by dividing each eigenvalue by the sum total of all eigen values.
The cumulative proportion which occupies the third column is obtained by adding the successive proportions of variation explained to obtain the running total.The cumulative proportion provides an easy way of selecting principal components based on the percentage of variation explained by these components.It suffices to choose the first and the second components for this study since they both explain 94.3% of information in the dataset. .These values imply presence of correlations between these variables and the first principal component.That is, an increase in the first principal component will result to an increase in these variables.On the other hand, the second principal component has negative values for 2015(-0.578)which would mean a negative relationship between that year and the second principal component.
Finally, a high correlation between 2012 (0.622) and the second principal component indicating an increase in the second principal component will result to an increase in debt values in 2012 but a decrease in 2015.Scores are obtained for each observation by a linear combination of the coefficients of each principal component from the data.That is, eigenvectors, with coefficients corresponding to each variable, are used to calculate the principal component scores for each observation.For example, the score for the first observation (Abia state) on PC1 can be given as, From Table 1.3, the PC1 score value for Abia state can be approximately equated to the calculated score (i.e.74579.5~74607.03).The values are not exactly equal due to the aggregations of approximations made by the software, and large values associated with the data.Nonetheless, these disparities do not affect the interpretations.Scores can be loosely defined as a function of the performance of each state in a particular principal component.From Table 1.4,by examining the magnitude of these values, Lagos state had the highest score (605694).This is plausible considering its robust economic activities with the capacity to attract high debts just like United States of America.While on PC2, Bayelsa and Delta states had high magnitudes but opposite sign which means a negative relationship in their values.
The minimum scores for PC1 and PC2 were associated with Niger and Yobe states.However, general high and low values are obtained from the first component since it accounts for the most variation in the data.So even if the least score (in terms of magnitude and regardless of sign) is associated with Niger state (881), Yobe state, with a score of 11613.4 on the first principal component, is the state with the least outstanding debt.To corroborate the understanding of the score table, the score plot pictorially buttresses the results of the scores by associating observations located at the extreme of both axes to high and low outstanding debt values of each state.
That is, the outstanding debt of states located closer to these variables on the PC2 axis of the plot score will be very high.For example, Bayelsa and Delta states can be viewed at the second component axis extreme while Niger state is the nearest to its origin though not so apparent owing to the overlapping of points.
Alternatively, the first component axis has Lagos state at its extreme, that is, Lagos state has the biggest outstanding debt value on this component.This can be generalized for the total dataset just same as the interpretation above.However, Yobe state being nearer to zero can be commonly attributed to the state with the least outstanding debt value.

Conclusion
One of the major conclusions that can be drawn from this work is that the PCA method is an offshoot of the SVD method.Both methods can be used to achieve the same purpose though the latter has a more robust application.PCA is applied by obtaining the eigenvectors and eigenvalues of the covariance matrix whereas SVD is applied on the matrix of observations by decomposing the data matrix into three factors that contain the eigenvectors and eigenvalues in the PCA method.The study demonstrated the power of the PCA tool by carrying out an exploratory data analysis on a financial data.Even when the correlation matrix is quite difficult to analyze for such a multidimensional dataset, the PCA method was able to decompose the data into simpler components that could be easily interpreted.More so, deep inherent structures were uncovered in terms of relationships that exist between and within data structure.

Figure 1
Figure 1.0.A score plot for the individual observations

Table 1 .
0. Data representing Domestic Debt of State Governments for the thirty-six (36) states of Nigeria and its capital (Abuja) as shown in appendix A. The table is analysed using the minitab computer package to obtain the eigen analysis as shown in Table1.1 below.Table1.1.An eigen analysis showing the eigenvalues and the proportion of variation explained by each component

Table 1 .
2. A table of eigenvectors and corresponding variablesThe principal components are the linear combinations of the original variables that account for the variance in the data.The maximum number of components extracted always equals the number of variables.In essence, the coefficients indicate the relative weight of each variable to each component.

Table 1 . 2 ,
PC1 has relatively high values for 2014, 2015 and 2016 (0.433, 0.459 and 0.490 respectively).High correlation values are in bold numbers in the table above

Table 1 .
3. A table of scores for each observation with respect to PC1 and PC2

Table 1 .
4. Summary of Scores from PC1 and PC2