Basic concepts of multivariate statistics
As a start to the descriptive statistics, we will use \(n\) observations of \(p\) variables. The data matrix \(X\) is a \(n \times p\) matrix with \(n\) rows and \(p\) columns. Each row represents an observation and each column represents a variable. The \(i\)-th row of \(X\) is denoted by \(x_i^T\), where \(x_i\) is a \(p \times 1\) vector. The \(j\)-th column of \(X\) is denoted by \(x_j\), where \(x_j\) is a \(n \times 1\) vector. The \(i\)-th element of \(x_j\) is denoted by \(x_{ij}\).
\(X\) = \(\begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}\)
The mean vector \(\mu\) is a \(p \times 1\) vector with \(j\)-th element \(\mu_j\), which is the mean of the \(j\)-th variable. The \(j\)-th element of the mean vector is given by: \(\mu_j = \frac{1}{n} \sum_{i=1}^{n} x_{ij}\).
If we stack up all these means, we get the mean vector: \(\overline{\mathbf{x}}=\frac{1}{n}\left(\begin{array}{c} \sum_{i=1}^n x_{i 1} \\ \sum_{i=1}^n x_{i 2} \\ \vdots \\ \sum_{i=1}^n x_{i p} \end{array}\right)=\left(\begin{array}{c} \bar{x}_1 \\ \bar{x}_2 \\ \vdots \\ \bar{x}_p \end{array}\right)\)
The covariance matrix \(S\) is a \(p \times p\) matrix with \(j\)-th row and \(k\)-th column element \(S{jk}\), which is the covariance between the \(j\)-th and \(k\)-th variables. The \(j\)-th row and \(k\)-th column element of the covariance matrix is given by:
\[S_{jk} = \frac{1}{n-1} \sum_{i=1}^{n} (x_{ij} - \mu_j)(x_{ik} - \mu_k)\]If we stack up all these covariances, we get the covariance matrix:
\[S=\frac{1}{n-1}\left(\begin{array}{cccc}\sum_{i=1}^n\left(x_{i 1}-\bar{x}_{1}\right)^{2} & \sum_{i=1}^n\left(x_{i 1}-\bar{x}_{1}\right)\left(x_{i 2}-\bar{x}_{2}\right) & \cdots & \sum_{i=1}^n\left(x_{i 1}-\bar{x}_{1}\right)\left(x_{i p}-\bar{x}_{p}\right) \\ \sum_{i=1}^n\left(x_{i 2}-\bar{x}_{2}\right)\left(x_{i 1}-\bar{x}_{1}\right) & \sum_{i=1}^n\left(x_{i 2}-\bar{x}_{2}\right)^{2} & \cdots & \sum_{i=1}^n\left(x_{i 2}-\bar{x}_{2}\right)\left(x_{i p}-\bar{x}_{p}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^n\left(x_{i p}-\bar{x}_{p}\right)\left(x_{i 1}-\bar{x}_{1}\right) & \sum_{i=1}^n\left(x_{i p}-\bar{x}_{p}\right)\left(x_{i 2}-\bar{x}_{2}\right) & \cdots & \sum_{i=1}^n\left(x_{i p}-\bar{x}_{p}\right)^{2} \end{array}\right)\]Question
Why do we divide by \(n-1\) instead of \(n\) in the covariance matrix calculation? In the covariance matrix calculation, we divide by n-1 instead of n due to Bessel’s correction. This adjustment provides an unbiased estimate of the population covariance when working with sample data. Dividing by n-1 compensates for the underestimation caused by using the sample mean, which has one less degree of freedom than the population mean.
The sample covariance matrix is then given by:
\[S=\frac{1}{n-1} \sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)\left(x_{i}-\overline{x}\right)^{T}\]A part of this formula, \(x - \overline{x}\) can be used to calculate the Mahalanobis distance.
The Mahalanobis distance is a measure of the distance between a point \(x\) and a distribution \(D\), introduced by P. C. Mahalanobis in 1936. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away \(x\) is from the mean of \(D\). This distance is zero if \(x\) is at the mean of \(D\), and grows as \(x\) moves away from the mean along each principal component axis. The Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
The Mahalanobis distance between two vectors \(x\) and \(y\) with covariance matrix \(S\) is defined as:
\[d(x, y)=\sqrt{(x-y)^{T} S^{-1}(x-y)}\]And the mahanalobis distance between a vector \(x\) and its mean vector \(\overline{x}\) with covariance matrix \(S\) is defined as:
\[d(x, \overline{x})=\sqrt{(x-\overline{x})^{T} S^{-1}(x-\overline{x})}\]Geometrically speaking, these points build and ellipsoid around the mean vector \(\overline{x}\), and the Mahalanobis distance is the distance from the point \(x\) to the mean vector \(\overline{x}\), measured in terms of the standard deviation of the ellipsoid in the direction of \(x\).
In the two dimensional case we obtain an ellipse chracterized by the position of its centroid, as well as length and orientation of the orthogonal principal axes. Direction and length of the principal axes are determined by the covariance matrix \(S\).
But theres one problem with these parameters, which is called as eigenvalue problem.
Now lets first transform the covariance matrix \(S\) into a diagonal matrix, using linear transformation (i.e preserving relevant characteristics like trace, determinant, eigenvalues, etc.): This means we are looking for a matrix \(G\) such that:
\[G ^ {T} S G = \Lambda\]where \(\Lambda\) is a diagonal matrix of eigenvalues of \(S\) and the column vectors of \(G\) are their corresponding eigenvectors.
\[\Lambda = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{bmatrix}\] \[G = \begin{bmatrix} g_1 & g_2 & \cdots & g_p \end{bmatrix}\]\(\lambda_1, \lambda_2, \cdots, \lambda_p\) are the eigenvalues of \(S\) and \(g_1, g_2, \cdots, g_p\) are the corresponding eigenvectors.
Question
What is the difference between eigenvectors and eigenvalues? Eigenvectors are the vectors that do not change their direction under a given linear transformation. Eigenvalues are the scalar values that are used to transform the eigenvectors. The eigenvectors can be scaled by the eigenvalues to get the original matrix.
The eigenvalues of \(S\) are the variances of the data along the principal axes, and the eigenvectors are the directions of the principal axes.
The objective of this transformation is that the covariance matrix \(S\) is now a diagonal matrix, which means that the covariance between the variables is zero. This is called decorrelation.
The matrix \(S\) provides a complete description of the variances and the covariances of the involved variables. But often we are interested in the total variance of the dataset.
The total variance of the dataset is the sum of the variances of the variables. The total variance is the trace of the covariance matrix \(S\) given by \(tr(S)\).
The trace of the covariance matrix, and thus the total sample variance, can be seen as a natural generalization of the variance of a single variable to multiple variables.
\[tr(S) = \sum_{i=1}^p \lambda_i\]Now here the total sample variance is the sum of the eigenvalues of the covariance matrix \(S\).
The generalized sample variance is the determinant of the covariance matrix \(S\) given by \(det(S)\) and it provides a measure of the volume of the ellipsoid.
\[det(S) = \prod_{i=1}^p \lambda_i\]The correlation matrix \(R\) is a normalized version of the covariance matrix \(S\), and it is given by:
\[R = \frac{1}{n-1} \sum_{i=1}^n \frac{(x_i - \overline{x})(x_i - \overline{x})^T}{\sigma_i \sigma_j}\]where \(\sigma_i\) and \(\sigma_j\) are the standard deviations of the variables \(x_i\) and \(x_j\) respectively.
The correlation matrix \(R\) is a symmetric matrix with ones on the diagonal, and it is a normalized version of the covariance matrix \(S\).
\[R = \begin{bmatrix} 1 & \rho_{12} & \cdots & \rho_{1p} \\ \rho_{21} & 1 & \cdots & \rho_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{p1} & \rho_{p2} & \cdots & 1 \end{bmatrix}\]The correlation matrix can be obtained from the covariance matrix by dividing each element by the product of the standard deviations of the two variables.
\[\rho_{ij} = \frac{\sigma_{ij}}{\sigma_i \sigma_j}\]The mean vector is an estimator for the expected value vector \(\mu\), the sample covariance matrix is an estimator for the covariance matrix \(\Sigma\), and the empirical correlation matrix is an estimator for the correlation matrix \(P\).
So we have,
\[\overline{x} \approx \hat{\mu}\] \[S \approx \hat{\Sigma}\] \[R \approx \hat{P}\]So now, the expected value for a random vector \(x\) = \([x_1, x_2, \cdots, x_p]^T\) is given by:
\[E(x) = \mu = \begin{bmatrix} E(x_1) \\ E(x_2) \\ \vdots \\ E(x_p) \end{bmatrix}\]and the covariance matrix is given by:
\[Cov(x) = E[(x - \mu)(x - \mu)^T] = \Sigma = \begin{bmatrix} E(x_1 - \mu_1)^2 & E((x_1 - \mu_1)(x_2 - \mu_2)) & \cdots & E((x_1 - \mu_1)(x_p - \mu_p)) \\ E((x_2 - \mu_2)(x_1 - \mu_1)) & E(x_2 - \mu_2)^2 & \cdots & E((x_2 - \mu_2)(x_p - \mu_p)) \\ \vdots & \vdots & \ddots & \vdots \\ E((x_p - \mu_p)(x_1 - \mu_1)) & E((x_p - \mu_p)(x_2 - \mu_2)) & \cdots & E(x_p - \mu_p)^2 \end{bmatrix}\] \[= \begin{bmatrix} \sigma^2(x_1) & \sigma(x_1, x_2) & \cdots & \sigma(x_1, x_p) \\ \sigma(x_2, x_1) & \sigma^2(x_2) & \cdots & \sigma(x_2, x_p) \\ \vdots & \vdots & \ddots & \vdots \\ \sigma(x_p, x_1) & \sigma(x_p, x_2) & \cdots & \sigma^2(x_p) \end{bmatrix}\]where \(\sigma^2(x_i)\) is the variance of \(x_i\) and \(\sigma(x_i, x_j)\) is the covariance between \(x_i\) and \(x_j\).
The covariance matrix is a symmetric matrix with the variances of the variables on the diagonal and the covariances between the variables on the off-diagonal.
The \(p\)-dimensional random vector \(x\) is said to have a multivariate normal distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\) if its probability density function is given by:
\[f(x) = \frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x - \mu)^T\Sigma^{-1}(x - \mu)\right)\]