\section{Matrices as Arrays} The simplest way we think of matrices is as arrays of numbers, with an associated way to 'multiply' them, yeilding new such arrays: $$[\begin{array}\Square&\Square&\Square\\\Square&\Square&\Square\\\Square&\Square&\Square\end{array}]\CircleTimes[\begin{array}\Square&\Square&\Square\\\Square&\Square&\Square\\\Square&\Square&\Square\end{array}]=[\begin{array}\Square&\Square&\Square\\\Square&\Square&\Square\\\Square&\Square&\Square\end{array}]$$This 'multiplication' can also 'apply' a matrix $M$ to a vector $\vec{V}$:$$M \CircleTimes \vec{V} = [\begin{array}\Square&\Square&\Square\\\Square&\Square&\Square\\\Square&\Square&\Square\end{array}]\CircleTimes(\begin{array}\Square\\\Square\\\Square\end{array})=(\begin{array}\Square\\\Square\\\Square\end{array})$$Provide an 'inner product' mapping two vectors to a scalar:$$(\begin{array}\Square&\Square&\Square\end{array})\CircleTimes[\begin{array}\Square\\\Square\\\Square\end{array}]=[\Square]$$Or even an \it{Outer Product}, seldom discussed but very useful and interesting, which takes two vectors and returns a matrix:$$ [\begin{array}\Square\\\Square\\\Square\end{array}]\CircleTimes(\begin{array}\Square&\Square&\Square\end{array})=[\begin{array}\Square&\Square&\Square\\\Square&\Square&\Square\\\Square&\Square&\Square\end{array}]$$\section{Matrices as Functions} Another way to think of a matrix is as a 'mapping' from vectors to vectors. This can be thought of an actual transform (generally, a scaling or rotation) or as a change in coordinate system. For example, the following matrix 'projects' a two-dimensional plane onto the line $y = -x$: $$ [\begin{array}1&-1\\-1&1\end{array}] \CircleTimes [\begin{array}x\\y\end{array}] = [\begin{array}x-y\\y-x\end{array}] $$ A very important point is that information is being lost in this transform. For example, the point $(2,2)$ is the 'image' of both $(5,3)$ and $(12,10)$ in the original space. The fact that this mapping 'loses' a dimension is reflected in the fact that the two rows of the matrix are (trivially) linearly dependant. This matrix has 'rank' ONE, since it can only transform ONE dimension without losing information. Note that a \it{rectangular} matrix of size $m \times n, m > n$ can have at most rank n.\section{Matrices as Datasets} Another useful way to think of matrices is as observations of Random Variables. For example, if we have two random variables $x$ and $y$, with three observations each, we can tabulate our data as $\vec{X} = (\begin{array}x_1\\x_2\\x_3\end{array})$ or even put both variables together as $$D = [\begin{array}x_1&y_1\\x_2&y_2\\x_3&y_3\end{array}]$$. It is common and convenient to 'center' these data matrices by substracting off the \bf{mean} for each variable (the vector of means is easily added back in when appropriate. With data arranged this way, our familiar matrix operations become very useful; Our inner product yeilds the variance (a fact used to great effect in the geometry of statistics): $$X' \CircleTimes X = ( \begin{array} (x_1-\mu_x ) & (x_2 - \mu_x) & (x_3 - \mu_x) \end{array} ) \CircleTimes (\begin{array}x_1\\x_2\\x_3\end{array}) = \sum_i (x_i - \mu_i)^2 = \text{cov}(x)$$ Extending this to multiple variables brings us into the land of covariance matrices, the fundamental object in applied statistics: $$D' \CircleTimes D = (\begin{array}x_1&x_2&x_3\\y_1&y_2&y_3\end{array}) \CircleTimes [\begin{array}x_1&y_1\\x_2&y_2\\x_3&y_3\end{array}] = [\begin{array}\text{ var}(x) & \text{cov}(x,y)\\\text{cov}(y,x) & \text{ var}(y)\end{array}]$$ Note that these covariance matrices are always real, symmetric, and \it{positive semi-definate}. \section{Eigenvalues and Eigenvectors} \subsection{Definitions}An \bf{eigenvector} $V$ associated with a matrix $M$ is a vector whose image under $M$ is a simple scaling, thus:$$M \CircleTimes V = \lambda V$$Note that every eigenvalue is thus associated with a 1-dimensional linear subspace defined by it\'s eigen vector, since $\forall k \in \mathbb{R}$$$M \CircleTimes \vec{V} = \lambda \vec{V} \Rightarrow \vec{V} \CircleTimes (kV)=k \lambda \vec{V} $$\subsection{Examples}Another way to think of the line defined by the eigenvector is as an \bf{invariant subspace} - every 'point' along that line is mapped to that line by the action of $M$.In dynamical systems, the 'acceleration' or 'flow' is often represented as a field of vectors, each assigned to a position $\vec{X}=(x_1,y_1)$ by the equation $\vec{X} = M \CircleTimes \vec{X}$, usually approximated as $\vec{X_{i+1}} = \vec{X_i} + M \CircleTimes \vec{X}$. This means the dynamics along these lines is known - and by 'continuity', they must be 'similar nearby'. Thus, understanding the eigenstuff yields understanding about the overall behavior of the system: $M=[\begin{array}-2&1\\1&-2\end{array}]$ has eigenvectors $\vec{V_1}=(1,1)$ and $\vec{V_2}=(1,-1)$, with eigenvalues $\lambda_{1,2}=-1$ (why?). The phase portrait is 'fixed' by the dynamics of these subspaces: image1\section{Diagonalization} A fundamental result is that, for every n-dimensional \it{real}, \it{symmetric} matrix $M$, there exists an \it{orthogonal} matrix $U$ and a \it{diagonal} matrix $D$ such that $$M = U' \CircleTimes D \CircleTimes U$$ and the diagonal elements of $D$ are $\lambda_1\geq...\geq\lambda_n$ (the ordered eigenvalues of $M$) corresponding to $(\vec{U_1},...,\vec{U_n})$ the columns of $U$ (the eigenvectors of $M$). This is huge. One way to think of this is that every (full-rank) matrix has a coordinate system ($U$ is essentially a change of coordinates) in which the action of the matrix is a simple scaling along the coordinate axes. Another is as a way of 'decomposing' a matrix. Reflection and hand-waving will show that the above 'factoring' of $M$ allows us to write M as the sum of simple ($n \times n$)-matrices determined by the eigensystem: $$M = \lambda_1 \vec{U'_1}' + ... + \lambda_n \vec{U'_n}$$ Note further that since the $\{U_i\}_{i \in 1..n}$ are all the same 'length' (part of the orthonormal condition on $U$), the relative contributions of the matrices are determined by the size of the eigenvalues $\{\lambda_i\}_{i \in 1..n}$. Thus, an excellent approximation to $M$ may be achievable using only the first few vectors of the eigensystem. This is especially useful if the eigenvectors describe important behaviors of the system, as they always seem to.\section{Quadratic Forms}A common construction with a matrix $M$ is in a \it{Quadratic Form}, thusly:$$y = \vec{X}' \CircleTimes M \CircleTimes \vec{X}$$ note that $y$ is a scalar. Thus we are generating a real-valued function of the vector $\vec{X}$. This can be a wieghted norm or average, or arise as a 'squared-distance' function. An important and useful result is that the \it{maximum} and \it{minimum} of this function of $\vec{X}$ are $\lambda_1$ and $\lambda_2$, acheived at their corresponding eigenvectors $U_1$ and $U_n$, respectively. More specifically, $$\frac{\vec{X}}{2}$$$$\text{sup}_{\vec{X}} \frac{ \vec{X'} \CircleTimes M \CircleTimes \vec{X}}{\vec{X'} \CircleTimes \vec{X}} = U_1 \text{for some k \in \mathbb{R}}$$ Since the $\{U_i\}$ form a \it{basis}, we can write any $X$ as in terms of them: $$\vec{X} = C_1 \vec{U_1} + ... + C_N \vec{U_n}$$ which allows us to us to write the value of the (normalized) Quadratic Form as: $$ \frac{ \vec{X'} \CircleTimes M \CircleTimes \vec{X}}{\vec{X'} \CircleTimes \vec{X}} = \frac{C_1 \lambda^2_1 + ... + C_n \lambda^2_n }{\lambda^2_1 + ... + \lambda^2_n } $$ It is clear upon examination this is maximized as $\lambda_1$ at $C=(1,0,0,...)$; that is, when $\vec{X} = \vec{U_1}$. Similar treatments show the minimum is acieved at $\vec{U_n}$.\section{Least Squares Theory} Many classical and useful problems in statistics reduce to the system $$\vec{Y} = M \CircleTimes \vec{X}$$ $\vec{Y}$ may be a vector of observations, and we seek to estimate the value of $\vec{X}$ for known parameters $M$. Note that we may well have many more observations than Random Variables, so this system may be underdetermined; that is, the $M$ may not be a real symmetric matrix. The \it{Least Squares Estimate} of $\vec{X}$ will minimize $$( Y - M \CircleTimes \vec{X})' \CircleTimes ( Y - M \CircleTimes \vec{X} )$$ Taking derivatives w.r.t. $\vec{X}$ yields the \it{Normal Equation}: $$(M'\CircleTimes M)\CircleTimes \vec{X} = M' \CircleTimes \vec{Y}$$ Since $(M'\CircleTimes M)$ is (generally) a real, symmetric matrix, we can write its spectral decompostion $$(M'\CircleTimes M) = \lambda_1 U_1' U_1 + ... + \lambda_n U_n' U_n$$ which in turn provides us with a (psuedo) inverse:$$(M'\CircleTimes M)^{-} = \frac{1}{\lambda_1} U_1' U_1 + ... + \frac{1}{\lambda_n} U_n' U_n$$ so we can solve the system:$$\vec{X} = (M'\CircleTimes M)^- M' \CircleTimes \vec{Y}$$\section{Fisher Linear Discriminant} A classic problem in statistics involves finding a \it{linear dsicriminant}, that is, a linear combination of the features of a datapoint whose \it{sign} classifies it into one of two (or more) categories. For example, a datapoint $\vec{X}$ might represent a particular tissue sample, the features $(x_1,...,x_n)$ the expression levels of various genes, and the classes a clinical outcome. We seek a $\vec{w}$ such that $$\vec{w}' \CircleTimes \vec{X} \text{is} \{\begin{array}> 0 \forall \vec{X} \in \text{Class A}\\< 0 \forall \vec{X} \in \text{Class B}\end{array}$$or as close as can be found. Fisher's measure of the discriminating ability of such a $\vec{w}$ is given by$$J(\vec{w}) = \frac{(\mu_\text{Class A} - \mu_\text{Class B})^2}{\sigma^2_A+ \sigma^2_B }$$ that is, the ratio of the (squared) distance between class means to the projected variance for the classes determined by choice of $\vec{w}$. If $\vec{X_A}$ is the \it{centered} datapoints in class A, define the \it{Scatter Matrix} of A as$$S_A := \vec{X_A}' \CircleTimes \vec{X_A} = \under{\sum}{\vec{x} \in A} (\vec{x_i} - \mu_A)' \CircleTimes (\vec{x_i} - \mu_A) = n_A \Sigma_A$$IE, simple the covariance matrix for the class weighted by the number of elements in the class $n_A$. If we then define the \it{Scatter-Within} matrix as $S_W = S_A + S+A$ to represent the scatter within the classes, and the \it{Scatter Between} matrix relating the two classes as $S_B = (\vec{\mu_A}-\vec{\mu_B})' \CircleTimes (\vec{\mu_A}-\vec{\mu_B})$then we can wewrite the discriminant function as $$J(\vec{w}) = \vec{w}' ( S_B\CircleTimes{S_W}^-) \vec{w}$$ but this is just a quadratic form. So we know the best linear discriminant $\vec{w}$ is given by the first eigenvector of $S_B \CircleTimes S^{-}_W$. This is given by $$\vec{w} = {S_W}^{-} (\mu_A - \mu_B)$$which can be seen as intuitive.\section{Singular Value Decomposition} So, eigensystems tell you everything you need to go. What else could be needed? Well, nothing. Unfortunately, most matrices encountered in the real world are not real and symmetric. For example, in statistical applications, we often have many more observations than variables, so the data matrices are not square. So, we cannot generally diagonalize our matrix and work with the Spectral Decompostion. The \it{Singular Value Decomposition} should be seen as a way to come as close as possible to this enviaable state of things. In short, for \bf{any} matrix $M$, we can find \it{orthogonal} matrices $U$ and $V$, and diagonal matrix $D$, such that: $$M = U \CircleTimes D \CircleTimes V$$ where the rows of $V$ are an orthonormal basis for the rows of $M$, the columns of $U$ are an orthonormal basis for the coloumns of $M$, and the diagonal elements of $D$ are the \it{Singular Values} of $M$. The singular values are the closest to eigenvalues we can get for a singular matrix $M$; they are the eigenvalues of $\sqrt{ M' \CircleTimes M}$, and correspond to the \it{Left and Right Singular Vectors} making up $U$ and $V$ respectively, just as do eigenvalues and eigenvectors. So the situation is akin to a diagonalized matrix, but we are unable to use a single coordinate tranform to get into the 'optimized' space where the action of the matrix is easy to identify. Instead, we have two coordinate transforms - one for the rows, and one for the columns. It's an elegant solution to the Curse of Dimensionality, and a powerful one. There is an interesting and useful duality between $U$ and $V$. Suppose we wish to find out what linear combination of solumns from $M$ gave rise to one of the right singular vectors (the columns of $U$). We'd like to get $U$ in terms of $M$: $$M \CircleTimes (D V)^{-} = U $$ In other words, $(D V)^{-}$ will tell us how to build the columns of $U$ from those of $M$. This is useful when we find something interesting and want to relate it back to the raw data. Note that $(D V)$ is very easy to invert. For any matrices $A$ and $B$, $(AB)^- = B^- A^-$ - multiply them out and see. Now, $D^-$ is easy: for any diagonal matrix $D=\text{diag}(d_1,...,d_n) \Rightarrow D^-=\text{diag}(1/d_{1},...,1/d_{n})$. The inverse of $V$ brings out the beauty of the SVD. We know that $V$ is orthogonal. This \it{implies} lots of nice things, but the \it{definition} is that $V^- = V'$. In other words, the linear combination of columns from $M$ used to make the $i^{th}$ left singular vector (column of $U$) is given by the i^{th} row of $V$. \section{Image Processing Example} A color-indexed or RGB digital image is essentially a big matrix (or three), and as such provides an excellent example of some of the properties of the SVD. We can make use of the ordering of the singular values to reduce the data needed to describe the most 'important' aspects of the system. Specifically, we can write the following SVD version of the Spectral Decomposition: $$M= d_1 \vec{U_1} \CircleTimes \vec{V_1} + d_2 \vec{U_2} \CircleTimes \vec{V_2}+ ... + d_n \vec{U_n} \CircleTimes \vec{V_n} $$ where $d_i$, $\vec{U_i}$, and $\vec{V_i}$ are the i^{th} Singular Value, Left Singular Vector (column of $U$), and Right Singular Vector (row of $V$), respectively. Just as with the eigenvalues in a Spectral Decomposition, the first few Singular Values will be the largest, and thus these terms will make the largest 'contributions' to $M$. In fact, the matrix constructed by adding the first $k$ such terms is the rank-k matrix closest to $M$, in terms of least-squares distance between elements. This provides a simple compression or feature extraction technique. The following image is the blue channel from a 650 \times 600 RGB image. We convert the elements to floats, contruct the SVD, and then recreate the image using the approximation above for various values of $k$. Note that for, say, $k=10$, we are using (at most) $k * 600 *2 + 1 = 1201$ numbers, rather then the original $600 * 650 = 390,000$, a significant compression.\section{SVD for Microarray Data} Microarray data, with its massive datasets and high dimensionality, can be a fruitful domain for the application of SVD techniques. We demonstrate here one such approach. This example uses a synthetic dataset for clarity of exposition. We simulate 100 hybridizations, the last 10 of which are associated with some phenotype of interest. We have 1000 genes, the last 900 of which are slightly upregulated in the above 10 samples. Note that the increase in mean expression in this set is less than the general variances of the datapoints, so the upregulation is very difficult to detect naievely. We then calculate the SVD. Note that the columns of $M$ are arrays (chips, hybridizations). We would like to find an eigen-array (Left Singular Value) that focuses on the behavior we are interested in. Recall that the 'recipe' for each LSV is given by the columns of $V$. So we calculate the correlation of each of these columns to our 'phenotype vector'. This correlation is maximal for the second eigen-array, and examination of its 'recipe' shows that it is indeed capturing the overall differences between the two pheontype classes. So we know that the second eigenarray is a coordinate axis for the information we're looking for. We now need only find eigen-genes whose component along this axis is large, indicating that they may be differentially expressed between the classes. Having found these (by looking at the second element of the rows of $U$), we need to get back to genes (remember, these are 'eigen-genes'. Exploiting the relationships withing the SVD, we