Matrix Algebra

Rafael A. Irizarry
October 9, 2014

Motivation

Our data is usually saved in matrices.
In the gene expression example data, \( X \), was \( 8793 \times 208 \)
Say you want to compute the average of each column

plot of chunk unnamed-chunk-3

Matrix Notation

Let \( X \) be a data matrix with \( N \) rows and \( p \) columns
The vector of column averages:

\[ \left( \frac{1}{N}\sum_{i=1}^N X_{i,1}, \dots, \frac{1}{N}\sum_{i=1}^N X_{i,p}\right) \]

can be written like this \( AX \) with \( A=(\frac{1}{N},\dots,\frac{1}{N}) \)

Matrix Notation

Here are the matrix operations \[ AX = \begin{pmatrix} \frac{1}{N} & \frac{1}{N} & \dots & \frac{1}{N} \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \bar{X}_1 & \bar{X}_2 & \dots &\bar{X}_N \end{pmatrix} \]

Matrix Notation

Here are the row averages:

\[ XB = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p\\ 1/p\\ \vdots\\ 1/p\\ \end{pmatrix} = \begin{pmatrix} \bar{X}_1 \\ \bar{X}_2 \\ \vdots\\ \bar{X}_q\\ \end{pmatrix} \]

Matrix Notation (more general)

\[ AX = \begin{pmatrix} a_1 & a_2 & \dots & a_N \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \sum_{i=1}^N a_i X_{i,1} & \dots & \sum_{i=1}^N a_i X_{i,p} \end{pmatrix} \] Convenient and fast for coding

Matrix Notation

Say for each gene you want to compute
1. the average across all samples
2. the difference between males and females

plot of chunk unnamed-chunk-4

Matrix Notation

\[ \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p&a_1\\ 1/p&a_2\\ \vdots\\ 1/p&a_p\\ \end{pmatrix} \] with \[ \begin{align*} a_j &= \begin{cases} -1/N_F & \text{if female } \\ 1/N_M & \text{if male} \end{cases} \end{align*} \]

with \( N_F \)=number of females and \( N_M \)=number of males.

Matrix Notation (even more general)

\[ AX = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N}\\ \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} X_{i,1} & \dots & \sum_{i=1}^N a_{1,i} X_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} X_{i,1} & \dots & \sum_{i=1}^N a_{M,i} X_{i,p} \end{pmatrix} \]

Regression Models

We used regression with our Baseball data. For example

\[ Y_i= \beta_0 + X_{i}\beta_1 + \varepsilon_i, i=1,\dots,N \]

with \( Y_i \) = runs for team \( i \) and \( X_{i} \) = BB for team \( i \).

We need to find the \( \beta_0,\beta_1 \) pair that minimizes

\[ \sum_{i=1}^N (Y_i- \beta_0 - X_i\beta_1)^2 \]

How do we do this?

Linear Algebra Representation

We can write it like this:

\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \beta_1+ \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]

The \( Y \) column is a point in \( \mathbb{R}^N \)

The \( 1 \) and \( X \) vectors define a subspace, each point defined by the pair \( (\beta_0,\beta_1)\in \mathbb{R}^2 \).

Projections

Recall: The projection of the point \( Y \) to the smaller space minimizes the distance.

Simplest example

\( \vec{y} \in \mathbb{R}^N \) and \( L \subset \mathbb{R}^N \) is the space spanned by

\[ \vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\} \]

What \( c \) minimizes distance between \( c\vec{v} \) and \( \vec{y} \)?

Answer: The projection of \( \vec{y} \) onto \( L \): \[ \mbox{Proj}_L(\vec{y}) = \hat{c} \vec{v} \]

So how do we find \( \hat{c} \)?

Simplest example

The projection is orthogonal to the space so:

\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]

\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]

\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

Simplest example

\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

Going back to our original notation this means \( \hat{\beta}_0 = \bar{Y} \)

In this case calculus would have been just as easy:

\[ \frac{\partial}{\partial\beta_0}\sum_{i=1}^N (Y_i - \beta_0)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \beta_0) = 0 \implies \]

\[ N \beta_0 = \sum_{i=1}^N Y_i \implies \hat{\beta_0}=\bar{Y} \]

Regression Model

Make it slightly more complicated.

The space is now

\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \} \]

with

\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]

The Transpose

The columns become the rows

\[ X = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} , \, X^\top = \begin{pmatrix} X_{1,1}&\dots & X_{N,1} \\ X_{1,2}&\dots & X_{N,2} \\ & \vdots & \\ X_{1,p}&\dots & X_{N,p} \\ \end{pmatrix} \]

Variance

This helps us define projections but also:

If \( X \) is \( N\times p \) matrix of features for which the columns have average 0,

\[ S = \frac{1}{N} X^\top X \]

is called the sample covariance matrix
\( S_{i,i} \) is the variance of column \( i \)
\( S_{i,j} \) entry is the covariance between columns \( i \) and \( j \).
\[ \frac{S_{i,j}}{\sqrt{S_{i,i}S_{j,j}}} \] is the correlation

Regression Model

Let \( X = [ \vec{v}_0 \,\, \vec{v}_1] \). It's a \( N\times 2 \) matrix

Any point \( L \) can be writen as \( X\beta \).

The orthogonal projection must have

\( X^\top (\vec{y}-X\vec{\beta}) = 0 \)

\[ X^\top X \vec{\beta}= X^\top y \]

\[ \vec{\beta}= (X^\top X)^{-1}X^\top y \]

Regression Models

In general, we can write this

\[ Y_i= \beta_0 + X_{i,1}\beta_1 + \dots X_{i,p} \beta_p + \varepsilon_i, i=1,\dots,N \] as

\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1,1}\\ X_{2,1}\\ \vdots \\ X_{N,1}\\ \end{pmatrix} \beta_1+ \dots + \begin{pmatrix} X_{1,p}\\ X_{2,p}\\ \vdots \\ X_{N,p}\\ \end{pmatrix}\beta_p + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]

Regression Models

We can simplify to:

\[ Y = X\beta + \varepsilon \]

with \[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_N\\ \end{pmatrix}, \, X=\begin{pmatrix} 1&X_{1,1}&\dots&X_{1,p}\\ 1&X_{2,1}&\dots&X_{2,p}\\ & \vdots & \\ 1&X_{N,1}&\dots&X_{N,p} \end{pmatrix}, \, \beta = \begin{pmatrix} \beta_1\\ \vdots\\ \beta_p \end{pmatrix}, \, \varepsilon = \begin{pmatrix} \varepsilon_1\\ \vdots\\ \varepsilon_N \end{pmatrix}\\ \\ \]

Regression Model

Solution is still

\[ \hat{\beta} = (X^\top X)^{-1} X^\top Y \]

Singular Value Decomposition (SVD)

We have a data matrix with \( N \) observations (rows) and \( p \) columns.

We can write our data matrix as

\[ X = U D V^\top , \mbox{ with } \]

\( U \) is \( N\times p \) with \( U^\top U=I \)
\( V \) is \( p\times p \) with \( V^\top V=I \) and
\( D \) is \( p \times p \) diagonal matrix

We are changing basis: \( XV = UD \)

And can change back \( XVV^\top = X \)

SVD

Say the columns of \( X \) are perfectly correlated
Then we really have just one column, the SVD catches this:

X = U[:,0] d[0,0] Vh[0,:]

Motivating SVD

Consider this simple linear combination:

\[ \Delta_j = (Y_{1,j}, \dots, Y_{N,j})'(1/N_1,\dots,1/N_1,-1/N_2,\dots,-1/N_2) = \]

\[ \frac{1}{N_1}\sum_{i=1}^{N_1} Y_{i,j} - \frac{1}{N_2}\sum_{i=N_1+1}^{N} Y_{i,j}, N_2=N-N_1 \]

Try to find the \( N_1 \) that maximizes the variance

\[ \frac{1}{p}\sum_{j=1}^p \Delta_j^2 \]

First PC correlates with batch

plot of chunk unnamed-chunk-9

Applications of SVD

Plot of standardized petal width and petal length columns plot of chunk unnamed-chunk-11

What transformation has more varinace?

This is the projection of \( (0,1) \) and \( (1,0) \) plot of chunk unnamed-chunk-12

How about this direction?

Line represents the space spaned by \[ \vec{v}=\begin{pmatrix}1/\sqrt(2)&-1/\sqrt(2)\end{pmatrix}' \]

plot of chunk unnamed-chunk-13

Here is the projection

\[ d\vec{u} = X \vec{v} \]

plot of chunk f

How about this one?

plot of chunk unnamed-chunk-14

What is \( \vec{v} \) here?

Transformation

Note these two are orthogonal so let

\[ V=\frac{1}{\sqrt{2}}\begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]

\( XV \) reports something proportional to the average and the difference.

The orthogonality is convenient because we can change back and forth:

\[ XVV^\top = X \]

SVD gives us that transformation

plot of chunk unnamed-chunk-15

Compression based on SVD

Get rid of one dimension and recuperate data XV[:,1]V.T

plot of chunk unnamed-chunk-16

Iris example

We take all four features now and apply SVD: plot of chunk unnamed-chunk-17

Iris example

plot of chunk unnamed-chunk-18

Iris example

plot of chunk unnamed-chunk-19

Iris example

plot of chunk unnamed-chunk-20

Obtain \( V \) from \( X_{train}=UDV^\top \)

Apply to \( X_{test} V \)

Iris example

plot of chunk unnamed-chunk-21