Matrix Algebra

Rafael A. Irizarry
October 9, 2014

Motivation

  • Our data is usually saved in matrices.

  • In the gene expression example data, \( X \), was \( 8793 \times 208 \)

  • Say you want to compute the average of each column

plot of chunk unnamed-chunk-3

Matrix Notation

  • Let \( X \) be a data matrix with \( N \) rows and \( p \) columns

  • The vector of column averages:

\[ \left( \frac{1}{N}\sum_{i=1}^N X_{i,1}, \dots, \frac{1}{N}\sum_{i=1}^N X_{i,p}\right) \]

can be written like this \( AX \) with \( A=(\frac{1}{N},\dots,\frac{1}{N}) \)

Matrix Notation

Here are the matrix operations \[ AX = \begin{pmatrix} \frac{1}{N} & \frac{1}{N} & \dots & \frac{1}{N} \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \bar{X}_1 & \bar{X}_2 & \dots &\bar{X}_N \end{pmatrix} \]

Matrix Notation

Here are the row averages:

\[ XB = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p\\ 1/p\\ \vdots\\ 1/p\\ \end{pmatrix} = \begin{pmatrix} \bar{X}_1 \\ \bar{X}_2 \\ \vdots\\ \bar{X}_q\\ \end{pmatrix} \]

Matrix Notation (more general)

\[ AX = \begin{pmatrix} a_1 & a_2 & \dots & a_N \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \sum_{i=1}^N a_i X_{i,1} & \dots & \sum_{i=1}^N a_i X_{i,p} \end{pmatrix} \] Convenient and fast for coding

Matrix Notation

  • Say for each gene you want to compute
    1. the average across all samples
    2. the difference between males and females

plot of chunk unnamed-chunk-4

Matrix Notation

\[ \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p&a_1\\ 1/p&a_2\\ \vdots\\ 1/p&a_p\\ \end{pmatrix} \] with \[ \begin{align*} a_j &= \begin{cases} -1/N_F & \text{if female } \\ 1/N_M & \text{if male} \end{cases} \end{align*} \]

with \( N_F \)=number of females and \( N_M \)=number of males.

Matrix Notation (even more general)

\[ AX = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N}\\ \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]

\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} X_{i,1} & \dots & \sum_{i=1}^N a_{1,i} X_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} X_{i,1} & \dots & \sum_{i=1}^N a_{M,i} X_{i,p} \end{pmatrix} \]

Regression Models

We used regression with our Baseball data. For example

\[ Y_i= \beta_0 + X_{i}\beta_1 + \varepsilon_i, i=1,\dots,N \]

with \( Y_i \) = runs for team \( i \) and \( X_{i} \) = BB for team \( i \).

We need to find the \( \beta_0,\beta_1 \) pair that minimizes

\[ \sum_{i=1}^N (Y_i- \beta_0 - X_i\beta_1)^2 \]

How do we do this?

Linear Algebra Representation

We can write it like this:

\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \beta_1+ \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]

The \( Y \) column is a point in \( \mathbb{R}^N \)

The \( 1 \) and \( X \) vectors define a subspace, each point defined by the pair \( (\beta_0,\beta_1)\in \mathbb{R}^2 \).

Projections

Recall: The projection of the point \( Y \) to the smaller space minimizes the distance.

Simplest example

\( \vec{y} \in \mathbb{R}^N \) and \( L \subset \mathbb{R}^N \) is the space spanned by

\[ \vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\} \]

What \( c \) minimizes distance between \( c\vec{v} \) and \( \vec{y} \)?

Answer: The projection of \( \vec{y} \) onto \( L \): \[ \mbox{Proj}_L(\vec{y}) = \hat{c} \vec{v} \]

So how do we find \( \hat{c} \)?

Simplest example

The projection is orthogonal to the space so:

\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]

\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]

\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

Simplest example

\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

Going back to our original notation this means \( \hat{\beta}_0 = \bar{Y} \)

In this case calculus would have been just as easy:

\[ \frac{\partial}{\partial\beta_0}\sum_{i=1}^N (Y_i - \beta_0)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \beta_0) = 0 \implies \]

\[ N \beta_0 = \sum_{i=1}^N Y_i \implies \hat{\beta_0}=\bar{Y} \]

Regression Model

Make it slightly more complicated.

The space is now

\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \} \]

with

\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]

The Transpose

The columns become the rows

\[ X = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} , \, X^\top = \begin{pmatrix} X_{1,1}&\dots & X_{N,1} \\ X_{1,2}&\dots & X_{N,2} \\ & \vdots & \\ X_{1,p}&\dots & X_{N,p} \\ \end{pmatrix} \]

Variance

This helps us define projections but also:

If \( X \) is \( N\times p \) matrix of features for which the columns have average 0,

\[ S = \frac{1}{N} X^\top X \]

  • is called the sample covariance matrix
  • \( S_{i,i} \) is the variance of column \( i \)
  • \( S_{i,j} \) entry is the covariance between columns \( i \) and \( j \).
  • \[ \frac{S_{i,j}}{\sqrt{S_{i,i}S_{j,j}}} \] is the correlation

Regression Model

Let \( X = [ \vec{v}_0 \,\, \vec{v}_1] \). It's a \( N\times 2 \) matrix

Any point \( L \) can be writen as \( X\beta \).

The orthogonal projection must have

\( X^\top (\vec{y}-X\vec{\beta}) = 0 \)

\[ X^\top X \vec{\beta}= X^\top y \]

\[ \vec{\beta}= (X^\top X)^{-1}X^\top y \]

Regression Models

In general, we can write this

\[ Y_i= \beta_0 + X_{i,1}\beta_1 + \dots X_{i,p} \beta_p + \varepsilon_i, i=1,\dots,N \] as

\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1,1}\\ X_{2,1}\\ \vdots \\ X_{N,1}\\ \end{pmatrix} \beta_1+ \dots + \begin{pmatrix} X_{1,p}\\ X_{2,p}\\ \vdots \\ X_{N,p}\\ \end{pmatrix}\beta_p + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]

Regression Models

We can simplify to:

\[ Y = X\beta + \varepsilon \]

with \[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_N\\ \end{pmatrix}, \, X=\begin{pmatrix} 1&X_{1,1}&\dots&X_{1,p}\\ 1&X_{2,1}&\dots&X_{2,p}\\ & \vdots & \\ 1&X_{N,1}&\dots&X_{N,p} \end{pmatrix}, \, \beta = \begin{pmatrix} \beta_1\\ \vdots\\ \beta_p \end{pmatrix}, \, \varepsilon = \begin{pmatrix} \varepsilon_1\\ \vdots\\ \varepsilon_N \end{pmatrix}\\ \\ \]

Regression Model

Solution is still

\[ \hat{\beta} = (X^\top X)^{-1} X^\top Y \]

Singular Value Decomposition (SVD)

We have a data matrix with \( N \) observations (rows) and \( p \) columns.

We can write our data matrix as

\[ X = U D V^\top , \mbox{ with } \]

  • \( U \) is \( N\times p \) with \( U^\top U=I \)
  • \( V \) is \( p\times p \) with \( V^\top V=I \) and
  • \( D \) is \( p \times p \) diagonal matrix

We are changing basis: \( XV = UD \)

And can change back \( XVV^\top = X \)

SVD

  • Say the columns of \( X \) are perfectly correlated

  • Then we really have just one column, the SVD catches this:

X = U[:,0] d[0,0] Vh[0,:]

Motivating SVD

Consider this simple linear combination:

\[ \Delta_j = (Y_{1,j}, \dots, Y_{N,j})'(1/N_1,\dots,1/N_1,-1/N_2,\dots,-1/N_2) = \]

\[ \frac{1}{N_1}\sum_{i=1}^{N_1} Y_{i,j} - \frac{1}{N_2}\sum_{i=N_1+1}^{N} Y_{i,j}, N_2=N-N_1 \]

Try to find the \( N_1 \) that maximizes the variance

\[ \frac{1}{p}\sum_{j=1}^p \Delta_j^2 \]

First PC correlates with batch

plot of chunk unnamed-chunk-9

Applications of SVD

Plot of standardized petal width and petal length columns plot of chunk unnamed-chunk-11

What transformation has more varinace?

This is the projection of \( (0,1) \) and \( (1,0) \) plot of chunk unnamed-chunk-12

How about this direction?

Line represents the space spaned by \[ \vec{v}=\begin{pmatrix}1/\sqrt(2)&-1/\sqrt(2)\end{pmatrix}' \]

plot of chunk unnamed-chunk-13

Here is the projection

\[ d\vec{u} = X \vec{v} \]

plot of chunk f

How about this one?

plot of chunk unnamed-chunk-14

What is \( \vec{v} \) here?

Transformation

Note these two are orthogonal so let

\[ V=\frac{1}{\sqrt{2}}\begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]

\( XV \) reports something proportional to the average and the difference.

The orthogonality is convenient because we can change back and forth:

\[ XVV^\top = X \]

SVD gives us that transformation

plot of chunk unnamed-chunk-15

Compression based on SVD

Get rid of one dimension and recuperate data XV[:,1]V.T

plot of chunk unnamed-chunk-16

Iris example

We take all four features now and apply SVD: plot of chunk unnamed-chunk-17

Iris example

plot of chunk unnamed-chunk-18

Iris example

plot of chunk unnamed-chunk-19

Iris example

plot of chunk unnamed-chunk-20

Obtain \( V \) from \( X_{train}=UDV^\top \)

Apply to \( X_{test} V \)

Iris example

plot of chunk unnamed-chunk-21