Rafael A. Irizarry
October 9, 2014
Our data is usually saved in matrices.
In the gene expression example data, \( X \), was \( 8793 \times 208 \)
Say you want to compute the average of each column
Let \( X \) be a data matrix with \( N \) rows and \( p \) columns
The vector of column averages:
\[ \left( \frac{1}{N}\sum_{i=1}^N X_{i,1}, \dots, \frac{1}{N}\sum_{i=1}^N X_{i,p}\right) \]
can be written like this \( AX \) with \( A=(\frac{1}{N},\dots,\frac{1}{N}) \)
Here are the matrix operations \[ AX = \begin{pmatrix} \frac{1}{N} & \frac{1}{N} & \dots & \frac{1}{N} \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]
\[ = \begin{pmatrix} \bar{X}_1 & \bar{X}_2 & \dots &\bar{X}_N \end{pmatrix} \]
Here are the row averages:
\[ XB = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p\\ 1/p\\ \vdots\\ 1/p\\ \end{pmatrix} = \begin{pmatrix} \bar{X}_1 \\ \bar{X}_2 \\ \vdots\\ \bar{X}_q\\ \end{pmatrix} \]
\[ AX = \begin{pmatrix} a_1 & a_2 & \dots & a_N \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]
\[ = \begin{pmatrix} \sum_{i=1}^N a_i X_{i,1} & \dots & \sum_{i=1}^N a_i X_{i,p} \end{pmatrix} \] Convenient and fast for coding
\[ \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \begin{pmatrix} 1/p&a_1\\ 1/p&a_2\\ \vdots\\ 1/p&a_p\\ \end{pmatrix} \] with \[ \begin{align*} a_j &= \begin{cases} -1/N_F & \text{if female } \\ 1/N_M & \text{if male} \end{cases} \end{align*} \]
with \( N_F \)=number of females and \( N_M \)=number of males.
\[ AX = \begin{pmatrix} a_{1,1} & a_{1,2} & \dots & a_{1,N}\\ a_{2,1} & a_{2,2} & \dots & a_{2,N}\\ & & \vdots & \\ a_{M,1} & a_{M,2} & \dots & a_{M,N}\\ \end{pmatrix} \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} \]
\[ = \begin{pmatrix} \sum_{i=1}^N a_{1,i} X_{i,1} & \dots & \sum_{i=1}^N a_{1,i} X_{i,p}\\ & \vdots & \\ \sum_{i=1}^N a_{M,i} X_{i,1} & \dots & \sum_{i=1}^N a_{M,i} X_{i,p} \end{pmatrix} \]
We used regression with our Baseball data. For example
\[ Y_i= \beta_0 + X_{i}\beta_1 + \varepsilon_i, i=1,\dots,N \]
with \( Y_i \) = runs for team \( i \) and \( X_{i} \) = BB for team \( i \).
We need to find the \( \beta_0,\beta_1 \) pair that minimizes
\[ \sum_{i=1}^N (Y_i- \beta_0 - X_i\beta_1)^2 \]
How do we do this?
We can write it like this:
\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \beta_1+ \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]
The \( Y \) column is a point in \( \mathbb{R}^N \)
The \( 1 \) and \( X \) vectors define a subspace, each point defined by the pair \( (\beta_0,\beta_1)\in \mathbb{R}^2 \).
Recall: The projection of the point \( Y \) to the smaller space minimizes the distance.
\( \vec{y} \in \mathbb{R}^N \) and \( L \subset \mathbb{R}^N \) is the space spanned by
\[ \vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\} \]
What \( c \) minimizes distance between \( c\vec{v} \) and \( \vec{y} \)?
Answer: The projection of \( \vec{y} \) onto \( L \): \[ \mbox{Proj}_L(\vec{y}) = \hat{c} \vec{v} \]
So how do we find \( \hat{c} \)?
The projection is orthogonal to the space so:
\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]
\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]
\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]
\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]
Going back to our original notation this means \( \hat{\beta}_0 = \bar{Y} \)
In this case calculus would have been just as easy:
\[ \frac{\partial}{\partial\beta_0}\sum_{i=1}^N (Y_i - \beta_0)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \beta_0) = 0 \implies \]
\[ N \beta_0 = \sum_{i=1}^N Y_i \implies \hat{\beta_0}=\bar{Y} \]
Make it slightly more complicated.
The space is now
\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \} \]
with
\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]
The columns become the rows
\[ X = \begin{pmatrix} X_{1,1}&\dots & X_{1,p} \\ X_{2,1}&\dots & X_{2,p} \\ & \vdots & \\ X_{N,1}&\dots & X_{N,p} \\ \end{pmatrix} , \, X^\top = \begin{pmatrix} X_{1,1}&\dots & X_{N,1} \\ X_{1,2}&\dots & X_{N,2} \\ & \vdots & \\ X_{1,p}&\dots & X_{N,p} \\ \end{pmatrix} \]
This helps us define projections but also:
If \( X \) is \( N\times p \) matrix of features for which the columns have average 0,
\[ S = \frac{1}{N} X^\top X \]
Let \( X = [ \vec{v}_0 \,\, \vec{v}_1] \). It's a \( N\times 2 \) matrix
Any point \( L \) can be writen as \( X\beta \).
The orthogonal projection must have
\( X^\top (\vec{y}-X\vec{\beta}) = 0 \)
\[ X^\top X \vec{\beta}= X^\top y \]
\[ \vec{\beta}= (X^\top X)^{-1}X^\top y \]
In general, we can write this
\[ Y_i= \beta_0 + X_{i,1}\beta_1 + \dots X_{i,p} \beta_p + \varepsilon_i, i=1,\dots,N \] as
\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_N\\ \end{pmatrix} = \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \beta_0 + \begin{pmatrix} X_{1,1}\\ X_{2,1}\\ \vdots \\ X_{N,1}\\ \end{pmatrix} \beta_1+ \dots + \begin{pmatrix} X_{1,p}\\ X_{2,p}\\ \vdots \\ X_{N,p}\\ \end{pmatrix}\beta_p + \begin{pmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_N\\ \end{pmatrix}\\ \]
We can simplify to:
\[ Y = X\beta + \varepsilon \]
with \[ Y = \begin{pmatrix} Y_1\\ \vdots\\ Y_N\\ \end{pmatrix}, \, X=\begin{pmatrix} 1&X_{1,1}&\dots&X_{1,p}\\ 1&X_{2,1}&\dots&X_{2,p}\\ & \vdots & \\ 1&X_{N,1}&\dots&X_{N,p} \end{pmatrix}, \, \beta = \begin{pmatrix} \beta_1\\ \vdots\\ \beta_p \end{pmatrix}, \, \varepsilon = \begin{pmatrix} \varepsilon_1\\ \vdots\\ \varepsilon_N \end{pmatrix}\\ \\ \]
Solution is still
\[ \hat{\beta} = (X^\top X)^{-1} X^\top Y \]
We have a data matrix with \( N \) observations (rows) and \( p \) columns.
We can write our data matrix as
\[ X = U D V^\top , \mbox{ with } \]
We are changing basis: \( XV = UD \)
And can change back \( XVV^\top = X \)
Say the columns of \( X \) are perfectly correlated
Then we really have just one column, the SVD catches this:
X = U[:,0] d[0,0] Vh[0,:]
Consider this simple linear combination:
\[ \Delta_j = (Y_{1,j}, \dots, Y_{N,j})'(1/N_1,\dots,1/N_1,-1/N_2,\dots,-1/N_2) = \]
\[ \frac{1}{N_1}\sum_{i=1}^{N_1} Y_{i,j} - \frac{1}{N_2}\sum_{i=N_1+1}^{N} Y_{i,j}, N_2=N-N_1 \]
Try to find the \( N_1 \) that maximizes the variance
\[ \frac{1}{p}\sum_{j=1}^p \Delta_j^2 \]
Plot of standardized petal width and petal length columns
This is the projection of \( (0,1) \) and \( (1,0) \)
Line represents the space spaned by \[ \vec{v}=\begin{pmatrix}1/\sqrt(2)&-1/\sqrt(2)\end{pmatrix}' \]
\[ d\vec{u} = X \vec{v} \]
What is \( \vec{v} \) here?
Note these two are orthogonal so let
\[ V=\frac{1}{\sqrt{2}}\begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]
\( XV \) reports something proportional to the average and the difference.
The orthogonality is convenient because we can change back and forth:
\[ XVV^\top = X \]
Get rid of one dimension and recuperate data XV[:,1]V.T
We take all four features now and apply SVD:
Obtain \( V \) from \( X_{train}=UDV^\top \)
Apply to \( X_{test} V \)