Distance, Clustering and Dimensionality Reduction

Rafael A Irizarry
September 18, 2014

Clustering

clustering

Distance

Review:

plot of chunk unnamed-chunk-1

Distance between A and B:

\[ \sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2} \]

Gene Expression Data

Here are first 10 rows and two columns

          GSM25349.CEL.gz GSM25350.CEL.gz
1007_s_at           6.627           6.250
1053_at             6.939           6.818
117_at              5.114           5.074
121_at              7.834           7.781
1255_g_at           3.152           3.112
1294_at             7.411           7.558
1316_at             4.298           4.183
1320_at             3.628           3.633
1431_at             2.770           2.793
1438_at             5.421           5.153
Dimensions: 8793 208

Gene Expression

    ethnicity sex       date
58        CEU   M 2003-01-03
154       ASN   F 2005-10-07
204       HAN   M 2006-04-28
104       ASN   F 2005-06-10
56        CEU   F 2002-11-15
137       ASN   M 2005-08-18
11        CEU   M 2002-12-17
187       HAN   F 2006-04-28
22        CEU   M 2002-11-21
150       ASN   M 2005-10-07

Distance

  • What is a point?
  • What is the distance between two samples?

Distance

  • What is a point?
  • What is the distance between two samples?

A point: \( (Y_{i,1},\dots,Y_{i,8793})' \)

Distance between two points:

\[ d(h,i) = \sqrt{ \sum_{j=1}^{N} (Y_{h,j}-Y_{i,j})^2 } \]

Expression data hierarchical clustering

plot of chunk unnamed-chunk-4

Expression data hierarchical clustering

plot of chunk unnamed-chunk-5

Expression data hierarchical clustering

plot of chunk unnamed-chunk-6

K-means clustering

Singular Value Decomposition

SVD1

Dimension Reduction

SVD2

Multidimensional Scaling Plot

plot of chunk unnamed-chunk-7

Multidimensional Scaling Plot

plot of chunk unnamed-chunk-8

Multidimensional Scaling Plot

plot of chunk unnamed-chunk-9