Professional Documents
Culture Documents
Expressions
Briton Park, Taylor Rhoads, Alexander Strzalkowski, Alborz Zibaii
MAPS REU
Department of Mathematics
University of Maryland, College Park
Abstract
Visualizations of Results
kxi xj k2
2
NCI-60 Data
60 Cancer Cell Lines
25041 Gene Expressions
A cancer cell line consists of multiple cancer cells of the same type such as
lymphoma or glaucoma. A gene expression is a relative number that
expresses how much protein is produced that is encoded by a specific
protein encoding gene. Both of these data sets exist in very high
dimensional space with the NCI-60 data set living in R60 and the Broad
Institute data set living in R823. The number of observations, which we
analyzed as gene expressions is also staggering for both data sets.
Figure 1: Laplacian Eigenmaps applied to Broad Institute gene expression data (left) and NCI-60 gene
expression data (right) using the first three eigenvectors. It is apparent that these two similar datasets have
vastly different representations.
Figure 2: Laplacian Eigenmaps applied to Broad Institute gene expression data (left) and NCI-60 gene data
(right) using the 18th, 19th, and 20th eigenvectors. Interestingly, using a different combination of eigenvectors
produces similar representations.
Methodology
Given the 823 cell lines and 19851 gene expressions, we used the first three
eigenvectors generated by Laplacian Eigenmaps to project the Broad Institute data onto
R3. Similarly, using the 60 cancer cell lines and 25041 gene expressions from the NCI-60
data set, we projected the data onto R3 using the first three eigenvectors (see Figure 1).
We investigated further by embedding both datasets using various combinations of the
first thirty eigenvectors over a range of 201000 nearest neighbors and from 020.
Results
Using k = 50 and = 1 as representative parameters and the first three eigenvectors
of the data, applying Laplacian Eigenmaps on the Broad Institute data yielded a smooth
representation of the data points. The output did not seem to yield similar aspects of the
more interesting structures present in the NCI-60 gene expression data.
We found that combinations of the first fifteen eigenvectors displayed similar smooth
representations of the Broad Institute data as seen in Figure 1 whereas combinations of
eigenvectors past the fifteenth one behaved more similarly to the NCI-60 embeddings
using same eigenvectors.
Because of this interesting aspect, combinations past the fifteenth eigenvector were
explored. This representation (see Figure 2) better portrayed aspects of the datas overall
structure that were not captured using any combination of the fifteen eigenvectors. In fact,
the representation of the Broad Institute data using the eighteenth, nineteenth, and
twentieth eigenvectors seemed to better capture its shared aspects with the structure of
the NCI-60 data produced using Laplacian Eigenmaps.
Mathematics, Applied Mathematics, and Statistics Research Experience for Undergraduates
https://www-math.umd.edu/maps-reu.html