You are on page 1of 1

Data Representation of Cancer Cell Line Gene

Expressions
Briton Park, Taylor Rhoads, Alexander Strzalkowski, Alborz Zibaii
MAPS REU
Department of Mathematics
University of Maryland, College Park

Abstract

Visualizations of Results

Partial Graph Isometries to Match Gene Expression Data

With the large quantity and high dimensionality that


characterizes todays world of data, finding meaningful
representations of data is a significant challenge. Using
Laplacian Eigenmaps, a popular dimensionality reduction
technique, on high dimensional cancer cell line data produces
interesting results. Specifically, we find which eigenvectors of
the Laplacian are most meaningful in bringing out the intrinsic
structure of the data. Additionally, we compare the
representation results between the two data sets.
Background
Data Representation
As the number of dimensions increases, the higher
dimensional volume of that space blows up. Exponentially
more data points are required to preserve the same amount of
informational density. In practical settings, this leads to
sparseness of high dimensional data, this is the curse of
dimensionality. Detecting and organizing areas of interest
becomes problematic, as this sparsity causes objects to
appear highly dissimilar.
Laplacian Eigenmaps
This is a nonlinear technique for dimensionality reduction.
Given n points, x1, x2, ..., xn in Rd , create a graph by allowing
each point to correspond to a vertex and placing an edge
between vertices i and j if vertex i is within k nearest
neighbors of vertex j, or vice versa.
Weight edges between two adjacent vertices, i and j, as
defined by the heat kernel with choice of the parameter :
Wij = e

kxi xj k2
2

Create a diagonal matrix D where diagonal entries are


defined by:
n
X
Wij
Dii =
i=1

Calculate the Laplacian matrix L = D W , and run an


eigenvalue decomposition.
To embed data into Rf , project data using the first f
eigenvectors of the Laplacian matrix of the data.
Cancer Cell Line Data
Broad Institute Data
823 Cancer Cell Lines
19851 Gene Expressions

NCI-60 Data
60 Cancer Cell Lines
25041 Gene Expressions
A cancer cell line consists of multiple cancer cells of the same type such as
lymphoma or glaucoma. A gene expression is a relative number that
expresses how much protein is produced that is encoded by a specific
protein encoding gene. Both of these data sets exist in very high
dimensional space with the NCI-60 data set living in R60 and the Broad
Institute data set living in R823. The number of observations, which we
analyzed as gene expressions is also staggering for both data sets.

Figure 1: Laplacian Eigenmaps applied to Broad Institute gene expression data (left) and NCI-60 gene
expression data (right) using the first three eigenvectors. It is apparent that these two similar datasets have
vastly different representations.

Figure 2: Laplacian Eigenmaps applied to Broad Institute gene expression data (left) and NCI-60 gene data
(right) using the 18th, 19th, and 20th eigenvectors. Interestingly, using a different combination of eigenvectors
produces similar representations.

Methodology
Given the 823 cell lines and 19851 gene expressions, we used the first three
eigenvectors generated by Laplacian Eigenmaps to project the Broad Institute data onto
R3. Similarly, using the 60 cancer cell lines and 25041 gene expressions from the NCI-60
data set, we projected the data onto R3 using the first three eigenvectors (see Figure 1).
We investigated further by embedding both datasets using various combinations of the
first thirty eigenvectors over a range of 201000 nearest neighbors and from 020.
Results
Using k = 50 and = 1 as representative parameters and the first three eigenvectors
of the data, applying Laplacian Eigenmaps on the Broad Institute data yielded a smooth
representation of the data points. The output did not seem to yield similar aspects of the
more interesting structures present in the NCI-60 gene expression data.
We found that combinations of the first fifteen eigenvectors displayed similar smooth
representations of the Broad Institute data as seen in Figure 1 whereas combinations of
eigenvectors past the fifteenth one behaved more similarly to the NCI-60 embeddings
using same eigenvectors.
Because of this interesting aspect, combinations past the fifteenth eigenvector were
explored. This representation (see Figure 2) better portrayed aspects of the datas overall
structure that were not captured using any combination of the fifteen eigenvectors. In fact,
the representation of the Broad Institute data using the eighteenth, nineteenth, and
twentieth eigenvectors seemed to better capture its shared aspects with the structure of
the NCI-60 data produced using Laplacian Eigenmaps.
Mathematics, Applied Mathematics, and Statistics Research Experience for Undergraduates

https://www-math.umd.edu/maps-reu.html

A large subset of the data is common to both the Broad


Institute and the National Cancer Institute datasets. About
17,000 gene expressions and 40 cancer cell lines are shared
between the two sources. Therefore, overlapping structures
between the two embeddings of the data points are expected,
despite institutional differences in preprocessing the data. To
identify and match overlapping substructures of the data, we
propose novel approaches that utilize the representation of the
data points using Laplacian Eigenmaps.
Vector Quantization A technique that allows us to locally
approximate clouds of data by unique vectors. This
drastically reduces the number of points and redundancies,
which can lead to a sharper representation of the structures
within a data set. As a result of the decreased computation
time and the less redundant vector representation, the
subgraph matching problem becomes more manageable.
Clustering Applying a clustering algorithm on the embedding
identifies inherent clusters in each dataset. Pairwise
comparisons of clusters between the datasets can be made
using a similarity metric where most similar clusters can be
matched. Hierarchical clustering, k-medoids, or Markov
clustering can be used along with eigenvalue similarity metric
or graph similarity via neighbor matching to achieve this goal.
Substructure Feature Selection By identifying the most
frequently occurring structures within pairs of graphs, one
can use these substructures as features. After computing the
relative distances between pairs of features, a consistency
graph is created, where two features are consistent and
have an edge between them if they are equidistant between
graphs. Highly connected components of this consistency
map can be used as candidates for graph matching.
Subgraph Matching One can create a summarized
representation of the embeddings of the two datasets. Given
the two weight matrices W1 and W2 obtained from the two
data sets, construct a combined weight matrix W . Extract
subgraphs from this combined representation using the
Highly Connected Subgraph Clustering Algorithm (HCS).
These mined subgraphs are the proposed overlapping
structures between the two data sets.
Acknowledgments
We are thankful for the help and advice from Dr. Wojciech Czaja, Dr. Alex
Cloninger and Dr. Vinodh Rajapakse, as well as that of Jeremiah Emidih
and Dr. Kasso Okoudjou.
This data was supplied to us by the National Institutes of Health.
This work was completed during the 2016 Mathematics, Applied
Mathematics, and Statistics Research Experience for Undergraduates
(MAPS REU) at the University of Maryland which is supported by the
National Science Foundation (grant No. 1359307).
References
M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data
Representation. Neural Computation 15.6 (2003): 1373-396. Web
Hartuv, Erez, and Ron Shamir. A Clustering Algorithm Based on Graph Connectivity.
Information Processing Letters 76.4-6 (2000): 175-81. Web.

You might also like