Dual Geometry of High-Dimensional Data Sets

I NTRODUCTION
A LGORITHM
S MOOTHNESS , APPROXIMATION , AND DECAY
Applications
Geometry
Conclusion
Dual geometries of high-dimensional datasets

Sarah Constantin Yale University
August 16, 2012
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
I NTRODUCTION Organizing high-dimensional data Motivating example: text database A LGORITHM Hierarchical clustering and trees Iterative principle Multiscale afnity S MOOTHNESS , APPROXIMATION , AND DECAY How good is a tree? Applications Genomic data Compression Matrix completion Geometry Example: trigonometric functions Conclusion
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
O RGANIZING HIGH - DIMENSIONAL DATA

We often need to nd clusters or patterns in high-dimensional data. Many observations, each with many features. What if we could simultaneously organize the feature and the observations, putting similar variables and similar observations together? Take advantage of the dual relationship: symmetry between features and observations. The algorithm presented here is the work of my advisor Ronald Coifman in a paper with Matan Gavish; Ill also be introducing some of the open problems Im currently working on.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
M OTIVATING EXAMPLE : TEXT DATABASE

Consider a document database represented as a matrix. Every row is a document; every column is a word. The (i, j)th entry is the number of times the jth word appears in the ith document. The dual clustering algorithm reshufes the rows and columns of the matrix, so that similar documents go together and similar words go together. This builds up a tree hierarchy of words and a tree hierarchy of documents, grouped by context. Even very similar documents share only a small set of highly correlated words, but this relationship may drown in overall noise; contextual similarity is more powerful than TF-IDF.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
S CIENCE N EWS DATABASE
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
H IERARCHICAL CLUSTERING AND TREES
Given a measure of similarity, you can always build a tree. K-means approach. Choose a set of centroids greater than a xed distance apart; Cluster each data point to the nearest centroids; Clusters form nodes of the tree. To go to a higher level, pick the centroids of each cluster and make clusters of those.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
H IERARCHICAL CLUSTERING EXAMPLE
Genetic microarray data.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
I TERATIVE PRINCIPLE
First build K-means trees on the observations and the features. Initial similarity is just Euclidean distance: the distance between two observations is the sum of squares of differences between their features. Now build new similarity that well dene based on the tree structure. Update the trees iteratively. New tree of features based on similarity dened using observation tree; new tree of observations based on similarity dened using feature tree. Reshufe rows and columns so that nearby rows are in the same cluster on the row tree and nearby columns are in the same cluster on the column tree.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
T REE
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
M ULTISCALE AFFINITY
l Let Xk be the kth node of a tree, at generation l. Let the afnity between f and g be given by
(f , g) = e1/
2 x (f (x)g(x))
(Another possible choice is = cov(f , g)/((f )(g)), the correlation/inner product between two functions. is an afnity between two functions; its equal to 1 if theyre similar, and decreases to 0 as they become more different.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
Dene the multiscale afnity of two columns My1 , My2 with respect to a row tree Tx by Tx (My1 , My2 ) =
k,l
1 l (My1 , My2 ) l |Xk | Xk
On each folder in the partition tree, we take the afnity between the columns, but just restricted to that folder.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
Dene the similarity between the rows with the multiscale afnity dened with respect to the existing tree on the columns. In the database example, this means that more weight is given to the similarity of two documents in clusters of related words; it incorporates contextual information. Likewise, the similarity of two words is given more weight if they occur in similar documents.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
H OW GOOD IS A TREE ?
After weve iterated this algorithm, how do we know if we have a good pair of trees and reorganization? For similar rows and similar columns to be arranged together, we want to be able to say that the values of the matrix vary smoothly. As a general principle, when you expand a function in a basis (like Fourier coefcients, wavelets, etc), faster coefcient decay corresponds to greater smoothness of the function. Cluster trees induce a wavelet-like basis called a tensor Haar-like basis that consists of functions on the matrix that take constant values on the rectangles dened by pairs of clusters in the rows and columns.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
H AAR - LIKE FUNCTION
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
T ENSOR H AAR - LIKE FUNCTION
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
S MOOTHNESS AND RESHUFFLING

It can be shown that: Every function on the matrix can be expressed by an orthonormal basis of tensor Haar-like functions. Functions can be approximated by taking only those tensor Haar functions which have large coefcients and are supported on large folders. If a function on the matrix has rapidly decaying coefcients in the tensor Haar expansion, then it satises a Holder smoothness condition, and vice versa. If a function can be described efciently on a tensor Haar-like basis, then it decomposes into a typical matrix, which varies smoothly in rows and columns, and a small outlier matrix with irregular behavior but small support; this is a Calderon-Zygmund argument.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
G ENOMIC DATA
Warning: rampant speculation! What if we put individuals on the rows and genes on the columns, and applied the algorithm? The row-tree approximates a family tree. The column-tree gives similarities between genes, which should approximate their location on the sequence. Possible new technique in computational phylogeny; doesnt require global optimizations.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
G ENOMIC DATA AND DISEASE RISK

A trait, or disease, is a function on the rows. Rectangles that light up correspond to clusters of genes inherited together and related individuals who have the disease. Predict disease risk via tensor Haar basis expansion Unlike genome-wide association studies + logistic regression, makes no assumption genes are uncorrelated! limited number of tensor Haar coefcients imposes a sparsity condition, prevents overtting. Cost of sequencing is dropping we need more efcient statistical tools! GWAS often nd spurious correlations, fail to predict risk of disease.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
C OMPRESSION
To compress a matrix, store both partition trees, and only the coefcients of the tensor Haar functions dened on rectangles bigger than c 2 , where is the desired error. This allows storage of matrices in much less space.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
M ATRIX COMPLETION
In a matrix with missing values, if we have the trees, we can estimate the values by averaging on local clusters. For example: if you have a matrix of users and clicks/purchases/preferences, reorganizing by the dual clustering algorithm and extrapolating new values can give a new algorithm for collaborative ltering.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
E XAMPLE : TRIGONOMETRIC FUNCTIONS
If the data matrix you use is A(m, n) = sin(m 2 n/T), then the rows are different frequencies of sines, and the columns are different points where the functions are sampled. Shufing the rows and columns at random, and then reorganizing them by the dual clustering algorithm, recovers the sines in order.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
E XAMPLE : TRIGONOMETRIC FUNCTIONS
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
T RIG FUNCTIONS AND DUAL GEOMETRY
One can actually prove that the multiscale afnity between two trigonometric functions is an increasing function in the difference between their frequencies. The geometry of trigonometric functions in the multiscale afnity norm gives us back the circle. Theres a relationship between the geometry of the surface the functions are dened on, and the geometry of relationships between the functions. We might ask: is this true for other functions on other surfaces?
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
E IGENFUNCTIONS OF THE L APLACIAN

The Laplace operator is an anti-averaging function; its the edge sharpening lter in image processing. On the plane its simply f = 2f 2f + 2. x2 y
On arbitrary surfaces, its dened as f = div(grad(f )) An eigenfunction of the Laplacian is a function having the property that f = f . On the circle, trigonometric functions are the eigenfunctions of the Laplacian.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
E IGENFUNCTIONS OF THE L APLACIAN ON SURFACES
Every function on a smooth manifold can be expanded in terms of eigenfunctions of the Laplacian, just as every periodic function can be expanded in a trigonometric series (the Fourier expansion). For some surfaces the Laplacian eigenfunctions have closed-form solutions: on the disc, the Bessel functions, and on the sphere, the spherical harmonics. Efciently or sparsely expanding a function on a surface in Laplacian eigenfunctions is valuable for the same reason that fast or sparse Fourier expansion is useful for functions dened on an interval.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
E IGENFUNCTIONS , NOT JUST EIGENVALUES
Can you hear the shape of a drum? No!
Eigenvalues alone dont determine geometry. But perhaps including relationships between eigenfunctions will.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
C OUNTEREXAMPLE : INCOHERENCE
It turns out that you cant always reconstruct the geometry of a surface from the multiscale afnities between its Laplacian eigenfunctions. Indeed, on the sphere, if you choose a basis of Laplacian eigenfunctions at random, one can prove that with high probability their multiscale afnity is close to zero. In other words, instead of getting a geometric relationship back, we get a clump. We cant distinguish eigenfunctions by looking at their multiscale afnity.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
I NCOHERENCE AND COMPRESSED SENSING
The notion of incoherence comes from the noiselet basis, an orthonormal basis for L2 functions on the line such that the inner product of a noiselet with a Haar function is independent of the choice of noiselet. In other words, if you try to measure a Haar function with noiselets, all the measurements look the same; its totally uncorrelated. This property of incoherence means that noiselets form a good basis for compressed sensing; if you measure a function with many different noiselets, you can reconstruct the function accurately.
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
C OMPRESSED SENSING
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
I NCOHERENCE ON SURFACES : CONJECTURE
If taking small observations against an incoherent basis allows us to reconstruct the original function, perhaps taking measurements against a random spherical eigenfunction basis will allow us to reconstruct functions on the sphere. (Applications: 3d photographic views, protein reconstruction, etc.) Theoretical question: which bases of Laplacian eigenfunctions in general are coherent with respect to the geometry of the surface? Which are incoherent?
I NTRODUCTION
A LGORITHM
Applications
Geometry
Conclusion
C ONCLUSION
We can organize large, high-dimensional datasets by creating a multiscale folder structure on observations and features. Iteratively improving observation and feature hierarchical trees gives us an organization that simultaneously clusters similar observations and similar features. This procedure has broad applications for machine learning and data science, and makes use of richer information than conventional statistical techniques.

Dual Geometry of High-Dimensional Data Sets

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dual Geometry of High-Dimensional Data Sets

Uploaded by

Copyright:

Available Formats

I NTRODUCTION

S MOOTHNESS , APPROXIMATION , AND DECAY

Dual geometries of high-dimensional datasets

August 16, 2012

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

O RGANIZING HIGH - DIMENSIONAL DATA

S MOOTHNESS , APPROXIMATION , AND DECAY

M OTIVATING EXAMPLE : TEXT DATABASE

S MOOTHNESS , APPROXIMATION , AND DECAY

S CIENCE N EWS DATABASE

S MOOTHNESS , APPROXIMATION , AND DECAY

H IERARCHICAL CLUSTERING AND TREES

S MOOTHNESS , APPROXIMATION , AND DECAY

H IERARCHICAL CLUSTERING EXAMPLE

Genetic microarray data.

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

1 l (My1 , My2 ) l |Xk | Xk

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

H AAR - LIKE FUNCTION

S MOOTHNESS , APPROXIMATION , AND DECAY

T ENSOR H AAR - LIKE FUNCTION

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS AND RESHUFFLING

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

G ENOMIC DATA AND DISEASE RISK

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

E XAMPLE : TRIGONOMETRIC FUNCTIONS

S MOOTHNESS , APPROXIMATION , AND DECAY

E XAMPLE : TRIGONOMETRIC FUNCTIONS

S MOOTHNESS , APPROXIMATION , AND DECAY

T RIG FUNCTIONS AND DUAL GEOMETRY

S MOOTHNESS , APPROXIMATION , AND DECAY

E IGENFUNCTIONS OF THE L APLACIAN

S MOOTHNESS , APPROXIMATION , AND DECAY

E IGENFUNCTIONS OF THE L APLACIAN ON SURFACES

S MOOTHNESS , APPROXIMATION , AND DECAY

E IGENFUNCTIONS , NOT JUST EIGENVALUES

Can you hear the shape of a drum? No!

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

I NCOHERENCE AND COMPRESSED SENSING

S MOOTHNESS , APPROXIMATION , AND DECAY

S MOOTHNESS , APPROXIMATION , AND DECAY

I NCOHERENCE ON SURFACES : CONJECTURE

S MOOTHNESS , APPROXIMATION , AND DECAY

You might also like