K Means Spectral

Spectral Comparison Using k-Means Clustering
Vignesh R. Ramachandran
Johns Hopkins Applied Physics Laboratory
Laurel, MD, USA
Vinny.Ramachandran@jhuapl.edu
Herbert J. Mitchell
Naval Postgraduate School
Monterey, CA, USA
herbert.mitchell@jieddo.mil
Samantha K. Jacobs
Laurel, MD, USA
Samantha.Jacobs@jhuapl.edu
Nigel H. Tzeng
Laurel, MD, USA
Nigel.Tzeng@jhuapl.edu
Alexer H. Firpi
Laurel, MD, USA
Alexer.Firpi@jhuapl.edu
Benjamin M. Rodriguez
Laurel, MD, USA
Benjamin.Rodriguez@jhuapl.edu
1. I NTRODUCTION
AbstractThere is a growing number of infrared (IR) spectral

signature data in the scientific community gathered from a
variety of sensors using a variety of collection techniques. As
the quantity of collected data grows, automated solutions for
searching and matching signatures need to be developed. When
searching and matching signatures, reducing computational
complexity and increasing matching accuracy are essential. We
present a signature classification method via k-means clustering
using a novel application of spectral angle mapping to efficiently
determine spectral similarity. We evaluate the method against
spectral data in the SigDB spectral analysis software application developed by the Johns Hopkins University Applied Physics
Laboratory (JHU/APL). The key component to this approach
is the set of characteristic functions used to map signatures
similarity into a spatial representation. Existing methods used
to autonomously identify and classify IR spectral data include
spectral angle mapping and key feature detection. Spectral
mapping is computationally slow due to the need for direct
individual comparison, and key feature detection improves computation time but is limited by the specific features selected for
comparison. The accuracy and computation time of the spectral
cluster classification method is evaluated against spectral angle
mapping and visual analyses on the ASTER NASA spectral
library. The goal of this method is to improve both the accuracy
and speed of classifying newly collected unlabeled spectra. We
find that the proposed method of scoring signatures offers a
speed increase of three orders of magnitude in comparing spectra at the expense of a high false positive rate, suitable for use as
a first-pass filter. We further find that the k-means cluster-based
classification is highly sensitive to the selection of initial cluster
centroids, and offer alternative solutions to use with our scoring
method.
Growing use of infrared spectral signature data in scientific

and forensic analysis requires collecting large quantities of
data from a variety of spectrometers using a variety of
techniques in diverse environmental conditions. Variations
in observed spectral features, regardless of the quality of
the data, make signature classification and comparison both
challenging for spectral analysts and often impossible for
automated systems. As the quantity of collected data continues to grow, automated solutions are increasingly critical.
For example, the national Integrated Signatures Program
(ISP) has collected approximately one million infrared (IR)
spectra, which largely rely on manually produced metadata
for identification and classification. The manual production
of metadata, at this scale, requires significant and rising cost
and time investment to reduce errors and inconsistencies.
The primary methods currently used to autonomously identify and classify IR spectral data are spectral angle mapping
and key feature detection [1]. Spectral angle mapping cannot
compare spectra with differing domains (e.g., spectral range,
spectral resolution or inconsistently removed bands) without
significant preprocessing. Thus, spectral angle mapping is a
computationally slow process, running in linear time against
an entire reference library to identify a single new or unknown signature. Key feature detection improves computation time by comparing only predetermined feature locations
in the reference spectra, but this method requires the user to
specifically identify the spectral features of interest.
A novel approach to signature classification via scoring and
clustering is presented. A set of characteristic mathematical
functions are used as artificial reference spectra to score
library signatures, and a k-means clustering algorithm determines classification clusters in the score space. New
signatures are scored against the same characteristic functions to determine their location in score space, and thus
determine their likely classification. Since new signatures
need only be compared against cluster centroids to determine
classification,
the algorithm performs in O(k) time, where
p
k n/2; i.e., the computation time increases linearly with
respect to the predetermined number of clusters k. We apply
this approach to a reference sampling of signatures from
the ASTER spectral library [2] and evaluated accuracy and
computation time versus direct spectral angle comparison.
TABLE OF C ONTENTS
1 I NTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 R ELATED W ORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 M ETHODOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 F INDINGS AND A NALYSIS . . . . . . . . . . . . . . . . . . . . . . .
5 C ONCLUSION AND F UTURE W ORK . . . . . . . . . . . . .
R EFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B IOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
2
7
8
8
9
c
978-1-4799-1622-1/14/$31.00 2014
IEEE.
IEEEAC Paper #2635, Version 1, Updated 11/15/2013.
This methodology identifies close matches in a spectral li-
brary to newly collected spectra three orders of magnitude

faster than the direct spectral angle mapping. Used in tandem with traditional signature analysis, this method provides
a first-pass coarse screening of spectral classification to
reduce the size of the identification pool. Reducing the
workload on a more intensive secondary analysis allows a
much larger reference libraries to be used in the near-realtime classification of field-collected spectra. Greater access
to spectral data, in addition to the ability to provide a preliminary classification of newly collected spectra, provides
forensic analysts and first responders enhanced chemical
detection capabilities when they need it most.
values. The vectors dimensionality is equal to the number

of data points in the signature. Two spectral vectors are
compared by simply computing the angle between them via
their vector dot product. This method inherently assumes
that the two spectra being compared share precisely the same
domain: not only the same spectral band, but also the same
sampling resolution and specific domain values. Thus a
signature sampled at 5, 10, 15... 100 microns cannot be
immediately compared with another signature sampled at 7,
12, 17... 102 microns, though the domains almost entirely
overlap. Any mismatch in domain must be resampled through
interpolation and extrapolation. Resolving the mismatch is
computationally inefficient since, pathologically, every possible pair of spectra may require resampling.
The remainder of this paper is structured as follows. In Section 2, related work in the area of spectral signature matching
is presented. Our proposed method, signature classification
via scoring and clustering, is described in Section 3. Findings
and analyses are provided in Section 4. Finally, we conclude
and describe opportunities for future work in Section 5.
3. M ETHODOLOGY
We formulate a methodology to rapidly classify a new,
unknown signature by identifying signatures in a spectral
library with similar spectral features. Instead of individually comparing the unknown signature against each member
of the library, the proposed method precomputes a score
representation of the library against a small number (N) of
artificial reference spectra. Using a derivative of the SAM
method, the scalar spectral angle values between each library
signature and the reference spectra are treated as coordinates
of a point in N-dimensional space, and cached within the
library; when a new signature is introduced, it is compared
against the reference spectra to produce a corresponding set
of coordinates. Then, sets of spatial coordinates with smaller
Euclidean distances correspond to library spectra with the
greater similarity to the new signature. This method is intended to filter the library down to a small subset of candidate
matches.
2. R ELATED W ORK
The need for spectral signature comparison and identification
has driven substantial work in the application of pattern
recognition, unsupervised and semi-supervised learning, and
data clustering [3] [4] [5]. Further, the need to quantify and
analyze enormous quantities of spectral data has spawned
many attempts at spectral collections or databases, with
mixed results [6]. The inability to reduce various methods
of collection and phenomenologies of spectra into a least
common denominator representation has made the problem
computationally challenging, especially without the use of
copious metadata to explain the exact conditions and context
in which the spectra were collected.
The classification and identification of unlabeled data has
been studied in great detail in the spatial domain, and a
wealth effective of solutions have been developed to address the problem [3] [7]. Spatial clustering algorithms
in general attempt to determine natural boundaries between
non-uniformly distributed spatial data points. Of these, kmeans is relatively simplistic approach: given some known
number of clusters k, cluster center points are randomly
distributed among the sample data, then iteratively updated
to reflect the average of their nearby constituents. A key
assumption is advance knowledge of k, as this algorithm has
no ability to merge or split existing clusters. However, it
is the simplest in a large field of clustering solutions that
includes hierarchical clustering, fuzzy k-means, DBSCAN,
expectation-maximization, and many others [3][8]. If the
spectral classification problem can be effectively adapted into
the spatial domain, any of these existing methods can be
applied.
Though the primary goal is to optimize spectral comparison,

this paper also investigates the opportunity to classify and
identify unlabeled signatures by characterizing the generated
N-dimensional map as a spatial clustering problem. k-Means,
an elementary but very popular [3] clustering algorithm,
generates cluster associations among spatial coordinates. A
new, unlabeled signatures spectral scores will place it within
a defined cluster; then, the labels of the library spectra sharing
the same cluster become preliminary guesses at the unknown
signatures classification. Since the labels of library signatures are predetermined, and signatures with the same label
are expected to have very similar spectral characteristics, the
score clustering process also serves to validate the choice of
reference spectra.
Characteristic Spectral Angle Mapping (cSAM)
Our proposed method adapts SAM spectral comparison as
a measure of indirect spectral similarity, rather than direct.
Traditionally, SAM operates on the principle that if signature
A has spectral angle AC to signature C, then small AC
implies spectral similarity between A and C. Our modified
application instead proposes that a characteristic function,
such as y = cos x, can serve as an artificial reference
signature B. If signatures A and C have spectral angles AB
and BC to C respectively, then AB BC implies a degree
of similarity between A and C. This relationship is not as
precise as SAM: the set of spectral vectors satisfying a given
spectral angle to B trace a surface around the reference
signatures vector, as shown in Figure 1: here both A and
C present the same spectral angle to B, but A 6= C. In
this three-dimensional representation, the area of ambiguity
presents as the surface of a cone; the spectral vector of a 500-
Several direct and indirect methods exist to compare signatures against each other, such as Spectral Angle Mapping
(SAM) [4], multiple endmember spectral mixture analysis
(MESMA) [9], peak detection and similarity indices such as
the Pearson correlation [1]. In general practice, the sensitivity
and utility of each method is inversely correlated with its
runtime computational complexity [1]. The SAM algorithm
is of specific interest due to its ability to precisely describe
the difference between two signatures without regard for the
relative illumination within the spectra (which is irrelevant
to the spectral features of the observed material) [4]. SAM
measures similarity by taking a signature in two dimensions
(X and Y) and creating a spectral vector consisting of its Y
2
Figure 1. Conical Surface of 3D Vectors Satisfying Spectral

~
Angle to Reference Vector B
Figure 2. Intersection of 3D solutions with Spectral Angles
1 , 2 to Reference Vectors B~1 , B~2 respectively
point signature has 500 dimensions, and thus the equal-angle

ambiguity surface cannot be represented graphically.
The solution set of spectral vectors represents many spectra
that are equally similar to the reference signature, at any
magnitude. The magnitude of the spectral vector represents
the illumination of the signature, which is an artifact of the
collection environment and irrelevant for the purposes of
determining the similarity of spectral characteristics [4]. This
leaves a cross-section of spectral vectors that all exhibit the
same degree of similarity to B; for vectors in three dimensions, this presents as a circle orthogonal to B. This level of
ambiguity is irreducible using only one reference signature.
Using multiple reference signatures further constrains the
solution to the intersection of the corresponding vector sets,
as shown in Figure 2. Thus if two signatures A and C exhibit
spectral angles AB1 CB1 to B1 and AB2 CB2 to
B2 , it becomes increasingly likely that A C. Introducing
additional reference signatures Bi can constrain the solutions
still further, at the expense of additional calculations.
bands within the library. For the purposes of this paper, a

small set of simplistic functions has been chosen to illustrate
the general approach. Selection and evaluation of more appropriate characteristic functions will be the subject of further
work, as are methods of dynamically generating appropriate
characteristic functions for a given spectral library.
The k-means algorithm requires, as initialization parameters,
the expected number of classification clusters (k); some
initial selection of centroid locations for the clusters; and
an error threshold, to limit the number of iterations. The
number of expected classifications p
was estimated using a
common rule-of-thumb value, k n/2 [10], and the initialization centroids were randomly chosen from the sample
set. However, given that the library spectra are generally
well-labeled, the clustering problem can instead be tackled
with a semi-supervised solution: that is, the known material
and chemical composition of the samples can be used to
intelligently select a diverse set of signatures to serve as
the initial centroid locations. Guided initialization has the
potential to pose a significant impact on the determination
of cluster associations, as starting-point selection is known
to strongly affect the result of the k-means algorithm [11].
Initialization is also a focus of ongoing work.
The application of cSAM allows every signature in a spectral

library to be reduced into a score vector:
1
~ = ..
.
N
(1)
~
where N is the number of characteristic functions B. The
values can then be considered the coordinates of a point in
N -dimensional space, where B1 ..BN serve as axes (orthogonality is not required, but is desirable). In this new spatial
representation, score similarity can be characterized as the
Euclidean distance between two points; thus this enables the
use of existing spatial clustering algorithms, such as k-means,
to perform classification of spectra.
Detailed Approach
Each signature in the spectral database consists of columns
of spectral data accompanied by various optional metadata
properties, such as sensor identification and calibration, environmental conditions, sample identification and description,
axis units and labels, and any known observable associations.
The data is in the wavelength domain with value columns
representing either reflectivity or emissivity, as indicated by
axis properties (see Figure 3, an example signature from
the ASTER library [12]). Note that NaN float values are
used to represent invalid or removed data points within the
spectra, such as deliberately suppressed water bands. A
hash of the spectral data uniquely identifies the signatures;
therefore, two signatures having the same identifier are assumed to be identical, and cannot both exist in the database.
The phenomenology of the signature (LWIR, MWIR, SWIR,
VIS/NIR) is also indicated by metadata properties.
Preconditions
The choice of mathematical functions used to produce reference spectra, and thus the spectral angle scores used for
comparison, is a critical factor in the utility of this approach.
Poorly chosen functions result in spectral angles that are
highly similar for many or all library spectra. Functions that
perform well in one spectral band may perform poorly in
others, thus requiring different sets of functions for different
3
each signature against the characteristic functions at the exact

domain locations in which the former is defined. Because
the characteristic functions are real and continuous, real Y
values are always returned at any signature-specified domain
location. NaN values are ignored in the comparison (lines 910), and thus removed or suppressed bands have no impact
on the score. For each real-valued datapoint, the product
is accumulated through the algebraic definition of the dotproduct (line 12 of the algorithm):
~B
~ =
A
M
X
ai bi = a1 b1 + ... + aM bM
(2)
i=1
Figure 3.
Library
M is the dimensionality of the spectral vector and each ai is

the Y value of one data point in the signature. Here, B is
an artificial vector that is automatically generated based on a
function fj . Correspondingly, bi = fj (xi ), where xi is the X
value corresponding to ai s Y value. The geometric definition
of the dot product then determines the angle between the
spectral vectors (line 17):
Sulphur Signature from the ASTER Spectral
~B
~ = |A||B| cos AB =
A
A set of characteristic mathematical functions are selected

based on a set of desirable properties:
M
X
ai bi
(3)
i=1
Each function is two-dimensional, because the comparison

spectra are two-dimensional.
Each function is real and differentiable within the domain
of interest, so that a spectral angle mapping against any real
spectra will yield a real-valued solution.
Each function generates a broad range of score values
against available library spectra within the domain of interest.
Each function is linearly independent and preferably orthogonal from the other selected characteristic functions
within the domain of interest, so that each presents a unique
spectral vector.
= AB = cos1
PM
a i bi
|A||B|
i=1
(4)
Thus the magnitudes of the vectors are also accumulated in

the loop (lines 13-14). Each resulting AB is that signatures
score against a characteristic function fj ; together they result
in a score vector:
Sif1
Sif2
(5)
~Si =
...
SifN
With each of the desirable properties, we attempt to guide

and constrain the selection of characteristic functions to those
that generate a wide distribution of real-valued spectral angle
scores within the database of spectra. Thus, the selection
of appropriate specific characteristic functions is contingent
on the nature of the library against which it is applied. The
number of functions to be used is likewise flexible; more axes
for comparison lead to less ambiguity in the set of solutions,
at the expense of computation time and volume of score data.
The choice of characteristic functions can be validated by a
set of empirical desirability tests against the database:
The procedure is performed against every signature in the

library to create a baseline set of score values. The twodimensional array of scores is indexed by each signatures
unique identifier. If additional signatures are added to the
library, sets of scores are calculated for the new additions and
stored.
When an unknown signature S is introduced, the same
method is used to calculate a set of scores. A Euclidean
distance value is then computed against each existing signatures set of scores. The scores of each database signature
serve as its coordinates in the N-dimensional function space.
The unknown signatures location, as determined by its own
scores, should then be located closest to other signatures that
share similar spectral characteristics. These nearest neighbors
are then selected for further automated or visual analysis, at
the users discretion.
1. When the function is calculated against the library spectra,

are any of the produced scores NaN (undefined, infinite,
or otherwise non-real)?If so, this indicates the function is
not real and/or differentiable throughout the entire domain of
interest.
2. Are the produced spectral scores broadly distributed?If
not, the function is not a good discriminator for the signatures
of interest.
The k-means clustering algorithm (Algorithm 2) is applied

any time the spectral library is updated. The set of k cluster
centroids C is initialized as a random sampling of locations
within the dataset (line 5). Each signature Si is associated
with the centroid closest to it by Euclidean distance (lines 79), then new centroid locations are calculated for each cluster
representing the average of the cluster constituents locations
(lines 10-12). Then the sum change in centroid positions
between the current iteration and the previous is calculated
3. Does the function produce scores that are highly correlated

with scores produced by another function?If so, the functions may not be linearly independent, or they may measure
highly correlated spectral features. One of the functions may
be used, but not both.
The cSAM algorithm is used to determine score values for the
database, as shown in Algorithm 1. The procedure compares
4
Table 1. Selected Characteristic Functions (x in microns)
Algorithm 1 cSAM Signature Scoring

1: procedure GENERATE S CORES(S, f )
2:
scores := 2D array of [signature IDs][score values]
3:
for each Si in S do
4:
for each fj (x) in f do
5:
product 0
6:
sM ag 0
7:
f M ag 0
8:
for each datapoint (X, Y ) in Si do
9:
if X or Y is NaN then
10:
skip datapoint
11:
else
12:
product = product + (Y fj (X))
13:
sM ag = sM ag + Y 2
14:
f M ag = f M ag + fj (X)2
15:
end if
16:
end for
17:
scores[Si ][fj ]=cos1 ( sMproduct
)
ag f M ag
18:
end for
19:
end for
20:
return scores
21: end procedure
Name
10-nm Cosine
1-m Cosine
100-m Cosine
Equation
y = cos(100x)
y = cos(x)
x
y = cos( 100
)
point values. Signature, sample, environment, sensor, and

observable metadata are all stored in various database tables
and referenced by the signatures unique hash identifier.
The data used for comparison was selected from the Advanced Spaceborne Thermal Emission Reflection Radiometer
(ASTER) Spectral Library 2.0, a collection of spectra of natural and man-made materials produced by a collaboration of
the Jet Propulsion Laboratory, the Johns Hopkins University,
and the United States Geological Survey [2]. The data spans
the 0.4 to 15.4 m wavelength, which includes the visual
and near-infrared (VIS/NIR), shortwave (SWIR), and thermal
infrared (TIR) electromagnetic bands. All of the selected signatures describe directional hemispherical reflectance as collected by the NASA Terra spaceborne hyperspectral imaging
platform, are represented in percent reflectivity, and consist
of approximately 400-600 data points each. The data are not
uniformly sampled; for example, some begin at 0.43 m and
others at 0.3 m. However, all of the data do exhibit the
same sampling resolution: 2-nm up to 0.8m, 20-nm between
0.8m - 5m, and 100-nm between 5m - 14m.
(lines 13-16). The associate / update loop continues until

falls below the threshold parameter (line 6), indicating that
the cluster associations have stabilized. The algorithm returns
the final cluster centroid locations, along with the mapping of
signatures to assigned cluster (line 19).
Algorithm 2 k-Means Clustering
1: procedure KMEANS (dbScores[S(f )],
Number of
classes k, Error Threshold )
2:
C, C 0 := size k arrays of centroid locations
3:
A := mapping of signatures to assigned cluster #
4:

. Change between iterations
5:
C k points randomly selected from dbScores
6:
while do
7:
for each Si in S do
8:
A (Si nearest cluster in C)
9:
end for
10:
for each Cj in C do
11:
Cj0 = average of all points mapped to Cj in A
12:
end for
13:
=0
14:
for each Cj0 in C 0 do
15:
= + |Cj Cj0 |
16:
end for
17:
C C0
18:
end while
19:
return C, A
20: end procedure
Table 1 lists the preliminary characteristic functions selected

as reference spectra for this paper. The selections, all cosine
functions of varying frequency, are intended to mirror the
desirable properties identified above while remaining computationally trivial to execute. Under the assumption that
all signature wavelength values are represented in microns,
the selected functions capture spectral features at the 10nanometer, 1-micrometer and 100-micrometer resolutions.
The computation of discrete values for these functions with
respect to the described Signature Scoring algorithm was
hard-coded into the software implementation.
1,800 samples of various minerals, soils, vegetation, and
manmade materials were selected from the ASTER library
for comparison. Spectral angle scores were generated against
the three characteristic functions above and stored in the
database. The k-Means algorithm
was performed against
q
N
the score dataset using k =
2 = 30 randomly selected
signatures as initial centroid locations. The selected error
threshold was = 1105 . The centroid locations converged
on a stable solution at this error threshold after 29 iterations.
One signature, 14259.61 (a sample of lunar dust collected
from the Apollo 11 mare site), was chosen to represent the
unknown signature (Figure 4). Its score data was manually
removed from the database, then the cSAM plugin was run
to recalculate the score and determine its cluster association.
The runtime of both the database scoring / clustering process
and the signature classification process were recorded. Also,
to evaluate the accuracy and run-time of the scoring process
without clustering, the unknown signatures score values
were recomputed and spatially compared against all 1,799
database spectra via Euclidean distance computation.
Implementation
In support of ISP spectral analysis, JHU/APL has developed
a standardized database schema representation of spectral
signatures, and an associated Java-language software application SigDB to aid in exchange, preliminary analysis, comparison and classification of collected spectra. The scoring
and clustering methodology described herein was developed
as a plug-in capability for the SigDB application, which
enabled immediate access to a large quantity of spectral data
and a framework for analysis. SigDB stores signature data
in an SQL relational database as IEEE754 64-bit floating
In addition, a traditional SAM algorithm was also imple5
Figure 5. Scores of Selected ASTER Data against 1-m and

100-m Cosine
Figure 4. Selected Unknown Signature for Search Comparison
mented as a SigDB plug-in capability and run against the

same dataset. The same signature was selected as the unknown and compared against the 1,799 other signatures. This
SAM implementation does not perform any interpolation or
extrapolation to align signature domains; as a result, database
signatures with minimum/maximum domain values were automatically pre-filtered. Further, if the code detects any
mismatch in domain values within the two compared spectra
during comparison (such as a missing/suppressed datapoint
in one but not the other), it immediately terminates that
comparison, reports a NaN spectral score, and moves on to
the next comparison; however, these still impact the run-time
of the algorithm. These are known and accepted limitations
of the traditional SAM algorithm, and are usually worked
around via data interpolation and extrapolation. The reported
matches and run-times of each process were recorded and
compared.
Figure 6. Signatures Scored by SAM (Lunar Dust and Sea

Water)
Results
Table 2 shows the number of computations and runtimes
for each of the algorithm processes performed. The first
row is a traditional spectral angle mapping comparison of
the unknown signature against the database spectra. Although the database contains 1,800 signatures, only 477
match the same minimum/maximum domain, so only these
were selected for comparison; of these, only 17 signatures
matched every datapoints domain precisely. Thus, the other
460 calculations were terminated before completion. This
process took approximately forty seconds; the generated
scores, along with the name of each signature, is shown
in Table 3. The second row includes computation of
scores against each of the three characteristic functions, kmeans cluster classification (which included 29 iterations of
the clustering algorithm), and storage in the database. All
database spectra were scored, but no actual comparisons were
performed in this step. This process took approximately four
minutes. The third row includes calculation of the unknown
signatures scores against the characteristic functions, then
Euclidean distance comparison of those scores against each
of the 30 cluster centroids to determine classification. This
process took ten milliseconds. The final row, which is not
performed in the normal course of k-means analysis, was a
recalculation of the unknown signatures scores (in order to
fairly compare run-time) and Euclidean distance comparison

against all 1,799 other score sets in the database. This process
took fifteen milliseconds, and the cSAM Euclidean distances
to the signatures scored by traditional SAM are also shown in
Table 3.
Figure 5 illustrates the overall spread of spectral angle scores
of the 1,800 ASTER signatures against two of the three
selected characteristic functions. The third function, 10-nm
cosine, results in minimal differentiation between signatures;
all fall in the range [88.2, 92.1] degrees, so this axis is omitted
in the figure. The narrow differentiation of scores by the
10-nm cosine and the relatively broad distribution of scores
against the other two functions illustrate both implications of
the second empirical test of desirability: the former indicates
that the 10-nm cosine is undesirable for the dataset at hand,
while the latter indicates that the 1-m and 100-m cosines
do perform well as discriminators. Table 4 describes the
location of the cluster centroids in the score space produced
by the three functions.
6
Table 2. Run-time Comparison

Process
# Score Calculations
SAM Algorithm
cSAM Score Computation
cSAM Cluster Comparison
cSAM Euclidean Score Comparison
477 (17)
1800
1
1
# Comparisons
Run-Time (ms)
477 (17)
0
30
1800
39,794
223,340
10
15
Avg ms per
Comparison
83
N/A
0.33
0.01
Table 3. SAM Direct Comparison Results ( Scores in

Degrees) and cSAM Euclidean Distances
Signature Name
14148.183
12024.69
12023.139
14149.18
12070.405
12030.135
61241.98
64801.34
68501.609
10084.1939
62231.15
60051.19
14141.146
67941.72
67701.36
61221.79
Sea water
Spectral Angle
cSAM Distance
1.333
1.656
2.651
2.670
2.760
3.041
3.369
4.159
4.561
4.643
4.776
5.550
7.066
8.133
8.356
8.855
24.929
1.063
0.695
1.901
2.032
2.094
2.046
3.017
3.642
4.303
4.810
4.329
5.080
5.637
7.648
7.972
7.852
8.786
Table 4. Cluster Centroid Locations ( Scores in Degrees)
4. F INDINGS AND A NALYSIS

The results were evaluated by comparing the results of the
traditional SAM approach to the score values produced by
cSAM, as well as the cluster classifications produced by kmeans. The SAM algorithm was only able to compare the
unknown signature with those that were in precisely the same
domain, which coincided with data produced by the same
sensor, and thus largely correlated with the most probable
association: as shown in the second column of Table 3 and
Figure 6, lunar dust signatures all scored within < 12.9 ,
and the one non-lunar signature compared, sea water, scored
= 23.003 . For the same signatures, cSAM produced Euclidean distances as shown in the third column of Table 3. The
full spread of Euclidean distances to the unknown signature
is shown in Figure 7, which illustrates the spatial distribution
of spectral scores with respect to the unknown. All of the
lunar dust signatures, which we consider true matches, scored
within the closest 11% of spectra within the database.
The k-means clustering placed the unknown signature in
Cluster #3. The other lunar dust signatures were placed
in Clusters 3 (7 signatures), 11 (8 signatures) and 27 (1
signature). The sea water signature was also placed in Cluster
27.
7
Cluster
10-nm
1-m
100-m
Signatures
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
90.443
89.401
90.636
89.874
90.451
89.293
90.587
90.633
90.104
90.366
90.575
90.630
90.175
90.522
90.289
90.282
90.859
90.380
89.745
90.624
90.476
90.355
89.594
89.747
90.483
89.429
90.591
90.121
90.413
90.546
90.380
123.94
93.475
124.72
80.886
119.21
116.74
114.72
126.14
118.52
106.13
107.24
131.17
86.036
125.50
119.15
75.508
117.53
140.77
97.332
74.903
68.062
122.00
134.34
57.068
130.32
78.987
124.53
76.509
105.788
16.014
42.293
38.517
33.820
7.3951
48.691
22.568
29.288
20.840
37.312
27.278
37.788
24.880
9.4432
27.491
9.6618
58.144
14.352
30.669
24.690
7.8526
15.130
1.4625
29.652
30.304
36.741
2.5388
13.485
24.125
19.860
61
62
57
70
59
44
40
68
89
43
53
72
122
49
61
89
13
72
48
64
77
48
39
78
47
56
120
38
18
43
force SAM comparison values, while offering a significant

decrease in the time to identify potential matches. This could
vastly improve the performance and accuracy of existing
spectral systems in use by scientific, defense and emergency
response stakeholders.
Two primary areas have been identified for further investigation. Characteristic functions appropriate for use with
libraries consisting of different spectral bands and spectral
resolutions should be considered and evaluated, based on
the desirable properties and empirical tests described above.
Automated polynomial-based approaches may also allow the
characteristic functions to be generated dynamically based
on the actual content of the spectral library. Other methods
besides k-means should also be considered and evaluated.
This includes fuzzy k-means, which would reduce the partitioning of dense areas of the score space; semi-supervised
approaches, which can take advantage of the copious label
data within the library; and dynamic determination of the
number of classes/clusters, based on the known material
content of the library.
Figure 7. Sorted Euclidean Distances of 1,799 Database

Spectral Scores to the Unknown Signature (sorted ascending
from left)
ACKNOWLEDGMENT
The authors would like to thank the Integrated Signatures
Program for their support, Thomas Spisz (JHU/APL) for
information on the Spectral Angle Mapper algorithm, and
Edward Birrane (JHU/APL) and Jason Oxenrider (JHU/APL)
for editing and review.
Based on these results and run-time comparisons, we find:

1. Scoring against characteristic functions via the cSAM
algorithm generally approximates the spectral similarity between signatures, as appropriate for a first-pass filter.
R EFERENCES
2. Though the cSAM method requires a greater up-front time

investment to perform characteristic computations, the time
to compare a newly captured signature against a large library
is reduced from a linear-scale operation to near-constant time.
[1]
3. The unsupervised k-means clustering algorithm does not

effectively partition the cSAM score-space into usable classifications. However, the use of semi-supervised approaches
(using existing classification information stored in the spectral library), better heuristic selection of the number of likely
classes k, and informed selection of the cluster-initialization
centroids all are likely to dramatically improve classification
accuracy.
[2]
[3]
[4]
The importance of careful selection of characteristic functions was clearly illustrated by the 10-nm cosine functions
inability to discriminate amongst the library spectra. Intuitively, a 10-nm-scale curvature is negligible when compared
against spectra on the micron scale; therefore, the range of
spectral resolutions of the library spectra are a significant
factor in the efficacy of the functions.
[5]
[6]
5. C ONCLUSION AND F UTURE W ORK

Characteristic spectral angle mapping is a potentially powerful approach to reducing the run-time cost of autonomous
spectral classification and identification against large signature data sets. By converting the spectral classification
problem into a spatial problem, cSAM enables the application
of many existing well-developed classification approaches.
Our preliminary results indicate a good correlation between
the chosen characteristic functions spatial scores and brute
[7]
[8]
8
J. Li, D. B. Hibbert, S. Fuller, and G. Vaughn, A comparative study of point-to-point algorithms for matching
spectra, Chemometrics and Intelligent Laboratory Systems, vol. 82, no. 1-2, pp. 5058, May 2006.
A. M. Baldridge, S. J. Hook, C. I. Grove, and G. Rivera,
The aster spectral library version 2.0, Remote Sensing
of Environment, vol. 113, pp. 711715, 2009.
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern
Classification, 2nd ed. John Wiley and Sons, 2001.
Y. Sohn and N. S. Rebello, Supervised and unsupervised spectral angle classifiers, Photogrammatric
Engineering and Remote Sensing, vol. 68, no. 12, pp.
12711280, December 2002.
F. A. Kruse, J. W. Boardman, and J. F. Huntington,
Comparison of airborne hyperspectral data and eo-1
hyperion for mineral mapping, IEEE Transactions on
Geoscience and Remote Sensing, vol. 41, no. 6, pp.
13881400, June 2003.
C. Salvaggio, L. E. Smith, and E. J. Antoine,
Spectral signature databases and their application/misapplication to modeling and exploitation of
multispectral/hyperspectral data, in Algorithms and
Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XI, S. S. Shen and P. E. Lewis, Eds.,
vol. 5806. SPIE, 2005.
K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed., W. Rheinboldt, Ed. New York: Academic Press, October 1990.
M. Ester, H.-P. Kriegel, J. S, and X. Xu, A densitybased algorithm for discovering clusters in large spa-
[9]
[10]
[11]
[12]
B IOGRAPHY [
tial databases with noise, in Proceedings of 2nd International Conference on Knowledge Discovery and
Data Mining, E. Simoudis, J. Han, and U. Fayyad,
Eds., American Association for Artificial Intelligence.
Menlo Park, California: The AAAI Press, 1996, pp.
226231.
P. E. Dennison, K. Q. Halligan, and D. A. Roberts, A
comparison of error metrics and constraints for multiple
endmember spectral mixture analysis and spectral angle
mapper, Remote Sensing of Environment, vol. 93, no. 3,
pp. 359367, November 2004.
K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate
Analysis. London: Academic Press, 1979, pp. 360
384.
F. Robinson, A. Apon, D. Brewer, L. Dowdy, D. Hoffman, and B. Lu, Initial starting point analysis for kmeans clustering: a case study, in Proceedings of ALAR
2006 Conference on Applied Research in Information
Technology, 2006.
(2008, December) Aster spectral library. [Online].
Available: http://speclib.jpl.nasa.gov/
Vignesh Ramachandran received a

B.S. in Computer Science from the Georgia Institute of Technology in 2007
and an M.S. In Aerospace Engineering
from the University of Maryland, College Park in 2013. He has worked
at the Johns Hopkins University Applied Physics Laboratory since 2008 as
a Ground Software Engineer, designing command, telemetry, data processing and network engineering solutions for NASA missions
(such as the Van Allen Probes and MESSENGER spacecraft)
as well as a variety of other civil and defense applications.
Mr. Ramachandran currently serves as the Vice-Chair of the
American Institute of Aeronautics and Astronautics (AIAA)
Mid-Atlantic Section, and has twice served as the General
Conference Chair of the AIAA Young Professionals, Students
and Education Conference (YPSE).
Herbert Mitchell received a B.S. in
Chemistry from Washington and Lee
University and a M.S. in Analytical
Chemistry from University of Virginia.
He entered the U.S. Navy after graduation and served as a scientist dealing
with the effects of nuclear weapons effects on humans and on the chemistry
of the atmosphere. In his government
roles and afterwards as a contractor
supporting the defense department and other agencies, he
authored several reports, worked on several special projects,
and served on several committees investigating scientific
phenomena. He has a record of leading them to successful
conclusions. Often these projects were of interest to high
levels of government. His interests have generally been to
develop novel ways to use wide ranges of sensors to better
acquire data of needed interest to the Defense Department.
For the last decade he has been working for the Physics
Department of the Naval Postgraduate School and has been
working at several agencies in the Washington, DC area, most
recently at the Joint IED Defeat agency (JIEDDO).
Samantha Jacobs received a B.S. in
Physics from Georgia Southern University in 2012. In 2013 she joined
the Johns Hopkins University Applied
Physics Laboratory as an associate
Ground Software Engineer in the Space
Department. Her work in the Space
Department includes automated testing,
network engineering solutions, and data
processing.
Nigel Tzeng received a B.S. In Computer Science and a M.S. In Software Engineering from the University of Maryland College Park. Mr. Tzeng has
over 20 years experience in spacecraft
ground systems, command and control (C2) systems, data visualization
and software engineering. He joined
the Johns Hopkins University Applied
Physics Laboratory (JHUAPL) in 2003
and is currently a senior member of the Space Department
technical staff. Mr. Tzeng leads the development of signature
and geospatial analysis/exploitation software systems and
served as the Group Chief Scientist for the C2 Systems Engineering Group from 2007-2009 as well as been the Principal
Investigator of several C2 research initiatives. His primary
area of research are command and control, geospatial visualization and collaboration. Prior to joining JHUAPL, Mr.
Tzeng worked in telecommunications, e-commerce, advanced
traffic management systems, spacecraft simulation (Landsat,
SOHO), spacecraft command and control (SAMPEX, TRMM,
FUSE), and science data processing/visualization (COBE).
He was the lead software architect and designer of the City of
Louisville Advanced Traffic Management System and developer of the DIRBE, FIRAS and DMR sky map visualization
software on COBE.
Alexer Firpi received a B.S. in electrical
engineering from Polytechnic University
(San Juan, Puerto Rico), an M.S. in
electrical engineering from the University of Puerto Rico (Mayaguez, Puerto
Rico), and a Ph.D. in electrical engineering from Michigan State University
(East Lansing, MI). After concluding his
doctoral studies, Dr. Firpi did postdoctoral work at different institutions in
diverse research areas such as intelligent control, biomedical
engineering, imaging genetics, and bioinformatics. He is
currently a senior staff member at Johns Hopkins University
- Applied Physics Lab. Dr. Firpis research focuses on
machine learning, brain-computer interfaces, computational
intelligence, and any other research problem that can be
automated using machine-learning approaches. He is the
author of more than 20 peer-reviewed publications and two
book chapters.
Benjamin Rodriguez received a Bachelors of Science (B.S.) and Masters of
Science (M.S.) in Electrical Engineering from the University of Texas, and
received a Doctor of Philosophy (Ph.D.)
in Electrical and Computer Engineering
from the Air Force Institute of Technology, Graduate School of Engineering
and Management, Electrical and Computer Engineering Department, WrightPatterson Air Force Base, OH. He is the Section Supervisor
for Space Systems and Architectures in the Space Department with The Johns Hopkins University Applied Physics
Laboratory. He is also an instructor at The Johns Hopkins
University, Whiting School of Engineering for the Department of Electrical and Computer Engineering as well as the
Department of Computer Science.
10

K Means Spectral

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means Spectral

Uploaded by

Copyright:

Available Formats

Spectral Comparison Using k-Means Clustering

AbstractThere is a growing number of infrared (IR) spectral

Growing use of infrared spectral signature data in scientific

This methodology identifies close matches in a spectral li-

brary to newly collected spectra three orders of magnitude

values. The vectors dimensionality is equal to the number

Though the primary goal is to optimize spectral comparison,

Figure 1. Conical Surface of 3D Vectors Satisfying Spectral

point signature has 500 dimensions, and thus the equal-angle

bands within the library. For the purposes of this paper, a

The application of cSAM allows every signature in a spectral

each signature against the characteristic functions at the exact

M is the dimensionality of the spectral vector and each ai is

Sulphur Signature from the ASTER Spectral

A set of characteristic mathematical functions are selected

Each function is two-dimensional, because the comparison

Thus the magnitudes of the vectors are also accumulated in

With each of the desirable properties, we attempt to guide

The procedure is performed against every signature in the

1. When the function is calculated against the library spectra,

The k-means clustering algorithm (Algorithm 2) is applied

3. Does the function produce scores that are highly correlated

Table 1. Selected Characteristic Functions (x in microns)

Algorithm 1 cSAM Signature Scoring

point values. Signature, sample, environment, sensor, and

(lines 13-16). The associate / update loop continues until

Table 1 lists the preliminary characteristic functions selected

In addition, a traditional SAM algorithm was also imple5

Figure 5. Scores of Selected ASTER Data against 1-m and

Figure 4. Selected Unknown Signature for Search Comparison

mented as a SigDB plug-in capability and run against the

Figure 6. Signatures Scored by SAM (Lunar Dust and Sea

fairly compare run-time) and Euclidean distance comparison

Table 2. Run-time Comparison

Table 3. SAM Direct Comparison Results ( Scores in

Table 4. Cluster Centroid Locations ( Scores in Degrees)

4. F INDINGS AND A NALYSIS

force SAM comparison values, while offering a significant

Figure 7. Sorted Euclidean Distances of 1,799 Database

Based on these results and run-time comparisons, we find:

2. Though the cSAM method requires a greater up-front time

3. The unsupervised k-means clustering algorithm does not

5. C ONCLUSION AND F UTURE W ORK

Vignesh Ramachandran received a

You might also like