Professional Documents
Culture Documents
Mean occurrences
brid fixed- and floating-point arithmetic architecture. Con- 2
3000
trary to our work, these approaches are based on the con- 1.5
ventional K-means algorithm and aim to gain speed-up by 1
2000
an increased amount of parallel hardware resources for dis-
0.5 1000
tance computations and nearest centre search.
An example of a tree-based algorithm for data-mining 0 0
0 0.1 0.2 0.3 0.4 5 10 15
applications other than clustering, implemented on a paral- Standard deviation Candidate list length
lel architecture, is presented in [10] for regression tree pro-
cessing on a multi-core processor. Chen et al. [11] present a Fig. 1: Left: Contribution of the pre-processing to the over-
VLSI implementation for K-means clustering which is no- all computation performed. Right: Frequency of set sizes.
table in that it, in the spirit of our approach, recursively splits
the data point set into two subspaces using conventional 2- exactly the same as the one produced by the conventional
means clustering. Logically, this creates a binary tree which algorithm which we implemented for comparison. Fig. 1
is traversed in a breadth-first fashion and results in compu- (left) shows the profiling results obtained from this test case
tational complexity proportional to log2 K. This approach, for several standard deviations σ (ranging from well-distin-
however, does not allow any pruning of candidate centres. guished clusters to a nearly unclustered point set). For all
Saegusa et al. [3] present a simplified kd-tree-based cases, the number of DCEs performed during tree creation
FPGA implementation for K-means image clustering. The is only a fraction of the total number of DCEs. We perform
data structure stores the best candidate centre(s) at its leaf the pre-processing in software and the FPGA implementa-
nodes and is looked up for each data point. The tree is built tion described in the following section focuses on the tree
independently of the data points, i.e. the pixel space is sub- traversal phase only. Fig. 1 shows the frequency of candi-
divided into regular partitions which leads to empty pixels date centre set sizes averaged over all cases above. During
being recursively processed. The tree also needs to be re- tree processing, most sets contain only 2 or 3 centres which
built at the beginning of each iteration and centre sets are shows the effectiveness of the search space pruning.
not pruned during tree traversal in the build phase, which
are essential features of the original filtering algorithm. 4. FPGA IMPLEMENTATION
3. ANALYSIS OF THE FILTERING ALGORITHM Three out of the four algorithm kernels described in Sec-
tion 2 are data-flow centric: 1) The closest centre search
The filtering algorithm can be divided into two phases: build- computes Euclidean distances to either C’s mid point or
ing the tree from the point set (pre-processing), and the iter- the node’s weighted centroid, followed by a min-search. 2)
ated tree traversal and centre update. In order to obtain in- The pruning kernel performs two slightly modified distance
formation about the computational complexity of both parts, computations to decide whether any part of the bounding
we profiled the software implementation of the algorithm box C crosses the hyperplane bisecting the line between two
using synthetic input data. Here, we chose the number of centres z ∗ and z (See [7] for a more detailed description).
Euclidean distance computations performed as our metric Those centres z for which this is not the case are flagged
for computational complexity. Since the tree creation phase and no longer considered by subsequent processing units.
does not compute any distances but performs mainly dot 3) When the centre update is performed on a leaf node or a
product computations and comparisons, we introduce dis- one-element candidate set, it is a simple accumulation oper-
tance computation equivalents (DCEs) to obtain a unified ation which updates a centre buffer. After a tree traversal is
metric for both parts which combines several operations that completed, the information contained in this buffer is used
are computationally equivalent. to update the centre positions memory before the next itera-
For the test case, we generated point sets of N = 10000 tion starts. All three kernels are implemented in a pipelined,
three-dimensional real-valued samples. The data points are stream-based processing core. This core has a hardware la-
distributed among 100 centres following a normal distribu- tency of 31 clock cycles and can accept a node-candidate-
tion with varying standard deviation σ, whereas the cen- centre pair on every other clock cycle.
tre coordinates are uniformly distributed over the interval The heart of the filtering algorithm is the traversal of the
[−1, 1]. Finally, the data points are converted to 16bit fixed- kd-tree which is built from the data points. During these
point numbers. We start off with K = 100 initial centres traversals, the algorithm maintains and propagates a set of
sampled randomly from the data set and run the algorithm potential candidates for the nearest centre to a data point (or
either until convergence of the objective function or until 30 generally a subset of points). Ineligible candidates are dis-
iterations are reached. We note that the clustering output is carded so that this set monotonically shrinks during traversal
addresses for tree node and candidate set
read write/read
node-centre set pairs must be scheduled for processing. Two
Tree node Centre index Centre positions
memory memory look-up memory node-centre pairs are independent if they do not exhibit any
read-write dependencies. Fig. 2 (bottom) illustrates that the
31 pipeline stages
(1) Closest centre search and candidate
centre set centre sets of a node and its immediate children: Processing
Resynchronisation (1 node-centre pair the parent node writes the new (thinned out) candidate set
per clock cycle)
into memory and it is read when processing the two child
(2) Candidate pruning Distortion calculation
nodes. Thus, all nodes (and their associated centre sets)
Traversal decision Centre update which do not have any direct parent-child relation can be
(3)
processed independently of each other in the pipeline.
Node & cand. set Dynamic memory
Centre buffer We implement this scheduling task using the stack: Ev-
address stack allocation
Centre index ery pointer to a node-centre set pair pushed onto the stack
memory naturally points to children whose parent has previously been
(scratchpad) r(1) : read in step 1
Cand. set 1
r(1)
w(2) : write in step 2
processed, and thus all items on stack point to disjoint sub-
w(1) trees and are independent. The scheduler in the stack man-
Cand. set 2
r(2)
r(n) agement fetches a new node and centre set pointer from the
Cand. set 3
w(2) stack as soon as the pipeline is ready to accept new data.
Independent centre sets are read and written simultaneously
...
3
40
2: if heap utilisation < bound B then 80
3: newAddress ← allocateNextFreeAddress() 30
60
4: else
5: newAddress ← 0 (default full-size set) 40
N=16384, K=128
20
6: end if 20 N=10000, K=100 10
7: end if Lena (16384, 128)
8: if scratchpad read requested then 0 0
0 0.1 0.2 0.3 0.4 0 0.1 0.2 0.3 0.4
9: if readSecondTime(readAddress) then Standard deviation Standard deviation
10: free(readAddress)
11: end if Fig. 4: Left: Average cycle count per iteration for the FPGA
implementation of the filtering algorithm (P = 1). Right:
20 Speed-up over an FPGA implementation of the conventional
interactions (x105)
Conv. p=82 Conv. p=82 times fewer LUTs, the types of resources which are exhausted
400 Filt. p=1 400 Filt. p=1
Filt. p=2 Filt. p=2 first in the conventional implementation. In future work, we
300 Filt. p=4 300 Filt. p=4
will store the tree structure containing larger data sets in off-
200 200 chip memory and design an application-specific prefetching
100 100 and cache interface to alleviate the performance degradation
0 0
due to the drop of memory bandwidth.
0 5 10 0 2 4 6
LUT (x10k) REG (x10k)
Conv. p=23
C. p=23 7. REFERENCES
C. p=42
500 Conv. p=42 500
Execution time (us)
C. p=82
Conv. p=82
F. p=1
[1] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pat-
400 Filt. p=1 400
Filt. p=2 F. p=2 tern Recognition Lett., vol. 31, no. 8, pp. 651–666, June 2010.
300 Filt. p=4 300 F. p=4
[2] D. Clark and J. Bell, “Multi-target state estimation and track
200 200 continuity for the particle PHD filter,” IEEE Trans. Aerosp.
100 100 Electron. Syst., vol. 43, no. 4, pp. 1441–1453, Oct. 2007.
0 0
[3] T. Saegusa and T. Maruyama, “An FPGA implementation of
0 500 1000 0 100 200 300 real-time K-means clustering for color images,” Real-Time
DSP48 BRAM
Image Processing, vol. 2, no. 4, pp. 309–318, Nov. 2007.
Fig. 5: Mean execution time per iteration over FPGA re- [4] J. P. Theiler and G. Gisler, “Contiguity-enhanced k-means
sources for N = 16384, K = 128, D = 3, σ = 0.2. clustering algorithm for unsupervised multispectral image
segmentation,” in Proc. SPIE 3159, 1997, pp. 108–118.
required to meet a target throughput, as well as the Pareto [5] M. Estlick, M. Leeser, J. Theiler, and J. J. Szymanski, “Al-
optimal frontiers (AT=const.) of the smallest AT-product. gorithmic transformations in the implementation of k- means
The runtime advantage of the filtering algorithm needs to be clustering on reconfigurable hardware,” in Proc. Int. Symp.
on Field Programmable Gate Arrays, 2001, pp. 103–110.
countered by significantly increased parallelism of computa-
tional units in the conventional implementation (23×-82×). [6] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay,
“Clustering large graphs via the singular value decomposi-
Table 1 shows a resource comparison as well as the absolute
tion,” Mach. Learn., vol. 56, no. 1-3, pp. 9–33, June 2004.
and relative utilisation for a throughput constraint of 128µs
per iteration. For DSP48, LUT and REG resources, the ef- [7] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Sil-
verman, and A. Wu, “An efficient k-means clustering algo-
ficiency advantage of the filtering algorithm in hardware is
rithm: Analysis and implementation,” IEEE Trans. Pattern
obvious. However, the implementation of the filtering algo- Anal. Machine Intell., vol. 24, no. 7, pp. 881–892, July 2002.
rithm requires more memory compared to the conventional
[8] M. Leeser, J. Theiler, M. Estlick, and J. Szymanski, “De-
implementation. This is mainly due to the increased on-chip sign tradeoffs in a hardware implementation of the k-means
memory space required to store the data points in the kd- clustering algorithm,” in Proc. IEEE Sensor Array and Mul-
tree structure. The availability of BRAM resources is thus tichannel Signal Process. Workshop, 2000, pp. 520–524.
the limiting factor in scaling the current implementation to [9] X. Wang and M. Leeser, “K-means clustering for multispec-
support larger data sets. For the configuration above and a tral images using floating-point divide,” in Symp. on Field-
maximum value of K = 2048, the 7vx485t-1 FPGA can Programmable Custom Comput. Mach., 2007, pp. 151–162.
support a maximum point set size of N = 65536. [10] S. Tyree, K. Q. Weinberger, K. Agrawal, and J. Paykin, “Par-
allel boosted regression trees for web search ranking,” in
6. CONCLUSION Proc. Int. Conf. on World Wide Web, 2011, pp. 387–396.
This paper presents an FPGA implementation of the compu- [11] T.-W. Chen and S.-Y. Chien, “Flexible Hardware Architec-
tationally efficient filtering algorithm for K-means cluster- ture of Hierarchical K-Means Clustering for Large Cluster
ing in order to investigate the practicality of the algorithm Number,” IEEE Trans. VLSI Syst., vol. 19, no. 8, pp. 1336–
1345, Aug. 2011.
in the context of an FPGA implementation. We evaluate