You are on page 1of 5

Application Notes

Paulo J.G. Lisboa, Liverpool John Moores University, UK Alfredo Vellido, Technical University of Catalonia, SPAIN Roberto Tagliaferri and Francesco Napolitano, University of Salerno, ITALY Michele Ceccarelli, University of Sannio, ITALY Jos D. Martn-Guerrero, University of Valencia, SPAIN Elia Biganzoli, University of Milano, ITALY

Data Mining in Cancer Research

I. Introduction

dvances in cancer medicine have traditionally come from detailed understanding of biological processes, later translated into therapeutic interventions, whose effectiveness is established by rigorous analysis of clinical trials. Over the last two decades the increasing throughput of data from microarray screening, spectral imaging and longitudinal studies are turning the understanding of cancer pathology into as much a data-based as EYEWIRE a biologically and clinically driven science, with potential to impact more strongly on evidence-based decision support moving towards personalized medicine [1]. This article is not intended as a comprehensive survey of data mining applications in cancer. Rather, it provides starting points for further, more targeted, literature searches, by embarking on a guided tour of computational intelligence applications in cancer medicine, structured in increasing order of the physical scales of biological processes, as outlined in Table 1.

sequencing for monitoring genetic changes in tumor cells as they progress from normal to invasive [2]. The first step in unravelling these processes is the study of co-occurrences, leading to the identification of disease sub-types. This already has profound implications for the clinical management of patients, impacting on one of the most fundamental precepts of cancer medicine the histological characterisation of malignancies. Currently this is overwhelmingly done by measuring the differentiation between cancerous cells and the original normal cells from which they have evolved, marking them as well to poorly differentiated, meaning that the changes are linked with good to poor

prognosis.Yet it is becoming apparent that the metabolism of cells is a key factor in uncontrolled cell proliferation, with a potentially greater influence on disease progression than cell differentiation alone. Clustering is the first line analysis methodology to mine data bases for useful research hypotheses to take onto further, more directed, study [3], [4]. From a mathematical perspective the objects of study, be they genes, patients, mode of action drugs or disease sub-types, are points in a multi-dimensional space where similarity is defined, typically trough a measure of the distance between object pairs. The main classes of clustering approaches used in data mining for cancer are partitional, creating a simple partition of the data, or hierarchical, which create a tree-structured hierarchy of the data. Partitional approaches form a very

TABLE 1 Overview of healthcare interventions across physiological scales.


BIOLOGICAL PROCESSES MOLECULAR BIOLOGY METABOLIC PATHWAYS CYTOLOGY AND HISTOLOGY MEDICAL IMAGING ELECTROPHYSIOLOGICAL MEASUREMENT CLINICAL SIGNS OUT-OF-HOSPITAL CARE LIFESTYLE FACTORS ORGAN LEVEL SYSTEM LEVEL LEVEL OF THE INDIVIDUAL POPULATION LEVEL PHYSIOLOGICAL SCALE GENOME AND PHYSIOME CELL AND TISSUE LEVELS INDUSTRIAL SECTOR BIOINFORMATICS PHARMACEUTICS SYSTEMS BIOLOGY LABORATORY TESTS NEUROINFORMATICS MEDICAL INFORMATICS MEDICAL EQUIPMENT & INSTRUMENTATION PHARMACEUTICS APPLICATION DOMAIN SUSCEPTIBILITY TO DISEASE TARGETS FOR INTERVENTION DISEASE DIAGNOSIS ANATOMICAL AND PHYSIOLOGICAL MONITORING DIAGNOSIS, PROGNOSIS AND SCREENING AMBULATORY MONITORING FOR EARLY DIAGNOSIS DISEASE PROTECTION AND PREVENTION

II. Robust Clustering and Data Visualization

Over the last decade, cancer has become a data-intensive area of research, with accelerating developments in dataacquisition technologies and specific methodologies, for example, next-generation
Digital Object Identifier 10.1109/MCI.2009.935311

PUBLIC HEALTH INFORMATICS

14

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2010

1556-603X/10/$26.002010IEEE

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on February 12, 2010 at 05:38 from IEEE Xplore. Restrictions apply.

heterogeneous class, including techniques based on Global Optimization, Vector Quantization, Kernel functions, Fuzzy sets, and Graph theory [5]. Hierarchical approaches can be agglomerative or divisive and are often used as for first-line clustering of high-throughput data. However, high-dimensional data spaces are by nature ultra-metric, each data point appearing to be at the centre of a hollow sphere, in contrast with the usual intuitive picture in low dimensions [6]. This effect needs to be taken into account especially with agglomerative methods to avoid early stage errors that can arise, resulting in sub-optimal solutions compared with partition algorithms. A third main class of algorithms is biclustering that is to say clustering of both objects and features within a single model [7]. The range of clustering techniques illustrates the complexity of the problem. The fundamental concept underlying the definition of clusters, similarity, is complex in itself because of its ambiguity. This difficulty is compounded when the most relevant features are not known a priori, yet the choice of different features usually produces different solutions. An open research question is feature selection operating with unsupervised training. This will nor mally involve parametric modeling and the application of Bayesian methods to derive the discriminative features which best separate between the clusters [8]. In addition to such intrinsic complexity, purely algorithmic issues may give rise to different solutions for the same data set, even for the same algorithm. This raises the possibility that cluster stabilization methods including global stability assessments [10], [11] and consensus clustering [12], [13] methods may miss the real complexity of the problem. An alternative approach to clustering assessment shifts the focus from the search of the best solution to the search of a set of equally plausible, though different, solutions [14]. Recently, metaclustering [15] (see fig 1) has been proposed as a clustering assessment tool. It is the process of clustering the solutions i.e., the cluster partitions themselves, generalizing the concept of global stability to

local stability (see Fig. 1), that is, the stability of a solution with respect to a homogenous subset of the candidates. The cluster composition then needs to be explored in detail, for which dendrograms provide a useful representation, usually through sequential univariate splitting of the data space. However, the interpretation of geometrical relationships in the data is often best derived from direct spatial representations. The most commonly used methods for data visualization are Multi Dimensional Scaling, Principal Component Analysis (PCA) and Self-Organizing Maps [16], [17]. Linear methods are particularly interesting since they enable axes projections to be overlaid on the data. An efficient methodology for cluster-based linear visualization is to use a decomposition of the covariance matrix which explicitly takes into account cluster membership. This is the natural tailoring of PCA to the case where the data are partitioned into labelled cohorts an example from breast cancer proteomics is shown in Fig. 2 [18].
III. Pathway Modeling

[20]. This has spawned the field of Systems Biology to understand and describe how molecular components interact to manifest the phenotypic behaviour which is the target of disease intervention [21], illustrated in Fig. 3. In the systems biology literature there are two main classes of approaches: Inverse problems, which involves mining data from many perturbation experiments to identify a compatible structure of the signaling pathway. Direct problems, to forecast and simulate the forward evolution of dynamical systems, using differential equations. Several methods make the assumption that the expression level of a given gene can be considered as a random variable and the mutual relationships between them can be derived by statistical dependencies. The resulting system is embedded in a probabilistic Graphical Model [22] described for example by a Bayesian Network [23], a Factor Graph [24], a Gaussian Graphical Model [25] or a Mutual Information derived Influence Network [26].
IV. Cancer Imaging

The study of the cell functions is modelled from experimental data sets which are both complex and dynamical [19],

Cancer imaging is a vast field focused on non-invasive measurement for

364 0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2

Distortion

310 342 315 311 295 325 360 356 353 358 313 333 122 371 352 309 318 317 357321 134 252 291 314 296 292 297 253 344 175 111 153 108 254 120 119 98 64 152 94 140 138 194 91 334 86 247 199 197 206 322 298 87 90 255 110 202203 200 230 225 214 224 96 223 95 226 101 107 100 102 113 103 299256 109 106 324 105 293 118 112 97 236 249 104 196 204 155 146 99 195 85 248 326 308 80 78 257 221 76 331 330 77 75 74 93 71 332 81 79 72 84 83 234 287 246 232 243 303 307 259 273 250 149 216 231 144 5 135 148 277 147 189 205 191 229 301 228 260 141 222 158 272 161 187 180 160 150 179 159 151 130 279 193 143 131 213 190 198 215 22 220 207 21 201 275 167 349 116 312 264 262 115 261 263 302 268 266 270 265 271 351 340 337 354 278 350156 306 238 145 192 274 186 171316 2 14 1 305 11 210 121 280 339 320 173 1149 15 345 343 336 338 365 288 363 369 124 362 300 335 7 643 347 370 361 346 117 348 366 367 235 66 48 49 368 209 61 208 33 178 168 34 241 35 182 165 164 32 184 62 20 31 4265 283 8 133 170 89 142 88 58 359 233 30 29 132 139 44 27 154 25 23 24 26 69 36 28 126 68 59 125 51 10 127 70 169 53 217 163 129 55 176 50 4 12 16 17 13 54 19 128 3 41 92 18 329 157 166 67 38 39 40 294 37 323 251 123 284 46 63 52 162 304 211 240 47 242 181 136 137 174 239 57 56 341 60 45 177 172 237 281 73 244 290 267 282 212 82 227 218 219 269 355 183 319 285 185 188 328 289 258 276 327 245 286 0.1 0 0.1 0.2

42.5

42

41.5

41

40.5

FIGURE 1 An illustration of meta-clustering. Each point on the map represents a clustering solution and close points represent similar solutions. Colors indicate a measure of quality for the solutions and red circles highlight local stability [15].

FEBRUARY 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE


Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on February 12, 2010 at 05:38 from IEEE Xplore. Restrictions apply.

15

150 100

300 200

Projection onto Axis 2

Projection onto Axis 3

50 0 50 100 150 200 250 300 350 2 4 5 919 23 25 20 11 22 1 18 10 8

100 0 100 200 300 400 500 5 25 9 1 8 11 20 10 18 24 23 22 19

0 15 0 20 0 25 0

Projection onto Axis 1 (a)

FIGURE 2 Two views of cluster-based visualization of 25-d protein expression date from breast cancer histology showing (a) the HER-2 positive cohort distinct from the rest and (b) rotating the 3-d view to illustrate the separation between the projections of the remaining seven clusters [18].

cancer detection, diagnosis and grading, and to monitor response to therapy by measuring tumor size and detecting regrowth. Having started out as mainly anatomical measurement for anomaly detection as an indicator of malignancy, for instance using ultrasound and Magnetic Resonance Imaging (MRI), cancer imaging has developed into multiple modalities of physiological imaging including spectroscopy, with Magnetic Resonance Spectroscopy (MRS) and spatially localized Chemical Shift Images (CSI), with Positron-Emission Tomography (PET) and Single-Photon emission CT (SPECT) clinically widely used to measure glucose metabolism and therefore find regions with exceptionally high energetic rates, which are characteristic of tumor growth. An accessible review of machine learning applications to detection and diagnosis of disease may be found in [27]. A recent development is the use of specific functional imaging for tumor delineation, that is to say, to measure the contour during surgical planning and to identify tumor striations that may not be apparent clinically even during re-section. Initial efforts in this direction applied blind signal separation to single-voxel spectra to separate infiltrating tumor mass compared with

Projection onto Axis 1 (b)

necrotic tissue [28]. However, tumors are not homogeneous, often mixing different types of tissue and sometimes with different grades of malignancy. One way to resolve this mixture is to segment the tumor composition, which is called nosological imaging, of which examples of diagnostic applications in brain are reported in [29], [30]. A related and very important clinical area is follow-up imaging to determine tumor progression and to differentiate recurrent tumor growth from treatment-induced tissue changes [31], [32].
V. Prognostic Outcome

Outcome modeling usually refers to the analysis of patterns of mortality and recurrence in longitudinal cohort studies, where a specific groups of patients followed-up for up to 20 years, recording the occurrence of events of interest which may be cancer specific mortality, overall mortality or local and distal recurrence. The value of these studies is closely linked to the design of the follow-up process, highlighting the need to involve biostatistical expertise from the very earliest stages in study design. It is required to model the likelihood of the event of interest, conditional on the event not taking place at the start of the time interval and taking

into account the loss of patients to follow-up for reasons other than the event of interest, termed censorship. The application in medical statistics of linear-in-the-parameters models to nonlinear effects in outcome data generally involves discretising the data space to generate piecewise linear approximations to the desired response surfaces. However, this heuristic approach can have important consequences in cancer research, making it difficult to mine the hazard function for insights into the dynamics of metastatic spread [33] and inducing a categorisation of the histological grade of the tumor, which can give rise to important subjective effects that impact on the choice of therapy given to the patient [34]. Neural network models fit non-linear, time dependent estimates of the hazard function, without the need to make proportionality assumptions or to discretize the state-space [35], [36] and their accuracy has been established by multi-centre, double-blind, evaluation [37] and this framework has recently been extended for estimating the joint model for more than one risk, e.g. local and distal recurrence [38].
VI. Conclusion

Computational intelligence methodology has been widely applied to data

16

IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2010

Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on February 12, 2010 at 05:38 from IEEE Xplore. Restrictions apply.

50 10 0 15 0 20 0 25 0

50

00

50

00

50

00

50

50

00

0 5

10

Thyroid Cancer Normal Thyrocyte Thyroid Follicular Epithelial Cell REDPTC TRK Papillary Carcinoma Normal Thyrocyte Ras BRAF +p MEK BRAF +p MAPK Signaling Pathway Thyroid Follicular Epithelial Cell +p +p Ras MEK ERK BRAF DNA PPFP PPAR Signaling Ligands Pathway Follicular Carcinoma Genetic Alterations Oncogenes: RET/PTC1, RET/PTC3, TRK (TRM3/NTRK1), TRK-T1, T2 (TPR/NTRK1), TRK-T3 (TFG/NTRK1), BRAF, Ras, PPFP (PAX8/PPAR), -Catenin p53 Signaling Pathway DNA p53 Damage Downregulation EACD Increased Invasion and Cell Motility Adherens Junction -Catenin Wnt Signaling Pathway TCF/LEF Genomic Instability Impaired GI Cycle Arrest Reduced Apoptosis Proliferation RXR PPAR Proliferation DNA Proliferation Survival +p ERK DNA Proliferation Survival

Follicular Adenoma

Tumor Suppressors: p53

DNA c-Myc

DNA CyclinD1

Undifferentiated Carcinoma 05216 3/31/09 Kanehisa Laboratories

FIGURE 3 An excerpt of the papillary thyroid carcinoma pathway from the KEGG database.

mining in cancer but open questions remain before we can fully exploit the potential of molecular biology, to integrate this into routine imaging and other non-invasive measurements for screening and early diagnosis, also to deliver consistency in laboratory measurements especially in histopathology, and to more accurately model the balance between benefit and risk of therapy for niche groups of patients, down to the level of the individual. New data mining methods need to be developed to handle high-dimensionality and large data volumes, to more efficiently model the dynamics of deep molecular pathways with high levels of

noise and small experimental sample sizes, and also to robustly estimate hazard functions with censored data with multiple competing risks. Analytical models need also to be integrated with clinical expert knowledge and processes, potentially by mapping into Boolean filters [39]. The road ahead is long and bumpy. This article is provided as a sign-post to the nature of the key methodological and application challenges in this growing and exciting field.
Acknowledgments

and IST-2002-508803. Further support was provided by MICINN research project TIN2009-13895-C02-01.
References
[1] P. J. G. Lisboa, A. F. G. Taktak, The use of artificial neural networks in decision support in cancer: A systematic review, Neural Netw., vol. 19, pp. 408415, 2006. [2] P. J. Campbell, et al., Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing, Nature Genetics, vol. 40, pp. 722729, 2008. [3] N. Belacel, Q. C. Wang, and M. Cuperlovic-Culf, Clustering methods for microarray gene expression data, Omics, vol. 10, no. 4, pp. 507531, 2006. [4] D. Jiang, C. Tang, and A. Zhang, Cluster analysis for gene expression data: A survey, IEEE Trans. Knowledge Data Eng., vol. 16, no. 11, pp. 13701386, 2004. [5] D. Wunsch and R. Xu, Clustering (IEEE Press Series on Computational Intelligence). Washington, DC: IEEE Computer Society Press, 2008. [6] F. Murtagh, G. Downs, and P. Contreras, Hierarchical clustering of massive, high dimensional data sets by

This work was supported in part by the Biopattern NoE under FP6/2002/IST/1

FEBRUARY 2010 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE


Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on February 12, 2010 at 05:38 from IEEE Xplore. Restrictions apply.

17

exploiting ultrametric embedding, SIAM J. Sci. Comput., vol. 30, pp. 707730, 2008. [7] S. C. Madeira and A. L. Oliveira, Biclustering algorithms for biological data analysis: A survey, IEEE/ ACM Trans. Computat. Biol. Bioinform., vol. 1, no. 1, pp. 2445, 2004. [8] L. Carrivick, et al., Identification of prognostic signatures in breast cancer microarray data using Bayesian techniques, J. R. Soc. Interface, vol. 3, no. 8, pp. 367381, 2006. [9]. A. Ben-Hur, A. Elisseeff, and I. Guyon, A stability based method for discovering structure in clustered data, in Biocomputing (Proc. Pacific Symp.) , vol. 7, R. B. Altman and K. Lauderdalc, Eds. Kauai, Hawaii, USA, 2002, pp. 617. [10] M. K. Kerr and G. A. Churchill, Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments, Proc. Natl. Acad. Sci., vol. 98, pp. 89618965, 2001. [11] L. I. Kuncheva and D. P. Vetrov, Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 11, pp. 17981808, 2006. [12] A. Gionis, H. Mannila, and P. Tsaparas, Clustering aggregation, ACM Trans. Knowl. Discov. Data, vol. 1, no. 1, 2007. [13] N. Nguyen and R. Caruana, Consensus clustering, in Proc. 7th IEEE Int. Conf. Data Mining (ICDM) , 2007, pp. 607612. [14] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith, Meta clustering, in Proc. 6th IEEE Int. Conf. Data Mining (ICDM) , 2006, pp. 107118. [15] A. Ciaramella, et al., Interactive data analysis and clustering of genomic data, Neural Netw., vol. 21, pp. 368378, 2008. [16] J. A. Lee and M. Verleysen, Nonlinear Dimensionality Reduction. New York: Springer-Verlag, 2007. [17] J. Vesanto, SOM-based data visualization methods, Intell. Data Anal., vol. 3, pp. 111126, 1999.

[18] P. J. G. Lisboa, I. O. Ellis, A. R. Green, F. Ambrogi, and M. B. Dias, Cluster-based visualisation with scatter matrices, Pattern Recogn. Lett., vol. 29, no. 13, pp. 18141823, 2008. [19] J. Hasty, D. McMillen, F. Isaacs, and J. J. Collins, Computational studies of gene regulatory networks: In numero molecular biology, Nature Rev. Gene., vol. 2, pp. 268279, 2001. [20] H. de Jong, Modeling and simulation of genetic regulatory systems: A literature review, J. Computat. Biol., vol. 9, no. 1, pp. 67103, 2002. [21] E. J. Borowski and J. M. Borwein, Computational Modeling of Genetic and Biochemical Networks. Cambridge, MA: MIT Press, 2001. [22] N. Friedman, Inferring cellular networks using probabilistic graphical models, Science, vol. 303, p. 799, 2004. [23] C. Needham, J. Bradford, A. Bulpitt, and D. Westhead, A primer on learning in Bayesian networks for computational biology, PLOS Computat. Biol., vol. 3, no. 8, pp. 14091416, 2007. [24] I. Gat-Viks, A. Tanay, D. Raijaman, and R. Shamir, A probabilistic methodology for integrating knowledge and experiments on biological networks, J. Computat. Biol., vol. 13, no. 2, pp. 165181, 2006. [25] J. Schafer and K. Strimmer, An empirical Bayes approach to inferring large-scale gene association networks, Bioinformatics, vol. 21, no. 6, pp. 754764, 2005. [26] K. Basso, et al., Reverse engineering of regulatory networks in human B cells, Nature Gene., vol. 37, no. 4, pp. 382290, 2005. [27] P. Sajda, Machine learning for detection and diagnosis of disease, Annu. Rev. Biomed. Eng., vol. 8, pp. 537565, 2006. [28] Y. Huang, P. J. G. Lisboa, and W. El-Deredy, Tumour grading from magnetic resonance spectroscopy: A comparison of feature extraction with variable selection, Stat. Med., vol. 22, no. 1, pp. 147164, 2003. [29] F. Szabo De Edelenyi, et al., A new approach for analyzing proton magnetic resonance spectroscopic im-

ages of brain tumors: Nosologic images, Nature Med., vol. 6, pp. 12871289, 2000. [30] M. De Vos, T. Laudadio, A.W. Simonetti, A. Heerschap, and S. Van Huffel, Fast nosologic imaging of the brain, J. Magnet. Reson., vol. 184, no. 2, pp. 292301, 2007. [31] A. H. Jacobs, et al., Imaging in neurooncology, J. Amer. Soc. Exp. NeuroTherapeu., vol. 2, pp. 333347, 2005. [32] S. Cha, Neuroimaging in neuro-oncology, J. Amer. Soc. Exp. NeuroTherapeu., vol. 6, pp. 465477, 2009. [33] R. Demicheli, P. Valagussa, and G. Bonadonna, Double-peaked time distribution of mortality for breast cancer patients undergoing mastectomy, Breast Cancer Res. Treat., vol. 75, pp. 127134, 2002. [34] J. A.Woolgar, The predictive value of detailed histological staging of surgical resection specimens in oral cancer, in Outcome Prediction in Cancer, A. F. G. Taktak and A. C. Fisher, Eds. U.K.: Elsevier, 2007, pp. 326. [35] E. Biganzoli, P. Boracchi, L. Mariani, and E. Marubini, Feed forward neural networks for the analysis of censored survival data: A partial logistic regression approach, Stat. Med., vol. 17, pp. 12691206, 1998. [36] A. Eleuteri, R. Tagliaferri, L. Milano, S. De Placido, and M. De Laurentiis, A novel neural network-based survival analysis model, Neural Netw., vol. 16, no. 56, pp. 855864, 2003. [37] A. Taktak, et al., Double-blind evaluation and benchmarking of survival models in a multi-centre study, Comput. Biol. Med., vol. 37, no. 8, pp. 11081120, 2007. [38] P. J. G. Lisboa, et al., Partial logistic artificial neural network for competing risks regularised with automatic relevance determination, IEEE Trans. Neural Netw., vol. 20, no. 9, pp. 14031416, 2009. [39] T. Rgnvaldsson, T. A. Etchells, L. You, D. Garwicz, I. H. Jarman, and P. J. G. Lisboa, How to find simple and accurate rules for viral protease cleavage specificities, BMC Bioinform., vol. 10, pp. 149, 2009.

From Imagination to Market


With IEEE, we have 24/7 access to the technical information we need exactly when we need it.
Dr. Bin Zhao, Senior Manager, RF/Mixed Signal Design Engineering, Skyworks Solutions

IEEE Expert Now


The Best of IEEE Conferences and Short Courses
An unparalleled education resource that provides the latest in related technologies. Keep up-to-date on the latest trends in related technologies Interactive content via easy-to-use player-viewer, audio and video les, diagrams, and animations Increases overall knowledge beyond a specic discipline 1-hour courses accessible 24/7

Free Trial!
Experience IEEE request a trial for your company.

www.ieee.org/expertnow

18 IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE | FEBRUARY 2010 07-PIM-0181f Expert Now Half.ind1 1
Authorized licensed use limited to: LIVERPOOL JOHN MOORES UNIVERSITY. Downloaded on February 12, 2010 at 05:38 from IEEE Xplore. Restrictions apply.

7/10/07 10:40:27 AM

You might also like