You are on page 1of 11

Elysium Technologies Private Limited

ISO 9001:2008 A leading Research and Development Division


Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

Abstract Bio - Informatics 2010 - 2011

01 Sparse Support Vector Machines with Lp Penalty for Biomarker Identification

The development of high-throughput technology has generated a massive amount of high-dimensional data, and many
of them are of discrete type. Robust and efficient learning algorithms such as LASSO [1] are required for feature
selection and over fitting control. However, most feature selection algorithms are only applicable to the continuous
data type. In this paper, we propose a novel method for sparse support vector machines (SVMs) with Lp ðp < 1Þ
regularization. Efficient algorithms (LpSVM) are developed for learning the classifier that is applicable to high-
dimensional data sets with both discrete and continuous data types. The regularization parameters are estimated
through maximizing the area under the ROC curve (AUC) of the cross-validation data. Experimental results on protein
sequence and SNP data attest to the accuracy, sparsity, and efficiency of the proposed algorithm. Biomarkers
identified with our methods are compared with those from other methods in the literature. The software package in
Matlab is available upon request.

02 Sorting Genomes by Reciprocal Translocations, Insertions, and Deletions

The problem of sorting by reciprocal translocations (abbreviated as SBT) arises from the field of comparative
genomics, which is to find a shortest sequence of reciprocal translocations that transforms one genome _ into another
genome _, with the restriction that _ and _ contain the same genes. SBT has been proved to be polynomial-time
solvable, and several polynomial algorithms have been developed. In this paper, we show how to extend Bergeron’s
SBT algorithm to include insertions and deletions, allowing to compare genomes containing different genes. In
particular, if the gene set of _ is a subset (or superset, respectively) of the gene set of _, we present an approximation
algorithm for transforming _ into _ by reciprocal translocations and deletions (insertions, respectively), providing a
sorting sequence with length at most OPT þ 2, where OPT is the minimum number of translocations and deletions
(insertions, respectively) needed to transform _ into _; if _ and _ have different genes but not containing each other, we
give a heuristic to transform _ into _ by a shortest sequence of reciprocal translocations, insertions, and deletions,
with bounds for the length of the sorting sequence it outputs. At a conceptual level, there is some similarity between
our algorithm and the algorithm developed by El Mabrouk which is used to sort two chromosomes with different gene
contents by reversals, insertions, and deletions.

03 Signal Quality Measurements for cDNA Microarray Data

Concerns about the reliability of expression data from microarrays inspire ongoing research into measurement error in
these experiments. Error arises at both the technical level within the laboratory and the experimental level. In this paper,
we will focus on estimating the spot-specific error, as there are few currently available models. This paper outlines two
different approaches to quantify the reliability of spot-specific intensity estimates. In both cases, the spatial correlation
between pixels and its impact on spot quality is accounted for. The first method is a straightforward parametric estimate of
within-spot variance that assumes a Gaussian distribution and accounts for spatial correlation via an over dispersion
factor. The second method employs a nonparametric quality estimate referred to throughout as the mean square
prediction error (MSPE). The MSPE first smoothes a pixel region and then measures the difference between actual pixel
values and the smoother. Both methods herein are compared for real and simulated data to assess numerical
characteristics and the ability to describe poor spot quality. We conclude that both approaches capture noise in the
microarray platform and highlight situations where one method or the other is superior.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

04 Reassortment Networks for Investigating the Evolution of Segmented Viruses

Many viruses of interest, such as influenza A, have distinct segments in their genome. The evolution of these viruses
involves mutation and reassortment, where segments are interchanged between viruses that coinfect a host.
Phylogenetic trees can be constructed to investigate the mutation-driven evolution of individual viral segments.
However, reassortment events among viral genomes are not well depicted in such bifurcating trees. We propose the
concept of reassortment networks to analyze the evolution of segmented viruses. These are layered graphs in which
the layers represent evolutionary stages such as a temporal series of seasons in which influenza viruses are isolated.
Nodes represent viral isolates and reassortment events between pairs of isolates. Edges represent evolutionary steps,
while weights on edges represent edit costs of reassortment and mutation events. Paths represent possible
transformation series among viruses. The length of each path is the sum edit cost of the events required to transform
one virus into another. In order to analyze _ stages of evolution of n viruses with segments of maximum length m, we
first compute the pair wise distances between all corresponding segments of all viruses in Oðm2n2Þ time using
dynamic programming. The reassortment network, with Oð_n2Þ nodes, is then constructed using these distances. The
ancestors and descendents of a specific virus can be traced via shortest paths in this network, which can be found in
Oð_n3Þ time.

05 Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis

The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and
consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological
knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we
describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit
semantic relationships between genes and functions. In this work, we use a vector space model and a number of
weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is
able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO
terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and
evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this
paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using
this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions
were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to
confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in
our previous approach.

06 On the Importance of Comprehensible Classification Models for Protein Function Prediction

The literature on protein function prediction is currently dominated by works aimed at maximizing predictive accuracy,
ignoring the important issues of validation and interpretation of discovered knowledge, which can lead to new insights and
hypotheses that are biologically meaningful and advance the understanding of protein functions by biologists. The overall
goal of this paper is to critically evaluate this approach, offering a refreshing new perspective on this issue, focusing not
only on predictive accuracy but also on the comprehensibility of the induced protein function prediction models. More
specifically, this paper aims to offer two main contributions to the area of protein function prediction. First, it presents the
case for discovering comprehensible protein function prediction models from data, discussing in detail the advantages of
such models, namely, increasing the confidence of the biologist in the system’s predictions, leading to new insights about
the data and the formulation of new biological hypotheses, and detecting errors in the data. Second, it presents a critical
review of the pros and cons of several different knowledge representations that can be used in order to support the
discovery of comprehensible protein function prediction models.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

07 Multidimensional Profiling of Cell Surface Proteins and Nuclear Markers

Cell membrane proteins play an important role in tissue architecture and cell-cell communication. We hypothesize that
segmentation and multidimensional characterization of the distribution of cell membrane proteins, on a cell-by-cell
basis, enable improved classification of treatment groups and identify important characteristics that can otherwise be
hidden. We have developed a series of computational steps to 1) delineate cell membrane protein signals and
associate them with a specific nucleus; 2) compute a coupled representation of the multiplexed DNA content with
membrane proteins; 3) rank computed features associated with such a multidimensional representation; 4) visualize
selected features for comparative evaluation through heat maps; and 5) discriminate between treatment groups in an
optimal fashion. The novelty of our method is in the segmentation of the membrane signal and the multidimensional
representation of phenotypic signature on a cell-by-cell basis. To test the utility of this method, the proposed
computational steps were applied to images of cells that have been irradiated with different radiation qualities in the
presence and absence of other small molecules. These samples are labeled for their DNA content and E-cadherin
membrane proteins. We demonstrate that multidimensional representations of cell-by-cell phenotypes improve
predictive and visualization capabilities among different treatment groups, and identify hidden variables.

08 Molecular Function Prediction Using Neighborhood Features

The recent advent of high-throughput methods has generated large amounts of gene interaction data. This has allowed
the construction of genome wide networks. A significant number of genes in such networks remain uncharacterized
and predicting the molecular function of these genes remains a major challenge. A number of existing techniques
assume that genes with similar functions are topologically close in the network. Our hypothesis is that genes with
similar functions observe similar annotation patterns in their neighborhood, regardless of the distance between them
in the interaction network. We thus predict molecular functions of uncharacterized genes by comparing their functional
neighborhoods to genes of known function. We propose a two-phase approach. First, we extract functional
neighborhood features of a gene using Random Walks with Restarts. We then employ a KNN classifier to predict the
function of uncharacterized genes based on the computed neighborhood features. We perform leave-one-out
validation experiments on two S. cerevisiae interaction networks and show significant improvements over previous
techniques. Our technique provides a natural control of the trade-off between accuracy and coverage of prediction. We
further propose and evaluate prediction in sparse genomes by exploiting features from well-annotated genomes.

09 Modeling Protein Interacting Groups by Quasi-Bicliques: Complexity, Algorithm, and Application

Protein-protein interactions (PPIs) are one of the most important mechanisms in cellular processes. To model protein
interaction sites, recent studies have suggested to find interacting protein group pairs from large PPI networks at the
first step and then to search conserved motifs within the protein groups to form interacting motif pairs. To consider
the noise effect and the incompleteness of biological data, we propose to use quasi-bicliques for finding interacting
protein group pairs. We investigate two new problems that arise from finding interacting protein group pairs: the
maximum vertex quasi-biclique problem and the maximum balanced quasi-biclique problem. We prove that both
problems are NP-hard. This is a surprising result as the widely known maximum vertex biclique problem is polynomial
time solvable [1]. We then propose a heuristic algorithm that uses the greedy method to find the quasi-bicliques from
PPI networks. Our experiment results on real data show that this algorithm has a better performance than a benchmark
algorithm for identifying highly matched BLOCKS and PRINTS motifs. We also report results of two case studies on
interacting motif pairs that map well with two interacting domain pairs in iPfam. Availability: The software and
supplementary information are available at http://www.cs.cityu.edu.hk/~lwang/software/ppi/index.html

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

10 Model Composition for Macromolecular Regulatory Networks

Models of regulatory networks become more difficult to construct and understand as they grow in size and complexity.
Large models are usually built up from smaller models, representing subsets of reactions within the larger network. To
assist modelers in this composition process, we present a formal approach for model composition, a wizard-style
program for implementing the approach, and suggested language extensions to the Systems Biology Markup
Language to support model composition. To illustrate the features of our approach and how to use the Jig Cell
Composition Wizard, we build up a model of the eukaryotic cell cycle “engine” from smaller pieces.

11 Linear Separability of Gene Expression Data Sets

We study simple geometric properties of gene expression data sets, where samples are taken from two distinct
classes (e.g., two types of cancer). Specifically, the problem of linear separability for pairs of genes is investigated. If a
pair of genes exhibits linear separation with respect to the two classes, then the joint expression level of the two genes
is strongly correlated to the phenomena of the sample being taken from one class or the other. This may indicate an
underlying molecular mechanism relating the two genes and the phenomena (e.g., a specific cancer). We developed
and implemented novel efficient algorithmic tools for finding all pairs of genes that induce a linear separation of the
two sample classes. These tools are based on computational geometric properties and were applied to 10 publicly
available cancer data sets. For each data set, we computed the number of actual separating pairs and compared it to
an upper bound on the number expected by chance and to the numbers resulting from shuffling the labels of the data
at random empirically. Seven out of these 10 data sets are highly separable. Statistically, this phenomenon is highly
significant, very unlikely to occur at random. It is therefore reasonable to expect that it manifests a functional
association between separating genes and the underlying phenotypic classes.

12 Integrating Data Clustering and Visualization for the Analysis of 3D Gene Expression Data

The recent development of methods for extracting precise measurements of spatial gene expression patterns from
three-dimensional (3D) image data opens the way for new analyses of the complex gene regulatory networks
controlling animal development. We present an integrated visualization and analysis framework that supports user-
guided data clustering to aid exploration of these new complex data sets. The interplay of data visualization and
clustering-based data classification leads to improved visualization and enables a more detailed analysis than
previously possible. We discuss 1) the integration of data clustering and visualization into one framework, 2) the
application of data clustering to 3D gene expression data, 3) the evaluation of the number of clusters k in the context
of 3D gene expression clustering, and 4) the improvement of overall analysis quality via dedicated post processing of
clustering results based on visualization. We discuss the use of this framework to objectively define spatial pattern
boundaries and temporal profiles of genes and to analyze how mRNA patterns are controlled by their regulatory
transcription factors.

13 Time Series Gene Expression Data Using a Linear Time Biclustering Algorithm

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to
restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent
expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable
problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time
linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a
discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for
ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and,
therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach
and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression
patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed
methodology to extract relevant information compatible with documented biological knowledge but also the utility of
using this algorithm in the study of other environmental stresses and of regulatory modules in general.

14 Identification of Full and Partial Class Relevant Genes

Multiclass cancer classification on microarray data has provided the feasibility of cancer diagnosis across all of the
common malignancies in parallel. Using multiclass cancer feature selection approaches, it is now possible to identify
genes relevant to a set of cancer types. However, besides identifying the relevant genes for the set of all cancer types,
it is deemed to be more informative to biologists if the relevance of each gene to specific cancer or subset of cancer
types could be revealed or pinpointed. In this paper, we introduce two new definitions of multiclass relevancy features,
i.e., full class relevant (FCR) and partial class relevant (PCR) features. Particularly, FCR denotes genes that serve as
candidate biomarkers for discriminating all cancer types. PCR, on the other hand, are genes that distinguish subsets
of cancer types. Subsequently, a Markov blanket embedded memetic algorithm is proposed for the simultaneous
identification of both FCR and PCR genes. Results obtained on commonly used synthetic and real-world microarray
data sets show that the proposed approach converges to valid FCR and PCR genes that would assist biologists in their
research work. The identification of both FCR and PCR genes is found to generate improvement in classification
accuracy on many microarray data sets. Further comparison study to existing state-of-the-art feature selection
algorithms also reveals the effectiveness and efficiency of the proposed approach

15 Heuristic Bayesian Segmentation for Discovery of Co expressed Genes within Genomic Regions

Segmentation aims to separate homogeneous areas from the sequential data, and plays a central role in data mining. It
has applications ranging from finance to molecular biology, where bioinformatics tasks such as genome data analysis
are active application fields. In this paper, we present a novel application of segmentation in locating genomic regions
with co expressed genes. We aim at automated discovery of such regions without requirement for user-given
parameters. In order to perform the segmentation within a reasonable time, we use heuristics. Most of the heuristic
segmentation algorithms require some decision on the number of segments. This is usually accomplished by using
asymptotic model selection methods like the Bayesian information criterion. Such methods are based on some
simplification, which can limit their usage. In this paper, we propose a Bayesian model selection to choose the most
proper result from heuristic segmentation. Our Bayesian model presents a simple prior for the segmentation solutions
with various segment numbers and a modified Dirichlet prior for modeling multinomial data. We show with various
artificial data sets in our benchmark system that our model selection criterion has the best overall performance. The
application of our method in yeast cellcycle gene expression data reveals potential active and passive regions of the
genome

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

GPD: A Graph Pattern Diffusion Kernel for Accurate Graph Classification with Applications in
16 Cheminformatics

Graph data mining is an active research area. Graphs are general modeling tools to organize information from
heterogeneous sources and have been applied in many scientific, engineering, and business fields. With the fast
accumulation of graph data, building highly accurate predictive models for graph data emerges as a new challenge
that has not been fully explored in the data mining community. In this paper, we demonstrate a novel technique called
graph pattern diffusion (GPD) kernel. Our idea is to leverage existing frequent pattern discovery methods and to
explore the application of kernel classifier (e.g., support vector machine) in building highly accurate graph
classification. In our method, we first identify all frequent patterns from a graph database. We then map sub graphs to
graphs in the graph database and use a process we call “pattern diffusion” to label nodes in the graphs. Finally, we
designed a graph alignment algorithm to compute the inner product of two graphs. We have tested our algorithm using
a number of chemical structure data. The experimental results demonstrate that our method is significantly better than
competing methods such as those kernel functions based on paths, cycles, and sub graphs.

Gene Association Networks from Microarray Data Using a Regularized Estimation of Partial Correlation
17 Based on PLS Regression

Reconstruction of gene-gene interactions from large-scale data such as microarrays is a first step toward better
understanding the mechanisms at work in the cell. Two main issues have to be managed in such a context: 1)
choosing which measures have to be used to distinguish between direct and indirect interactions from high-
dimensional microarray data and 2) constructing networks with a low proportion of false-positive edges. We present an
efficient methodology for the reconstruction of gene interaction networks in a small-sample-size setting. The strength
of independence of any two genes is measured, in such “high-dimensional network,” by a regularized estimation of
partial correlation based on Partial Least Squares Regression. We finally emphasize specific properties of the
proposed method. To assess the sensitivity and specificity of the method, we carried out the reconstruction of
networks from simulated data. We also tested PLS-based partial correlation network on static and dynamic real
microarray data. An R implementation of the proposed algorithm is available from
http://biodev.extra.cea.fr/plspcnetwork/

18 Fixed-Parameter Tractability of the Maximum Agreement Supertree Problem

Given a set L of labels and a collection of rooted trees whose leaves are objectively labeled by some elements of L, the
Maximum Agreement Supertree (SMAST) problem is given as follows: find a tree T on a largest label set L0 _ L that
homeomorphically contains every input tree restricted to L0. The problem has phylogenetic applications to infer
supertrees and perform tree congruence analyses. In this paper, we focus on the parameterized complexity of this NP-
hard problem, considering different combinations of parameters as well as particular cases. We show that SMAST on k
rooted binary trees on a label set of size n can be solved in Oðð8nÞkÞ time, which is an improvement with respect to
the previously known Oðn3k2 Þ time algorithm. In this case, we also give an Oðð2kÞpkn2Þ time algorithm, where p is
an upper bound on the number of leaves of L missing in a SMAST solution. This shows that SMAST can be solved
efficiently when the input trees are mostly congruent. Then, for the particular case where any triple of leaves is
contained in at least one input tree, we give Oð4pn3Þ and Oð3:12p þ n4Þ time algorithms, obtaining the first fixed-
parameter tractable algorithms on a single parameter for this problem. We also obtain intractability results for several
combinations of parameters, thus indicating that it is unlikely that fixed-parameter tractable algorithms can be found in
these particular cases.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

Feature Selection for Gene Expression Using Model-Based Entropy


19

Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for
gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using
machine learning techniques, traditional gene selection based on empirical mutual information suffers the data
sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based
approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use
multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy
among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate
various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of
class variables given the selected features is a normal distribution, its entropy can be computed with the log-
determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-
determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The
experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the
multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.

20 Fast Hinge Detection Algorithms for Flexible Protein Structures

Analysis of conformational changes is one of the keys to the understanding of protein functions and interactions. For
the analysis, we often compare two protein structures, taking flexible regions like hinge regions into consideration.
The Root Mean Square Deviation (RMSD) is the most popular measure for comparing two protein structures, but it is
only for rigid structures without hinge regions. In this paper, we propose a new measure called RMSD considering
hinges (RMSDh) and its variant RMSDhðkÞ for comparing two flexible proteins with hinge regions. We also propose
novel efficient algorithms for computing them, which can detect the hinge positions at the same time. The RMSDh is
suitable for cases where there is one small hinge region in each of the two target structures. The new algorithm for
computing the RMSDh runs in linear time, which is the same as the time complexity for computing the RMSD and is
faster than any of previous algorithms for hinge detection. The RMSDhðkÞ is designed for comparing structures with
more than one hinge region. The RMSDhðkÞ measure considers at most k small hinge region, i.e., the RMSDhðkÞ
value should be small if the two structures are similar except for at most k hinge regions. To compute the value, we
propose an Oðkn2Þ-time and OðnÞ-space algorithm based on a new dynamic programming technique. With the same
computational time and space, we can enumerate the predicted hinge positions. We also test our algorithms against
actual flexible protein structures, and show that the hinge positions can be correctly detected by our algorithms.

21 Exploratory Consensus of Hierarchical Clustering for Melanoma and Breast Cancer

Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used
to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies
between the clusterings produced by different algorithms. This work introduces a simple method which provides the
most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The
method is validated by showing that the Silhouette, Dunne’s and Davies-Bouldin’s cluster validation indices are better
for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The
hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent
classification accuracy. In Bittner et al.’s melanoma data set, a previously hypothesized primary cluster is recognized
as the largest consensus cluster and a new partition of this cluster into two sub clusters is proposed. In van’t Veer et
al.’s breast cancer data set, previously proposed “basal” and “luminal A” subtypes are clearly recognized as the two
predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the
“basal” subtype in this data set. The clusters of van’t Veer’s data set is also validated by high classification accuracy
obtained in the data set of van de Vijver et al.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

22 Efficient Peak-Labeling Algorithms for Whole-Sample Mass Spectrometry Proteomics

Whole-sample mass spectrometry (MS) proteomics allows for a parallel measurement of hundreds of proteins present
in a variety of biospecimens. Unfortunately, the association between MS signals and these proteins is not
straightforward. The need to interpret mass spectra demands the development of methods for accurate labeling of ion
species in such profiles. To aid this process, we have developed a new peak-labeling procedure for associating protein
and peptide labels with peaks. This computational method builds upon characteristics of proteins expected to be in the
sample, such as the amino sequence, mass weight, and expected Concentration within the sample. A new probabilistic
score that incorporates this information is proposed. We evaluate and demonstrate our method’s ability to label peaks
first on simulated MS spectra and then on MS spectra from human serum with a spiked-in calibration mixture.

23 Data-Fusion in Clustering Microarray Data: Balancing Discovery and Interpretability

While clustering genes remains one of the most popular exploratory tools for expression data, it often results in highly
variable and biologically uninformative clusters. This paper explores a data fusion approach to clustering microarray
data. Our method, which combined expression data and Gene Ontology (GO)-derived information, is applied on a real
data set to perform genome-wide clustering. A set of novel tools is proposed to validate the clustering results and pick
a fair value of infusion coefficient. These tools measure stability, biological relevance, and distance from the
expression-only clustering solution. Our results indicate that a data fusion clustering leads to more stable, biologically
relevant clusters that are still representative of the experimental data.

BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous,


24 Distributed Bimolecular Data

Many in silico investigations in bioinformatics require access to multiple, distributed data sources and analytic tools.
The requisite data sources may include large public data repositories, community databases, and project databases
for use in Domain-specific research. Different data sources frequently utilize distinct query languages and return
results in unique formats, and therefore researchers must either rely upon a small number of primary data sources or
become familiar with multiple query languages and formats. Similarly, the associated analytic tools often require
specific input formats and produce unique outputs which make it difficult to utilize the output from one tool as input to
another. The BioExtract Server (http://bioextract.org) is a Web-based data integration application designed to
consolidate, analyze, and serve data from heterogeneous biomolecular databases in the form of a mash-up. The basic
operations of the BioExtract Server allow researchers, via their Web browsers, to specify data sources, flexibly query
data sources, apply analytic tools, download result sets, and store query results for later reuse. As a researcher works
with the system, their “steps” are saved in the background. At any time, these steps can be preserved long-term as a
workflow simply by providing a workflow name and description.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

Automatic Detection of Large Dense-Core Vesicles in Secretory Cells and Statistical Analysis of Their
25 Intracellular Distribution

Analyzing the morphological appearance and the spatial distribution of large dense-core vesicles (granules) in the cell
cytoplasm is central to the understanding of regulated exocytosis. This paper is concerned with the automatic
detection of granules and the statistical analysis of their spatial locations in different cell groups. We model the
locations of granules of a given cell as a realization of a finite spatial point process and the point patterns associated
with the cell groups as replicated point patterns of different spatial point processes. First, an algorithm to segment the
granules using electron microscopy images is proposed. Second, the relative locations of the granules with respect to
the plasma membrane are characterized by two functional descriptors: the empirical cumulative distribution function
of the distances from the granules to the plasma membrane and the density of granules within a given distance to the
plasma membrane. The descriptors of the different cells for each group are compared using bootstrap procedures. Our
results show that these descriptors and the testing procedure allow discriminating between control and treated cells.
The application of these novel tools to studies of secretion should help in the analysis of diseases associated with
dysfunctional secretion, such as diabetes.

Automated Isolation of Translational Efficiency Bias That Resists the Confounding Effect of GC ( AT)-
26 Content

Genomic sequencing projects are an abundant source of information for biological studies ranging from the molecular
to the ecological in scale; however, much of the information present may yet be hidden from casual analysis. One such
information domain, trends in codon usage, can provide a wealth of information about an organism’s genes and their
expression. Degeneracy in the genetic code allows more than one triplet codon to code for the same amino acid, and
usage of these codons is often biased such that one or more of these synonymous codons are preferred. Detection of
this bias is an important tool in the analysis of genomic data, particularly as a predictor of gene expressivity. Methods
for identifying codon usage bias in genomic data that rely solely on genomic sequence data are susceptible to being
confounded by the presence of several factors simultaneously influencing codon selection. Presented here is a new
technique for removing the effects of one of the more common confounding factors, GC(AT)-content, and of
visualizing the search-space for codon usage bias through the use of a solution landscape. This technique
successfully isolates expressivity-related codon usage trends, using only genomic sequence information, where other
techniques fail due to the presence of GC(AT)-content confounding influences.

Automated Hierarchical Density Shaving: A Robust Automated Clustering and Visualization Framework for
27 Large Biological Data Sets

A key application of clustering data obtained from sources such as microarrays, protein mass spectroscopy, and
phylogenetic profiles is the detection of functionally related genes. Typically, only a small number of functionally
related genes cluster into one or more groups and the rest need to be ignored. For such situations, we present
Automated Hierarchical Density Shaving (Auto-HDS), a framework that consists of a fast hierarchical density-based
clustering algorithm and an unsupervised model selection strategy. Auto-HDS can automatically select clusters of
different densities, present them in a compact hierarchy, and rank individual clusters using an innovative stability
criteria. Our framework also provides a simple yet powerful 2D visualization of the hierarchy of clusters that is useful
for further interactive exploration. We present results on Gasch and Lee microarray data sets to show the effectiveness
of our methods. Additional results on other biological data are included in the supplemental material.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

28 Approximation Algorithms for Predicting RNA Secondary Structures with Arbitrary Pseudo knots

We study three closely related problems motivated by the prediction of RNA secondary structures with arbitrary
pseudo knots: the problem 2-Interval Pattern proposed by Vialette [36], the problem Maximum Base Pair Stackings
proposed by Leong et al. [16], and the problem Maximum Stacking Base Pairs proposed by Lyngsø [21]. For the 2-
Interval Pattern, we present polynomial time approximation algorithms for the problem over the preceding-and-
crossing model and on input with the unitary restriction. For Maximum Base Pair Stackings and Maximum Stacking
Base Pairs, we present polynomial-time approximation algorithms for the two problems on explicit input of candidate
base pairs. We also propose a new problem called Length-Weighted Balanced 2-Interval Pattern, which is natural in the
context of RNA secondary structure prediction

29 Approximate Maximum Parsimony and Ancestral Maximum Likelihood

We explore the maximum parsimony (MP) and ancestral maximum likelihood (AML) criteria in phylogenetic tree
reconstruction. Both problems are NP-hard, so we seek approximate solutions. We formulate the two problems as
Steiner tree problems under appropriate distances. The gist of our approach is the succinct characterization of Steiner
trees for a small number of leaves for the two distances. This enables the use of known Steiner tree approximation
algorithms. The approach leads to a 16/9 approximation ratio for AML and asymptotically to a 1.55 approximation ratio
for MP.

30 Alignments of RNA Structures

We describe a theoretical unifying framework to express the comparison of RNA structures, which we call alignment
hierarchy. This framework relies on the definition of common super sequences for arc-annotated sequences and
encompasses the main existing models for RNA structure comparison based on trees and arc-annotated sequences
with a variety of edit operations. It also gives rise to edit models that have not been studied yet. We provide a thorough
analysis of the alignment hierarchy, including a new polynomial-time algorithm and an NP-completeness proof. The
polynomial-time algorithm involves biologically relevant edit operations such as pairing or unpairing nucleotides. It
has been implemented in a software, called gardenia, which is available at the Web server
http://bioinfo.lifl.fr/RNA/gardenia.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com
Elysium Technologies Private Limited
ISO 9001:2008 A leading Research and Development Division
Madurai | Chennai | Kollam | Ramnad | Tuticorin | Singapore

A Trade-Off between Sample Complexity and computational Complexity in Learning Boolean Networks
31 from Time-Series Data

A key problem in molecular biology is to infer regulatory relationships between genes from expression data. This
paper studies a simplified model of such inference problems in which one or more Boolean variables, modeling, for
example, the expression levels of genes, each depend deterministically on a small but unknown subset of a large
number of Boolean input variables. Our model assumes that the expression data comprises a time series, in which
successive samples may be correlated. We provide bounds on the expected amount of data needed to infer the correct
relationships between output and input variables. These bounds improve and generalize previous results for Boolean
network inference and continuous-time switching network inference. Although the computational problem is
intractable in general, we describe a fixed-parameter tractable algorithm that is guaranteed to provide at least a partial
solution to the problem. Most interestingly, both the sample complexity and computational complexity of the problem
depend on the strength of correlations between successive samples in the time series but in opposing ways.
Uncorrelated samples minimize the total number of samples needed while maximizing computational complexity; a
strong correlation between successive samples has the opposite effect. This observation has implications for the
design of experiments for measuring gene expression.

32 A Multiple-Filter-Multiple-Wrapper Approach to Gene Selection and Microarray Data Classification

Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of
statistical properties of each gene to represent its discriminating power between different classes. The computation is
fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing
classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in
previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes.
The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on
the choice of specific filter and wrapper. In this paper, a multiple-filter multiple- wrapper (MFMW) approach is proposed
that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification,
and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW
approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding
MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be
biomarkers or contribute to the development of particular cancers by other studies.

#230, Church Road, Anna Nagar, Madurai 625 020, Tamil Nadu, India
(: +91 452-4390702, 4392702, 4390651
Website: www.elysiumtechnologies.com,www.elysiumtechnologies.info
Email: info@elysiumtechnologies.com

You might also like