Professional Documents
Culture Documents
Commercial applications
16.1 Introduction
This book was intended, designed and written as an unbiased technical text, inde-
pendent of any particular software package or vendor. The methods discussed are
very general and they can be implemented in many ways. The main goal of the
book was to describe problems, algorithms, techniques and approaches as opposed
to software packages. However, most readers would not want to re-invent the wheel
and will probably use a commercial package. It is important to make a connection
between the abstract algorithms and techniques discussed and their implementation
in the mainstream software currently available. The aim of this chapter is to showcase
some illustrative commercial packages.
The commercial packages can be roughly divided into several categories as follows:
• Enterprise level packages. These are typically very expensive software suites
that are designed to cover many if not all the processes of microarray design,
production, hybridization, data analysis and data storage. Examples include
Rosetta Resolver (Rosetta Biosoftware), Gene Director (BioDiscovery), Gene
Traffic (Iobion), Silicon Genetics Enterprise Solution (Silicon Genetics), Array
Scout (Lion Bioscience AG), etc. These packages are usually very large, very
often include a relational database, and have to accomplish a large number of
very heterogeneous tasks. As such, they go well beyond the scope of this book
which is focused on data analysis.
• “Point and click” analysis tools. These tend to be rather sophisticated prod-
ucts, highly tailored and customized for the specific use in the analysis of
microarray data. In most cases, such packages will include several tools de-
signed to help with the analysis of the biological aspects of the data (e.g. chro-
mosome and pathways viewers, analysis of biological function, etc.). Exam-
ples might include GeneSight (BioDiscovery), GeneSpring (Silicon Genetics),
Array Analyzer (Insightful), dChip (Harvard Univ.), ArrayStat (Imaging Re-
search), AMIADA (Hong Kong University), Expression Profiler (EBI), TIGR
Array Viewer and Multiple Experiment Viewer (TIGR), XCluster (Stanford
Univ.), etc. In many cases, such packages are included as the data analysis
engine in the enterprise solutions offered by their respective companies (e.g.
this was not a problem. We specify which column contains the gene IDs (the IMAGE
Clone ID column), that contains annotation information (column 3) which contain
the data (columns 4 through 25). Fig. 16.1 shows the Data set builder window in
which this information is specified.
such a way that the loss of information due to the dimensionality reduction is mini-
mal. As discussed in Chap. 10, PCA can be used to achieve this optimal projection.
The result of PCA using all genes is shown in Fig. 16.13. As before, the points are
color-coded according to the cancer category. Clearly the three types of cancer are
intermingled when all genes are used.
Fig. 16.14 shows the results after sub-selecting just the gene list from the origi-
nal paper [135]. Clearly the categories are now cleanly separated. Finally, PCA
is performed after sub-selecting the list of 48 genes from the significance analysis.
Fig. 16.15 shows the samples projected in the space of the first two principal compo-
nents. Once more, the delineation of the cancer types is clear. The two-dimensional
scatter plot is quite similar to that generated using the list of genes from the original
paper.
Again, if microarray expression measurements for an unclassified patient were intro-
duced, adding the additional point to the PCA scatter plot would probably allow us
to classify that patient into one of the existing categories.
16.2.4 Conclusion
Significance analysis is a powerful tool for selecting genetic markers for cancer
types. The clustering and PCA algorithms and visualization techniques shown here
are powerful tools allowing the researcher to quickly and intuitively examine the
results of said marker identification. The comparison between GeneSight’s signif-
FIGURE 16.15: Principal component analysis using gene list from significance
analysis.
S-PLUS R
includes out-of-the-box functionality to address most of the many data
management and analysis issues that arise in the analysis of microarray data. Some of
these statistical issues include: experimental design, pre-processing e.g. normaliza-
tion, differential expression testing, clustering and prediction, and annotation. In ad-
dition, Insightful, the makers of S-PLUS, offers the S+ArrayAnalyzerTM solution for
microarray data analysis. S+ArrayAnalyzer is available as a desktop analysis work-
bench and an enterprise client-server and web-based application. S+ArrayAnalyzer
provides an end-to-end solution for microarray data analysis, as well as a toolkit
and development environment for extending and customizing microarray analysis
implementations. S+ArrayAnalyzer includes S-PLUS libraries and scripts, including
libraries based on packages from the Bioconductor project, www.bioconductor.org;
web scripts; data connections and S-PLUS GraphletsTM for interactive annotation.
S-PLUS is the premier software solution for exploratory data analysis and statistical
modeling. S-PLUS offers point-and-click capabilities to import various data formats,
select statistical functions and display results. S-PLUS includes more than 4,200
data analysis functions, including the most comprehensive set of robust and modern
methods. In addition to out-of-the-box functionality, users may modify existing
methods or develop new ones using the S programming language [61].
Graphics are a strong point of S-PLUS. Graphics are highly interactive and the user
has control over every detail of all graphs. TrellisTM graphics, unique to S-PLUS, can
display many variables informatively. Further, all S-PLUS graphs can be displayed as
Graphlets, lightweight applets that may be simply created with java and XML-based
graphics classes using the java.graph graphics device that is new to S-PLUS 6. In ad-
dition to providing interactive graphs in a web browser, Graphlets enable connection
to external databases such as NCBI Genbank[32, 33], Onto-Express [90, 185] and
Onto-Compare [91] databases. Such connectivity facilitates incorporation of anno-
tation information into graphical and tabular summaries, captured as Graphlets, via
database querying on the URL.
Signal extraction is a little more involved, but there are segmentation algorithms
(e.g. seeded region growing) that are available in S-PLUS. Probe-level analysis of
oligonucleotide chip data, for example as described by [165, 194], is also available
in S+ArrayAnalyzer.
(1,1) (1,2) (1,3) (1,4) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,1) (4,2) (4,3) (4,4)
PrintTip
(1,1) (1,2) (1,3) (1,4) (2,1) (2,2) (2,3) (2,4) (3,1) (3,2) (3,3) (3,4) (4,1) (4,2) (4,3) (4,4)
PrintTip
FIGURE 16.18: Box plots of the differences M = log(R/G) before and after
normalization.
There are a number of S-PLUS functions available for differential expression testing.
These include two-sample t-test, Welch t-test, Wilcoxon test, F-test, paired t-test,
blocked F tests and local pooled error tests. These 2-color cDNA data are naturally
blocked via the chips (blocks of size 2) so that the paired t-test and blocked F test
are well suited to the design. Note however, that there are only 4 replicates, so
methods that leverage groups of genes such as the local pooled error approach [192]
may be useful in this case. Whatever the significance test chosen for assessing
differential expression, adjustment of the p-values obtained to take account of the
multiple (many genes) testing procedures is needed. Procedures in S+ArrayAnalyzer
for such adjustment include the Bonferroni, [145, 149] and Sidak procedures for strong
control of the family-wise Type I error rate (FWER), and procedures for (strong)
control of the false discovery rate (FDR) [30, 31].
One useful plot offered in S+ArrayAnalyzer for summarizing differentially expressed
genes is the volcano plot. This is a plot of the p-value (adjusted or unadjusted)
versus the fold change. Typically a horizontal line at a p-value corresponding to
control of the FWER is included on the plot as well as vertical lines indicating 2x
or 3x fold-change. Most graphical summaries in S+ArrayAnalyzer are presented as
Graphlets. In addition to providing interactive graphs in a web browser, Graphlets
enable connection to external databases such as NCBI Genbank, Onto-Express [90,
185] and Onto-Compare [91] databases. In the volcano plot shown in Fig. 16.19, the
user has clicked on a particular gene and chosen to obtain annotation information for
this gene from the Unigene database on Genbank.
16.3.7 Summary
Microarray technology is complex, and experiments using microarrays are resource-
intensive. As such, there is an urgent need for rigorous, statistical design and analysis
of microarray experiments. Some of the important statistical issues in microarray
experiments include:
5. Annotation
S-PLUS R
includes out-of-the-box functionality to address most of these many data
management and analysis issues that arise in the analysis of microarray data. In
addition, the S+ArrayAnalyzerTM solution for microarray data analysis provides an
end-to-end solution for microarray data analysis, as well as a toolkit and development
environment for extending and customizing microarray analysis implementations.
S+ArrayAnalyzer includes S-PLUS libraries and scripts, including libraries based on
packages from the Bioconductor project, www.bioconductor.org; web scripts; data
connections and S-PLUS GraphletsTM for interactive annotation. S+ArrayAnalyzer
is available as a desktop analysis workbench and an enterprise client-server and web-
based application.
S-PLUS code snippets for some of the methods described above are available from
the Insightful web site at www.insightful.com/ArrayAnalyzer. This site also includes
information regarding S+ArrayAnalyzer solutions and updates regarding analysis of
microarray data using S-PLUS and S+ArrayAnalyzer.
SAS has long been recognized as a premiere provider of data management and sta-
tistical analysis software. Dozens of various generic SAS software packages have
bearing in the life sciences, and SAS has also recently begun pursuing genomics-
specific offerings. For sake of space, we will just overview these new components
here.
16.4.1.1 Datamarts
The concept behind datamarts involves extracting selected data and reorganizing that
data in a form well suited to specific analysis and reporting task. As with larger
data warehouses, a key aspect of datamarts is metadata, or data about the primary
data. Examples of metadata include a data file name, location, storage format, and
structure. A datamart can capture metadata throughout the analysis process: where
the data came from, what transformations were performed, and information about
the context of the data. Managing the metadata and controlling the transformation
processes are the key benefits of datamarts.
Upon submission, an Analytical Process sends its macro code via SAS/Integration
Technologies to a preactivated SAS server. The code can execute any SAS routine
you wish, and typically includes calls to SAS data steps and procedures. You can also
use it to create a JMP Scripting Language (JSL) file. SAS MAS then executes this
file, producing predefined graphical and analytical displays in JMP (Fig. 16.22). JMP
is a dynamic interactive module that offers loads of features for scientific discovery.
The preceding examples represent a small fraction of the capabilities SAS offers
you for analyzing microarray and related genomics data. RDM and MAS provide a
framework that enables you to tap into the full power of the SAS System, including
the following capabilities:
• The SAS Data Step, SAS Macro Language, and JMP Scripting Language are
very flexible and extensive environments for manipulating and processing ge-
nomics data.
• Proc Optex, Proc Factex, Proc Plan, the ADX Menu System, and the DOE
features in JMP help you to create optimal designs.
• Over half of the 50+ procedures in SAS/Stat are applicable to array data. These
include procedures for analysis of variance, clustering, density estimation, dis-
criminant analysis, multidimensional scaling, multiple comparisons, nonpara-
metric statistics, partial least squares, power and sample size, principal compo-
nents and smoothing. New residual diagnostics will be available in Release 9.1
of Proc Mixed, enabling better quality control of your array data. Procedures
from SAS/QC, SAS/OR, SAS/ETS, and SAS/IML (including IML Workshop)
can also be useful. Scores of examples from other scientific disciplines are in
the SAS Sample Library and at www.sas.com/techsup/download/stat.
• You can use RDM and MAS to preprocess and create analysis-ready data sets
for SAS Enterprise Miner. Its incredibly powerful collection of data mining
methodologies and intuitive process-flow interface enable you to efficiently
generate a wide range of cross-validated predictions. Furthermore, you can
use SAS Text Miner to process collections of scientific abstracts you create
from important gene lists.
Details about these and other SAS offerings are available at www.sas.com/sds.
16.5.1 Introduction
Spotfire’s DecisionSite for Functional Genomics is a platform for gene expression
analysis encompassing each of the categories of the commercial applications previ-
ously mentioned in this chapter, including enterprise level package (extensive database
access and development platform), “Point and click” analysis (guided workflows),
and Analysis/Visualization environment (dynamic data analysis and visualizations).
This platform combines state-of-the-art data access, gene expression analysis meth-
ods, guided workflows, dynamic visualizations, and extensive development tools.
Tools and Guides are components of this platform. Tools are the analytic access
components added to the interface, while Guides connect Tools together to initiate
suggested analysis paths, allowing the user to deviate as the results determine.
With rapidly emerging microarray analysis methods, scientists are faced with the
challenge of incorporating new information into their analysis quickly and easily.
Spotfire DecisionSite Developer offers the ability to add new microarray data sources,
normalization methods, and algorithms within a guided workflow that can be rapidly
deployed to all users. The platform can be updated and extended to incorporate the
expertise of the user and the rapid advances in analysis methodology, and is therefore
appropriate and useful for a broad range of expertise from novice to expert.
point, and then add the previous time points to the analysis using the Add Columns
from File Tool (Fig. 16.24). Data can also be added from Spotfire files, the clipboard,
or database links, provided a common identifier exists between the data sets.
were determined to be present in every sample. The raw, summarized, and normalized
data are compared by using multiple visualizations (Fig. 16.28). Three profile charts
were created using the original cardiac samples, averaged sample data, and Z-score
normalized data at the five development time points. For the selected genes we
observe the overall profile trend for summarized samples is reflected in averaged
profiles which may have been influenced by variations within replicates. The Z-score
normalization has standardized the profiles for more accurate comparisons.
-1 to +1. By taking the absolute value of the correlation similarity column, a new
column and the corresponding query device can be filtered in one direction to select
genes that are both most similar (up-regulated) and most dissimilar (down-regulated)
to our edited profile. The profiles of the raw, averaged and Z-score normalized data
are all updated to reflect our query device setting. Up- and down-regulated genes are
easily identified in the Z-score normalized data set (Fig. 16.30).
branches of the tree represent the clusters and can be explored within the visualization.
Initialization parameters include clustering methods and similarity measures. 2-D
hierarchical clustering can be used to identify clusters within samples as well as
genes. This is useful when investigating the relationship between samples such as
tumor types or cell lines. Clustering can also be performed on keywords using a
guided work flow to aggregate text information, perform clustering, and split out
genes by common annotations.
Both clustering methods, as with all computation dialogs within DecisionSite, allow
the user to specify the samples to include in analysis and to work only on selected
records (those that have not been filtered out using query devices). Thus, data that
are determined to be out of detection range, do not satisfy certain statistical criteria,
or are not the focus of a specific question can be removed from the analysis.
Principal Components Analysis (PCA) can be used when a large number of genes
are analyzed to reduce dimensionality of the data while preserving variability. This
allows the user to confirm categorization of genes across diverse samples, such as
tumor types, in fewer dimensions. Results of PCA are typically plotted in a 3-D
scatter plot where a “cloud” of genes can be observed in the new PCA space and
categories of genes or experiments can be revealed as tightly clustered markers (see
Chap. 10 for a full discussion of PCA).
In our example data set, a Guide was used to group genes using hierarchical clustering.
In the Guide, visualizations are automatically generated, annotations added, and
number of clusters selected and added to the data set (Fig. 16.31). Genes were
grouped using all the sample values, Pearson’s correlation similarity measure and
complete linkage clustering method (see Chap. 11 for a full discussion of various
distances and linkages as well as criteria to select them). The Guide was used to add
annotations from text files downloaded from Affymetrix NetAffx website. These text
files contain annotations and matching Probe Set ID information for the corresponding
Affymetrix Murine GeneChip U74v2 used in the study. The results of hierarchical
clustering display roughly three large blocks of up-regulated genes (shown in red)
specific for the early embryonic stage, neonatal through 1 week, and 4 weeks through
1 year in the heat map visualization in Figure 16.31. These groups are also members of
the first three nodes of the dendrogram and can be categorically grouped by pruning
three clusters from the dendrogram using the Guide or a level selector within the
dendrogram window. The dendrogram branches can also be explored by double-
clicking on nodes to zoom in on clusters of leaves of interest in the dendrogram.
The Gene Ontology (GO)biological processes displayed in the corresponding table
view associated with down-regulated genes reflect processes that may decrease from
embryonic stages through adulthood such as intracellular signaling cascade, fatty acid
synthesis, regulation of cell growth, and cell growth.
The Portfolio Tool in DecisionSite for Functional Genomics stores gene lists. Gene
lists were created from the cardiac development data set by selecting 50 genes that
were observed to be increasing and decreasing in the Z-score normalized average
sample profile chart. These Portfolio lists were then used to make selections in
the visualizations, confirm groupings and identify possible subsets within the hier-
archical clustering results (Fig. 16.32). Using the similarity rank query device and
observing the status bar for the number of genes selected, 50 of the most similar
and dissimilar genes were selected and added to the Portfolio list. Portfolio lists are
persistent and can be exported, imported and shared. These lists can be added to the
data set as categories and can be applied to analysis by filtering or plotting axes to
quickly isolate the top 50 up- or down-regulated cardiac development genes in new
data sets and future experiments (Fig. 16.33).
16.5.10 Summary
Spotfire’s DecisionSite for Functional Genomics is a powerful platform for gene ex-
pression analysis encompassing each of the categories of the commercial applications
mentioned in this chapter. This is demonstrated in the analysis of a cardiac data set,
from data access and use of guided workflows deployed from a server, to isolation
of genes that are involved in cardiac development and gender-specific resistance to
cardiac disease using dynamic visualizations and analysis tools.
FIGURE 16.37: Identifying gender-specific genes. Genes are isolated using lists
of significant genes created using distinction calculation on male and female groups.
Reporting tools are used to export text file reports from the table view and visualiza-
tions are sent directly to MS Word, PowerPoint or displayed as a web page.