Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Guide to Human Genome Computing
Guide to Human Genome Computing
Guide to Human Genome Computing
Ebook557 pages5 hours

Guide to Human Genome Computing

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The Guide to Human Genome Computing is invaluable to scientists who wish to make use of the powerful computing tools now available to assist them in the field of human genome analysis. This book clearly explains access and use of sequence databases, and presents the various computer packages used to analyze DNA sequences, measure linkage analysis, compare and align DNA sequences from different genes or organisms, and infer structural and functional information about proteins from sequence data.

This Second Edition contains completely updated material. Rather than a revision of the previous volume, the Second Edition is essentially a new book, based on the subjects which will be of interest over the coming years. This new book is international, both in scope and authorship.

  • Computing resources for the following are clearly explained: Internet resources - databases etc.
  • Genetic analysis
  • Sib-pair studies
  • Comparative mapping
  • Radiation hybrids
  • Sequence ready clone maps
  • Human genome sequencing
  • ESTs
  • Gene prediction
  • Gene expression
LanguageEnglish
Release dateMar 25, 1998
ISBN9780080532707
Guide to Human Genome Computing

Related to Guide to Human Genome Computing

Related ebooks

Computers For You

View More

Related articles

Reviews for Guide to Human Genome Computing

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Guide to Human Genome Computing - Martin J. Bishop

    9044

    Preface

    The first stage of the Human Genome Project, sequencing of representative DNA to the extent that the only gaps are repetitive sequences or regions which present major technical problems in cloning or sequencing, should be completed by the middle of the next decade. The methodology to do this is in place and is described in this book in the context of bioinformatics. The work is in progress at a number of centres worldwide. The expected major technical advances necessary to speed the process have so far not been forthcoming. It is business as usual in factory scale operations.

    Genome projects involve a systematic approach to the identification of all the genes of an organism. Comparative approaches involve yeast, nematode, fly, fish (fugu and zebra), frog, chicken and mouse (and others). Access to the physical DNA via overlapping clone maps has proved remarkably difficult to provide in man. Radiation hybrid mapping and BAC libraries have eased the situation.

    A high resolution genetic linkage map has been constructed for the mouse. CEPH families and sib-pairs are the major tools for genetic linkage studies in man but the resolution is lower than for mouse. The mouse is therefore crucial in relating phenotypes to genes.

    cDNA (EST) sequencing and mapping on radiation hybrid DNA promise to raise the number of sequenced, mapped and named human genes from 6000 to 50 000 in a short time. A further 50 000 genes encoding transcripts which are expressed in low amounts or are highly localized in time (development) or space (specialized cell types) are expected to be discovered by genomic sequencing. Expression profiles of cells can be studied by massively parallel methodologies. Pathways involving metabolic and developmental genes are being elucidated and need to be modeled. Understanding the function of 10 000 proteins will take longer but commercial rewards are available today.

    Bioinformatics is the key to the organization and retrieval of data on genome organization and macromolecular function. Databases, query and browsing tools and analysis tools are evolving world wide to meet the challenge. Many of these are highly graphical. The spectacular rise of the World Wide Web to be the major access method has occurred since the First Edition was published and we provide a wide ranging and clear introduction (Chapter 1). Many tasks are still too time-consuming and labour intensive. The aim for future development is to reduce human intervention in analysis and reporting without loss of validity of results. Reliability of predictions will be enhanced as knowledge of sequence, structure and function advances.

    Our subject has very major implications for mankind in terms of agriculture, biotechnology, pharmaceuticals and general healthcare. The ethical issues which are emerging will require very sensitive and intelligent treatment.

    MJ Bishop

    1

    Introduction to Human Genome Computing Via the World Wide Web

    Lincoln D. Stein,      Whitehead Institute, MIT Center for Genome Research, 9 Cambridge Center, Cambridge, MA 02142-1497, USA. E-mail address: lstein@genome.wi.mit.edu

    1 INTRODUCTION

    Since the first edition of this book was published, the genome community has seen an explosion in the number and variety of resources available over the Internet. In large part this explosion is due to the invention of the World Wide Web, a system of document linking and integration that has gone from obscurity to commonplace in a mere five years.

    The genome community was an early adopter of the Web, finding in it a way to publish its vast accumulation of data, and to express the rich interconnectedness of biologic information. The Web is the home of primary data, of genome maps, of expression data, of DNA and protein sequences, of X-ray crystallographic structures, and of the genome project’s huge outpouring of publications. These data, spread out among thousands of individual laboratories and community databases, are hot-linked throughout. Researchers who wish to learn more about a particular gene can (with a bit of patience) move from physical map to clone, to sequence, to disease linkage, to literature references and back again, all without leaving the comfort of their Web browser application.

    However, the Web is much more than a static repository of information. The Web is increasingly being used as a front end for sophisticated analytic software. Sequence similarity search engines, protein structural motif finders, exon identifiers, and even mapping programs have all been integrated into the Web. Java applets are adding rapidly to Web browsers’ capabilities, enabling pages to be far more interactive than the original click–fetch–click interface. It may soon be possible for biologists to do all their computational work with no more than a browser on their desktop computers.

    This chapter is an illustrated tour of the World Wide Web from the genome biologist’s perspective. It does not pretend to be a technical discussion of Web protocols or to explain how things work. Nor is there any attempt for this to be an exhaustive listing of all the myriad Web resources available. Instead I have attempted to show the range of resources available and to give guidance on how to learn more. Other chapters in this book delve more deeply into selected topics of the Web and genome.

    URLs (Universal Resource Locators) are necessary for interactive Web browsing but terribly ugly when they appear on the printed page. I have gathered all the URLs and placed them in a table at the end of this chapter. Within the body of the text I refer to them by descriptive names such as ‘Pedro’s Home Page’ rather than by their less friendly addresses.

    2 EQUIPMENT FOR THE TOUR

    2.1 Web Browser

    The Web was designed to run with any browser software, on any combination of hardware and operating system. However, the pace of change has outstripped many software developers. Although some Web sites can still be viewed with older browsers (such as the venerable National Center for Supercomputing Applications (NCSA) Mosaic or the Windows Cello browser), many require advanced features found only in recent browsers from the Microsoft and Netscape companies. For most effective genome browsing, I recommend one of the following browsers:

    1. Netscape Navigator, 3.02 or higher

    2. Netscape Communicator 4.01 or higher

    3. Microsoft Internet Explorer, 3.01 or higher

    These browsers can be downloaded free of charge from the Netscape and Microsoft home pages. They are also available from most computer stores and mail order outfits.

    Although it is good to have a recent version of the browser software installed, it may not be such a good idea to use the most recent, as these versions often contain bugs that cause frustrating crashes. Be particularly wary of ‘pre-release’, ‘preview’, and ‘beta’ browser versions.

    2.2 Internet Connection

    A direct connection to the Internet is a necessity for Web browsing. All academic centers, government laboratories and nearly all private companies usually have a fast Internet connection of at least 56K bps (56 000 bits per second). This will be more than adequate for Web browsing purposes.

    Home users usually dial into the Internet via an Internet Service Provider using a PPP (point-to-point protocol) or SLIP (serial line interface protocol) connection. For such users a modem of 28.8K bps or better is strongly recommended.

    3 GENOME DATABASES

    3.1 The WWW Virtual Library: Genetics

    Although we could start our tour anywhere, a good place to begin is the genetics division of the WWW Virtual Library, a distributed topic-oriented collection of Web resources (Figure 1.1). This page contains links to several hundred sites around the world, organized by organism.

    Figure 1.1 The WWW Virtual Library, a good jumping-off point for genome resources on the Web.

    The list of organisms in the left-hand frame provides a quick way to jump to the relevant section. Click on the link labeled ‘Human’ to see sites under this heading. Subheadings direct you to a variety of US and international sites, as well as to chromosome-specific Web pages and search services.

    3.2 Entrez

    We select the link for GenBank, taking us to the home page of the National Center for Biotechnology Information (Figure 1.2). NCBI administers GenBank, the main repository for all published nucleotide sequencing information. Links from its home page will take you to GenBank, as well as to SwissProt, the protein sequence repository, OMIM, the Online Mendelian Inheritance in Man collection of genetic disorders, MMDB, a database of crystallographic structures, and several other important resources.

    Figure 1.2 The NCBI home page provides access to the huge GenBank sequence database.

    While there are several ways to access GenBank and the other databases, the most useful interface is the Entrez search engine, an integrated Web front end to many of the databases that NCBI supports. To access Entrez, we click on the labeled button in the navigation bar at the top of the window.

    This takes us to the Entrez welcome page shown in Figure 1.3. The links on this page point to Entrez’s six main divisions:

    Figure 1.3 The Entrez search engine provides access to GenBank’s bibliographic, nucleotide, protein, structural and genome divisions.

    1. PubMed Division. This is an interface to the MedLine bibliographic citation service. Some nine million citations of papers in the biologic and biomedical literature are available going back as far as 1966. Most citations are accompanied by full abstracts.

    2. Nucleotide Database. This is the GenBank collection of nucleotide sequences, now merged with the EMBL database.

    3. Protein Database. This database combines primary protein sequencing data from SwissProt and other protein database collections, together with protein sequences derived from translated GenBank entries.

    4. 3D Structures Database. This contains protein three-dimensional (3D) structural information derived from X-ray crystallography and nuclear magnetic resonance (NMR). The source of the structures is the MMDB (Molecular Modelling Database) maintained at Brookhaven National Laboratories.

    5. Genomes Database. This is a compilation of genetic and physical maps from a variety of species. Maps of similar regions are integrated to allow for comparisons among them.

    6. Taxonomy. This is the phylogenetic taxonomy used throughout GenBank. Its primary purpose is as a consultation guide to obscure species.

    A search on the nucleotide division will illustrate how Entrez works. For this example, we’ll say that we’re interested in information on the ‘sushi’ family of repeats found in many serine proteases. Selecting the link labeled ‘Search the NCBI nucleotide database’ displays a page similar to the one shown in Figure 1.4. There is a single large text field in which to type keyword search terms, as well as a pop-up menu that allows us to limit the search to certain database fields. Available fields depend on which database we are searching. In the case of the nucleotide database, there are nearly two dozen fields covering everything from the name of the author who submitted the sequence entry to the sequence length. In this case we accept the default, which is ‘All Fields’. We type ‘sushi’ into the text field and press the ‘Search’ button.

    Figure 1.4 Searching the nucleotide database for entries that refer to ‘sushi’.

    Searches rarely take longer than a few seconds to complete. The page that appears now (Figure 1.5) indicates that eight entries matched our search. This is a small enough number that we can display the entire list. In cases where too many matches are found, Entrez allows us to add new search terms, progressively narrowing down the search until the number of hits is manageable. We press the button labeled ‘Retrieve 8 documents’.

    Figure 1.5 The ‘sushi’ search finds eight documents. We can either view them or refine the search further.

    A page listing a series of GenBank entries that match the search now appears (Figure 1.6). This is a complex page with multiple options. Each entry is associated with a checkbox to its left. You may select all or a subset of the entries on the list and click the ‘Display’ button at the top of the page. This will generate a summary page that reports on each of the selected entries. The pop-up menu at the top of the page allows you to choose the format of the report. Choices include the standard GenBank format, a list of bibliographic references for the selected entries, the list of protein ‘links’, and the list of nucleotide ‘neighbors’ (more on ‘links’ and ‘neighbors’ later).

    Figure 1.6 Entrez presents search results as a list of hotlinks to GenBank entries.

    You may also retrieve information about a single entry. Following the description there are a series of hypertext links, each linking to a page that gives more information about the entry. Depending on the entry, certain links may or may not be present. A brief description of these links is as follows:

    1. GenBank report. This shows the raw GenBank entry in the form that most biologists are familiar with.

    2. Sequence report. This is the GenBank entry in a friendlier text format.

    3. FAST A report. This is just the nucleotide sequence in the format accepted by the FASTA similarity searching program.

    4. ASN.1 format. A structured format used by the NCBI databases (and almost no one else).

    5. Graphical view. For sequences derived from cosmids, bacterial artificial chromosomes (BACs) and other contigs, this shows a graphic representation of the sequencing strategy.

    6. NN genome links. If the entry corresponds to a sequence that has been placed on one or more physical or genetic maps, this link appears. Selecting it will jump to the Entrez Genomes division (see p.11). The number of maps that the entry appears in will replace the ‘NN’.

    7. NN Medline links. If a published paper refers to the entry, this link appears. Selecting it will jump to the list of paper(s) in the Entrez bibliographic division.

    8. NN structural links. As above, but for 3D structures (see later).

    9. NN protein links. This corresponds to protein sequences related to the entry. If the entry contains an open reading frame (real or predicted) there will be at least one protein link.

    10. NN nucleotide neighbors. Each nucleotide entry added to GenBank is routinely BLASTed against all previous entries (see later) to create precalculated sets of ‘neighbors’ that share sequence similarity. If a nucleotide entry has any sequence-similarity neighbors, this link will appear.

    To continue our example, we decide to investigate GenBank entry U78093, described as a Human ‘sushi-repeat-containing protein precursor’. Selecting ‘1 MEDLINE links’ takes us to a page that lists the one paper that refers to this entry (not shown), and prompts us to select its citation format. Selecting the default format displays the citation shown in Figure 1.7. This article indicates that the gene in question is deleted in some patients with retinitis pigmentosa, and offers us links (in the form of buttons) to related articles, other relevant DNA and protein sequences, and to entries in OMIM that deal with retinitis pigmentosa.

    Figure 1.7 The GenBank entry for accession number U78093.

    Returning to our original list of sushi sequences, we can now select the link labeled ‘9 nucleotide neighbors’. This takes us to a list of all the sequence entries in GenBank that have significant BLAST homologies to U78093. Here we find several EST (expressed sequence tags) entries produced by the Washington University/Merck cDNA sequencing project. It is possible that some of them represent previously undescribed members of the sushi family of serine proteases.

    The user interface for other Entrez divisions provides a similar search–link–follow interface. The exception is the genomes division, which, because it has fewer entries than the others, is entered through a straightforward listing of prominent organisms and the genome maps available for them. From the Entrez welcome page, we select ‘Search the NCBI genomes database’ and then ‘Homo sapiens’ from the list of prominent organisms (not shown). This leads us to a list of 26 maps (22 autosomes, one sex chromosome and three mitochondrial maps) from which we select human chromosome 14. This leads us to a page (Figure 1.8), that displays a single prominent image in the center. The image shows a series of genetic and physical maps published from a variety of sources, roughly aligned, with diagonal lines connecting common features. The image is ‘live’, meaning that we can click on it to magnify areas or to view information about individual maps. When the magnification is large enough to see individually mapped objects (sequences, genetic loci and sequence tagged sites (STSs)), clicking on them will take us to a page showing the object’s GenBank record, where we can learn more about it in the manner described above.

    Figure 1.8 The genomes division of Entrez has a graphic interface based on alignments among multiple maps.

    If you are interested in a known physical or genetic region and wish to view it directly, the genomes division interface allows you to type in the names of the two mapped loci that define the region. The map will be expanded and scrolled to the proper area. You can then examine the map for interesting candidate genes near the region of interest.

    Entrez’s 3D structures division contains entries for several thousand proteins and other macromolecules whose structures have been determined by X-ray crystallography and/or NMR. The entries are fully linked to related entries in the nucleotide, protein, citation and genome divisions.

    To get the most out of the 3D structures, you will need to install a ‘helper application’ to view and explore the MMDB structure files. Entrez supports two different helpers, Rasmol and Kinemage. Both are available in versions that run on Macintosh, Windows and Unix systems. You will need to obtain and install one of these software packages, then configure your browser to launch it automatically to view a structure file. Full instructions can be found at Entrez’s MMDB FAQ (frequently-asked questions) page.

    The search interface to the 3D structures division is nearly identical to that used for nucleotide and protein sequences. You enter one or more keywords into a text field and press the ‘Search’ button, optionally limiting the scope of the search to a particular field. However, the retrieved entries will contain two links that we have not seen before, Structure Summary and NN structure neighbors. The first link retrieves a page that describes the entry’s structure in a standardized format. The second link indicates the presence of one or more entries that are structurally ‘similar’ to the entry. ‘Similarity’, in the case of 3D structures, is determined by an algorithm that measures the volume of overlap between the two molecules.

    Searching for the term ‘sushi’ in this case was ineffective, but searching for ‘serine protease’ was more productive, recovering 136 entries with structural information. Selecting the ‘Structure Summary’ link for any of the matching entries retrieves a page that gives information on the structural determination method and its citation. A series of pop-up menus and push buttons allows you to retrieve the 3D structure in a variety of formats. Selecting ‘RasMol’ format (assuming that the RasMol viewer is installed) and pressing the ‘View’ button launches the helper application (Figure 1.9). You are now free to rotate the image with the mouse, magnify it, adjust various display options, and save the structure to local disk for further exploration.

    Figure 1.9 Entrez’s structural division uses external viewers to display and rotate 3D protein models.

    3.3 ‘Gene Map’ and UniGene Databases

    No tour of the NCBI’s Web site is complete without a side trip to the ‘Gene Map of the Human Genome’, a compendium of approximately 16000 expressed sequences from the UniGene set that have been localized by radiation hybrid mapping (see Chapter 6). These maps were published in late 1996 by a consortium of research groups. Although the maps are already somewhat out of date, it is expected that these pages will be updated at regular intervals.

    From the NCBI home page, select the link labeled ‘Gene Map of the Human Genome’. This leads you to a pastel page that offers a series of ideograms of human chromosomes. There are several ways to search this database. If the region you are interested in is defined cytogenetically, just click on the ideogram in the desired region. A page like that shown in Figure 1.10 will appear showing a list of all mapped expressed sequences in the area. Selecting the GenBank accession numbers of the retrieved sequences will bring up pages with further information about the sequences and how they were mapped. If the region of interest is defined by markers on the Genethon genetic map, you can search for all expressed sequences located between any pair of

    Enjoying the preview?
    Page 1 of 1