You are on page 1of 48

Hillary Term 04: The Human Genome

20.1 The Human Genome evolutionary issues (Hein)


27.1 Non-Genic Selection in the Human Genome (Lunter)
3.2 Mammalian Genes I: Conservation and slow evolution (Ponting)
10.2 Mammalian Genes II: Functional innovation and rapid change (Ponting/Goodstadt)

17.2 RNAs in Human Genome (Sam Griffiths-Jones) 24.2 Population Genetics of the Human Genome (Gil McVean ) 2.3 Association Mapping and the Human Genome (Lon Cardon) 9.3 The Human Genome and Human Evolution (Chris Tyler-Smith)

The Human Genome key issues


The Human Genome Project Few basic facts of the human genome Grammar of Genes Basic events happening to a genome per mitosis/generation Genealogical Structures: Phylogenies, Pedigrees and the ARG Long term Dynamics of the Human Genome: The comparative aspect
(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping

History
Our interests.

History of the Human Genome Project


1956 Physical map. 24 types and total set of 46 chromosomes 1977 Sanger publishes dideoxy sequencing method 1980 Botstein proposes human genetic map using RFLPs 1987 US DOE publishes report discussing HGP 1988 HUGO is established

1990 Official start of HGP with 3 billion $ and a 15 year horizon.


1991 Genome Database GB is established 1992 Genethon publishes map based on microsatelites. 1995 Lander et al. detailed map based on sequence tagged sites. 1998 Comprehensive map based on gene markers. 1999 Sanger Centre publishes chromosome 22 2001 Draft Genome published: Celera & Public 2003 Completion (almost) of Human Genome

Strachan and Read, HMG3 p213

Sequencing Strategies
Public effort- strategy:

Celera - strategy:

From Myers 99

Celeras view of International Consortium

International Consortiums view of Celera

Unfair competition: IC delivering the same goods but with state funding.

Unfair competition: Celera delivering the same goods but can use IC data, while IC cannot use Celera data.

Other Genome Projects


1976/79 First viral genome MS2/fX174 1980 1982 1995 1996 1998 2000 Mitochondrion First shotgun sequenced genome Bacteriophage lambda First prokaryotic genome H. influenzae First unicellular eukaryotic genome Yeast The first multicellular eukaryotic genome C.elegans Drosophila melanogaster

2000
2001 2002

Arabidopsis thaliana
Human Genome Mouse Genome
The Genome OnLine Database knows of 958 genome sequencing projects, of which 169 are completed

Favourite and Model Organisms


Multicellular Animals Mammals Human Mouse Cow Dog Rat Chimp Pig
Fish Puffer Fish Zebra Fish Insects Drosophila Honey Bee Yellow Fever Mosquito Malaria Mosquito
Strachan and Read (2004) Chapter 8

3.5 3.2 3.0 2.8 3.1 3.5 3.0

Gb Gb Gb Gb Gb Gb Gb

Birds Chicken

1.2 Gb

Frog Xenopus Laevis

1.7 Gb

Nematodes Caenorhabdites elegans 100 Mb Caenorhabdites briggsae 80 Mb Sea Urchin Strongylocentrotus purpuratus

0.4 Gb 1.9 Gb

800 Mb

Multicellular Plants
165 270 780 278 Mb Mb Mb Mb Arabidopsis thaliana Rice 125 Mb 430 Mb

1 2 3 4 5 6 7

The Human Genome I


http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245

10

11

12

13 14

16 15

17

18

19 20

X 21

22

mitochondria
Y .016

104 118 107 100 148 143 142 176 163 148 140 197 198

72 88 86

66

45 48 163

51

3.2*109 bp

279

221
251

a globin

Myoglobin

*5.000
6*104 bp

(chromosome 11)

b-globin
Exon 3
3 flanking

Exon 1 Exon 2
5 flanking

*20
3*103 bp

*103
30 bp

DNA: Protein:

ATTGCCATGTCGATAATTGGACTATTTGGA
aa aa aa aa aa aa aa aa aa aa

The Human Genome II


http://www.sanger.ac.uk/HGP/

Highly conserved - coding Highly conserved - other Transposon based repeats Heterochromatin Other non-conserved

Nuclear Genome 1.5% 3.5% 45 % 6.6% 44 % Mendelian inheritance 1 (typically) Recombination 1/130 kb

Mitochondria 93% 5% 2% Maternal inheritance Possibly thousands No recombination 2 kb

Gene Density:

Pseudogenes:

20000

Processed Pseudogenes

Strachan and Read (2004) Chapter 9

The Human Genome III


http://www.sanger.ac.uk/HGP/

Gene families Clustered

a-globins (7), growth hormone (5), Class I HLA heavy chain (20),.
Dispersed Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),.. Clustered and Dispersed HOX (38 4), Histones (61 2), Olfactory receptors (>900 25),

Transposons

Strachan and Read (2004) Chapter 9 + Lander et al.(2001)

Genes and Gene Structures I


Presently estimated Gene Number: 24.000 (reference: )
Average Gene Size: 27 kb The largest gene: Dystrophin 2.4 Mb - 0.6% coding 16 hours to transcribe.

The shortest gene: tRNATYR 100% coding


Largest exon: ApoB exon 26 is 7.6 kb Smallest: <10bp

Average exon number: 9

Largest exon number: Titin 363

Smallest: 1
Smallest: 10s of bp

Largest intron: WWOX intron 8 is 800 kb Largest polypeptide: Titin 38.138

smallest: tens small hormones.

Intronless Genes: mitochondrial genes, many RNA genes, Interferons, Histones,..

Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9

Genes and Gene Structures II


Genes within Genes: Intron 26 of neurofibromatosis type I (NF1) contains 3 internal (2 exons) genes in the opposite direction. Overlapping Genes: Class III region of HLA

Simple Eukaryotic

Strachan and Read (2004) Chapter 9 p 258

Alternative Splicing

1. A challenge to automated annotation. 2. How widespread is it? 3. Is it always functional? 4. How does it evolve?

Cartegni,L. et al.(2002) Listening to Silence and understanding nonsense: Exonic mutations that affect splicing Nature Revi ews Genetics 3.4.285HMG p291-294

RNAs in the Genome

~200 ~100 ~200 ~175 ~175 ~250 >500 >1500

snoRNA snRNA miRNA 28S,5.8S,5S 18S 5S tRNA Antisense RNA

small nucleolar, over 100 types - RNA modification and processing small nuclear - involved in splicing very small ~22bp , regulation large cytosolic subunit small mitochondrial subunit large mitochondrial subunit transfer RNA > 1500 types

Strachan and Read (2004) p.247 F9.4

Genome Annotation
Proteins

Genomes ESTs

Ensembl http://www.ensembl.org Santa Cruz Genome Browser

http://genome.ucsc.edu/

Gene Finding and Protein (HMM) Descriptors


Burge & Karlin jmb 96

A. Make gene characteristics to each nucleotide. Extract legal prediction by dynamical programming. B. Use HMM to describe biological knowledge of gene structure.

Mutations and Mutation Rates


1 mitosis or generation

Average Number of Mitoses Male generation (15:35 .. 20:150 Female generation: ~24

Single nucleotide substitutions: ~10-7 Microsatellites (~100.000): ~10-2 Small insertion deletions: ~10-8

Crow,JF (2000) The Origins, Patterns and Implications of Human Spontaneous Mutation Nature Review Genetics 1.1.40 -47 + Strachan and Read (2004) chapter 11 +Jobling, Hurles and TylerSmith (2004) chapter 2

Recombination
Recombination:
1 meiosis

Gene Conversion:

Total Haploid length males: 25.9 M - females: 44.6 M.


Gene conversions 1-2 orders higher. Length 300-2000 pb.

Lander et al.(2001) Initial sequencing and analysis of the human genome Nature 409.860-912. + Kong,E. et al.(2002) A high resolution recombination map of the human genome Nature Genetics

Selection: Positive & Negative


One sequence scenario
A

Population scenario
A A A C C

One sequence scenario again


ThrSer ACGTCA

ThrPro ACGCCA

A A A A C C ArgSer AGGCCG

ThrSer ACGCCG

ThrSer ACTCTG AlaSer GCTCTG AlaSer GCACTG

A A A C C

The selection criteria could in principle be anything, but the selection against amino acid changes is without comparison the most important.

Certain events have functional consequences and will be selected out. The strength and localization of this selection is of great interest.

The Genetic Code

Substitutions

Number

Percent

Total in all codons


Synonymous Nonsynonymous Missense Nonsense

549
134 415 392 23

100
25 75 71 4

Examples of rates
Organism RNA Virus Influenza A Hemagglutinin 13.1 10-3 Gene Syno/year

remade from Li,1997

Non-Syno/Year

3.6 10-3

Hepatitis C
HIV 1

E
gag

6.9 10-3
2.8 10-3

0.3 10-3
1.7 10-3

DNA virus Hepatitis B Herpes Simplex P Genome 4.6 10-5 3.5 10-8 1.5 10-5

Nuclear Genes
Mammals Mammals Mammals c-mos a-globin histone 3 5.2 10-9 3.9 10-9 6.2 10-9 0.9 10-9 0.6 10-9 0.0

Genealogical Structures
Homology:
The existence of a common ancestor (for instance for 2 sequences)

ccagtcg

cagtct

ccggtcg

Phylogeny
Only finding common ancestors. Only one ancestor.

Pedigree:

Ancestral Recombination Graph the ARG


i. Finding common ancestors.

ii. A sequence encounters Recombinations


iii. A point ARG is a phylogeny

Populations
Grand parents

Parents

Now

Genealogical approach to Population Variation Analysis

Africa

Non-Africa

Inter.SNP Consortium (2001): A map of human genome sequence variation containing 1.42 million SNPs. Nature 409.928-33

Pedigrees
Chinese
http://demography.anu.edu.au/People/Staff/zhongwei.html

Burkes British Peerage


http://www.burkes-peerage.net/sites/wars/sitepages/home.asp

Quebec French
Heyer and Tremblay, 1998 PNAS

Mormons
http://genealogy-mormons.com/

Icelandic
http://www.decode.com + Helgason, A. et al. (2003 June) A population-wide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: Evidence for a faster evolutionary rate of mtDNA lineages than Y-chromosomes American Journal Human Genetics.

Total Pedigree

Helga son

1848 1892
1

2 1

Ancestor cohort

Year

2 2 1 2 1 2 1 3 1 1

1972
Contemporary cohort

2002

Matrilines
N = 31,817
77.9%

Patrilines
Ancestral cohort born 1848-1892 N = 31,659
73.9%

4 .3

22.1% 8.3% 91.7%

3 .8

26.1% 13.8% 86.2%

g=

N = 64,150

Descendant cohort born after 1972

g=

N = 66,910

Genealogical Questions
Pedigrees
Time back to first individual common ancestor to everyone

ARG questions:

The height of ARGs - correlation between local phylogenies

Gene Phylogeny Questions Total Branch Length - Height

Long Term Evolutionary History: Myr/Gyr


Origin of Life Last Universal Common Ancestor LUCA

First Eukaryotes
First Chordates First Vertebrates First Mammals First Primates First Hominoids Chimp-Human Split

Hedges, SB (2002) The Origin and Evolution of Model Organisms Nature Review Genetics 3.11.838 -848.

Brown (2003) Horizontal Genetic Transfers Nature Genetics

The Comparative Aspect.


MRCA-Most Recent Common Ancestor

3 Problems:

i. Test all possible relationships. ii. Examine unknown internal states. iii. Explore unknown paths between states at nodes.

Time Direction

ATTGCGTATATAT.CAG

ATTGCGTATATAT.CAG

ATTGCGTATATAT.CAG

observable

observable

observable

One Principle of Comparative Genomics


Observable
Unobservable

Protein Structure

Goldman, Thorne & Jones, 96

P ( Sequence Structure) P ( Structure)


C
RNA Structure

C A

A C A U G U

P ( Structure Sequence) P ( Sequence)

Gene Structure

Observable

Unobservable

Molecular Evolution and Gene Finding: Two HMMs

AGTGGTACCATTTAATGCG..... AGTGGTACTATTTAGTGCG.....

Pcoding{ATG-->GTG} or Pnon-coding{ATG-->GTG}

Simple Prokaryotic

Simple Eukaryotic

The Rise of Comparative Genomics

Lander et al(2001) Figure 25A

The Domain of Comparative Genomics


ACTGT
Cabbage
Renin 1 2 3 4 5

ACTCCT
HIV proteinase

8 2

Turnip
Gene Order/Orientation.

Sequences

RNA (Secondary) Structure

Protein Structure

General Theme.

Formal Model of Structure

Stochastic Model of Structure Evolution.

Interaction Networks Gene Structure


Any Graph.

Linkage Mapping
D r M

From McVean

Association/Fine scale mapping

Dominant/Recessive.

2Ne generations

Penetrance Spurious Occurrence Heterogeneity

A set of characters. Binary decision (0,1). Quantitative Character.

genotype

Genotype Phenotype

phenotype

BRCA2 example
1000 cases and 1000 controls typed at 8 microsatellite markers

Single marker association

Bayesian analysis

Causative SNPs.

Rafnar et al.(2004) Morris et al(2001) +

Short Term Evolutionary History: Kyr/Myr


Oldest Polymorphisms Neutral Human Autosomal Polymorphisms First Out-of-Africa Anatomically Modern Man Peopling of the Globe genetic and fossil evidence. Supposedly well behaved populations Iceland Finland Sardinia

The globe & migrations:

Cavalli-Sforza,2001 + HEG (2004)

The International HapMap Project Nature 426, 789 - 796 (18 Dec 2003) http://www.hapmap.org/

Started October 27-29, 2002

HapMap

HapMap

Ontologies
A Structured Vocabulary Consistent across species.

Purpose:
Facility communication among researchers Facility communication among computer systems

2001: Three Ontologies: Molecular Function Biological Process

Source NAR(2004) 32.D258-

Cellular Component

http://www.geneontology.org

Gene Ontology Consortium (2001) Creating the Gene Ontology Resource: Design and Implementation. Genome Research 11.1425-33
Gene Ontology Consortium (2004) The Gene Ontology (GO) database and informatics resource Nucleic Acid Research 32.D258 -61.

Structural Genomics: Systematic Structure Determination


Examples:
Center for Eukaryotic Structural Genomics Structural Genomics of Pathogenic Protozoa Consortium Berkeley Structural Genomics Center : Mycoplasma genitalium and Mycoplasma pneumoniae

PDB Holdings List: 10-Feb-2004


Molecule Type Proteins, Peptides, and Viruses X-ray Diffraction and other NMR 19014 2934 Protein/Nucleic Acid Complexes 898 96

Nucleic Acids

Carbohydrates

total

719 569

14 4

20645 3603

Exp. Tech.

Total

21948

994

1288

18

24248

http://www.strgen.org/

http://www.nysgrc.org/

http://www.oppf.ox.ac.uk/

http://pdb.ccdc.cam.ac.uk/pdb/strucgen.html

John Westbrook, Zukang Feng, Li Chen, Huanwang Yang and Helen M. Berman The Protein Data Bank and structural genomics Nucl eic Acids Research, 2003, Vol. 31, No. 1 489-491

Structural Genomics: Mycoplasma pneumoniae proteins

http://www.strgen.org/status/mpoverview.html

Proteomics
2D PAGE gels (polyacryl gel electrophoresis )

MALDI

Source: Hanash (2003)

Protein Micro-arrays

Source Gavin et al.(2002)

http://www.hupo.org Hanash,S.(2003) Disease Proteomics Nature 422.226- Aebersold,R. and M.Mann (2003) Mass spectrometry-based proteomics Nature 422.198- Gavin et al. (2002) Functional Organisation of the Yeast Proteome by systematic analysis of protein complexes Nature 415.141-

Summary
The Genome

Genomes: Variation and long term evolution.

Genealogical Structures: Phylogenies, Pedigrees and the ARG

Long term Dynamics of the Human Genome: The comparative aspect

(Genotype Phenotype) & (Population Genetics/History) => Gene Mapping

Our Genomically Motivated Projects


1. Comparative gene annotation (Meyer, Skou Pedersen) 2. Superimposed selective constraints (Forsberg, Meyer, Skou Pedersen) * 3. Haplotype Blocks (Song) * 4. Genome transformations (Miklos) 5. Ancestral Blocks* 6. Statistical Sequence Comparison (Drummond, Lunter, Miklos) 7. Substitutions and insertion-deletions at the Genome Level (Lunter) Next week

Minimal ARGs and Haplotype Blocks (Song) a: (3,4) b: (3,4) c: (15,16) d: (16,17) e: (35,36)

f: (35,36)
g: (36,37)

Combining Levels of Selection.


Forsberg, Meyer, Pedersen

Assume multiplicativity: fA,B = fA*fB


Protein-Protein Hein & Stvlbk, 1995
Codon Nucleotide Independence Heuristic

Jensen & Pedersen, 2001


Contagious Dependence

Protein-RNA Singlet Doublets

Contagious Dependence

Applications to Human Genome


Parameters used Chromosome 1: 4Ne 20.000 Segments

(Wiuf and Hein,97)

Chromos. 1: 263 Mb. 52.000 Ancestors

263 cM 6.800

All chromosomes Ancestors 86.000 Physical Population. 1.3-5.0 Mill.

A randomly picked ancestor:


0 0 0 6890 *250

(ancestral material comes in batteries!)


260 Mb 52.000 *35 7.5 Mb 8360

30kb

References: Books & www-pages.


Books: Strachan and Read (2004) Human Molecular Genetics (3rd Ed.) Bioscience Jobling, Hurles and Tyler-Smith (2004) Human Evolutionary Genetics Bioscience Sulston, J.(2002) Our Common Thread Corgi Books Ridley, Matt (2001) Genome Encyclopedia of the Human Genome (2003) Nature Publishing Group Cavalli-Sforza,L. (2001) Genes, People and Language Penguin

Key articles: Lander et al.(2001) Initial Sequencing and Analysis of the Human Genome Nature Venter et al.(2001)The Sequence of the Human Genome Science 291.1304-1351

References: www-pages.
Major sequencing centers:
Baylor College of Medicine Genome Sequencing Center Celera DoE Joint Genome Institute Genoscope TIGR Washington University Genome Sequencing Center Wellcome Trust Sanger Institute hgsc.bcm.tcm.edu/ www.celera.com www.jgi.doe.gov www.genoscope.cns.fr www.tigr.org www.genome.wustl.edu www.sanger.ac.uk

Whitehead Institute/MIT Center for Genome Research


Ensembl genome annotator European Bionformatics Institute NCBI Nature Genome Gateway Integrated Genomics Ebi genome databases Primate Sequencing Projects European Bioinformatics Institute Proteomics National Center for Biotechnology Information HapMap Project Homepage Online Inheritance in Man

www.-genome.wi.mit.edu
www.ensembl.org www.ebi.ac.uk www.ncbi.nlm.nih.gov http://www.nature.com/genomics/human/ http://wit.integratedgenomics.com/GOLD/ http://www2.ebi.ac.uk/genomes/ http://sayer.lab.nig.jp/~silver/index.html http://www.ebi.ac.uk/proteome/ http://www.ncbi.nlm.nih.gov/ http://www.hapmap.org/ http://www.ncbi.nlm.nih.gov/omim/

You might also like