Professional Documents
Culture Documents
Major findings
20,000 to 25,000 protein-coding genes (IHGSC, 2004) The same number of genes as much simpler organisms such as Arabidopsis thaliana (26,000 genes) and puffer fish (21,000 genes) The human proteome is far more complex than the set of proteins encoded by invertebrate genomes About 40 genes that underwent horizontal transfer from bacteria (Salzberg et al., 2001)
Objectives
High throughput sequencing of chromatin regulatory elements including transcription factor binding sites, using chromatin immuno-precipitation followed by high throughput DNA sequencing Comprehensively identifying active functional elements in human chromatin (in part using DNase I hypersensitivity assays) Characterizing the human transcriptome. Developing a reference gene set for protein-coding genes, non-coding genes, and pseudogenes
Evidence viewer
displays evidence supporting the proposed structure of a gene highlights possible discrepancies in the nucleotide sequence, exonintron boundaries, or other aspects of an annotated gene
Evidence Viewer
Contig GenBank
Ensembl
A comprehensive resource for information about the human genome as well as many other genomes (Flicek et al., 2008) effectively interconnects a wide range of genomics tools with a focus on annotation of known and newly predicted genes
Ensembl contd..
Contig view allows you to search across an entire chromosome Gene view includes the transcript DNA sequence and information on exonintron boundaries (splice sites) Anchor view allows you to select two features from a chromosome as anchor points and to display the intervening region Disease view links to disease entries in OMIM Map view shows an ideogram of each chromosome, including the known genes, GC content, and SNPs Cyto view displays genes, BAC end clones, repetitive elements, and the tiling path across genomic DNA regions
University of California at Santa Cruz Human Genome Browser The Golden Path is the human genome sequence annotated at UCSC Along with the Ensembl and NCBI sites, the human genome browser at UCSC is one of the three main web-based sources of information for both the human genome and other genomes
Sequencing Strategy
The average length of a clone or a contig genome has been sequenced and assembled? The N50 length describes the largest length L such that 50% of all nucleotides are contained in contigs or scaffolds of at least size L. Half of all nucleotides were present in a fingerprint clone contig of at least 8.4 megabases (2001) The N50 length rose to 38.5 megabases with the most recent freeze of the genome assembly (2008)
IHGSC, 2001
A histogram of the overall GC content (in 20 kb windows) shows a broad profile with skewing to the right Fifty-eight percent of the GC content bins are below the average, while 42% are above the average, including a long tail of highly GC-rich regions
CpG Islands
The dinucleotide CpG is greatly underrepresented in genomic DNA, occurring at about one-fifth its expected frequency. Most CpG dinucleotides are methylated on the cytosine and subsequently are deaminated to thymine bases. However, the genome contains many CpG islands which are typically associated with the promoter and exonic regions of housekeeping genes (Gardiner-Garden and Frommer, 1987) CpG islands have roles in processes such as gene silencing, genomic imprinting (Tycko and Morison, 2002), and X chromosome inactivation (Avner and Heard, 2001)
Transposon-Derived Repeats
Incredibly, 45% of the human genome or more consists of repeats derived from transposons. Also called interspersed repeats Transposon-derived repeats can be classified in four categories
The centromeres contain large amounts of interchromosomal duplicated segments, with almost 90% of a 1.5 Mb region containing these repeats Smaller regions of these repeats also occur near the telomeres
In the extreme case, the human dystrophin gene extends over 2.4 Mb,
the size of an entire genome of a typical prokaryote!
Use of cDNAs continues to provide an essential approach to gene identification There are many pseudogenes that may be difficult to distinguish from functional protein-coding genes The nature of non-coding genes is poorly understood
Average coding sequence 1344 bp internal exons are about 50 to 200 bp the size of introns is far more variable Protein-coding genes are associated with a high GC content Gene density increases 10-fold as GC content rises from 30% to 50%
Non-coding RNAs
Classes of genes that do not encode proteins Noncoding RNAs can be difficult to identify in genomic DNA because
lack open reading frames may be small, and not polyadenylated difficult to detect by gene-finding algorithms, and often not present in cDNA libraries
74% of the proteins were significantly related to other known proteins The number of protein-coding genes in humans is comparable to the number of genes in other metazoans and plants and only five-fold greater than the number in unicellular fungi human proteome may be far more complex (compared to other metazoans and plants)
relatively more domains and protein families relatively more paralogs, potentially yielding more functional diversity relatively more multi-domain proteins having multiple functions Domain architectures tend to be more complex Alternative RNA splicing may be more extensive
Human Chromosomes
Human Chromosomes
Genetic Variation
SNPs represent a fundamental form of variation in the human population (copy number being the other) number of characterized SNPs to 3.1 million
International HapMap Consortium, 2007)
most SNPs are biallelic SNPs are spaced apart 875 base pairs across the genome SNPs have varying extents of linkage disequilibrium (LD) Why study SNPs?
SNP microarray analyses are used for genome-wide studies of disease association SNPs reveal patterns of variation, such as shared ancestry, in human populations SNP analyses can reveal regions of the genome under strong positive selection use of SNPs is to identify chromosomal deletions, duplications, inversions and other abnormalities
outstanding problems
How can we accurately determine the number of protein-coding genes? How can we determine the number of noncoding genes? How can we determine the function of genes and proteins? What is the evolutionary history of our species? What is the degree of heterogeneity between individuals at the nucleotide level?
Private Venture
Celera Genomics: DNA from five different individuals were used for sequencing.
The lead scientist of Celera Genomics at that time, Craig Venter, later acknowledged (in a public letter to the journal Science) that his DNA was one of 21 samples in the pool, five of which were selected for use
On September 4, 2007, a team led by Craig Venter published his complete DNA sequence unveiling the six-billion-nucleotide genome of a single individual for the first time.
Benefits? ELSI?
- The UCEs exhibit almost no natural variation in the human population - Widely distributed in the genome - On all chromosomes except chromosomes 21 and Y - Often found in clusters
About 1.2% of the human genome appears to code for protein As much as 5% is more conserved than expected from neutral evolution and hence may be under negative or purifying selection
Specific non-coding segments in the human genome that appear to be under selection using a threshold for conservation of 70% or 80% identity with mouse over more than 100bp
Evolutionary issues
Evolutionary importance
Perfect conservation of these long stretches of DNA
UCEs appear to have experienced strong negative selection for 300-400 million years The probability of finding UCEs by chance (under neutral evolution) has been estimated at less than 1022 in 2.9 billion bases
Ultra-Conserved Elements
Along with more than 5,000 sequences of over 100bp that are absolutely conserved among the three sequenced mammals Most often located
either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in regulation of transcription and development.
256 show no evidence of transcription from any matching EST or mRNA from any species Non-exonic
Type II genes 255 known genes DNA binding and regulation of expression
For the remaining 114 the evidence for transcription is inconclusive Possibly exonic Distribution A hundred non-exonic elements are located in introns of known genes and the rest are intergenic. The non-exonic elements tend to congregate in clusters near transcription factors and developmental genes The exonic and possibly exonic elements are more randomly distributed along the chromosomes
Functions?
493 ultra-conserved elements have been identified in the human genome A small number of those which are transcribed have been connected with human carcinomas and leukemias For example, TUC338 is strongly upregulated in human hepatocellular carcinoma cells A study comparing ultra-conserved elements between humans and Takifugu rubripes proposed an importance in vertebrate development Several ultra-conserved elements are located near transcriptional regulators or developmental genes Other functions include enhancing and splicing regulation?
Importance of UCEs
Evolutionary importance - Under negative selection Possible biological role ?
The longest elements (779bp, 770bp, and 731bp) all lie in the last three introns in the 3 portion of POLA, the DNA polymerase alpha catalytic subunit on chromosome X, along with other shorter UCEs A similar-sized conserved region, 711bp formed by concatenation of uc.468 and uc.469 (separated by a single base), lies in the ~7Kb intergenic region between the 3 end of POLA and its downstream neighbor, ARX gene. ARX involved in CNS development and is associated with a host of X-linked Mendelian diseases, including epilepsy, mental retardation, autism and cerebral malformations They instead form a cluster of enhancers of ARX?
Chromosome X