Professional Documents
Culture Documents
Topics
What is Bioinformatics ? Introduction to Molecular genetics Some challenging problems Review of the current computational techniques. Future approaches Conclusion
What is Bioinformatics?
Bioinformatics is a management information system for molecular biology Organization of a huge amount of information in Gene Banks and protein Banks Data mining and analysis tools Modeling, interpreting and predicting Biological activities.
March 16, 2004 3
Chromosomes
Chromosomes are the cellular components that contain genes, in animals and plants they are located in Nucleolus; Genes are the functional units of inheritance. Genes are specific segments of DNA that code for specific proteins which control cell structure and function.
Number of chromosomes vary from organism to another Human 46, Chicken 78, Mouse 40, wheat 42, corn 20, Fruit fly 8, scorpion 4
10
11
TCTCGGCATTAGGGCCT AGAGCCGTAATCCCGGA
12
13
14
Protein Mapping
Protein consists of a chain of amino acids There are 20 amino acids Each amino acid is coded by three bases. During protein synthesis T->U; DNA->mRNA
15
Protein Expression
16
Genes length between 30k-250k, exon region 693106 bp. Introns can be as large as 32k Mean internal coding exon 150 bp. Eukaryotes have only 10% of their DNA coding for proteins. Humans may have a little as 1% coding for proteins. Viruses and prokaryotes use a great deal more of their DNA. Human genome project completed 2003, 3 billion bp, and about 30,000 genes, compared to 13,600 for the fruit fly, and over 14,000 genes in mosquitoes, Rice 50,000.
March 16, 2004 17
If the number of genes really turns out to be about 30,000, then this can be a testament to the marvellous design of life. Only a genius could create us with so few genes performing so many functions
A famous scientist in genetics.
March 16, 2004 18
An RNA gene is any gene that is not translated into a protein. Commonly-used synonyms of "RNA gene" are noncoding RNA or (ncRNA). RNA genes code certain Regulatory functions. RNA genes are not predictable by current algorithms. Not clear how many of these are hidden in the human genome.
19
Gene Banks
20
Challenges
1-Gene finding: try to identify a potential gene region in DNA, however, only 1-3% of human genome is translated into proteins. 2- Finding a region of interest. Raw sequencing is performed on pieces of random lengths between 500 to 5000 pbs. With possible large overlapping parts at both ends, 6 possible interpretation of each strand. Need for algorithms to align the fragments 3-Multiple Alignment of a set of genes to reveal regions of similarities, and cross species changes. 4- Local alignment and similarity search, Statistical grouping, clustering, statistical similarity measures for course classification. 5- Protein structure prediction: given a protein sequence, how it would fold itself into a specific 3D complex shape. Locating the non-coding genes (RNA)
21
Methods
Similarity Search Content search Signal Search
22
23
24
GRAIL
Gene Recognition and Analysis Link There are multiple versions:
Grail 1, Grail 1a, Grail 2, GRAIL III, etc.
GRAIL II uses Neural to classify Introns and Exons. GRAIL III Uses Dynamic Programming to find the optimal combinations of Introns and Exons. Refinements: consideration of contextual information, and linguistic methods.
March 16, 2004 26
GenScan
Predicts complete gene structures Input sequence may represent more than one gene It follows a probabilistic model Uses Markov Model, Generalized Hidden Markov Model.
27
Alignment editors
Bioedit
28
ClustalW
finds the best global alignment for a set of input sequences (nucleic acid or protein). A global alignment refers to the best match over the total length of the sequences. Produces a similarity tree with scores
29
CLUSTALW
Step 1: Pairwise alignment, distance matrix
Calculates distance scores between pairs Cost: O(q2l2) , q number of sequences, l mean length
Other programs (MAP) use DP to find the most likely evolutionary sequence.
March 16, 2004 30
31
32
33
THANK YOU
34