You are on page 1of 34

Computational Methods in Bioinformatics

Dr. Moustafa Elshafei


Systems Engineering Department

March 16, 2004

Topics
What is Bioinformatics ? Introduction to Molecular genetics Some challenging problems Review of the current computational techniques. Future approaches Conclusion

March 16, 2004

What is Bioinformatics?
Bioinformatics is a management information system for molecular biology Organization of a huge amount of information in Gene Banks and protein Banks Data mining and analysis tools Modeling, interpreting and predicting Biological activities.
March 16, 2004 3

Introduction to molecular genetics

Molecules Lipids Proteins

March 16, 2004

Nucleus and Nucleolus


Plant Cell Note the large nucleus and nucleolus in the centre of the cell

March 16, 2004

Chromosomes and Genes

March 16, 2004

How long is DNA?


DNA helix ( 2 nm wide) are rounded on histone fibre of diameter 11 nm, then compacted in 30 nm cromation fiber, then coiled in 700 nm diameter then formed as chromosomes 1400 nm diameter.
If the the DNA strand of the human gene had 1 mm diameter, it would stretch to 25km. It would be winded and twisted, and coiled until it becomes a chromosome of 50 cm diameter and 4 meter length.

March 16, 2004

Chromosomes
Chromosomes are the cellular components that contain genes, in animals and plants they are located in Nucleolus; Genes are the functional units of inheritance. Genes are specific segments of DNA that code for specific proteins which control cell structure and function.

March 16, 2004

Number of chromosomes vary from organism to another Human 46, Chicken 78, Mouse 40, wheat 42, corn 20, Fruit fly 8, scorpion 4

March 16, 2004

Genes & Genetics

March 16, 2004

10

Deoxyribonucleic acid (DNA)


Pair of sequence of four nucleotides: cytosine (C), guanine (G), adenine (A), and thymine (T). A Pairs with T, and C pairs with G, the pairs held together by hydrogen bonds.

March 16, 2004

11

TCTCGGCATTAGGGCCT AGAGCCGTAATCCCGGA

March 16, 2004

12

Genome length in nucleotide pairs


Virus 5k E.Coli 4700k Human being 3,000,000k Corn 4,500,000k Salamander 72,500,000k

March 16, 2004

13

Genes and proteins


Genes are segments of DNA which code for proteins. A segment of the DNA that codes for a specific protein is a structural gene. Protein synthesis is also governed by a genetic code Every function in a cell is controlled by some kind of proteins . Proteins are formed by strands from 20 amino acids Every three nucleotides are called codons. The 64 possible codons are mapped into, Start, Stop, and one of the 20 amino acids

March 16, 2004

14

Protein Mapping
Protein consists of a chain of amino acids There are 20 amino acids Each amino acid is coded by three bases. During protein synthesis T->U; DNA->mRNA

March 16, 2004

15

Protein Expression

March 16, 2004

16

Genes length between 30k-250k, exon region 693106 bp. Introns can be as large as 32k Mean internal coding exon 150 bp. Eukaryotes have only 10% of their DNA coding for proteins. Humans may have a little as 1% coding for proteins. Viruses and prokaryotes use a great deal more of their DNA. Human genome project completed 2003, 3 billion bp, and about 30,000 genes, compared to 13,600 for the fruit fly, and over 14,000 genes in mosquitoes, Rice 50,000.
March 16, 2004 17

If the number of genes really turns out to be about 30,000, then this can be a testament to the marvellous design of life. Only a genius could create us with so few genes performing so many functions
A famous scientist in genetics.


March 16, 2004 18

An RNA gene is any gene that is not translated into a protein. Commonly-used synonyms of "RNA gene" are noncoding RNA or (ncRNA). RNA genes code certain Regulatory functions. RNA genes are not predictable by current algorithms. Not clear how many of these are hidden in the human genome.

March 16, 2004

19

Gene Banks

March 16, 2004

20

Challenges
1-Gene finding: try to identify a potential gene region in DNA, however, only 1-3% of human genome is translated into proteins. 2- Finding a region of interest. Raw sequencing is performed on pieces of random lengths between 500 to 5000 pbs. With possible large overlapping parts at both ends, 6 possible interpretation of each strand. Need for algorithms to align the fragments 3-Multiple Alignment of a set of genes to reveal regions of similarities, and cross species changes. 4- Local alignment and similarity search, Statistical grouping, clustering, statistical similarity measures for course classification. 5- Protein structure prediction: given a protein sequence, how it would fold itself into a specific 3D complex shape. Locating the non-coding genes (RNA)

March 16, 2004

21

Methods
Similarity Search Content search Signal Search

March 16, 2004

22

Common Software Uses


Similarity analysis Sequence analysis Sequence alignment Population genetics statistical analysis Format conversion, Database maintenance and searching

March 16, 2004

23

Data base Fast Search


BLAST & FASTA Query data base for DNAs similar to a given sequence. Rely on identification of brief subsequences (ktuples). Multiple k-tuples serve as seeds for extended alignment. Versions for DNA and protein sequences. Limited capability to handle gaps in coding regions.

March 16, 2004

24

Gene Prediction/Gene analysis


The most common :

March 16, 2004

GRAIL* FGENEH/FGENES MZEF GENSCAN* Procrustes GeneID GeneParser HMMgene


25

GRAIL
Gene Recognition and Analysis Link There are multiple versions:
Grail 1, Grail 1a, Grail 2, GRAIL III, etc.

GRAIL II uses Neural to classify Introns and Exons. GRAIL III Uses Dynamic Programming to find the optimal combinations of Introns and Exons. Refinements: consideration of contextual information, and linguistic methods.
March 16, 2004 26

GenScan
Predicts complete gene structures Input sequence may represent more than one gene It follows a probabilistic model Uses Markov Model, Generalized Hidden Markov Model.

March 16, 2004

27

Multiple Sequence Alignment Programs


Discover the commonalities and evolutionary relations among a set of genes or proteins. Examples
ClustalW DiAlign MAP

Alignment editors
Bioedit

March 16, 2004

28

ClustalW
finds the best global alignment for a set of input sequences (nucleic acid or protein). A global alignment refers to the best match over the total length of the sequences. Produces a similarity tree with scores

March 16, 2004

29

CLUSTALW
Step 1: Pairwise alignment, distance matrix
Calculates distance scores between pairs Cost: O(q2l2) , q number of sequences, l mean length

Step 2: Guide tree


Group nearest first Build tree sequentially Cost: O(q3)

Step 3: Progressive alignment


Align, starting at leaves of tree Cost: O(ql2)

Other programs (MAP) use DP to find the most likely evolutionary sequence.
March 16, 2004 30

Protein Structure Prediction


NNs are the bases for many known software packages for predicting protein structures. The main software packages :
nnPredict Predict Protein Predator PSIPRED SOPMA

March 16, 2004

31

POSSIBLE RESEARCH DIRECTIONS


Neuro Fuzzy techniques Genetic Algorithm Theory of Error Correction codes Wavelets Spectrum analysis Dynamic modeling of protein expression.

March 16, 2004

32

March 16, 2004

33

THANK YOU

March 16, 2004

34

You might also like