You are on page 1of 5

IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, July 2013, IIT Kanpur, India

Various Coding Based Frameworks Suitable for


Error control in Genetic Sequence Analysis
Bhawani Sankar Biswal

Anjali Mohapatra

Department of Computer Sc.& Engg.


International Institute of Information Technology
Bhubaneswar, India
A111010@iiit-bh.ac.in,bhawani.biswal@gmail.com

Department of Computer Sc.& Engg.


International Institute of Information Technology
Bhubaneswar, India
anjali@iiit-bh.ac.in

AbstractDNA sequencing is one of the first steps towards


understanding the genome of a species and also gene prediction.
The importance of DNA sequencing and it's proper transmission in
a process of DNA to protein transformation is a qualitative area of
research in computational biology. In a biological domain, the
information regarding a species or an organism or proper
functioning of cells are treated as the valuable data. So the
transformation of those information over a noisy channel from one
end to another (source to receiver) is always challenging. By
introducing the error correcting codes into the molecular biology
gives a better understanding of the biological communication
process. Considering the DNA nucleotide bases [Adenine, Cytosine,
Guanine and Thymine] as digital codes we can prepare a
theoretical model of transmitting the genetic information over a
noisy channel. In this paper we propose various frameworks
supporting the error-free transmission of genomic sequences.
These frameworks are capable of error detection if occurred during
transmission using Cyclic Redundancy Check(CRC) and also error
correction using Hamming Coding technique over a noisy
channel.
Keywords Evolutionary Computation; Computational
Biology; DNA sequencing; Error Detection &Correction;
Coding Theory; EDC & EDA.

polymerase enzyme usually known as ' DNA polymerase' by


acting in pairs generates two double stranded DNA molecules
out of one double stranded DNA molecule. If there is any
fault during the whole process then a possibility of error
generation arises. Here the chances of error are very less, may
be once in 100,000,000 bases. And also, the environmental
agents like ultraviolet light, nuclear radiations and various
chemicals can cause harm to the DNA sequences and can
affect the nucleotide bases by altering the base values.
Double stranded DNA Molecule

messenger RNA(mRNA)

Chained Amino acids

Proteins
I.

Introduction
Figure.1: Schematic Representation DNA transformation into protein.

When data is transferred from source to destination over a


noisy channel, there is every possibility of error during
transformation. So various coding theory mechanism have
been developed to control the error in communication system
during transmission. These coding theory practices provide
modulation and demodulation techniques which are the key
aspects in error correction methodology. The availability of
the large amount of genome datasets in public databases[4]-[6]
helps us to explore the error handling capabilities of channel
codes also in computational Biology.
In a biological communication system, the
information in DNA undergoes in two phases i.e transcription
& translation to be transformed into protein. DNA replications
and some of the environmental agents are basically
responsible for the mutations happening in DNA sequences.
DNA replication refers to the splitting of DNA molecule into
two strands(usually separated by DNA helicase). A cellular

Copyrights All rights reserved.

Figure.2:A-T-C-G DNA code binding

185

IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, July 2013, IIT Kanpur, India

As it is known that the coding theory mechanism is capable of


handling errors in digital information, we can also apply it in
repairing the errors generated during biological transmission
and it also can increase the fidelity of replicating DNA.
For example: Let us take some of the DNA sequences by
matching start and stop codons:
CTGGGCTAA
ATGGGCCATTAG
TTGCAAGGAAGAACCATTCGTGA
CTGAGCTTCTTTTGA
The generated protein sequences from the above DNA
sequences are :
LWGGAL
MWGGAPHIL
LCAQKRGEKRGEKRENT
LCAQKRGEKRENTPHIFSRV
If single nucleotide polymorphism(SNP) occurs here, T can
be replaced by C or G or A.
For an instance:
DNA sequence
Generated Protein sequence
ACCCGTCTT
TPPRVSL
Mutation occurs
ACCGGTCTT
TPRGVSL
(wrong protein sequence)
In this paper, we proposed two new algorithms that
are specially designed for error-free
information
transformation in molecular biology for synthesis of proteins
from DNA. Both the models are capable of detecting the
errors occurred during the data transfer and also one of them
not only detects the error but also can modifies it.
II.

for the very first time recommended the practice of different


codes(such as block & convolution) in the proceedings of
translation initialization in prokaryotic organisms. P.P Debata
et al. [9] suggested a hamming algorithm based theoretical
model to detect the errors during transmission of genomic
sequences. But this work is limited to the error detection
segment. Correction of those found errors was still a
challenging task.
III.

Proposed Model

Here we propose two new models i.e Error Detection &


Correction (EDC) algorithm and Error Detection
Algorithm(EDA).In both the models we take original DNA
sequences collected from the large databases as input and
receive error free DNA sequence as output. In both the
models, Genetic information in DNA is transformed into
digital forms. The four base values of DNA i.e. adenine,
cytosine, guanine, and thymine are taken into account for the
above formulation. By using those digital codes, we can
transmit the information over a noisy channel.When the
transmission is over we again retranslate the digital values into
their corresponding analog values to get the proteins which
exactly carries the living cell properties and functions. In EDA
model, the data is verified against any loss or damage during
transmission. This EDA model is based on Cyclic Redundancy
Check(CRC) methodology. This helps us to get uncorrupted
data in the whole process .On the other hand, the EDC model is
based on the Hamming coding technique in which we can not
only detect the errors but also correct or reconstruct the DNA
sequences that are corrupted during the whole communication
process.

Related Work

Application of coding theory mechanisms in the field of


biological domain evolved for the first time in 1950s[7-8].
And those coding methods are being started applying to
various fields of genetic data classification and analysis,
regulatory processes of different types of Genes and their
classifications, biological chip design technology and many
more.
Andrea et al.[12] came up with a model planted on a
modulator and encoder. Yockey[13] developed the first blue
print for gene expression that introduces the perception of
encoding and decoding technique. Battail[14] put an argue
that there is a possible availability of the nested codes in a
DNA structure . Rosen [15] also urged an approach to uncover
the linear block codes and so defining the inclusions and
removal in DNA sequences. Leibovitch et al. designed a
methodology that determines the absence or presence of
error-correcting code in genomic sequences. Schneider et
al.[16] came up with an algorithm that was helpful in
differentiating the coding sectors from it's non coding
neighborhoods in a DNA structure. And finally May et al.[1]

Copyrights All rights reserved.

Figure.3: Block Diagram of proposed models

ERROR DETECTION & CORRECTION


ALGORITHM (EDC MODEL APPROACH)

186

IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, July 2013, IIT Kanpur, India

Input: Original DNA sequences .


Output: Error free sequences.

Step 5 : Add the CRC bits to the data bits following


K+C ,C=D-1 where C= number of CRC bits ,
D=length of divisor & k=number of data bits.

Step 1 : Generate nucleotide sequences from


original DNA sequences.
Step 2 : Select any sequence from the generated
DNA sequences.
Step 3 : Map the sequences by using binary
mapping.
Step 4 : Convert the DNA sequence into binary
data bits of length K.
Step 5 : Add the parity bits to the data bits
following 2P K+P-1.where p= number of parity
bits & k=number of data bits.
Step 6 : Represent each data bit & parity bit with a
column vector.
Step 7 : Create a hamming matrix(H) by arranging
the column vectors.
Step 8 : Repeat step-3 to 7 for all generated DNA
sequences
Step 9 : For decoding a parity check matrix same
as (H) will be constructed.
Step 10 : The syndrome vector s will be calculated
by Checking up to the Pth parity relation starting from
the first parity.
Step 11 : If s==0,assume there is no error, and store
the sequence for generating amino acid sequences.
Step 12 : If s0, Capture the Frame no. and the error
position of the corresponding frame .
Step 13 : Modify the error bit at the traced location
and generate the error free DNA sequence .
Step14:
Stop.

ERROR DETECTION ALGORITHM


(EDA MODEL APPROACH)

Input: Original DNA sequences .


Output: Error free sequences.
Step 1 : Generate nucleotide sequences from original
DNA sequences.
Step 2 : Select any sequence from the generated DNA
sequences.
Step 3 : Map the sequences by using binary mapping.

Step 6 : Represent each data bit & CRC bit with a


column vector.
Step 7 : Generate CRC bit from the data bit using
divisor polynomial
Step 8 :Repeat step-3 to 7 for all generated DNA
sequences
Step 9 :For decoding the CRC bit is generated again
from the encoded DNA sequence .
Step 10 :The syndrome is will be calculated by
comparing the regenerated CRC bit with appended
CRC bit.
Step 11 :If s==0,assume there is no error, and store
the sequence for generating amino acid sequences.
Step 12 :If s0, then Sequence is corrupted.
Step 13 :Discard the corrupted sequence.
Step14: Stop.
IV.

Results & Discussions

The EDA was implemented on MATLAB environment


and run on a Windows 7 machine. A number of DNA
sequences has been created from the available data sets and
tested. For example, a selected DNA sequence
"CTGGGCTAAAATCCGG" from the dataset having GI no.
ACU08131
is
mapped
into
the
binary
bits
like[A=00,T=01,G=10,C=11].The corresponding data bit in
binary form is"11011010101101000000000111111010".
During transmission, If there is a change/error in the
nucleotide bases of the DNA sequences(T = 11 to G = 10)
[often called single nucleotide polymorphism(SNP)], then the
frame no. and it's position in the frame is detected by EDC.
The exact location of the error is now traced & finally
rectified with EDC . And the output generated in the decoding
side now is an error free DNA sequence.
On the other hand the EDA model is mainly designed for
the error detection purpose. Here we verify the transmitted
DNA sequences against any disorder or corruption. If found
happened ,then this model simply discards the sequence and if
the sequences are uncorrupted data the it stores them for
generating amino acid sequences.

Step 4 : Convert the DNA sequence into binary data


bits of length K.

Copyrights All rights reserved.

187

IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, July 2013, IIT Kanpur, India

Datasets used
The EDAs performance was examined on various
publically available data sets that are being used for gene
structure prediction. i.e. ACU08131, AGGGLINE,
AGU04852, ALOEGLOBIM etc. These data sets are
available in http://genome.crg.es/datasets/genomics96/ or
directly from NCBI database.

CTGGGCTA
AAAGGTCC

11010001
01000110
11100010
00110000
00000101
11100101
11011111

11010001
01000110
11100010
00110000
00000101
11100101
11011110

CTGGGCTA
AAAGGTCC

11010001
01000110
11100010
00110000
00000101
11100101
11011111

11010001
01000110
11100010
00110000
00000101
11100101
11011111

S0

Yes

S=0

No

Graphs Showing Results


EDC Model

Figure.4: A snapshot of dataset


TABLE I.

Figure.4: Binary Coded DNA Sequences

PERFORMANCE ANALYSIS OF EDC

Experimental Evaluation Using EDC Model


Generated
DNA
sequences

Encoded
Codewords(Us
ing Binary
mapping and
CRC codes)

CTGGGCT
A
AAAGGTC
C

Received
bits

10101011

10101011

01101001

01101001

10011100

10011100

11000000

11000000

00001010

00001010

10001000

10001000

11111111

11111111

TABLE II.

Calculate
Syndrome
(s)

Error
Detecte
d
&
Resolve
d

Error
free
DNA
Seque
nce

S0

Yes

CTG
GGC
TAA
AAG
GTC
C

Figure.5: Hamming Coded DNA Sequences

PERFORMANCE ANALYSIS OF EDA

Experimental Evaluation Using EDA Model


Generated
DNA
sequences

Encoded
Codewords(Using
Binary
mapping
and CRC codes)

Received
bits

Calc
ulate
Synd
rome
(s)

Error
Detected

Figure.6: DNA Sequences with Error Bit

Copyrights All rights reserved.

188

IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions, July 2013, IIT Kanpur, India

v.Conclusion

Figure.7: Error free DNA Sequence

Graphs Showing Results


EDA Model

This paper satisfies the requirement of detecting and


modifying the generated errors in biological information
transmission. The hamming Coding technique provides single
bit error correction capability. So the EDC model is capable of
single bit error correction in a particular frame. So the EDC
model fails when a frame consists of multiple errors. Again
CRC coding technique is less efficient in correcting the errors,
so here we are not able to provide the error correction
capability using CRC technique in EDA model.
So this work can be extended to handle multi bit errors
generated during transmission and also to adopt the other
efficient coding techniques in the communication process of
molecular biology to give a better view of genomic sequence
analysis.

References
[1]

Figure.8: Binary Coded DNA Sequences

Figure.9: CRC Coded DNA Sequences

E.May, M.Vouk, D.Bitzer and D.Rosnick,An error-correcting code


framework for genetic sequence analysis, Journal of Franklin Institute ,
2004; 34: 89-109.
[2] R.Dawkins, The Blind Watchmaker, Longman , New York,1986.
[3] T.D.Schneider , Information Content of Individual Genetic
Sequences, Journal Of Theoritical Biology.1997;189:427-441.
[4] http://genome.crg.es/datasets/genomics96/
[5] http://www.ncbi.nlm.nih.gov/nuccore/
[6] http://www.ncbi.nlm.nih.gov/protein
[7] B. Hayes, The invention of the genetic code , Am. Sci. 86 (1) (1998)
814.
[8] S.W. Golomb, Efficient coding for the desoxyribonucleic channel ,
Proceedings of the Symposia in Applied Mathematics, New York , NY,
USA, Mathematical Problems in the Biological Sciences,Vol. 14,
American Mathematical Society, Providence, RI, 58 April 1961, pp.
87100.
[9] P.P.Debata,D.mishra,K.Shaw,S.mishra, A coding theoretic model for
error-detecting in DNA sequences, International conference on
modeling optimization and computing(ICMOC-2012),volume 38, 2012,
Pages 17731777.
[10] Bhawani Sankar Biswal, Anjali MohapatraA Coding Based Framework
for Error Control in DNA Sequences, International Journal of
Emerging Technologies in Computational and Applied Sciences ,Issue
4, Vol. 1, 2 & 3, March-May, Pages 61-65.
[11] http://web.expasy.org/translate/
[12] A.Andrade,and R.Palazzo Jr, DNA Sequences Generated by Z4 -linear
codes, ISIT 2010,Austin,Texas,U.S.A.,June 13 18,2010.
[13] H.Yockey, Information Theory and Molecular Biology,Cambridge
University press:Cambridge,1992.
[14] G.Battail , Information Theory and error correcting codes in genetics
and
biological
evolution,introduction
to
Biosemiotic,Springer,November,2006.
[15] G.L.Rosen , Examining Coding Structure and Redundancy in
DNA,IEEE Engineering in Medicine and Biolog.,2006;25:62-68.

Figure.10: DNA Sequences with Error Bit

Copyrights All rights reserved.

[16] T.D.Schneider, Information Content of Individual Genetic Sequences,


Journal Of Theoritical Biology.1997;189:427-441.

189

You might also like