You are on page 1of 1

Finding disease associated proteins: a bioinformatics solution for the discovery of sequences with high probability of mutation

POPOV I.1, NENOV A.2, PETROV P.3, VASSILEV D.1*


backgrOunD Mutations and natural selection are the driving force behind evolution. Mutations in some regions are more tolerated, as the respective locus is not lethal to the cell when changed. however, research on various mutant proteins has shown that there are also nucleic sequences that are more easily changed than others, leading to a number of genetic diseases. Whether there is a connection between the fact that the sequence is linked to a disease, and the actual sequence contents (codons, nucleotide combinations, etc.) can be determined by analysing the codon usage bias among different sequence groups. The bias is examined through several simple calculations to determine codon frequency and the fraction of cases in which respective synonymous codons are used to code their common amino acid. These numeric properties of the sequence can be used to compare it to other sequences. The same can be done with whole groups of sequences, giving us a chance to examine the properties of the group and possibly use them to classify other sequences. Objective linked to a disease, and codon usage, a sixth group was selected. It encompassed all the genes with variations that were connected to a certain kind of cancer, or were found in a cancer sample (and so could be connected to cancer). The number of sequences in all the groups can be found in Table 1.
9.0000000

8.0000000

The major goal of the study is to determine the association between codon usage and the inherent mutability of the sequence. To search for a model that can be used to classify sequences in groups based on their codon content, and Group Sequences associate them to the other sequences in the Null 18742 group is also in the scope of the work. With 8457 <6 6909 Data set 6 to 15 1124 >15 424 For the purpose of our research five groups Cancer related genes 868 of coding sequences (mRNAs) were selected Table 1. Number of sequences in each group. from the latest available version of the EMBL Nucleotide Sequence database. The groups were MethODs anD algOrithMs built based on the data for genetic variations from Swiss-Prot, which includes only missense For each of the above groups the EMBOSS application changes. This data was entered into a MySQL CUSP was used to calculate synonymous codon database and the following two tables were fractions and the relative frequency of every codon generated: per 1000 codons in the sequence. CUSP works by Gene showing the number of different simply counting the codons in the sequence and variations and diseases connected to the calculating statistics for each of them based on amino specific gene name. acid and total number of occurrences. This data was Disease showing the number of different generated for each sequence and for the group as a genes and variations connected to the whole. The results for the whole groups were used respective disease name. to measure the difference between them. When calculating the distance the aim was to preserve the These tables were used to select genes with a difference imposed by the frequency of every codon. certain number of variations in the following This is why we used Euclidean distance in 64D space five groups: as a distance function. The frequency and fraction numbers for each codon were taken as coordinates Null group genes that were actually not in this 64D space and the distance between the present in the variations dataset, meaning that so defined points, representing the groups, and a there were no registered/annotated variations reference was calculated (Table 2). The CUSP output for them. This of course is not a group of for the whole dataset was used as a reference point for genes with no mutations, but is still another the distance calculation. A graphical representation reference for the other four groups. of the distances can be seen on Figure 1. Genes with less than 6 variations. Genes with a number of variations between 6 Group Distance and 15. Null 2,1971714 With 2,4203601 Genes with more than 15 variations. <6 2,4840020 Every gene with recorded variations. 6 to 15 3,5784836 >15 8,5619555 These groups were selected in order to show Cancer related genes 7,3694174 that there is a correlation between the number of variations occurring in the sequence and its Table 2. Results from the distance calculations between the gene codon usage. This is why groups were selected groups and the reference (whole dataset). Null - genes without based on the total number of variations, and any variation data in Swiss-Prot; With - All genes with recorded not based on what diseases they were connected variations; <6/ 6 to 15 / >15 - groups with the respective number to. To determine the connection between genes of variations recorded.
1 - AgroBioInstitute, 2 - Dynomica Ltd, 3-Sofia University, FMI * Corresponding author: e-mail jim6329@gmail.com

7.0000000

6.0000000

5.0000000

4.0000000

3.0000000

2.0000000

1.0000000

0.0000000

NONE

With

<6

6 to 15

>15

cancer

Figure 1. Graphical representation of the distance results for the all groups. To determine whether the relation observed was persistent in the data a second round of calculations was carried out, this time limiting the number of sampled members per group to the number of members in the smallest one (logically, the group with the largest number of variations). The base group from which the distances were calculated was limited too. The trend is shown on Figure 2.
12.0000000

10.0000000

8.0000000

6.0000000

4.0000000

2.0000000

0.0000000 NONE With <6 6 to 15 >15 cancer

Figure 2. Graphical representation of the distance results for the limited groups. A second test of the results was done using five randomly picked groups of 400 sequences (close to the number of sequences in the smallest group). The CUSP results for these groups, compared to a reference group of another randomly picked 400 sequences are represented on Figure 3.
10.0000000 9.0000000

8.0000000

7.0000000

6.0000000

5.0000000

4.0000000

3.0000000

2.0000000

1.0000000

0.0000000 NONE With <6 6 to 15 >15 cancer

Figure 3. Distance results for the random groups.

You might also like