Simulation can be an alternative to experiment execution and an important tool to unveil some aspects of the association between phenotipic / genetic values and markers. The problem is: how to generate a genome with total length (t) randomly distributed among c chromosomes of different length, each of them comprising m markers.
Simulation can be an alternative to experiment execution and an important tool to unveil some aspects of the association between phenotipic / genetic values and markers. The problem is: how to generate a genome with total length (t) randomly distributed among c chromosomes of different length, each of them comprising m markers.
Simulation can be an alternative to experiment execution and an important tool to unveil some aspects of the association between phenotipic / genetic values and markers. The problem is: how to generate a genome with total length (t) randomly distributed among c chromosomes of different length, each of them comprising m markers.
8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil
SIMULATING A GENOME WITH MARKERS
1
R. K. Ono 1 , R. da Fonseca 2 , M. P. Pires 2 , A. T. H. Utsunomyia 2 , A. V. Pires 3 .
1 Funded by CNPq - Brazil and Fapesp So Paulo Brazil 2 Universidade Estadual Paulista UNESP Faculdade de Zootecnia. Dracena So Paulo Brazil. e-mail: ricardo@dracena.unesp.br 3 Universidade Federal dos Vales do Jequitinhonha e Mucuri UFVJM Instituto de Cincias Agrrias Diamantina MG - Brazil
INTRODUCTION Markers has become important in last years. Since the keystone paper of Fernando and Grossman(1989), several studies relating molecular markers with methods in animal breeding have been conducted. However, conduct experiments to produce markers and phenotipic data are expensive and in some case the costs are prohibitive.
Simulation can be an alternative to experiment execution and an important tool to unveil some aspects of the association between phenotipic/genetic values and markers. Therefore, an algorithm to generate markers and distribute them among chromosomes should be useful.
ALGORITHM The problem is: how to generate a genome with total length (t) randomly distributed among c chromosomes of different length, each of them comprising m markers (not necessarily equal for all chromosomes).
1 Create a vector to store markers position on chromosome (vp); 2 Create a vector control; The vector avoids two markers to occupy the same position in the chromosome. 3 Calculate t/c and store in cmean; The result provides the average length value of each chromosome in the genome. This value will be the base to generate the final length of the chromosomes. 4 Define a standard deviation value (csd); The csd value will be used to cause variation in the chromosomes length. 5 For each chromosome until c-1 do; 5.1 Sample a integer uniform random number (rnc) in the interval [-csd, +csd]; 5.2 cl = cmean + rnc; The value store in cl is the chromosome length. 5.3 sc = sc + cl; sc is a variable to store the partial sums of the chromosomes length. It is necessary to avoid creating chromosomes of length 0. At the first use, sc has value zero. 5.4 If sc >= (t x constant in the interval [0,1] ), return to step 10.1 and restart from first chromosome; The check does not allow that the last chromosomes have length 0. It avoids also that the first chromosomes have too large lengths in comparison with the others chromosomes. For instance, if the constant is 0.99 and the first chromosome has a length greater than 99% of the length of the genome, the process restart again trying a better solution. In the other side, if one is generating 10 chromosomes and the first eight sum more than 99% of the genome, the last two chromosomes will be too small if compared with others (risking to create one of them with length zero or close to zero). Thus, the process restart again searching for a new scenario. 8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil 5.5 Sample a integer uniform random number (rnm) in the interval [0, chromosome lenght]; If one does not desire too many markers in the genome, it is enough to multiply the chromosome length by a factor, wich result will be smaller than chromosome lenght. 5.6 am = rnm; The value in am is the amount of markers in the chromosome 5.7 For each marker until am do; 5.7.1 Sample a random number (rnp) in the interval [0, cl]; The number generated represents the position of the marker in the chromosome. 5.7.2 If rnp is not in control, store rnp in vp and control; otherwise return to step 10.9.1; The vector control does not allow that the same position been attributed to two markers. Since the value is not in control, then it can be stored in vp. 5.7.3 Clear control; 6 In the remaining chromosome do; 6.1 cl = t sc; 6.2 Repeat steps 5.5 and 5.6; 6.3 For each marker until am do; 6.3.1 Sample a random number (rnp) in the interval [0, cl]; 6.3.2 If rnp is not in control, store rnp in vp and control; otherwise return to step 11.3.1; 6.3.3 Clear control;
ILLUSTRATION To demonstrate the performance of the algorithm, it were coded in C++, using the g++ compiler under SUSe Linux 9.3.
The critical step in this algorithm are that numbered as 5.4. The check, if not properly configured, has the potential to make the algorithm extremely inefficient. A constant much smaller than 1, say 0.5, will become the task harder to be completed, i.e., more restarts are needed. A similar check could be applied to markers if one intend to limit the total number of markers. However this check become the algorithm extremely inneficient for the most cases. and it should be avoided. Table 1 illustrate a sample run of the implemented algorithm with the checks for chromosome length and marker implemented.
Table 1. Number of restarts of the implemented algorithm in two different values for the constant in the checks. The parameters provided were 100 for the number of chromosomes, 500 markers and a genome size of 3,000 cM.
Constant in checks CLC A
AMC B
0.899 112 1008 0.990 0 132 A Chromosome length check, step 10.4. B Amount of markers check, step 10.8.
The result shows that a tiny change in the constant value has significant effect in the algorithms performance. It is also clear that is harder to achive a convenient solution to markers than to chromosomes. If many replicates are to be performed with the two checks configured, the time consumed is much larger than the algorithm with only the check for chromosome length implemented. In the worst case scenario, the time consumed by the 8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006, Belo Horizonte, MG, Brasil implemented algorithm with the two checks can be of several minutes.
If the parameters are altered the algorithm performance also changes. For instance, keeping the number of chromosomes and genome size constant and increasing the number of markers, the effort to complete the task is smaller. It is true, because it is less probable to reach the check condition when the number of markers is greater. The same rationale is valid to number of chromosomes. Table 2, illustrate the behavior of the algorithm when the number of markers changes and make clear the overload a check for markers can cause in the program. The genome length has no significant effect upon performance, since there is no checks associated with that parameter.
Table 2. Number of restarts (approximate time spent in seconds) of the implemented algorithm in two different values for the number of markers. The parameters number of chromosomes and genome size were keeping constant in 100 and 3000 respectively. The values for the constant checks were 0.99 for the chromosome length and for the number of markers per chromosome.
Number of markers Number of restarts for markers (time spent in seconds) 200 930,667 (18) 500 132 (< 1)
The chromosome length, the distribution of markers and their position among chromosomes is randomly done. Changes in the algorithm to cover more elaborated schemes should be relatively easy using the idea presented. A different approach can be found in Euclydes (1996).
Table 3 shows an example output provided by the implemented algorithm.
Table 3. Output of the implemented algorithm considering 3 chromosomes and 100 cM genome size. The constant check was set to 0.99 for chromosomes length.
CI A CL B NM C MP D
1 17 3 5, 11 and 13 2 23 2 12 and 19 3 60 1 10 A CI = Chromosome Id B CL = Chromosome length C NM = Number of markers in chromosome D MP = Markers position in chromosome
REFERENCES Fernando, R.L. and Grossman, M (1996) Genet. Sel. Evol. 21: 467-477. Euclydes, R.F. (1996) Doctor Thesis, Universidade Federal de Viosa, Viosa MG - Brasil.