You are on page 1of 12

Chemosphere 58 (2005) 559570 www.elsevier.

com/locate/chemosphere

Ranking of aquatic toxicity of esters modelled by QSAR


Ester Papa, Francesca Battaini, Paola Gramatica
*
Department of Structural and Functional Biology, QSAR and Environmental Chemistry Research Unit, University of Insubria, via Dunant 3, 21100 Varese, Italy Received 22 March 2004

Abstract Alternative methods like predictions based on Quantitative StructureActivity Relationships (QSARs) are now accepted to ll data gaps and dene priority lists for more expensive and time consuming assessments. A heterogeneous data set of 74 esters was studied for their aquatic toxicity, and available experimental toxicity data on algae, Daphnia and sh were used to develop statistically validated QSAR models, obtained using multiple linear regression (MLR) by the OLS (Ordinary Least Squares) method and GA-VSS (Variable Subset Selection by Genetic Algorithms) to predict missing values. An ESter Aquatic Toxicity INdex (ESATIN) was then obtained by combining, by PCA, experimental and predicted toxicity data, from which model outliers and esters highly inuential due to their structure had been eliminated. Finally this integrated aquatic toxicity index, dened by the PC1 score, was modelled using only a few theoretical molecular descriptors. This last QSAR model, statistically validated for its predictive power, could be proposed as a preliminary evaluative method for screening/prioritising esters according to their integrated aquatic toxicity, just starting from their molecular structure. 2004 Elsevier Ltd. All rights reserved.
Keywords: QSAR; Esters; Principal Component Analysis (PCA); Aquatic toxicity index; Theoretical molecular descriptors

1. Introduction The use of chemicals in commerce, medicine and other aspects of daily life is generally acknowledged to be a quite positive benet; however there is continuing concern about their negative impact on human health and the environment (Gough and Hall, 1999). More than 100 000 chemical substances are produced and used

* Corresponding author. Tel.: +39 0332 421573; fax: +39 0332 421554. E-mail address: paola.gramatica@uninsubria.it (P. Gramatica). URL: http://dipbsf.uninsubria.it/qsar/.

on a commercial scale, and about 2000 new ones are introduced onto the market each year. Many of these substances have little or no adverse eects, but some may be harmful to human health and the natural environment (Sabljic and Piver, 1992). This dichotomy in social concern has caused both regulatory agencies and chemical industries to take an interest in the potential environmental impact of a particular chemical prior to its release into an ecosystem. The limited availability of experimental data necessary for the risk assessment of chemicals, and the general lack of knowledge of the properties and activities of existing substances, has led the European Commission to adopt a White Paper on a strategy for a future Community Policy for Chemicals (White Paper, 2001).

0045-6535/$ - see front matter 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.chemosphere.2004.08.003

560

E. Papa et al. / Chemosphere 58 (2005) 559570

The main objective of this new Chemical Strategy is to ensure a high level of protection for human health and the environment, while ensuring ecient functioning in the internal market and stimulating innovation and competitiveness in the chemical industry. In this context the REACH (Registration, Evaluation and Authorisation of CHemicals) legislation has decreed that basic information for all chemicals marketed in Europe in volumes greater than 1 tonne per year must be available before the end of 2012, and physico-chemical and toxicity data for High Production Volume (HPV) compounds should be available by the end of 2005. However the production of such a large quantity of experimental data is time-consuming, expensive, and obviously restricted to existing substances (Cronin, 2002; Cronin et al., 2003a,b; Walker, 2003; Walker et al., 2003). In addition, public resistance to animal testing is another important hurdle for the development of laboratory programs (Worth, 2002). The expected high cost of REACH has increased interest in the development and validation of alternative methods to ll existing data gaps. Preference is now being given to the use of rapid, sensitive and low-cost computational models, or short term toxicity assays, rather than to traditional and expensive experimental approaches (Netzeva et al., 2004). Modelling allows the simulation of a chemicals environmental fate and adverse eects (Devillers, 2001). Among such approaches, QSAR (Quantitative StructureActivity Relationships) models are a valid choice as they model molecular activity from chemical structure and/or properties (Sabljic, 1991; Basak, 1994; Yin et al., 2002; Gramatica, 2001). QSARs have been applied widely throughout the United States to prioritise untested chemicals for more intensive and costly experimental evaluations. Also the EU-White Paper recommends the implementation of QSAR models, and intense work is now under way in regulatory and regulated communities to dene which criteria QSAR models must meet to gain regulatory acceptance all over the world (OECD, 2004; Worth et al., 2004; ECVAM, 2004). Esters represent one of the most important classes among the HPV compounds. They are commonly used in the manufacture of plastics, and are found in plastic tubing, oor tiles, furniture, automobile upholstery, insect repellents and cosmetics. The widespread production of these compounds, combined with the fact that some of them (for instance phthalates) are able to migrate from plastic, make their environmental eects and fate, of interest: they tend to be omnipresent in the environment (Parkerton and Konkel, 2000). Some ecotoxicity end-points, like sh, Daphnia and algae, have been widely employed, and have been recommended by the EU-TGD (TGD, 1996) to highlight the hazardous eects esters have on the aquatic environment, where

toxicity is of high concern not just for the individual chemical but mainly for chemical mixtures in the environment. Our group has been involved in two European Projects on this subject (PREDICT, 1999: Prediction and assessment of the aquatic toxicity of mixtures of chemicals (19961999) and BEAM, 2003: Bridging Effects assessment of Mixtures to Ecosystem Situation and Regulation (19992003)), dealing with QSAR modelling of aquatic toxicity end-points (Gramatica et al., 2001; Vighi et al., 2001; Walter et al., 2002; Vighi et al., 2003). The rst aim of the present work was to develop validated QSAR models to predict the toxicity of esters from structure alone. The modelled eco-toxicological endpoints were LC50 in sh and EC50 in Daphnia and algae for a selected set of esters. It is reasonable to suppose that most of the studied esters should be narcotic-type chemicals (with a non-specic mode of action). The nal, and most important aim has been to propose a simple and fast procedure for a preliminary ranking, and prioritisation, of esters according to their integrated aquatic toxicity, by modelling, using only information on chemical structure, their global tendency of toxicity for aquatic organisms. This tendency has been obtained by condensing the data into a Principal Component Analysis-based index. The basic idea in this screening approach is to propose a tool oriented towards concentrating experimental eort onto esters of major concern or towards the synthesis of esters that are not toxic to the aquatic environment.

2. Methods 2.1. Data set A list of 74 esters of specic interest to Italian chemical companies were considered in this study. The data set, reported in Table 1, is structurally highly heterogeneous and includes acetates, acrylates, phthalates and more complex esters. The ecotoxicological end-points selected to model the aquatic toxicity of esters are LC50 in Pimephales promelas and EC50 in Daphnia magna and algae (Selenastrum capricornutum and Scenedesmus subspicatus), the data selected are for the same end-points in the same experimental conditions for all the studied esters. Experimental toxicity data (log (1/ EC50 or 1/LC50)) were taken from the literature (Cash and Clements, 1996; Staples et al., 1997; IUCLID, 2000). The data selected in IUCLID are related to tests performed according to OECD and GPL norms. A nal matrix of 61 esters with toxicity data (experimental and predictions veried for reliability) for each studied endpoint was used to perform the nal ranking. All data are reported in mmol/l and transformed in logarithmic units.

Table 1 Experimental and predicted ecotoxicity data for 74 esters (mmol/l) ID ID PCA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 CAS Name Exp. Fisha 2.6 0.64 2.4 1.12 2.49 0.67 Pred. Fish 2.3 0.9 1.5 2.6 1.2 2.6 2 2 1 0.5 2.2 1.9 1.2 1.6 0.8 0.9 0.8 0.9 1 2.1 2 1 0.3 0.8 0.6 1.1 1.1 1.7 2.2 2.3 1.9 0.4 0.3 1.3 2.1 2.3 0.9 0.8 Exp. Algaeb Pred. Algae 0.44 0.68 0.26 0.75 1.05 1.48 1.45 2.4 0.3 0.1 0.14 0.19 0.72 0.54 0.88 0.28 0.76 0.56 0.69 0.09 0.39 0.37 0.61 0.62 0.81 0.5 0.68 0.68 0.06 0.07 0.15 0.26 0.93 0.27 1.96 0.69 1.04 0.57 Exp. Daphniab Pred. Daphnia 1.2 0.99 0.23 2.59 0.47 0.45 1.2 2.12 0.89 0.87 0.07 0.76 0.32 1.25 0.62 1.06 0.57 0.04 0.45 1.21 0.58 0.37 0.87 1.45 0.45 1.11 0.62 1.27 0.52 1.89 1.21 0.28 0.24 0.08 0.9 2.55 0.69 0.49 ESATIN From PCA 2.14 2.96* 0.04* 0.85* 2.06* 3.42* 0.26 0.99* 1.22 0.83 0.94* 1.36* 1.71 0.43 1.54* 0.81 1.21 1.12 1.05* 1.34 1.97* 1.66 0.96* Pred. ESATIN 1.64 2.3 0.14 1.67 2.42 2.77 0.01 0.68 0.7 0.77 0.04 2.29 2.04 0.41 1.95 0.77 1.76 0.88 1.08 0.16 2.31 1.8 1.16 E. Papa et al. / Chemosphere 58 (2005) 559570

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

00010-06-0 00079-20-9 00080-62-6 00084-62-8 00084-66-2 00084-69-5 00084-74-2 00085-68-7 00094-09-7 00096-33-3 00097-86-9 00097-88-1 00102-76-1 00103-11-7 00105-37-3 00105-38-4 00105-45-3 00105-53-3 00105-54-4 00105-99-7 00106-63-8 00107-31-3 00108-05-4 00108-21-4 00108-59-8 00108-65-6 00109-60-4 00110-19-0 00110-27-0 00110-33-8 00110-40-7 00111-15-9 00111-55-7 00112-07-2 00117-81-7 00117-84-0 00118-61-6 00119-36-8

24 25 26 27 28 29 30 31 32 33

Di-n-butyl isophthalate Methyl acetate Methyl methacrylate Diphenyl phthalate Diethyl phthalate Diisobutyl phthalate Di-n-butyl phthalate Butyl benzyl phthalate Ethyl p-aminobenzoate Methyl acrylate Isobutyl methacrylate Butyl methacrylate Triacetin 2-Ethylhexyl acrylate Ethyl propionate Vinyl propionate Methyl acetoacetate Diethyl malonate Ethyl butyrate Dibutyl adipate Isobutyl acrylate Methyl formate Vinyl acetate Isopropyl acetate Dimethyl malonate 1-Methoxy-2-propyl acetate n-Propyl acetate Isobutyl acetate Isopropyl myristate Dihexyl adipate Diethyl sebacate 2-Ethoxyethyl acetate Ethylene glycol diacetate 2-Butoxyethyl acetate Di-sec-octyl phthalate Bis(n-octyl) phthalate Ethyl salicylate Methyl salicylate

1.14 0.16 0.41 1.91 2.51 1.59 0.65 0.65 0.24 1.02 0.34 0.54 0.1 1.33

0.87

0.04 0.59 0.25 0.5

1.03 1.85 1.79

0.6

0.49

0.23

0.43

1.97 0.5

0.64 1.55 2.34

1.16 0.83* 1.97* 1.09* 1.12* 1.85* 0.07

2.07 0.75 1.77 0.75 0.76 1.52 0.7

0.93

2.86* 2.69 1.03 0.75 0.4* 0.13 (continued on next page)

561

562

Table 1 (continued) ID ID PCA 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 CAS Name Exp. Fisha Pred. Fish 1.4 1 1.3 0.9 1.1 2.1 0.6 2 1.5 1.4 0.4 1 2.3 1.4 1.2 0.5 1.1 2.4 1.1 1.1 1.1 3.1 1 1 1.8 1.2 1 0.8 1.6 1.8 Exp. Algaeb Pred. Algae 0.45 0.62 0.64 0.59 0.63 0.23 0.17 0.47 0.28 0.4 0.45 0.12 0.36 0.53 0.58 0.86 0.1 0.33 0.81 0.45 0.59 1.09 0.48 0.64 0.28 0.97 3.29 0.14 0.13 0.51 Exp. Daphniab Pred. Daphnia 0.08 0.33 0.11 0.45 0.08 4.14 1.06 0.87 0.51 1.41 0.8 0.38 4.66 0.11 2.1 0.8 0.86 2.39 0.64 0.28 0.47 6.17 0.1 2.02 1.39 0.04 0.57 0.62 1.75 0.3 ESATIN From PCA 0.33* 0.46* 0.8 1.44 0.49 0.54 0.37* 0.23* 2.02* 2.43 0.87 0.44* 1.96* 2.02 0.05* Pred. ESATIN 0.37 0.4 0.8 1.4 0.87 0.99 0.15 0.03 1.35 1.85 1.41 1.06 1.34 2.28 0.33

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

00120-61-6 00122-79-2 00123-66-0 00123-86-4 00131-11-3 00131-17-9 00140-88-5 00141-03-7 00141-28-6 00141-32-2 00141-78-6 00141-97-9 00142-22-3 00142-92-7 00540-88-5 00554-12-1 00619-50-1 00620-67-7 00622-45-7 00628-63-7 00687-47-8 00693-36-7 00763-69-9 00818-61-1 00999-61-1 01119-40-0 01126-46-1 02150-47-2 02499-95-8 02867-47-2

49 50 51

52 53 54 55 56 57 58

Dimethyl terephthalate Phenyl acetate Ethyl hexanoate n-Butyl acetate Dimethyl phthalate Diallyl phthalate Ethyl acrylate Dibutyl succinate Diethyl adipate Butyl acrylate Ethyl acetate Ethyl acetoacetate Allyl diglycol carbonate Hexyl acetate Tert-butyl acetate Methyl propionate Methyl 4-nitrobenzoate 1,2,3-Propanetriyl heptanoate Cyclohexyl acetate n-Amyl acetate Ethyl lactate Distearyl thiodipropionate Ethyl 3-ethoxypropionate 2-Hydroxyethyl acrylate 2-Hydroxypropyl acrylate Dimethyl glutarate Methyl 4-chlorobenzoate Methyl 2,4 dihydrobenzoate Hexyl acrylate 2-Dimethylaminoethyl methacrylate

1.21 0.81

0.76 0.69 0.32

E. Papa et al. / Chemosphere 58 (2005) 559570

1.36

1.71 1.02 0.42

1.19 0.91 0.7

1.52

0.89

2.04

0.76

0.6* 0.79 1.27

0.55 0.91 1.99

1.06

1.38 1.59

0.82 2.17

1.23 1.94* 1.29 1.04 0.34 1.82 1.06*

0.59 1.63 1.29 1.33 0.75 1.75 1.48

0.05

1.19 0.64 2.14

0.47

E. Papa et al. / Chemosphere 58 (2005) 559570 61 esters used for the denition of the ESATIN (ESter Aquatic Toxicity INdex) are highlighted in bold and numbered in the second column with the ID of PCA. In italic the esters not included in the ESATINdex for unreliable predicted data (outlier or inuential chemicals). Esters used as training set in the ESATIN model are highlighted with an asterisk in the ESATIN from PCA. a LC50 in sh. b EC50 in Daphnia and algae.

563

1.79 0.25 0.11

2.2. Molecular descriptors A total of 1150 molecular descriptors of dierent kinds were used as input to describe compound chemical diversity. Molecular descriptors were computed using the software DRAGON (Todeschini et al., 2004). The descriptor typology is: 0D:- constitutional (atom and group counts); 1D:- functional groups, atom centred fragments and molecular properties; 2D:- topological, walk and path counts, connectivity indices, information indices, various autocorrelations from the molecular graph, BCUTs and eigenvalue-based indices; 3D:- Randic molecular proles from the geometry matrix, geometrical, WHIMs (Todeschini and Gramatica, 1997a; Todeschini and Gramatica, 1997b; Todeschini and Gramatica, 1997c) and GETAWAYs descriptors (Consonni et al., 2002a; Consonni et al., 2002b). 3D structures necessary for descriptor calculation, containing information on atom and bond types, connectivity, partial charges and atomic spatial coordinates relative to the minimum energy conformation of the molecule, were minimised by geometry optimisation using the molecular mechanics method of Allinger (MM+) of HYPERCHEM (2002). Molecular descriptor meanings and their calculation procedure are summarised in the software DRAGON, and explained in detail, with related literature references, in the Handbook of Molecular Descriptors by Todeschini and Consonni (2000).

0.68 2.49 0.58 0.47 2.87 2.15 0.34 0.73 0.02 1.17 1 1.4 1.4 1.3 2.8 1.17 2.02 1.56 1.4 1.2

1.75 0.17* 0.33

Methyl 2,5-dichlorobenzoate Cyclohexyl acrylate Dimethyl nitroterephthalate Dimethyl 2-aminoterephthalate 2,2 0 -Thiodiethyl bis[3-(3,5-Di-tert-butyl-4hydroxyphenyl) propionate] Methyl 4-chloro-2-nitrobenzoate

0.9

0.9

2.74

1.08

2.3. Chemometric methods Data exploration and ranking by Principal Component Analysis was performed in SCAN (1995) on autoscaled data. QSARs were developed by multiple linear regression (MLR) using the Ordinary Least Squares regression (OLS) method; variable selection was performed by the GA-VSS (Genetic Algorithm-Variable Subset Selection) method (Leardi et al., 2003) in order to set out the most relevant variables in modelling the dierent response. Both procedures were implemented in the MOBY DIGS package (Todeschini, 2002). The quality of the MLR models was dened by maximising the cross-validated R2 (Q2, leave-one-out), applying the QUIK rule (Todeschini et al., 1999): only models with a global correlation of [X + y] block (Kxy) greater than the global correlation of the block (Kxx) variable (X being the molecular descriptors and y the response variable) were accepted. The robustness of the proposed models and their predictivity is guaranteed by the stability of the Q2 LMO procedure strongly recommended for QSAR modelling (Wold and Eriksson, 1995), and by response permutation testing (Y scrambling). Both procedures were performed by MOBY DIGS. The leavemany-out cross-validation was obtained by randomly

02905-69-3 03066-71-5 05292-45-5 05372-81-6 41484-35-9 69 70 71 72 73 59 60 61

74

42087-80-9

564

E. Papa et al. / Chemosphere 58 (2005) 559570

leaving out 4050% of the training compounds with maximum iterations of 5000. Y scrambling was performed by response scrambling with maximum iterations of 300. We provided evidence that the proposed models are well founded, and not just the result of chance correlation, by obtaining new models on randomised responses with signicantly lower R2 and Q2 than the original models. A nal validation procedure was performed, except for the algae model, by evaluating the models external predictive power on a selected external test set. This feature is assessed by Q2 EXT = 1 PRESS/SD, where PRESS is the sum of the squared dierences between the measured response and the predicted values for each molecule in the test set, and SD is the sum of the squared deviations between the measured response for each molecule in the test set and the mean measured value of the training set. External validations were performed on a validation set derived from the splitting of the original data set by the D-optimal Experimental Design procedure, applying the software DOLPHIN (Marengo and Todeschini, 1992; Todeschini and Mauri, 2000). Standard Deviation Error in Prediction (SDEP), Standard Deviation Error in Calculation (SDEC), Standard Error of Estimation (s), KXX and KXY are also reported for each model together with the coecient of determination (R2) and coecients of internal and exter2 nal validation (Q2, Q2 LMO and QEXT ). The presence of outliers (i.e. compounds with cross-validated standardised residuals greater than 2.5 deviation units), and chemicals very inuential in determining model parameters (i.e. compounds with high leverage value (h), greater than 3p 0 /n (Atkinson, 1985), where p 0 is the number of model variables plus one, and n the number of the objects used to calculated the model) were veried by the Williams plot in SCAN.

3. Results and discussion The rst step of this study was to develop aquatic toxicity QSARs to ll data gaps in the studied data set of 74 esters, of specic interest for Italian chemical companies, allowing the following application of multivariate analysis by Principal Component Analysis (PCA). The next step and principal aim of this work is the proposal of a simple approach that uses only theoretical molecular descriptors to screen/rank and prioritise esters according to their integrated aquatic toxicity, by modelling the score of the rst Principal Component (PC1). 3.1. Toxicity QSARs Owing to the very limited availability of easily available experimental data for the studied esters (just 30

data found in the literature for sh, 29 for Daphnia and 11 for algae), and the evident diculty in obtaining new data in reasonable time, new QSAR models have been developed. The input data are selected mainly from the IUCLID database, which should be a fairly readily available and complete database for the EINECS chemicals. All the QSAR models proposed here, calculated using Multiple Linear Regression by OLS, are based on a statistical approach starting from a wide set of different molecular descriptors, theoretically derived from the molecular structure itself and taking into account its various features. Genetic Algorithms were applied as a variable selection strategy in order to select, from among all the calculated descriptors, only the best combinations of those descriptors most relevant to the obtaining of models with the highest predictive power. An advantage of the exclusive use of theoretical descriptors is that they are free of the uncertainty of experimental measurements, are calculable in a homogeneous manner by a dened software and available for not yet synthesised compounds. The models were always evaluated for performance: stability and internal predictivity was veried by internal validation (leave-one-out and leave-many-out) (Golbraikh and Tropsha, 2002), permutation of response (Y-scrambling) and external validation (Tropsha et al., 2003) when the splitting of the original dataset into training and validation set is allowed by its size. In fact, because of the reduced number of experimental data available for algae toxicity (just 11 experimental data) external validation was done only for Daphnia and the sh end-points. The splitting of the original data sets was realised by the D-optimal Experimental Design procedure, on the basis of complete structural similarity information obtained from all the used molecular descriptors, and also taking into account the toxicity responses. Experimental design provides a strategy for selecting the most informative molecular structures in a data set, and therefore guarantees that the chemical composition of the training and validation sets have well balanced structural diversity and are also representative of the entire range of biological responses. The inuence of all the studied compounds on dening model parameters is assessed by leverage values, also calculated to check their distances from the model experimental space (Atkinson, 1985). The leverage approach can also be applied to verify each models applicability to new chemicals with regard to the chemical domain of the training chemicals (Eriksson et al., 2003; Tropsha et al., 2003). The best models obtained to estimate LC50 in sh and EC50 in Daphnia and algae are reported in Eqs. (1)(3). Descriptors are written in decreasing order of signicance, based on their standardised regression coecients:

E. Papa et al. / Chemosphere 58 (2005) 559570

565

Log 1=LC50 in Pimephales Promelas 4:16 2:03 MAST4v 3:35 REIG ntraining 24; Q2 LOO Q2 EXT K XX 78:5%; ntest 6; R2 adj: 82:3%; R2 83:9%; Q2 LMO50% 73:8%; 1

Daphnia toxicity model (2) is reported in Fig. 1: the 2.5r interval is reported as dotted lines. Log1=EC50 in algae 1:04 3:5 DISPp 0:65 H8u nobj: 11; R2 95:9%; Q2 LOO 92:3%; Q2 SDEC 0:13; R2 adj: 94:9%; LMO40% 84:5%; SDEP 0:18; K XX 33:1%; K XY 49:1%; s 0:16 3 The most important is a geometrical descriptor weighted by atomic polarizability (DISPp), the second is a 3DGETAWAY descriptor (H8u) representing structural three-dimensional features. In general, these models indicate the importance of structural descriptors related to molecule size, shape prole, symmetry and orientation in the x, y, z space together with intramolecular long distance interactions and reactivity parameters such as the polarizability or the presence of double bonds. The robustness and internal predictivity of these three models was checked by applying the leave-manyout procedure leaving 4050% training objects out. The 40% internal perturbation, applied for Q2 LMO calculation in algae toxicity model (3), highlights that the use of highly reduced structural information in the experimental training set could inuence the robustness and internal predictivity of the model (D(Q2 LOO Q2 LMO ) = 7.78%). In general, in this study, the presence of relatively low internal stability, or not excellent performance, could be explained by the heterogeneous information

71:5%; SDEC 0:3; SDEP 0:4; 41%; K XY 65:6%; s 0:35

Molecular descriptors, selected by Genetic algorithms, are MATS4v, an autocorrelation index weighted on van der Waal volumes, and REIG, a GETAWAY index with 3D dimensional features of selected chemicals. Thus, the dimensional aspects, condensed dierently in the selected descriptors, appear the most relevant in modelling sh toxicity. Log 1=EC50 in Daphnia 0:193 5:39E 02 TIC0 0:82 nCp 0:94 n@CH2 ntraining 24; Q2 LOO 83:1%; ntest 5; R2 adj: 86:0%; R2 87:9%; Q2 LMO50% 79:4%; 2

Q2 SDEC 0:4; SDEP 0:4; EXT 79%; K XX 39:4%; K XY 47:5%; s 0:47

TIC0 is a topological index counting the total information about neighbourhood symmetry. The other descriptors give information about functional groups heavily involved in response modelling: in particular, nCp (number of total primary C (sp3)) could be an indicator of both molecule dimension and shape; n@CH2 is a counter of double bonds and could show the relevance of reactivity sites in the structure. The regression line of the

3.0 2.5 2.0


Daphnia Log 1/EC50 Predicted values

1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -1.5 Training Test -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Daphnia Log 1/EC50 Experimental values

Fig. 1. Regression line for the externally validated model for log 1/EC50 in Daphnia (in mmol/l). The response values for training and test set chemicals are dierently labelled. The dotted lines indicate the 2.5r interval.

566

E. Papa et al. / Chemosphere 58 (2005) 559570

of the experimental data set derived by IUCLID: it is well known that data included in IUCLID, derived from dierent sources and of dierent quality, cannot be considered gold data (Cronin and Schultz, 2003). However, even if their variability could aect the nal performance of the derived QSAR models, in this paper we hope to verify if the collection of data in IUCLID could be usefully applied at least to prioritize the studied esters. Satisfactory internal stability can be veried for the other two models, calculated on a less reduced but always 2 small sets of 24 experimental data (D(Q2 LOO QLMO ) ranges from 3.68% in Daphnia to 4.67% in sh). The experimental information in these two data sets is sucient to allow an external validation, the only way to conrm the real predictive power of these models: in fact, it is possible that QSARs with apparently good internal stability and predictive power present very low, or in some cases negative, Q2 EXT values (Tropsha et al., 2003). Models (1) and (2) show satisfactory performance also in 2 external validation (Q2 EXT 71:5% for sh and QEXT 79% for Daphnia), conrming their stability and prediction ability also for new chemicals. An analysis of the Williams plots of these three models reveals that the training sets contain no chemicals that are particularly inuential in determining the model parameters, while there is just one evident outlier (n-propylacetate) to be highlighted in the sh model (Eq. (1)). In this case n-propylacetate presents cross-validated standardised residuals of three deviation units. Two explanations are possible: either the experimental input data was wrong or the descriptors selected in the model failed to capture some peculiar feature present in this molecule and absent in others. At this stage, owing to the simple structure of the outlier, the rst hypothesis could be the right one. Therefore, this chemical will not be included in the following analysis. Another interesting point to highlight is the comparison with the EPIWIN (EPI SuiteTM, 2000) model results. The RMS (Residual Means Squares) value of our linear model for sh is 0.31, a value lower, but still comparable with that of the EPIWIN model (0.36). On the contrary, on comparing the residuals for the algae models, the EPIWIN model shows decidedly poorer predictivity performance (RMS = 3.47) than our ad hoc model (RMS = 0.12). The high value of RMS for the EPIWIN model could be interpreted as a diculty for this general model to include, in the chemical domain, some of the studied esters, in this case a local model, like the ad hoc one developed here, appears a better solution. Our local model, with a clear denition of chemical domain, produces only interpolated, and not extrapolated, predictions for the esters selected in this study. These results appear particularly satisfactory for the quality of our ad hoc models, especially considering that the EPIWIN models were obtained on considerably

bigger training sets than ours, and thus with much more information available. No comparisons can be provided for toxicity in Daphnia, since the EPIWIN software does not consider these EC50 values. 3.2. ESter Aquatic Toxicity INdex (ESATIN) Data obtained by applying QSARs, developed above (Eqs. (1)(3)), were used to ll experimental data gaps in the original matrix. The nal data set is reduced to 61 esters (from the original one of 74 chemicals) since only compounds with reliable predicted values (nor outlier nor highly inuential chemicals) for all the three selected end-points were included. In Table 1 the selected esters with all experimental data available or reliable predictions are highlighted in bold. Principal Component Analysis (PCA) was applied to the three sets of ecotoxicity data of this lled matrix of 61 esters, allowing a fast ranking of the studied chemicals according to their integrated aquatic toxicity. PCA is the multivariate explorative technique that, by linear combination of the studied properties, condenses their information into new informative axes named Principal Components. The score plot (coordinates of objects on the new variables, PC1 and PC2) gives information about similarity of chemical behaviour, while the loading plot (weights of original variable in the PCs) shows correlation among the original variables. A biplot (a combined plot of scores and loadings) gives condensed information. The integrated toxicity trend highlighted in this study is represented by the order of chemicals along the rst component, expressed numerically by PC1 scores. Fig. 2 shows the projections of the 61 compounds (each represented by a point) in the space dened by the two rst Principal Components (PC1 and PC2). These two principal components synthesise most of the information contained in the data: the cumulative explained variance is 85% and the rst component alone (PC1 on the x-axis) provides most of the information (65.3%). The loading plot (the lines in the gure) reveals the relevance of each variable in each of the rst two principal components. It is interesting to note that the rst component, along with the variables grouped in the same direction, tends to discriminate between the relatively more (on the right) and the less toxic (on the left) chemicals for all the considered aquatic organisms, while PC2 appears to dierentiate between the compounds toxic for both Daphnia and algae in the lower part, and more toxic for sh in the upper part of the graph. All the variables (the studied end-points) are oriented in the same direction along the most informative principal component, this is evidence of their correlation, as was previously and better highlighted for some species (Dimitrov et al., 2000; Dimitrov et al., 2003; Dimitrov et al., 2004).

E. Papa et al. / Chemosphere 58 (2005) 559570


2.0 1.5
46 24

567

25 45 55 11 36 51 17 23 52 50 49 16 44 41 14 28 56 48 38 35 4 33 39 20 42 32 34 60 40 3 30 61 10 27 18 19 58 54 12 9 26 59 57

1.0 0.5 PC2 (19.7%) 0.0 -0.5 -1.0


2 43 21 47 29

Fish

5 31 6

13

37 15

22

Daphnia

Algae
53

-1.5
8

-2.0 -2.5 -4 -3 -2 -1

ESATIN 0 PC1 (65.3%) 1 2 3 4

Fig. 2. ESATIN (ESter Aquatic Toxicity INdex) calculated by Principal Component Analysis of the indexes of 61 esters. All the original variables (toxicity in sh, Daphnia and algae, in mmol/l) are represented by the loadings (the lines). The Cumulative Explained Variance of this PCA is 85% of the total variance, the rst Principal Component (PC1: ESATIN) accounts for 65.3% of data variability.

Therefore, since the rst principal component alone synthesises most of the information included in the toxicity data, and all the loadings are oriented along the same direction, this PC1 score is proposed as an ESters Aquatic Toxicity INdex (ESATIN) that ranks the esters according to their global aquatic toxicity tendency. Some chemicals show extreme behaviour, lying towards the extreme sides of the graph: in particular, on the right di-n-butyl phthalate (6) and bis-n-octyl phthalate (31) appear as the most toxic chemicals, contemporarily showing the highest values of toxicity in Daphnia, sh and algae; on the contrary, on the left side of Fig. 2, methyl acetate (2) and ethyl acetate (43) show the lowest global toxicity and, in particular, low toxicity in sh. In spite of the high structural heterogeneity characterising this data set, it is possible to identify three structurally dominating groups: the phthalates present generally high toxic potential since they are mainly located on the right side of the graph; the acrylates show mediumhigh aquatic toxicity power since they are located predominately in the middle-right side of the graph; the acetates, on the contrary, are the group of least concern, lying prevalently on the left side of the graph, in accordance with their low global aquatic toxicity tendency. In order to make this integrated toxicity index applicable also for new chemicals, even those not yet synthesised, just starting from their molecular structure, also the above proposed ranking derived from linear combination by PCA of the three ecotoxicity end-points was modelled by theoretical molecular descriptors. Such a model was calculated by the same, above mentioned, procedure. The best selected model was validated both

internally (by cross-validation, LOO and LMO) and externally. Also in this case, the splitting of the data set was realised by D-optimal Experimental design. The most signicant three principal components of each group of DRAGON molecular descriptors were used to select the subset. The relatively high number of chemicals in this case allows a stronger splitting: 31 chemicals were chosen as the training set, the remaining ones (30 chemicals) were used as the test set for external validation. The best model for the prediction of this ESter Aquatic Toxicity INdex is: ESATIN 5:60 13:87 SHP2 1:60 n@CH2 3:06 DISPp ntraining 31; Q2 LOO 87:3%; Q2 EXT 86:6%; K XX 32:5%; ntest 30; R2 adj: 89:8%; SDEC 0:45; K XY 51:4%; R2 90:8%; Q2 LMO50% 84:9%; SDEP 0:5; s 0:48 4

2 The tting (R2) and predictive parameters (Q2 LOO , QLMO ) of the model appear good. The external predictive power is conrmed by a high Q2 EXT value (86.6%) that reveals model applicability also to predict the global aquatic toxicity of unknown esters belonging to its chemical domain. This result is even more relevant considering that the model was strongly externally validated on a number of chemicals equivalent to that included in the training set. Analysis of the Williams plot reveals neither outliers nor very inuential chemicals, neither in the training set nor in the test set.

568
4 3

E. Papa et al. / Chemosphere 58 (2005) 559570

31 12 5 59 5726 1 53 42

2 1
14

58

Predicted ESATIN

19 39 38 10 18 25 32 827 9 11 16 36 50 23 55 44 49 41 60 3 34 48 30 56 35 33 7 61 40

54

0 -1
43

20 52 28 46 29 2 47 21 37

45

-2 -3 -4 -4

17 22 13 15 51 24

Training Tes t -3 -2 -1 0 ESATIN 1 2 3 4

Fig. 3. Regression line for the externally validated ESATIN model. The ESATIN values for training and test set chemicals are dierently labelled. The dotted lines indicate the 2.5r interval.

Fig. 3 shows the regression line of the above proposed model, the 2.5r interval is reported as dotted lines. The molecular descriptors selected for this model are: SHP2, DISPp and n@CH2 . The rst and most inuential descriptor is the Randic index SHP2: it describes the border geometry of a chemical and reveals the inuence of the shape prole on the trend modelling. The remaining descriptors, already selected in two previous models (2) and (3), show the relevance of the reactivity sites (like double bonds) and, again, of the geometrical information in modelling esters aquatic toxicity. This also conrms the eectiveness of Genetic Algorithms in selecting variables: among the hundreds of molecular descriptors, it is always the same or similar variables that are selected as being the most related to response. The practical output of this ESATIN model for new chemicals is the possibility of inserting them into the PCA-graph of Fig. 2 and, thus, of obtaining some indication regarding their tendency towards cumulative toxicity to aquatic organisms. In fact, the principal aim of this paper is the proposal of an integrated index of ester aquatic toxicity that could be usefully applied, mainly in a screening approach for the prioritisation of esters of more environmental concern. The real uniqueness of this approach applicability is for new chemicals: even before the synthesis of esters this approach could highlight potential toxicity for the aquatic environment; note that the proposed ESATIN model is applicable just starting from designed chemical structures. In practice, if an ester is predicted with an high ESATINdex (at this stage it is really dicult to propose a threshold value) its synthesis must be avoided if it is a not yet synthesised chemical, or, if already existing

its use must be strongly dissuaded. Certainly chemicals with a high ESATIN value, must be prioritised for experimental tests, while QSAR predictions of the proposed models on single end-points could be applied for chemicals with a low ESATIN value, with reasonable reliability when the domain of applicability is veried for the chemical of interest.

4. Conclusions For the production of priority lists of dangerous chemicals it is necessary to have reliable data of toxicity and ecotoxicity. Unfortunately, the availability of experimentally obtained data is very limited (even for the most common chemicals) to be useful for screening purposes. QSAR modelling is an alternative method (requested now also by the EU-White Paper) applicable for lling data gaps, ranking chemicals and thus producing priority lists. Validated QSAR models are developed here on three of the most common aquatic toxicity endpoints in order to ll the huge data gaps of a highly structurally heterogeneous esters data set. These predicted data are used to ll the original limited experimental matrix, and then to rank the studied esters according to their aquatic toxicity tendency by Principal Component Analysis. The obtained integrated aquatic toxicity trend, represented by the PC1 scores and proposed here as an ESter Aquatic Toxicity INdex (ESATIN), is modelled by a few theoretical molecular descriptors, with the aim of applying it also for unknown esters, just starting from their structural features. All the proposed QSARs are reliable, having been strongly validated and built with theoretical molecular

E. Papa et al. / Chemosphere 58 (2005) 559570

569

descriptors, calculated by a dened software: reliable predicted toxicity data can thus be obtained for esters in the aquatic environment. The ESATIN model (Eq. (4)) allows a fast screening and ranking of esters according to their global aquatic toxicity, considering only their molecular structure. Certainly, care must be taken to use only reliable predictions: it is necessary to verify that a new chemical belongs to the chemical domain of the training set used for the model development. In addition, it must be remembered that risk based prioritisation and risk assessment tools can introduce marked uncertainty in the data. However, this approach is particularly useful for rst stage prioritisation concerning the global aquatic toxicity of chemicals already existing, but without experimental data, or not even synthesised (clearly this is the only method applicable to the latter compounds!). This simple and rapid priority-setting process can be used as a basis for selecting existing esters of higher concern that need more detailed and urgent assessment, or to orient the synthesis of new safer chemicals.

Acknowledgements We wish to thank Prof. Roberto Todeschini for the software and Federchimica for the fellowship granted to Francesca Battaini. Financial support by BEAM Program of the Commission of the European Communities (EVK1-CT-1999-00012) is gratefully acknowledged.

References
Atkinson, A.C., 1985. Plots, Transformations and Regression. Clarendon Press, Oxford (UK), p. 282. Basak, S.C., 1994. Molecular similarity and risk assessment: analog selection and property estimation using graph invariant. SAR QSAR Environ. Res. 2, 289307. BEAMBridging Eects Assessment of Mixtures to Ecosystem Situation and Regulation, European Research Project, 2003, Contract N. EVK1-CT-99-00012. Cash, G.G., Clements, R.G., 1996. Comparison of structure activity relationships derived from two methods for estimating octanolwater partition coecients. SAR QSAR Environ. Res. 5, 113124. Consonni, V., Todeschini, R., Pavan, M., 2002a. Structure/ response correlation and similarity/diversity analysis by GETAWAY descriptors. Part 1. Theory of the novel 3D molecular descriptor. J. Chem. Inf. Comput. Sci. 42, 693 705. Consonni, V., Todeschini, R., Pavan, M., Gramatica, P., 2002b. Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. Part 2. Application of the novel 3D-molecular descriptors to QSAR/QSPR studies. J. Chem. Inf. Comput. Sci. 42, 693705.

Cronin, M.T.D., 2002. The current status and future applicability of quantitative structureactivity relationships (QSARs) in predicting toxicity. ATLA 30 (Suppl. 2), 8184. Cronin, M.T.D., Schultz, W.T., 2003. Pitfalls in QSAR. J. Mol. Struct. 622, 3951. Cronin, M.T.D., Walker, J.D., Jawroska, J.S., Comber, M.H.I., Watts, C.D., Worth, A.P., 2003a. Use of QSARs in international decision making frameworks to predict ecologic eects and environmental fate of chemical substances. Environ. Health. Perspect. 111, 13761390. Cronin, M.T.D., Jawroska, J.S., Walker, J.D., Comber, M.H.I., Watts, C.D., Worth, A.P., 2003b. Use of QSARs in international decision making frameworks to predict health eects of chemical substances. Environ. Health. Perspect. 111, 13911401. Devillers, J., 2001. QSAR modeling of large heterogeneous sets of molecules. SAR QSAR Environ. Res. 12, 515528. Dimitrov, S.D., Mekenyan, O.G., Schultz, T.W., 2000. Interspecies modeling of narcotics toxicity to aquatic animals. Bull. Environ. Contam. Toxicol. 65, 399406. Dimitrov, S.D., Mekenyan, O.G., Sinks, G.D., Schultz, T.W., 2003. Global modelling of narcotic chemicals: ciliate and sh toxicity. J. Mol. Struct. (TEOCHEM) 622, 6370. Dimitrov, S.D., Koleva, Y., Schultz, T.W., Walker, J.D., Mekenyan, O.G., 2004. Interspecies quantitative structure activity relationship model for aldehydes: aquatic toxicity. Environ. Toxicol. Chem. 23, 463470. ECVAMEuropean Center for the Validation of Alternative Methods, 2004: http://ecvam.jrc.it/qsar. EPI SuiteTM version 3.10. 2000. U.S. Environmental Protection Agency, http://www.epa.gov/opptintr/exposure/docs/ episuitedl.htm. Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M., Gramatica, P., 2003. Methods for reliability and uncertainty assessment and for applicability evaluations of classication- and regression-based QSARs. Environ. Health Perspect. 111 (10), 13611375. Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graph. Mod. 20, 269276. Gough, J.D., Hall, L.H., 1999. Modeling the toxicity of amide herbicides using the electrotopological state. Environ. Toxicol. Chem. 18, 10691075. Gramatica, P., 2001. QSAR approach to the evaluation of chemicals. Chimica Oggi. 9, 1824. Gramatica, P., Vighi, M., Consolaro, F., Todeschini, R., Finizio, A., Faust, M., 2001. QSAR approach for the selection of congeneric compounds with a similar toxicological mode of action. Chemosphere 42, 873883. HYPERCHEM, 2002. Rel.7.03 for Windows, Autodesk, Inc., Sausalito, CA, USA. IUCLID CD-ROM, 2000. European Commission Joint Research Centre. Leardi, R., Boggia, R., Terrile, M., 2003. Genetic algorithms as a strategy for feature selection. J. Chemom. 6, 267281. Marengo, E., Todeschini, R., 1992. A new algorithm for optimal distance based experimental design. Chemom. Int. Lab. Sys. 16, 3744. Netzeva, T., Dearden, J., Edwards, R., Worgan, A.D.P., Cronin, M.T.D., 2004. QSAR analysis of the toxicity of aromatic compounds to Chlorella vulgaris in a novel short term assay. J. Chem. Inf. Comput. Sci. 44, 258265.

570

E. Papa et al. / Chemosphere 58 (2005) 559570 Genetic Algorithm, rel. 1.2 for Windows, Talete srl, Milan, Italy. Todeschini, R., Consonni, V., Mauri, A., Pavan, M., 2004. DRAGONSoftware for the calculation of molecular descriptors. Ver. 5.0 for Windows. Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance of being Earnest: validation as the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 22, 6976. Vighi, M., Gramatica, P., Consolaro, F., Todeschini, R., 2001. QSAR and chemometric approaches for setting water quality objectives for dangerous chemicals. Ecotoxicol. Environ. Safety 49, 206220. Vighi, M., Altenburger, R., Arrhenius, A., Backaus, T., Bodeker, W., Blanck, H., Consolaro, F., Faust, M., Finizio, A., Froehner, K., Gramatica, P., Grimme, L.H., Gronvall, F., Hamer, V., Scholze, M., Walter, H., 2003. Water quality objectives for mixtures of toxic chemicals: problems and perspectives. Ecotox. Environ. Safety 54, 139150. Walker, J.D., 2003. Applications of QSARs in toxicology: a US Government perspective. J. Mol. Struct. 622, 167184. Walker, J.D., Carlsen, L., Jawroska, J., 2003. Improving opportunities for regulatory acceptance of QSARs: the importance of model domain, uncertainty, validity and predictability. QSAR Comb. Sci. 22, 346350. Walter, H., Consolaro, F., Gramatica, P., Scholze, M., Altenburger, R., 2002. Mixture toxicity of priority pollutants at no Observed Eect Concentrations (NOECs). Ecotoxicology 11, 299310. White Paper: http://europa.eu.int/comm/environmental/chemicals/whitepaper.htm. Wold, S., Eriksson, L., 1995. Statistical validation of QSAR results. In: Mannhold, R., Krogsgaard-Larsen, P., Timmerman, H. (Eds.), Chemometric Methods in Molecular Design. VCH, Germany, pp. 309318. Worth, A.P., 2002. ECVAMs activities on computer modelling and integrated testing. Atla 30 (Suppl. 2), 133137. Worth, A.P., Cronin, M.T.D., Van Leeuwen, C.J., 2004. A framework for promoting acceptance and regulatory use of (quantitative) structureactivity relationships. In: Cronin, M.T.D, Livingstone, D.J. (Eds.), Predicting Chemical Toxicity and Fate. CRC Press, Boca Raton, FL, USA, pp. 429 440. Yin, C., Xinhui, L., Weimin, G., Teng, L., Xiaodong, W., Liansheng, W., 2002. Prediction and application in QSPR of aqueous solubility of sulfur containing aromatic esters using GA-based MLR with quantum descriptors. Water Res. 36, 29752982.

OECDOrganisation Economic Co-operation and Development, 2004: http://www.oecd.org/env. Parkerton, T.F., Konkel, W.J., 2000. Application of quantitative structureactivity relationships for assessing the aquatic toxicity of phthalate esters. Ecotox. Environ. Safety 45, 61 78. PREDICTPrediction and assessment of the aquatic toxicity of mixtures of chemicals, European Research Project, 1999. Contract N. ENV4-CT96-0319. Sabljic, A., 1991. Chemical topology and ecotoxicology. Sci. Total Environ. 109/110, 197220. Sabljic, A., Piver, W.T., 1992. Quantitative modelling of environmental fate and impact of commercial chemicals. Environ. Toxicol. Chem. 11, 961972. SCANSoftware for Chemometric Analysis, 1995. Rel. 1.1 for Windows, Minitab, USA. Shao, J., 1993. Linear Model Selection by Cross-Validation. J. Am. Stat. Assoc. 88, 486 494. Staples, C.A., Adams, W.J., Parkerton, T.F., Gorsuch, J.W., Biddinger, G.R., Reinert, K.H., 1997. Aquatic toxicity of eighteen phthalate esters. Environ. Toxicol. Chem. 16, 875 891. TGDTechnical Guidance Document in support of Commission Directive 93/67/ECC on risk assessment for new notied substances and commission regulation (EC) No. 1488/94 on risk assessment for existing substance. Part III, 1996. Todeschini, R., Gramatica, P., 1997a. The WHIM theory: new 3D molecular descriptors for QSAR in environmental modelling. SAR QSAR Environ. Res. 7, 89115. Todeschini, R., Gramatica, P., 1997b. 3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the WHIM descriptors. Quant. Struct.Act. Relat. 16, 113119. Todeschini, R., Gramatica, P., 1997c. 3D-modelling and prediction by WHIM descriptors. Part 6. Applications of WHIM descriptors in QSAR studies. Quant. Struct.Act. Relat. 16, 120125. Todeschini, R., Maiocchi, A., Consonni, V., 1999. The K correlation index: theory development and its application in chemometrics. Chemom. Intell. Lab. Syst. 46, 1329. Todeschini, R., Consonni, V., 2000. Handbook of Molecular Descriptors. Wiley-VCH, Weinheim, Germany, p. 667. Todeschini, R., Mauri, A., 2000. DOLPHINSoftware for experimental Design. rel. 2.1 for Windows, Milano Chemometrics and QSAR Research Group. Todeschini, R., 2002. MOBY DIGS-software for Multilinear Regression Analysis and Variable Subset Selection by

You might also like