Professional Documents
Culture Documents
Ansgar Schuffenhauer,
Switzerland
doi: 10.1002/9780470048672.wecb072
Advanced Article
Article Contents
The screening of chemical libraries is one of the major sources for new leads in drug discovery. The large size of chemistry space compared with the library sizes that are feasible to screen requires careful selection of the compounds for the screening library to maximize screening success. Besides issues around technology compatibility and chemical tractability of the compounds, the main objective is to increase the probability of obtaining hits for the screened targets. Diversity selection approaches have often shown only limited success. In the absence of any knowledge, it is proposed to screen smaller lead-like ligands with preference. When knowledge about the target is available, it can be used for target-focused compound selection or for library design. In the screening process, physical high-throughput screening (HTS) can be combined with virtual screening either to avoid the high-throughput primary screen of the whole library or to limit false negatives by combining primary HTS and virtual screening results. Screening an initial subset then using the results obtained to predict likely hits for subsequent screening rounds in sequential screening can lessen the number of compounds to be screened, but it causes a greater logistics effort and has the risk of missing compounds that are not well represented structurally by the initial set. Data analysis and visualization of the screening results are a necessary nal step of a screening campaign to ensure that the prioritization of compounds followed up is based on all available relevant information.
When in vitro biologic assays replaced in vivo animal models as the rst tool to assess biologic activity of molecules in drug discovery, the possibility existed to test many more compounds than was possible before. This triggered the hope that the slow process of lead discovery, which relies to a large extent on medical chemists intuition and serendipity, could be accelerated by a systematic brute-force screening of large collections of chemical compounds, for which the term chemical libraries has been introduced. Consequently, the pharmaceutical industry has built up high-throughput screening (HTS) facilities (see the article High Throughput Screening (HTS) Techniques: Overview of Applications in Chemical Biology), in which in vitro assays could be performed in a highly parallel, miniaturized, and automated way. With HTS available, it became not only possible to screen the historically accumulated compound collections of pharmaceutical companies, but also a much greater number of compounds exist. This nding triggered
the demand for a highly parallelized and automated synthesis of compounds to feed the HTS machinery (see the articles Combinatorial Libraries: Overview of Applications in Chemical Biology and Small Molecule Combinatorial Libraries). Although large pharmaceutical companies screen compound libraries in the magnitude of 106 molecules, this approach is far away from a systematic brute-force approach because the chemistry space is estimated to contain 1013 1060 small molecules (1). The conservative estimate of 1013 molecules is based on well-established chemical reactions and commercially available reagents (2). Extrapolation from a systematic enumeration of all theoretically viable organic molecules up to 11 non-H atom toward 25 non-H atoms (the average size of drug-like molecules) suggest the existence of 1027 unique structures (3). Because only a small subset of the chemistry space can be screened, the compounds must be chosen appropriately to maximize the success of the screen. Three groups of criteria for this exist. 1
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
First, the compounds must be compatible with the compound handling and screening technology used and should not cause assay artifacts. Second, the library must contain molecules with the desired activity. Last, once a hit is identied, the molecule must be optimizable into a drug candidate with suitable efcacy, bioavailability, therapeutic window, and, in the case of industrial drug discovery, patentability (see the article Lead Optimization in Drug Discovery).
the enzyme in biochemical assays in an unspecic way and can lead to the detection of false-positive hits. The exact cause for the aggregate formation and the mechanism and conditions of the enzyme sequestration are not understood completely; however, experimental procedures have been suggested to detect false positives caused by aggregation (7). The second property of importance for bioavailability is the polar surface area (PSA) that is associated with intestinal absorption and cell membrane penetration by passive transport. Compounds with a high polar surface are less likely to penetrate the lipophilic environment of the cell membranes by passive transport. Like the logP, PSA can be computed by summing up fragment contributions (8) with H-bonding fragments as the main contributor. The role of the physical chemical properties discussed so far is the rationale behind two popular rules of the thumb to estimate drug-likeness: Lipinskis rule of-ve (9), in which counts of hydrogen bond donors and acceptors take the place of the PSA, and the Egan Egg (10) (see Table 1).
Physical-chemical properties
Generally, biologic assays are performed in aqueous solution, typically in a concentration of up to 50100 mol. These solutions are produced by diluting a stock solution of the compound in dimethylsulfoxide (DMSO) in the millimolar concentration range with buffer. Therefore the compounds must be soluble in water and in DMSO under the respective conditions, or a potential activity of the compound remains undetected or is underestimated largely. Water solubility is equally important for bioavailability of the drug, in which sufciently high blood plasma levels must be achieved for efcacy. Unfortunately, neither the experimental determination of water solubility nor its prediction by computational methods is straightforward, because both depend not only on the hydrophilicity of compound, but also on the lattice energy of the crystal (4). Thus, based on Yalkowskis equation (5) as a guideline to estimate water solubility, the logarithm of the octanolwater partition coefcient (logP) has been used frequently. It can be predicted by summing up fragment contributions that have been tted on experimental data (6) as the ratio between hydrophobic and hydrophilic fragments. From a high lipophilicity as indicated by a high computed logP (ClogP), it can be concluded that the water solubility of the neutral compound is low; however, a low ClogP does not guarantee high water solubility. Protonation of basic groups or deprotonation of acidic groups lead to ionic species that frequently have higher solubility than neutral compounds. In this context, it is noteworthy that lipophilicity is not only related to low aqueous solubility, but also to the tendency of compounds to form aggregates. Such aggregates can sequester 2
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
Table 1 Empirical rules-of-thumb to estimate the suitability of compounds at different stages of drug discovery based on structural properties Name Rule of ves Rule Two or more of the following conditions violated: MW 500 Da ClogP 5 HBD 5 HBA 0 Ellipse dened in the ClogP and PSA space. MW 460 Da and 4 ClogP 4.2 and LogSw 5 and RTB 5 and RNG 4 and HBD 5 and HBA 9 MW 300 Da and ClogP 3 and HBD 3 and HBA 3 2 and RTB n3 PSA 60 A Purpose Estimate whether a compounds absorption and membrane permeation are good enough to be orally bioavailable Reference (9)
Egan Egg
Lead-likeness
Estimate whether a compounds absorption and membrane permeation are good enough to be orally bioavailable Identify compounds that have the potential to be successful leads
(10)
(11)
Identify compounds that have the potential to be successful fragment screening hits.
(12)
logSw , logarithm of aqueous solubility; RTB , number of rotatable bonds; RNG , number of rings; HBD , number of H-bond donors; HBA, number of H-bond acceptors; PSA, polar surface area.
Diversity-based strategies
The central hypothesis for all diversity-based strategies of compound selection is the similarity property principle, which states that molecules with similar structures can be expected to have similar properties and to bind to the same target proteins (15). Following this principle, it is only necessary to screen one representative out of a group of molecules with similar structures because the other molecules of the group should have the same binding behavior as the representative. Consequently, many algorithms exist to select diverse subsets of molecules from a database that represent the groups of unselected molecules. These algorithms have been reviewed elsewhere (16, 17) and only a short overview is provided here. Most methods encode the molecular structures as a descriptor vector, from which similarity coefcients for pairs of molecules can be calculated without aligning the molecules. Then, these similarity coefcients are used in diversity selection or clustering algorithms (18). From a clustering solution, a diversity selection is obtained by choosing one or more representative molecules from each cluster. Each molecule in a diverse subset is expected to represent
the nonselected molecules, and it can be interpreted as the center of a cluster formed by its similar neighbors. For the sake of a more descriptive discussion, the clustering viewpoint is assumed in the following paragraphs; however, the arguments made can generalize to other diversity-selecting procedures. Alternative to clustering, rule-based methods, typically based on the molecular scaffold, can be used to create partitions of molecules from which the representatives are selected (1921). Despite initially high expectations, diversity-based strategies for compound selection have shown only limited success. Diversity selection from the MDL Drug Data Report (MDDR), a database that contains only molecules with documented pharmacological properties, led to an enrichment of covered activity classes (22). However, diversity selections from a compilation of screening data that includes inactive molecules did not lead to an enrichment of targets covered by selected compounds (23). In a clustering experiment, the intracluster similarity of the IC50 vectors of the compounds measured in a uniform panel of assays was not much greater than the intra-group IC50 similarities of compounds grouped randomly (24). How can these results be understood? At rst, the similarity property principle is only valid on rather short similarity ranges. According to a popular rule of thumb, molecules that have a Tanimoto similarity coefcient of 0.85 calculated over the Daylight ngerprints are supposed to be very similar; however, often they differ signicantly in their protein binding properties (25). Also, the inversion of the similarity property principle that dissimilar molecules should also have dissimilar protein binding properties is not generally true (26). Second, the theory that the screening of only one representative per structural cluster is sufcient to determine the activity of the cluster assumes that the screening procedure is error free. 3
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
Any error in the screening for the representative molecule is extrapolated to the whole cluster and leads to its misclassication. To compensate screening errors and to determine its activity safely, it is necessary to screen several representative molecules from a cluster of molecules with assumed common biologic activity. A statistical model to determine the number of representatives that need to be screened per cluster based on empirically estimated false positive and negative rates has been published by Harper et al. (27). They described that the probability that a compound is active is determined by the product of the probability i of the cluster that contains the active compound (variable from cluster to cluster) and a probability that describes the probability of an individual active compound, provided the cluster is active. In this model, accounts for the average error of the screening process that leads to an erroneous determination of an individual compounds activity and errors of the clustering procedures that lead to the erroneous grouping of a compound to a cluster with different activity. The activity related to the common chemotype or pharmacophore of the cluster i is described by I , and even an active cluster with a high i is likely to be missed if is low. If n compounds per cluster are screened, the probability to nd at least one hit if the cluster is active equals (1-(1-n )). The third reason for the limited success of diversity-based strategies is the low baseline probability for bioactivity, with hit rates of 0.1% as the typical order of magnitude. In the case where a clustering was highly predictive of biologic activity, with active clusters that show a 100-fold enrichment of active compounds, this would still indicate that, on average, only 10% of compounds in any given active cluster are active. If the cluster was sampled with only one compound, the probability that the cluster is identied correctly as active would only be 10%. This theory is illustrated qualitatively in Fig. 1.
Figure 1 Qualitative illustration of a cluster-based sampling of a compound data set for screening and its impact on nding active compounds. In the clustering stage, illustrated on the left picture, the compounds are distributed into clusters and one (centroid) representative from each cluster is selected for testing. After testing these compounds, each cluster is attributed as active or inactive, which depends on the testing outcome of the representative. In cluster A, this works well, and a cluster with several active compounds (29%) is identied correctly. It is important to note in this context that even active clusters often contain only 20% of active compounds, which is still a large fraction compared with the overall baseline probability for bioactivity (19). In the case of cluster B, the representative was inactive, which leads to the misclassication of the cluster as inactive although 29% of the compounds are active. For most of the compounds in this cluster, the prediction that they would be inactive is correct; the overall outcome is more important than a false-negative active cluster.
molecule. This relationship is described by the qualitative model of Hann et al. (28). After an initial increase, the probability that a ligand matches the binding site in exactly one orientation decreases with the ligand size. In addition, the number of potential molecules increases exponentially with the number of atoms; therefore, with increasing cutoff for the molecular size, it becomes more and more difcult to sample the chemistry space (3, 11). On the other hand, the larger the molecule, the more binding contacts it makes if it ts the binding site perfectly, which leads to a higher binding afnity (29). The binding afnity that is required minimally for detection depends on the sensitivity of the assay. An increase of assay sensitivity leads to a decrease of the required minimal size for ligands that have the potential to bind with a detectable afnity. For the reasons stated above, it makes sense to screen molecules in the size range that is large enough to allow for a detectable afnity, but not larger. Lead-likeness criteria have been formulated based on this nding (see Table 1). In fragment-based screening (FBS), highly sensitive biophysical assay technologies are used to detect the binding events of small molecular fragments to proteins (12). In the molecular size range used in FBS, the hit rate is, in accordance with the Hann model, much higher than in conventional biologic assays. The observed afnity is much lower (30), which requires the fragments to be amenable to chemical transformation to evolve them to molecules for which activity can be validated and be optimized using biochemical assays. Binding or ligand efciency (LE) metrics (31), in which the activity is normalized by the molecular size, have been introduced to prioritize screening hits. It has been observed by Hajduk (32) that, during the optimization of a chemical
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
series, the ligand efciency for the best compound after each optimization step is in most cases constant, what indicates that an increase of afnity coincides with an increase of molecular weight (MW). To achieve a nal drug candidate with a potency of less than or equal to 10 nM and a MW less than or equal to 500 Da to comply with Lipinskis rule, a LE of 0.016 pKi .Da1 is the minimum requirement. Conventional HTS are typically sensitive enough to detect compounds Ki in the range of 1 M. Assuming that the ligand efciency is indeed constant for a chemical series, and only ligands with LE greater than or equal to 0.016 pKi Da1 have the potential to optimized into a suitable drug candidate, then it would be sufcient to screen compounds with a maximum MW of 375 Da. Likewise, for a biophysical fragment screen that can detect KD in the range of 1 mmol/L, a maximum MW of 188 Da would be sufcient if the binding constant translates into an inhibition constant of the same order of magnitude. However, the criterion of LE greater than or equal to 0.016 pKi Da1 may not be achievable for every target. In Reference 32, chemical series exist with lower LE values that nevertheless went into preclinical development. It must also be taken into account that a screening with a low LE can still be a suitable tool compound and may serve as a starting point to design a scaffold with a higher LE, which has not been present in the screening library. The denition of optimization of a chemical series used by Hajduk is very narrow, and it allows an initial hit fragment to grow but not to have parts of it removed. In the exploration of HTS hits, pruning operations are frequent, although an ideal library also would have contained the compound that resulted from the pruning in the rst place. Therefore, such LE-derived cut-off criteria for screening hits rarely can be applied stringently.
similarity coefcients (37). Although similarity searching with the individual ligands and combination of the results by data fusion can be highly successful (38), the numerical or binary nature of the descriptor vectors allows a whole range of machine learning techniques to be applied from other areas of multivariate statistics. Examples of these techniques are binary kernel discrimators (38), support vector machines (39), emerging patterns (40), na ve Bayesian classiers (41), and self-organizing maps (42). Self-organizing maps have become especially popular because of the intuitive visualization of their results (43, 44). Often, the results depend strongly on the target class chosen and the available data. For this reason, one key success factor is to compile as comprehensive and accurate a reference set as possible, which requires bioactivity databases that are well integrated into both bioinformatics databases that describe protein family membership and chemical databases that characterize the ligands. Although considerable room for improvement still exists in this sector, a wide range of databases has become available, and they were reviewed recently (45) (see the article Small Molecule, Drug-Target Databases). Structure-based virtual screening technologies use the complementarity between the structural features of ligand and its target protein-binding site. Docking, which tries to predict explicitly the orientation of a ligand within the binding site (pose) and to estimate its binding energy (score), is the most frequently used method and has now become well established (46). However, a new method uses the target-ligand complementarity for the generation of predictive models without generating binding poses (47). To make docking suitable to identify the ligands of a whole target family, it is necessary to address the issue of how to deal with family members without an available protein structure and how to overcome the inaccuracy of the scoring functions for the analysis of the docking results. A substantial improvement of the results can be obtained by using not only the docking score as a decision criterion to retain or reject a pose, but also the key interactions between ligand and protein. Based on binding afnity of known ligand, the scoring weights of different interactions can be adjusted to reect the SAR of the ligands (48). Ligandprotein interactions can be described by discrete bit vectors comparable with chemical ngerprint descriptors, which allow the efcient and fast ltering of poses as described for the design of a kinase-focused library (49). Recently, a method has been described to train the weights of interactions in the scoring functions automatically based on a set of known ligands. This method allowed the authors to predict not only the activity for the kinase to which the ligand has actually docked into, but also, by using a training set of activity data for a second kinase that is different from the one used for docking, the activity for this second kinase. This method allows predicting activity for kinases without a known structure (50). It is an example of a method in which classic docking is combined with statistical learning. To apply these methods, both protein structures and activity information over large set of ligands is required. Even further generalizations of the binding sites are schematic descriptions of kinase binding sites summarizing the features of several kinase inhibitor complexes and the variations in their binding pockets between different kinases 5
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
(51, 52). Such qualitative models have been used successfully to design kinase-focused libraries. In the case of G-protein coupled receptors (GPCRs), the structural information is rather sparse; bovine rhodopsin is the only GPCR for which a crystal structure is currently available. However, this structure could be used to determine which residues are positioned within the binding cavity. A qualitative model has been derived to visualize of the molecular recognition of ligands in different GPCRs that depend on the amino acids side chains exposed in the binding site (52). Although many methods described above can be used to screen individual structures or enumerated libraries for potential biologic activity, they offer, with exception of the generalized binding site models, no guidance for the de novo design of new scaffolds that have an increased potential for biologic activity on a range of targets. One of the earliest concepts to offer such guidance is based on the observation, that common core scaffolds exist, which can be differentiated by modication of its side chains into ligands that are individually selective for different target proteins. Evans et al. (53), who made this observation in the case of the benzodiazepine core contained in selective ligands for different cholecystokinin receptors, called such scaffolds privileged structures. This concept has been elaborated additionally (54) and studied systematically on the ligands in the MDDR. For each target family in the MDDR, ligands were exported and maximum common substructures (MCS) were extracted. Then, these substructures were assumed to be privileged structures and were checked for the presence in the ligands of other target families with the result that indeed many of these structures were present in the ligands of more than one target family (55). Targets that have similar ligand sets are not necessarily members of the same target family (56). It has been proposed that the limited number of protein folds in the proteome also leads to a limited number of ligand-sensing cores. A ligand-sensing core is dened as the folding pattern of the around the bindprotein in a sphere with a diameter of 2030 A ing site without taking the individual protein side chains into account (57). Different side chains in the ligand-sensing core can lead to a variety of diversely functionalized binding cavities, which may fulll different functions and may occur in more than one target family. A privileged structure might be a suitable scaffold that orients its side chains in different regions of the binding pocket that is dened by the ligand-sensing core. Depending on the amino acid side chain, which is exposed in the binding pocket by the individual protein, different functional groups are required to be present on the privileged scaffold. The concept of biologically-oriented synthesis (BIOS (58)) suggests that in absence of detailed knowledge of an individual target proteins structure, it is required to screen a diversely-functionalized library around a privileged or biologically pre-validated scaffold for the ligand sensing core in order to identify the correct substitution pattern for the individual target.
Probe Biosynthetic Pathways). They cover a wide range of chemical classes (59), and they are expected to be ne-tuned by evolution to fulll a purpose that is often still unknown, but likely to involve the interaction with biomolecules. The capability of an organism to produce many variations of a metabolite at a low effort is considered a benecial evolutionary trait of a species, which allows it to adapt its range of produced metabolites quickly to a changing environment. For this reason, the metabolic pathways involved in the synthesis of natural products are often branched highly and lead to high chemical diversity in collections of natural products (60). The synthetic complexity of natural products makes them often difcult to optimize. It has been shown, however, that not only natural products themselves, but also synthetic libraries of simplied analogs that retain only the key features of the original natural product can be applied successfully in screens for biologically active compounds. It has been demonstrated that from such a simplied natural product, core selective ligands for different targets can be derived and may be regarded as privileged structures (58).
HTS processes
Typically, the screening process (see the article High Throughput Screening (HTS) Techniques: Overview of Applications in Chemical Biology) begins with the production of stock solutions by dissolving powder samples and reformatting the solution samples into a uniform deck of stock solution plates. These samples are then stored under controlled conditions, and from these samples the screening plates are produced by plate replication systems. Perhaps the most direct approach to screening is rst to measure the dose response curves with the prefabricated assay plates that contain the compounds in the different concentrations (Fig. 2a). This technique has been shown to be feasible with a high level of automation for libraries up to the size of 100,000 samples. Because the same number of data points is measured
Natural products
Natural products are a traditional source of biologically active compounds, which are used either as drugs themselves or have inspired the discovery of synthetic drugs (59) (see the articles Natural Products: An Overview and Natural Products to 6
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
(a)
(b)
(c)
(d)
(e)
Figure 2 Workows for experimental physical HTS and virtual screening of compound libraries and their combination.
for active and inactive compounds, absence and presence of activity are determined with the same degree of reliability. This reliability is an advantage for building SAR models. In addition, the analysis of the dose response curve shapes allows some
conclusion as to whether the interaction between ligand and protein is specic (61). However, in the pharmaceutical industry, larger libraries of a million or more compounds are often screened, and it is desirable to have a lower consumption of protein and compounds 7
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
on those compounds that are inactive. Therefore, HTS begins with a single concentration screen: the primary screen. Then, for the compounds found active in the primary screen, dose response curves are determined in a validation phase. Between the primary screen and validation a conrmation screen can be performed in which, for the primary hits, the single concentration experiment is repeated and only hits with conrmed activity are validated (Fig. 2b). In any case, it is necessary to conrm chemical identity and purity of the samples found active to avoid misleading SAR information. For the same reasons, counter screens or secondary assays that use different read-out methods are performed to exclude an unspecic interaction of the compound with the assay system. This process requires the capability to access large subsets of the screening library. In this process step, called cherry picking, individual samples must be taken from the mother plates with stock solution and dispensed into plates for the conrmation or validation screen. In addition, dilution series for dose-response curve measurement must be produced of the cherry-picked samples for validation. Technically, cherry picking is a nontrivial task, and if all compounds with signicant primary activity are to be conrmed, which can be several thousand compounds, and then not only the screening capacity for conrmation screening needs to be available, but also the cherry-picking capacity for these samples must be available. For large screening libraries, these processes can only be run with a high degree of automation in sample storage, cherry picking, screening, and chemical analytics. These automation systems must be driven by an informatics platform that tracks the contents of plates; collects the results of the different readers used for screening; and performs normalization, curve tting, and detection of errors that may result from spillage and carry over of compounds in the pipetting process or edge effects (62, 63). The results of these automated preprocessing steps must be presented to the screener in an appropriate visualization after each screening step for quality control and nal decision making. If the primary screening and its results justify the follow-up of more compounds than can be processed, then chemoinformatics techniques such as clustering can be used to ensure appropriate representation of all chemical classes in the validation set. Also in this step, compounds can be removed that interfere with the assay technology and are unlikely to interact specically with the target (64). The decisions taken at these steps must be captured, and the lists of the selected compounds must be handed over to the cherry-picking system for automated process. The software tools used for these different tasks must be well integrated to achieve a process that runs smoothly (65).
compounds from vendor catalogs and even enumerated virtual libraries from which the hit compounds are then purchased or synthesized. Compilations of screening compound catalogs exist both publicly, such as ZINC (compiled by the Shoichet laboratory at UCSF, San Francisco, http://zinc.docking.org ), which contains docking ready 3D structures (66), or in the commercial sector such as ChemNavigator (Chemnavigator, San Diego, http://www.chemnavigator.com ), which is linked to a sample procurement service. Several cases have been reported in which active ligands have been discovered successfully using such processes (67). However, if automated high-throughput experimentation is abandoned, then only small numbers of compounds can be validated (typically below 1000), whereas typical HTS setups allow the validation of a couple of more than 1000 compounds. Similar to physical HTS, but to a higher extent, virtual screening is affected by false positives and false negatives. In typical virtual screening, accumulation of 90% of the true positives in the top 10% ranking compounds is an excellent result that is almost never reached in practice (68, 69). Assuming an industrial HTS library of a million compounds, to validate a virtual screening hitlist that consists only of 1% of this library despite the inevitably high false-negative rate this will cause, requires HT experimentation. Data fusion of HT experimentation and virtual screening can be expected to compensate errors of each of the methods and to allow the validation of a signicant number of hits. Virtual screening in this setup no longer has the purpose to save investment in HTexperimentation, but to maximize the positively validated compounds over the whole process to identify as many true positive hits in the collection as possible to feed in the drug discovery process (Fig. 2d).
Sequential screening
Instead of screening the whole library in one batch, it has been proposed to screen an initial subset and use the screening results from this subset to train a statistical model to predict and to prioritize the remaining library. The remaining library is then screened, and the cycle of model building, prediction, and screening can be executed several times, which is referred to as sequential or iterative screening (70). Although this seems to be very attractive because it reduces the number of compounds that require screening, the multiple selection cycles leads to a longer overall screening time increasing the assay logistics effort. Together with the multiple cherry-picking and data processing cycles this may cause more effort than the savings from screening less compounds. In sequential screening it is necessary to choose an initial set. In the absence of reasonable knowledge for the selection of a focused subset, the initial set must be selected by diversity selection, whose limitations have been discussed above. In a compound collection that has been designed to avoid unnecessary redundancy by applying reasonable diversity selection, little can be gained by additional diversity selection. Any active compound class not represented reasonably by the initial screening set is unlikely to be recovered in the additional screening cycles, because the statistical models built on the screening results cannot make valid predictions for it. However, one can expect to identify
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
additional actives in the series covered by the initial set. Recently, it has been demonstrated that screening 25% of a one million compound library selected as a diversity set based on full plates followed by one prediction and screening cycle offers a reasonable compromise between logistical efforts, numbers of compounds screened and hit series covered (Fig 2e) (71). When the screening cost per compound is high and dominates the logistics effort sequential screening can be expected to be benecial, provided it is acceptable to identify only a limited number of tool compounds instead of as many hit series out of the library as possible.
References
1. Gorse AD. Diversity in medicinal chemical space, Curr. Top. Med. Chem. 2006;6:318. 2. Andrews KM, Cramer RD. Toward general methods of targeted library design: topomer shape similarity searching with diverse structures as queries. J. Med. Chem. 2000;43:17231740. 3. Fink T, Reymond J-L. Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. J. Chem. Inf. Comput. Sci. 2007;47:342353. 4. Delaney JS. Predicting aqueous solubility from structure. Drug Disc. Today 2005;10:289295. 5. Ran Y, Yalkowsky SH. Prediction of drug solubility by the general solubility equation (GSE). J. Chem. Inf. Comput. Sci. 2001;41:354357. 6. Leo AJ. Calculating log Poct from structures. Chem. Rev. 1993;93: 12811306. 7. McGovern SL, Helfand BT, Feng B, Shoichet BK. A specic mechanism of nonspecic inhibition. J. Med. Chem. 2003;46: 42654272. 8. Ertl P, Rohde B, Selzer P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 2000;43:37143717. 9. Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Deliv. Rev. 1997;23:325. 10. Egan WJ, Merz KM, Baldwin JJ. Prediction of drug absorption using multivariate Statistics. J. Med. Chem. 2000;43:38673877. 11. Hann MM, Oprea TI. Pursuing the leadlikeness concept in pharmaceutical research. Curr. Opin. Chem. Biol. 2004;8:255263. 12. Rees DC, Congreve M, Murray CW, Carr R. Fragment based lead discovery, Nature Rev. Drug. Disc. 2004;3:600672. 13. Rishton GM. Nonleadlikeness and leadlikeness in biochemical screening. Drug Disc. Today 2002;8:8696. 14. Charifson PS, Walters WP. Filtering databases and chemical libraries. J. Comput.-Aid. Mol. Design 2002;16:311323. 15. Dean PM. Molecular recognition: the measurement and search for molecular similarity in ligand-receptor interaction. In: Concepts and Applications of Molecular Similarity. Maggiora GM, Johnson MA, eds. 1990. John Wiley & Sons, New York. pp. 99117. 16. Schuffenhauer A, Brown N. Chemical diversity and biological activity. Drug Disc. Today Technol. 2006;3:387395. 17. Gibbs AC, Agraotis DK. Chemical diversity: denition and quantication. In: Exploiting Chemical Diversity for Drug Discovery. Bartlett PA, Entzeroth M, eds. 2006. RSC Publishing, Cambridge. pp 137159. 18. Downs GM, Barnard JM. Clustering methods and their uses in computational chemistry. Rev. Comput. Chem. 2002;18:140. 19. Bemis GW, Murcko MA. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 1996;39:28872893. 20. Xu YJ, Johnson M. Using molecular equivalence numbers to visually explore structural features that distinguish chemical libraries. J. Chem. Inf. Comput. Sci. 2002;42:912926. 21. Schuffenhauer A, Ertl P, Wetzel S, Koch MA, Waldmann H. The scaffold treevisualization of the scaffold universe by hierarchical scaffold classication. J. Chem. Inf. Model. 2007;47:4758. 22. Snarey M, Terrett NK, Willett P. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graph. Mod. 1998;15:372385.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
23.
24.
25.
26. 27.
28.
29. 30.
35.
36.
37. 38.
39.
40.
41.
42. 43.
44.
Schuffenhauer A, Brown N, Selzer P, Ertl P, Jacoby E. Relationships between molecular complexity, biological activity, and structural diversity. J. Chem. Inf. Model. 2006;46:525535. Schuffenhauer A, Brown N, Ertl P, Jenkins JL, Selzer P, Hamon J. Clustering and rule-based classications of chemical structures evaluated in the biological activity space. Chem. Inf. Model. 2007;47:325336. Martin YC, Kofron JL, Traphagen LM. Do structurally similar molecules have similar biological activity? J. Med. Chem. 2002;45:43504358. Barbosa F, Horvath D. Molecular similarity and property similarity. Curr. Top. Med. Chem. 2004;4:589600. Harper G, Pickett SD, Green DVS. Design of a compound screening collection for use in high throughput screening. Comb. Chem. HTS 2004;7:6370. Hann MM, Leach AR, Harper G. Molecular complexity and its impact on the probability of nding leads for drug discovery. J. Chem. Inf. Comput. Sci. 2001;41:856864. Selzer P, Roth HJ, Ertl P, Schuffenhauer A. Complex molecules do they add value?. Curr Opin. Chem. Biol. 2005;9:310316. Schuffenhauer A, Ruedisser S, Marzinzik A, Jahnke W, Selzer P, Jacoby E. Library design for fragment based screening. Curr. Top. Med. Chem. 2005;5:751762. Hopkins LA, Groom CR, Alex A. Ligand efciency: a useful metric for lead selection. Drug Disc. Today 2004;9:430431. Hajduk PJ. Fragment-based drug design: how big is too big? J. Med. Chem. 2006;49:69726976. Bajorath J. Integration of virtual and high-troughput screening. Nature Rev. Drug Disc. 2002;1:882894. Paolini GV, Shapland RHB, van Hoorn WP, Mason JS, Hopkins AL. Global mapping of pharmacological space. Nature Biotechnol. 2006;24:805815. Schuffenhauer A, Floersheim P, Acklin P, Jacoby E. Similarity metrics for ligands reecting the similarity of the target proteins. J. Chem. Inf. Comput. Sci. 2003;43:391405. Frye SV. Structure-activity relationship homology (SARAH); a conceptual framework for drug discovery in the genomic era. Chem. Biol. 1999;6:R3R7. Willett P, Barnard JM, Downs GM. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 1998;38:983996. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. Comparison of ngerprint-based methods for virtual screening using multiple bioactive reference structures. J. Chem. Inf. Comput. Sci. 2004;44:11771185. Byvatov E, Schneider G. SVM-based feature selection for characterization of focused compound collections. J. Chem. Inf. Comput. Sci. 2004;44:993999. Auer J, Bajorath J. Emerging chemical patterns: a new methodology for molecular classication and compound selection. J. Chem. Inf. Model. 2006;46:25022514. Xia X, Maliski EG, Galliant P, Rogers D. Classication of kinase inhibitors using a bayesian model. J. Med. Chem. 2004;47: 44634470. Zupan J, Gasteiger J. Neural Networks in Chemistry and Drug Design. 1999. Wiley-VCH, Weinheim. von Korff M, Hilpert K. Assessing the predictive power of unsupervised visualization techniques to improve the identication of GPCR-focused compound libraries. J. Chem. Inf. Model. 2006;46:15801587. Selzer P, Ertl P. Applications of self-organizing neural networks in virtual screening and diversity selection. J. Chem. Inf. Model. 2006;46:23192323.
45. 46.
47.
48.
49.
50.
51.
52.
53.
54. 55.
56.
57.
58.
59.
60. 61.
62.
Oprea TI, Tropsha A. Target, chemical and bioactivity databases integration is key. Drug Disc. Today: Technol. 2006;3:357365. Kitchen DB, Decornez H, Furr JR, Bajorath J. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat. Rev. Drug. Discov. 2004;3:935949. Oloff S, Zhang S, Sukumar N, Breneman C, Tropsha A. Chemometric analysis of ligand receptor complementarity: identifying complementary ligands based on receptor information (CoLiBRI). J. Chem. Inf. Mod. 2006;46:844851. Jansen JM, Martin EJ. Target-biased scoring approaches and expert systems in structure-based virtual screening. Curr. Opin. Chem. Biol. 2004;8:359364. Sun D, Chuaqui C, Deng Z, Bowes S, Chin D, Singh J, Cullen P, Hankins G, Lee WC, Donelly J, Friedmann J, Josiah S. A kinase-focused compound collection: compilation and screening stragtegy. Chem. Biol. Drug. Des. 2006;67:385394. Martin E, Sullivan D. Surrogate docking with AUTOSHIM ensembles: using PLS/MAGNET to customize scoring functions for an ensemble of diverse kinases to predict the activity of new kinases, even without crystal structures or homology models. 2006. 232nd ACS National Meeting, San Francisco, CA. Liao JJL. Molecular recognition of protein kinase binding pockets for design of potent and selective kinase inhibitors. J. Med. Chem. 2007;50:116. Harris JC, Stevens AP Chemogenomics: structuring the drug discovery process to gene families. Drug Disc. Today 2006;11: 880888. Evans BE, Rittle KE, Bock MG, DiPardo RM, Freidinger RM, Whitter WL, Lundell GF, Veber DF, Anderson PS, Chang RSL, Lotti VJ, Cerino DJ, Chen TB, Kling PJ, Kunkel KA, Springer JP, Hirsheld J. Methods for drug discovery: development of potent, selective, orally effective cholecystokinin anatagonists. J. Med. Chem. 1988;31:22352246. M uller G. Medicinal chemistry of target family-directed masterkeys. Drug Disc. Today 2003;8:681691. Schnur DM, Hermsmeier MA, Tebben AJ. Are target-familyprivileged substructures truly privileged. J. Med. Chem. 2006;49: 20002009. Keiser MJ, Roth BL, Armbruster BN, Ernsberger P; Irwin JJ, Shoichet BK Relating protein pharmacology by ligand chemistry. Nature Biotechnol. 2007;25:197206. Koch MA, Wittenberg L-O, Basu S, Jeyaraj DA, Gourzoulidou E, Reinecke K, Odermatt A, Waldmann H. Compound library development guided by protein structure similarity clustering and natural product structure. Proc. Nat. Acad. Sci. U.S.A. 2004;101:1672116726. N oren-M uller A, Reis-Corr ea I, Prinz H, Rosenbaum C, Saxena K, Schwalbe HJ, Vestweber D, Cagna G, Schunk S, Schwarz O, Schiewe H, Waldmann H. Discovery of protein phosphatase inhibitor classes by biology-oriented synthesis. Proc. Nat. Acad. Sci. U.S.A. 2006;1003:1060610611. Newman DJ, Cragg GM, Snader KM. Natural products as sources of new drugs over the period 1981-2002. J. Nat. Prod. 2003;66: 10221037. Firn RD, Jones CG. Natural productsa simple model to explain chemical diversity Nat. Prod. Rep. 2003;20:382391. Inglese J, Auld DS, Jadhav A, Johnson RL, Simeonov A, Yasgar A, Zheng W, Austin CP. Quantitative high-throughput screening: a titration-based approach that efciently identies biological activities in large chemical libraries. Proc. Nat. Acad. Sci. U.S.A. 2006;103:1147311478. Harper G, Picket SD. Methods for mining HTS data. Drug Disc. Today 2006;11:694699.
10
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
63. 64.
65. 66.
67.
68.
69.
70. 71.
72.
73.
74.
75.
76.
Heyse S. Comprehensive analysis of high-throughput screening data. Proc. SPIE 2002;4626:535547. Davies JW, Glick M, Jenkins JL. Streamlining lead discovery by aligning in silico and high-throughput screening. Curr. Opin. Chem. Biol. 2006;10:343251. Fay N. The role of the discovery informatics framework in early lead discovery. Drug Disc. Today 2006;11:10751084. Irwin JJ, Shoichet B. ZINC - A free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005;45:177182. Fara DC, Oprea TI, Prossnitz ER, Bologa CG, Edwards BS, Sklar LA. Integration of virtual and physical screening. Drug Disc. Today: Technol. 2006;3:377385. Hert J, Willett P, Wilton DJ, Acklin P, Azzaoui K, Jacoby E, Schuffenhauer A. Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org. Biomol. Chem. 2004;2:32563266. Warren GL, Andrews CW, Capelli A-M, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS. A critical assessment of docking programs and scoring functions. J. Med. Chem. 2006;49:59125931. Engels MFM, Venkatatangam P. Smart screening: approaches to efcient HTS. Curr. Opin. Drug. Disc. Dev. 2001;4:275283. Crisman TJ, Jenkins JL, Parker CN, Hill WAG, Bender A, Deng Z, Nettles JH, Davies JW, Glick M. Plate Cherry Picking: a novel semi-sequential screening paradigm for cheaper, faster, information-rich compound selection. J Biomol Screen 2007;12:320327. Raymond JW, Kibbey CE. An automated method for exploring targeted substructural diversity within sets of chemical structures. J. Chem. Inf. Model. 2005;45:11951204. Tamura SY, Bacha PA, Gruver HS, Nutt RF. Data analysis of high-throughput screening results: application of multidomain clustering to the NCI anti-HIV data set. J. Med. Chem. 2002;45:30823093. Roberts G, Myatt GJ, Johnson WP, Cross KP, Blower PE Jr. LeadScope: software for exploring large sets of screening data. J. Chem. Inf. Comput. Sci. 2000;40:13021314. Howe TJ, Mahieu G, Maricahl P, Tabruyn T, Vugts P. Data reduction and presentation in drug discovery. Drug Disc. Today 2007;12:4553. Agraotis DK, Bandyopadhyay D, Farnum M. Radial clustergrams: visualizing the aggregate properties of hierarchical clusters. J. Chem. Inf. Model. 2007;47:6975.
Jacoby E, ed. Chemogenomics. Knowledge-based Approaches to Drug Discovery. 2006. Imperial College Press, London. Lipinski C, Hopkins A. Navigating chemical space for biology and medicine. Nature Rev. Drug. Disc. 2004;432:855861. Oprea TI, Davis AM, Teague SJ. Is there a difference between leads and drugs? A historical perspective. J. Chem. Inf. Comput. Sci. 2001; 41:13081315.
See Also
Combinatorial Libraries: Overview of Applications in Chemical Biology Compound Handling Computational Approaches in Drug Discovery and Development High Throughput Screening (HTS) Techniques: Overview of Applications in Chemical Biology Lead Optimization in Drug Discovery Natural Products to Probe Biosynthetic Pathways Natural Products: An Overview Small Molecule Combinatorial Libraries Small Molecule, Drug-Target Databases Target Family-Biased Compound Library: Optimization, Target Selection and Validation
Further Reading
Bartlett PA, Entzeroth M, eds. Exploiting Chemical Diversity for Drug Discovery. 2006. RSC Publishing, Cambridge.
WILEY ENCYCLOPEDIA OF CHEMICAL BIOLOGY 2008, John Wiley & Sons, Inc.
11