You are on page 1of 62

Production system

A production system (or production rule system) is a computer program typically used to provide some form of artificial intelligence, which consists primarily of a set of rules about behavior. These rules, termed productions, are a basic representation found useful in automated planning, expert systems and action selection. A production system provides the mechanism necessary to execute productions in order to achieve some goal for the system. Productions consist of two parts: a sensory precondition (or "IF" statement) and an action (or "THEN"). If a production's precondition matches the current state of the world, then the production is said to be triggered. If a production's action is executed, it is said to have fired. A production system also contains a database, sometimes called working memory, which maintains data about current state or knowledge, and a rule interpreter. The rule interpreter must provide a mechanism for prioritizing productions when more than one is triggered.

Basic operation_____________________________________________________________ ________________


Rule interpreters generally execute a forward chaining algorithm for selecting productions to execute to meet current goals, which can include updating the system's data or beliefs. The condition portion of each rule (left-hand side or LHS) is tested against the current state of the working memory. In idealized or data-oriented production systems, there is an assumption that any triggered conditions should be executed: the consequent actions (right-hand side or RHS) will update the agent's knowledge, removing or adding data to the working memory. The system stops processing either when the user interrupts the forward chaining loop; when a given number of cycles has been performed; when a "halt" RHS is executed, or when no rules have true LHSs. Real-time and expert systems, in contrast, often have to choose between mutually exclusive productions --- since actions take time, only one action can be taken, or (in the case of an expert system) recommended. In such systems, the rule interpreter, or inference engine, cycles through two steps: matching production rules against the database, followed by selecting which of the matched rules to apply and executing the selected actions.

Matching production rules ________________________________

against

working

memory-

Production systems may vary on the expressive power of conditions in production rules. Accordingly, the pattern matching algorithm which collects production rules

with matched conditions may range from the naive -- trying all rules in sequence, stopping at the first match -- to the optimized, in which rules are "compiled" into a network of inter-related conditions. The latter is illustrated by the RETE algorithm, designed by Charles L. Forgy in 1983, which is used in a series of production systems, called OPS and originally developed at Carnegie Mellon University culminating in OPS5 in the early eighties. OPS5 may be viewed as a full-fledged programming language for production system programming.

Choosing which rules to evaluate


Production systems may also differ in the final selection of production rules to execute, or fire. The collection of rules resulting from the previous matching algorithm is called the conflict set, and the selection process is also called a conflict resolution strategy. Here again, such strategies may vary from the simple -- use the order in which production rules were written; assign weights or priorities to production rules and sort the conflict set accordingly -- to the complex -- sort the conflict set according to the times at which production rules were previously fired; or according to the extent of the modifications induced by their RHSs. Whichever conflict resolution strategy is implemented, the method is indeed crucial to the efficiency and correctness of the production system.

Using production systems


The use of production systems varies from simple string rewriting rules to the modeling of human cognitive processes, from term rewriting and reduction systems to expert systems. A simple string rewriting production system example This example shows a set of production rules for reversing a string from an alphabet that does not contain the symbols "$" and "*" (which are used as marker symbols).

In this example, production rules are chosen for testing according to their order in this production list. For each rule, the input string is examined from left to right with a moving window to find a match with the LHS of the production rule. When a match is found, the matched substring in the input string is replaced with the RHS of the

production rule. In this production system, x and y are variables matching any character of the input string alphabet. Matching resumes with P1 once the replacement has been made. The string "ABC", for instance, undergoes the following sequence of transformations under these production rules:

In such a simple system, the ordering of the production rules is crucial. Often, the lack of control structure makes production systems difficult to design. It is, of course, possible to add control structure to the production systems model, namely in the inference engine, or in the working memory.

An OPS5 production rule example


In a toy simulation world where a monkey in a room can grab different objects and climb on others, an example production rule to grab an object suspended from the ceiling would look like:

In this example, data in working memory is structured and variables appear between angle brackets. The name of the data structure, such as "goal" and "physical-object", is the first literal in conditions; the fields of a structure are prefixed with "^". The "-" indicates a negative condition.

Production rules in OPS5 apply to all instances of data structures that match conditions and conform to variable bindings. In this example, should several objects be suspended from the ceiling, each with a different ladder nearby supporting an empty-handed monkey, the conflict set would contain as many production rule instances derived from the same production "Holds::Object-Ceiling". The conflict resolution step would later select which production instances to fire. Note that the binding of variables resulting from the pattern matching in the LHS is used in the RHS to refer to the data to be modified. Note also that the working memory contains explicit control structure data in the form of "goal" data structure instances. In the example, once a monkey holds the suspended object, the status of the goal is set to "satisfied" and the same production rule can no longer apply as its first condition fails.

Ensemble learning
In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. .[1][2][3] Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble refers only to a concrete finite set of alternative models.

Overview
Supervised learning algorithms are commonly described as performing the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. In other words, an ensemble is a technique for combining many weak learners in an attempt to produce a strong learner.

Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. Fast algorithms such as decision trees are commonly used with ensembles, although slower algorithms can benefit from ensemble techniques as well.

Ensemble theory
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built. Thus, ensembles can be shown to have more flexibility in the functions they can represent. This flexibility can, in theory, enable them to over-fit the training data more than a single model would, but in practice, some ensemble techniques (especially bagging) tend to reduce problems related to over-fitting of the training data. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. [4] [5] Many ensemble methods, therefore, seek to promote diversity among the models they combine. [6][7] Although perhaps nonintuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[8] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.[9]

Common types of ensembles


Bayes optimal classifier The Bayes Optimal Classifier is an optimal classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it, so it is the ideal ensemble.[10] Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes Optimal Classifier can be expressed with following equation:

where y is the predicted class, C is the set of all possible classes, H is the hypothesis space, P refers to a probability, and T is the training data. As an ensemble, the Bayes Optimal Classifier represents a hypothesis that is not

necessarily in H. The hypothesis represented by the Bayes Optimal Classifier, however, is the optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in H). Unfortunately, Bayes Optimal Classifier cannot be practically implemented for any but the most simple of problems. There are several reasons why the Bayes Optimal Classifier cannot be practically implemented: Most interesting hypothesis spaces are too large to iterate over, as required by the argmax. Many hypotheses yield only a predicted class, rather than a probability for each class as required by the term P(cj | hi). Computing an unbiased estimate of the probability of the training set given a hypothesis (P (T | hi)) is non-trivial. Estimating the prior probability for each hypothesis (P (hi)) is rarely feasible.

Bayesian model averaging


Bayesian model averaging is an ensemble technique that seeks to approximate the Bayes Optimal Classifier by sampling hypotheses from the hypothesis space, and combining them using Bayes' law.[11] Unlike the Bayes optimal classifier, Bayesian model averaging can be practically implemented. Hypotheses are typically sampled using a Monte Carlo sampling technique such as MCMC. For example, Gibbs sampling may be used to draw hypotheses that are representative of the distribution P (T | H). It has been shown that under certain circumstances, when hypotheses are drawn in this manner and averaged according to Bayes' law, this technique has an expected error that is bounded to be at most twice the expected error of the Bayes optimal classifier.[12] Despite the theoretical correctness of this technique, however, it has a tendency to promote over-fitting, and does not perform as well empirically as simpler ensemble techniques such as bagging.[13] Bootstrap aggregating (bagging) Bootstrap aggregating, often abbreviated as bagging, involves having each model in the ensemble vote with equal weight. In order to promote model variance, bagging trains each model in the ensemble using a randomly-drawn subset of the training set. As an example, the random forest algorithm combines random decision trees with bagging to achieve very high classification accuracy.[14]

Boosting

Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified. In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to over-fit the training data. By far, the most common implementation of Boosting is Adaboost, although some newer algorithms are reported to achieve better results.

Bucket of models
A "bucket of models" is an ensemble in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set. The most common approach used for model-selection is cross-validation selection. It is described with the following pseudo-code:

Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best".[15] Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket. When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.[16]

Stacking
The crucial prior belief underlying the scientific method is that one can judge among a set of models by comparing them on data that was not used to create any of them. This same prior belief underlies the use in machine learning of bake-off contests to judge which of a set of competitor learning algorithms is actually best.

This prior belief can also be used by a single practitioner, to choose among a set of models based on a single data set. This is done by partitioning the data set into a held-in data set and a held-out data set; training the models on the held-in data; and then choosing whichever of those trained models performs best on the held-out data. This is the cross-validation technique, mentioned above. Stacking (sometimes called stacked generalization) exploits this prior belief further. It does this by using performance on the held-out data to combine the models rather than choose among them, thereby typically getting performance better than any single one of the trained models.[17] It has been successfully used on both supervised learning tasks (regression)[18] and unsupervised learning (density estimation).[19] It has also been used to estimate Bagging's error rate.[20][3] Because the prior belief concerning held-out data is so powerful, stacking often outperforms Bayesian model-averaging.[21] Indeed, renamed blending, stacking was extensively used in the two top performers in the recent Netflix competition.[22]

Boosting
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns[1]: can a set of weak learners create a single strong learner? A weak learner is defined to be a classifier which is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily wellcorrelated with the true classification. Schapire's affirmative answer [2] to Kearns' question has had significant ramifications in machine learning and statistics, most notably leading to the development of boosting.

Boosting algorithms
While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority[3] and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified. There are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation and Yoav Freund (boost by majority), were not adaptive and could not take full advantage of the weak learners.

Only algorithms that are provable boosting algorithms approximately correct learning formulation are called boosting algorithms that are similar in spirit to boosting algorithms are "leveraging algorithms", although they are also sometimes boosting algorithms.[4]

in the probably algorithms. Other sometimes called incorrectly called

Examples of boosting algorithms


The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and perhaps the most significant historically as it was the first algorithm that could adapt to the weak learners. However, there are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, MadaBoost, LogitBoost, and others. Many boosting algorithms fit into the AnyBoost framework,[5] which shows that boosting performs gradient descent in function space using a convex cost function. On October 7th, 2010 Phillip Long (at Google) and Rocco A. Servedio (Columbia University) released a paper suggesting that these algorithms are provably flawed in that "convex potential boosters cannot withstand random classification noise," thus making the applicability of such algorithms for real world, noisy data sets questionable.

Implementations
- Orange, a free data mining software suite, module orngEnsemble - Weka is a machine learning set of tools that offers variate implementations of boosting algorithms like AdaBoost and LogitBoost

Bayesian probability
Bayesian probability is one of the different interpretations of the concept of probability and belongs to the category of evidential probabilities. The Bayesian interpretation of probability can be seen as an extension of logic that enables reasoning with uncertain statements. To evaluate the probability of a hypothesis, the Bayesian probabilist specifies some prior probability, which is then updated in the light of new relevant data.[1] The Bayesian interpretation provides a standard set of procedures and formulae to perform this calculation. Bayesian probability interprets the concept of probability

as "a measure of a state of knowledge", in contrast to interpreting it as a frequency or a "propensity" of some phenomenon. The term "Bayesian" refers to the 18th century mathematician and theologian Thomas Bayes (17021761), who provided the first mathematical treatment of a non-trivial problem of Bayesian inference. Nevertheless, it was the French mathematician Pierre-Simon Laplace (17491827) who pioneered and popularized what is now called Bayesian probability. Broadly speaking, there are two views on Bayesian probability that interpret the state of knowledge concept in different ways. According to the objectivist view, the rules of Bayesian statistics can be justified by requirements of rationality and consistency and interpreted as an extension of logic. According to the subjectivist view, the state of knowledge measures a "personal belief". Many modern machine learning methods are based on objectivist Bayesian principles. In the Bayesian view, a probability is assigned to a hypothesis, whereas under the frequentist view, a hypothesis is typically tested without being assigned a probability.

Bayesian methodology
In general, Bayesian methods are characterized by the following concepts and procedures: The use of hierarchical models and marginalization over the values of nuisance parameters. In most cases, the computation is intractable, but good approximations can be obtained using Markov chain Monte Carlo methods. The sequential use of the Bayes' formula: when more data becomes available after calculating a posterior distribution, the posterior becomes the next prior. In frequentist statistics, a hypothesis is a proposition (which must be either true or false), so that the (frequentist) probability of a frequentist hypothesis is either one or zero. In Bayesian statistics, a probability can be assigned to a hypothesis.

Objective and subjective Bayesian probabilities


Broadly speaking, there are two views on Bayesian probability that interpret the 'state of knowledge' concept in different ways. For objectivists, the rules of Bayesian statistics can be justified by requirements of rationality and consistency. Such requirements of rationality and consistency are also important for subjectivists, for which the state of knowledge corresponds to a 'personal belief' (rather than the objective state of knowledge in the world). For subjectivists however, rationality and consistency constrain the probabilities a subject may have, but allow for substantial variation within those constraints. The objective and subjective variants of Bayesian

probability differ mainly in their interpretation and construction of the prior probability.

History
The term Bayesian refers to Thomas Bayes (17021761), who proved a special case of what is now called Bayes' theorem. However, it was Pierre-Simon Laplace (1749 1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence. Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics. In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objectivist stream, the statistical analysis depends on only the model assumed and the data analyzed. No subjective decisions need to be involved. In contrast, "subjectivist" statisticians deny the possibility of fully objective analysis for the general case. In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications. Despite the growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics. Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning and talent analytics.

Justification of Bayesian probabilities


The use of Bayesian probabilities as the basis of Bayesian inference has been supported by several arguments, such as the Cox axioms, the Dutch book argument, arguments based on decision theory and de Finetti's theorem.

Axiomatic approach
Richard T. Cox showed that Bayesian updating follows from several axioms, including two functional equations and the controversial hypothesis that probability should be treated as a continuous function. Here "continuity" is equivalent to countable additivity, as proved in measure-theoretic probability books. The countable additivity requirement is rejected (e.g. for being non-falsifiable) by Bruno de Finetti, for example.

Dutch book approach


The Dutch book argument was proposed by de Finetti, and is based on betting. A Dutch book is made when a clever gambler places a set of bets that guarantee a profit, no matter what the outcome is of the bets. If a bookmaker follows the rules of the Bayesian calculus in the construction of his odds, a Dutch book cannot be made. However, Ian Hacking noted that traditional Dutch book arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. For example, Hacking writes "And neither the Dutch book argument, nor any other in the personalist arsenal of proofs of the probability axioms, entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour." In fact, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics" following the publication of Richard C. Jeffrey's rule). The additional hypotheses sufficient to (uniquely) specify Bayesian updating are substantial, complicated, and unsatisfactory.

Decision theory approach


A decision-theoretic justification of the use of Bayesian inference (and hence of Bayesian probabilities) was given by Abraham Wald, who proved that every Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.

Personal probabilities and objective methods for constructing priors


Following the work on expected utility theory of Ramsey and von Neumann, decision-theorists have accounted for rational behavior using a probability distribution for the agent. Johann Pfanzagl completed the Theory of Games and Economic Behavior by providing an axiomatization of subjective probability and utility, a task left uncompleted by von Neumann and Oskar Morgenstern: their original theory supposed that all the agents had the same probability distribution, as a convenience. Pfanzagl's axiomatization was endorsed by Oskar Morgenstern: "Von Neumann and I have anticipated" the question whether probabilities "might, perhaps more typically, be subjective and have stated specifically that in the latter case axioms could be found from which could derive the desired numerical utility together with a number for the probabilities (cf. p. 19 of The Theory of Games and Economic Behavior). We did not carry this out; it was demonstrated by Pfanzagl with all the necessary rigor".

Ramsey and Savage noted that the individual agent's probability distribution could be objectively studied in experiments. The role of judgment and disagreement in science has been recognized since Aristotle and even more clearly with Francis Bacon. The objectivity of science lies not in the psychology of individual scientists, but in the process of science and especially in statistical methods, as noted by C. S. Peirce. Recall that the objective methods for falsifying propositions about personal probabilities have been used for a half century, as noted previously. Procedures for testing hypotheses about probabilities (using finite samples) are due to Ramsey (1931) and de Finetti (1931, 1937, 1964, 1970). Both Bruno de Finetti and Frank P. Ramsey acknowledge their debts to pragmatic philosophy, particularly (for Ramsey) to Charles S. Peirce. The "Ramsey test" for evaluating probability distributions is implementable in theory, and has kept experimental psychologists occupied for a half century.[19] This work demonstrates that Bayesian-probability propositions can be falsified, and so meet an empirical criterion of Charles S. Peirce, whose work inspired Ramsey. (This falsifiability-criterion was popularized by Karl Popper). Modern work on the experimental evaluation of personal probabilities uses the randomization, blinding, and Boolean-decision procedures of the Peirce-Jastrow experiment. Since individuals act according to different probability judgements, these agents' probabilities are "personal" (but amenable to objective study). Personal probabilities are problematic for science and for some applications where decision-makers lack the knowledge or time to specify an informed probabilitydistribution (on which they are prepared to act). To meet the needs of science and of human limitations, Bayesian statisticians have developed "objective" methods for specifying prior probabilities. Indeed, some Bayesians have argued the prior state of knowledge defines the (unique) prior probability-distribution for "regular" statistical problems; cf. wellposed problems. Finding the right method for constructing such "objective" priors (for appropriate classes of regular problems) has been the quest of statistical theorists from Laplace to John Maynard Keynes, Harold Jeffreys, and Edwin Thompson Jaynes: These theorists and their successors have suggested several methods for constructing "objective" priors: Maximum entropy Transformation group analysis Reference analysis

Each of these methods contributes useful priors for "regular" one-parameter problems, and each prior can handle some challenging statistical models (with "irregularity" or several parameters). Each of these methods has been useful in

Bayesian practice. Indeed, methods for constructing "objective" (alternatively, "default" or "ignorance") priors have been developed by avowed subjective (or "personal") Bayesians like James Berger (Duke University) and Jos-Miguel Bernardo (Universitat de Valncia), simply because such priors are needed for Bayesian practice, particularly in science. Each of these methods gives implausible priors for some problems, and so the quest for "the universal method for constructing priors" continues to attract statistical theorists. Thus, the Bayesian statistican needs either to use informed priors (using relevant expertise or previous data) or to choose among the competing methods for constructing "objective" priors.

See also
Bertrand's paradox: a paradox in classical probability, solved by in the context of Bayesian probability De Finetti's game a procedure for evaluating someone's subjective probability Uncertainty

Bayesian inference
Bayesian inference is a method of statistical inference in which evidence is used to update the uncertainty of parameters and predictions in a probability model. The term "Bayesian" comes from the use of the Bayesian interpretation of probability. Coming to a conclusion about uncertain inferences involves collecting evidence. As evidence accumulates, the degree of confidence in the hypothesis typically tends either high or low. Hypotheses whose confidence level tends high can be accepted, while those whose confidence level tends low can be rejected. In Bayesian inference, this process is quantified using the Bayesian interpretation of probability as confidence in the value of a variable. The variable under test may be a single hypothesis, but more generally is the joint probability distribution of parameters and predictions in a probability model. Each new piece of evidence may be used to update the confidence with an application of Bayes' theorem. In each application, the initial belief is called the prior, whereas the modified belief is called the posterior.

Method_______________________________________________________________ ________________________

Let be the set of uncertain parameters and predictions in the model Let E be the new evidence

Before the evidence is taken into account, one starts with some belief about , expressed as an initial prior probability distribution. To take evidence into account, Bayes' theorem is applied:

P( | E) is the probability distribution of the uncertain quantities after the evidence is taken into account, the posterior probability. P() is the probability distribution representing uncertainty about the parameters and predictions before the evidence is taken into account, the prior probability.

is a factor representing the impact of the evidence on inferences about . The numerator is called the likelihood. To take further evidence into account, Bayes' theorem may be applied repeatedly. In each application, the previous posterior becomes the new prior.

Properties and Interpretation________________________________________________________ ____


Interpretation of factor
To interpret the factor, consider a special case in which can take on a discrete set of values. Let H be one of these possible values. (H stands for "hypothesis" but in general it can represent any parameter or uncertain quantity in a model. For example, in a study of a treatment effect, H might correspond to the hypothesis that the treatment effect is positive.)

That is, confidence increases if the evidence is more likely when the hypothesis is true. The reverse argument applies for a decrease in confidence. In the case that confidence does not change,

The prior probability of the hypothesis does not affect the likelihood of the evidence.

Cromwell's rule
If the prior probability P(H) = 1, then P(H | E) = 1. Similarly, if P(H) = 0 then P(H | E) = 0. This can be interpreted to mean that hard convictions are insensitive to fact, or that certainty is insensitive to new evidence.

Examples_____________________________________________________________ ________________________
Testing a hypothesis
Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1? Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1 correspond to bowl #1, and H2 to bowl #2. It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 0.5. The event E is the observation of a plain cookie. From the contents of the bowls, we know that P(E | H1) = 30 / 40 = 0.75 and P(E | H2) = 20 / 40 = 0.5. Bayes' formula then yields

Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, P(H1), which was 0.5. After observing the cookie, we must revise the probability to P(H1 | E), which is 0.6.

Applications__________________________________________________________ _______________________
Computer applications

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis-Hastings algorithm schemes. Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously. In the areas of population genetics and dynamical systems theory approximate Bayesian computation (ABC) are also becoming increasingly popular. As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying e-mail spam. Applications which make use of Bayesian inference for spam filtering include DSPAM, Bogofilter, SpamAssassin, SpamBayes, and Mozilla. Spam classification is treated in more detail in the article on the naive Bayes classifier. In some applications fuzzy logic is an alternative to Bayesian inference. Fuzzy logic and Bayesian inference, however, are mathematically and semantically not compatible. You cannot, in general, understand the degree of truth in fuzzy logic as probability and vice versa; fuzziness measures "the degree to which an event occurs, not whether it occurs".

In the courtroom
Bayesian inference can be used in a court setting by an individual juror to coherently accumulate the evidence for and against the guilt of the defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.[2][3][4] The benefit of adopting a Bayesian approach is that it gives the juror a formal mechanism for combining the evidence presented. The approach can be applied successively to all the pieces of evidence presented in court, with the posterior from one stage becoming the prior for the next. The juror would still have to have a prior estimate for the guilt probability before the first piece of evidence is considered. It has been suggested that this could reasonably be the guilt probability of a random person taken from the qualifying population. Thus, for a crime known to have been committed by an adult male living in a town containing 50,000 adult males, the appropriate initial prior probability might be 1/50,000. For the purpose of explaining Bayes' theorem to jurors, it will usually be appropriate to give it in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach which replaces multiplication with addition and reduces the range of the numbers involved might be easier for a

jury to handle. This approach, developed by Alan Turing during World War II and later promoted by I. J. Good and E. T. Jaynes among others, amounts to the use of information entropy. In the United Kingdom, Bayes' theorem was explained to the jury in the odds form by a statistician expert witness in the rape case of Regina versus Denis John Adams. A conviction was secured but the case went to appeal, since no means of accumulating evidence had been provided for those jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave their opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task." No further appeal was allowed and the issue of Bayesian assessment of forensic DNA data remains controversial.

Gardner-Medwin[5] argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist p-value). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions: A - The known facts and testimony could have arisen if the defendant is guilty B - The known facts and testimony could have arisen if the defendant is innocent C - The defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox. Other court cases in which probabilistic arguments played some role were the Howland will forgery trial, the Sally Clark case, and the Lucia de Berk case.

Other
The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments.[6] In March 2011, English Heritage reported the successful outcome of a research project by archaeologists at Cardiff University, which demonstrated the possibility of using Bayesian inference to more accurately date prehistoric remains.[7] Bayesian search theory is used to search for lost objects. Bayesian inference in phylogeny Bayesian tool for methylation analysis

Distribution of a parameter distribution___________________

of

the

hypergeometric

Consider a sample of n marbles drawn from an urn containing N marbles. If the number of white marbles in the urn is known to be equal to K, then the probability that the number of white marbles in the sample is equal to k, is

The mean number of white marbles in the sample is

and the standard deviation is

An interesting situation is when the number of white marbles in the sample is known, and the number of white marbles in the urn is unknown. If the number of white marbles in the sample is equal to k, then the degree of confidence that the number of white marbles in the urn is equal to K, is

where p(K) is the probability that the number of white marbles in the urn is equal to K, that is before observing the number of white marbles in the sample, and p(k) is the probability that the number of white marbles in the sample is equal to k, without knowing the number of white marbles in the urn. Assume now that all the possibilities are considered equally likely in advance,

for Then the degree of confidence that the number of white marbles in the urn is equal to K, is

The mean number of white marbles in the urn is

and the standard deviation is

These two formulas regarding the number of white marbles in the urn emerge from the simpler formulas regarding the number of white marbles in the sample by the substitution

The limiting cases when beta distribution, see below.

and

are the binomial distribution and the

Posterior distribution of parameter____________________________________

the

binomial

The problem considered by Bayes in Proposition 9 of his essay[8] is the posterior distribution for the parameter of the binomial distribution. Consider n Bernoulli trials. If the success probability is equal to a, then the conditional probability of observing k successes is the (discrete) binomial distribution function

The mean value of k is na and the standard deviation is value of is a and the standard deviation is

The mean

In the more realistic situation when k is known and a is unknown, p(k | a) is a likelihood function of a. The posterior probability distribution function of a, after observing k, is

where a prior probability distribution function, p(a), is available to express what was known about a before observing k. Assume now that the prior distribution is the continuous uniform distribution, p(a) = 1 for distribution, . Then the posterior

is a beta distribution,

The mean value of a is

rather than

and the standard deviation is

rather than If the prior distribution is

then the posterior distribution is

So the beta distribution is a conjugate prior. What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter a. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter depend on a random event, he cleverly escapes a philosophical quagmire that was an issue he most likely was not even aware of.

History_______________________________________________________________ ________________________
The term Bayesian refers to Thomas Bayes (17021761), who proved a special case of what is now called Bayes' theorem. However, it was Pierre-Simon Laplace (1749 1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence.[9] Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes[10]). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.[10] In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objectivist stream, the statistical analysis depends on only the model assumed

and the data analysed.[11] No subjective decisions need to be involved. In contrast, "subjectivist" statisticians deny the possibility of fully objective analysis for the general case. In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.[12] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.[13] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.[14]

Multivariate analysis
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. Uses for multivariate analysis include: Design for capability (also known as capability-based design) Inverse design, where any variable can be treated as an independent variable Analysis of Alternatives (AoA), the selection of concepts to fulfill a customer need Analysis of concepts with respect to changing scenarios Identification of critical design drivers and correlations across hierarchical levels.

Multivariate analysis can be complicated by the desire to include physics-based analysis to calculate the effects of variables for a hierarchical "system-of-systems." Often, studies that wish to use multivariate analysis are stalled by the dimensionality of the problem. These concerns are often eased through the use of surrogate models, highly accurate approximations of the physics-based code. Since surrogate models take the form of an equation, they can be evaluated very quickly. This becomes an enabler for large-scale MVA studies: while a Monte Carlo simulation across the design space is difficult with physics-based codes, it becomes trivial when evaluating surrogate models, which often take the form of response surface equations.

Factor analysis______________________________________________________________ ________________


Overview: Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors. Factor analysis originated a century ago with Charles Spearman's attempts to show that a wide variety of mental tests could be explained by a single underlying intelligence factor. Applications: To reduce a large number of variables to a smaller number of factors for data modeling To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor. To select a subset of variables from a larger set, based on which original variables have the highest correlations with the principal component factors. To create a set of factors to be treated as uncorrelated variables as one approach to handling multi-collinearity in such procedures as multiple regression Factor analysis is part of the general linear model (GLM) family of procedures and makes many of the same assumptions as multiple regression

Non-parametric statistics
In statistics, the term non-parametric statistics has at least two different meanings: 1. The first meaning of non-parametric covers techniques that do not rely on data belonging to any particular distribution. These include, among others: distribution free methods, which do not rely on assumptions that the data are drawn from a given probability distribution. As such it is the opposite of parametric statistics. It includes non-parametric statistical models, inference and statistical tests.

non-parametric statistics (in the sense of a statistic over data, which is defined to be a function on a sample that has no dependency on a parameter), whose interpretation does not depend on the population fitting any parametrized distributions. Statistics based on the ranks of observations are one example of such statistics and these play a central role in many nonparametric approaches.

2. The second meaning of non-parametric covers techniques that do not assume that the structure of a model is fixed. Typically, the model grows in size to accommodate the complexity of the data. In these techniques, individual variables are typically assumed to belong to parametric distributions, and assumptions about the types of connections among variables are also made. These techniques include, among others: non-parametric regression, which refers to modeling where the structure of the relationship between variables is treated non-parametrically, but where nevertheless there may be parametric assumptions about the distribution of model residuals. non-parametric hierarchical Bayesian models, such as models based on the Dirichlet process, which allow the number of latent variables to grow as necessary to fit the data, but where individual variables still follow parametric distributions and even the process controlling the rate of growth of latent variables follows a parametric distribution.

Applications and purpose______________________________________________________________ ___


Non-parametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of nonparametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences; in terms of levels of measurement, for data on an ordinal scale. As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more robust. Another justification for the use of non-parametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, non-parametric methods may be easier to use. Due both to this simplicity and to their greater

robustness, non-parametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding. The wider applicability and increased robustness of non-parametric tests comes at a cost: in cases where a parametric test would be appropriate, non-parametric tests have less power. In other words, a larger sample size can be required to draw conclusions with the same degree of confidence.

Non-parametric models_______________________________________________________________ ____


Non-parametric models differ from parametric models in that the model structure is not specified a priori but is instead determined from data. The term non-parametric is not meant to imply that such models completely lack parameters but that the number and nature of the parameters are flexible and not fixed in advance. A histogram is a simple nonparametric estimate of a probability distribution Kernel density estimation provides better estimates of the density than histograms. Nonparametric regression and semiparametric regression methods have been developed based on kernels, splines, and wavelets. Data envelopment analysis provides efficiency coefficients similar to those obtained by Multivariate Analysis without any distributional assumption.

Methods______________________________________________________________ ________________________
Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed. The most frequently used tests include AndersonDarling test Statistical Bootstrap Methods Cochran's Q Cohen's kappa Friedman two-way analysis of variance by ranks KaplanMeier

Kendall's tau Kendall's W KolmogorovSmirnov test Kruskal-Wallis one-way analysis of variance by ranks Kuiper's test Logrank Test MannWhitney U or Wilcoxon rank sum test median test Pitman's permutation test Rank products SiegelTukey test Spearman's rank correlation coefficient WaldWolfowitz runs test Wilcoxon signed-rank test.

Parametric statistics
Parametric statistics is a branch of statistics that assumes that data have come from a type of probability distribution and makes inferences about the parameters of the distribution.[1] Most well-known elementary statistical methods are parametric.[2] Generally speaking parametric methods make more assumptions than nonparametric methods.[3] If those extra assumptions are correct, parametric methods can produce more accurate and precise estimates. They are said to have more statistical power. However, if those assumptions are incorrect, parametric methods can be very misleading. For that reason they are often not considered robust. On the other hand, parametric formulae are often simpler to write down and faster to compute. In some, but definitely not all cases, their simplicity makes up for their non-robustness, especially if care is taken to examine diagnostic statistics.[4] Because parametric statistics require a probability distribution, they are not distribution-free.[5]

Example______________________________________________________________ ________________________
Suppose we have a sample of 99 test scores with a mean of 100 and a standard deviation of 10. If we assume all 99 test scores are random samples from a normal distribution we predict there is a 1% chance that the 100th test score will be higher than 123.65 (that is the mean plus 2.365 standard deviations) assuming that the 100th test score comes from the same distribution as the others. The normal family of distributions all has the same shape and are parameterized by mean and standard deviation. That means if you know the mean and standard deviation, and that the distribution is normal, you know the probability of any future observation. Parametric statistical methods are used to compute the 2.365 value above, given 99 independent observations from the same normal distribution. A non-parametric estimate of the same thing is the maximum of the first 99 scores. We don't need to assume anything about the distribution of test scores to reason that before we gave the test it was equally likely that the highest score would be any of the first 100. Thus there is a 1% chance that the 100th is higher than any of the 99 that preceded it.

History_______________________________________________________________ _______________________
Statistician Jacob Wolfowitz coined the statistical term "parametric" in order to define its opposite in 1942:

"Most of these developments have this feature in common, that the distribution functions of the various stochastic variables which enter into their problems are assumed to be of known functional form, and the theories of estimation and of testing hypotheses are theories of estimation of and of testing hypotheses about, one or more parameters . . ., the knowledge of which would completely determine the various distribution functions involved. We shall refer to this situation . . , as the parametric case, and denote the opposite case, where the functional forms of the distributions are unknown, as the non-parametric case."[6]

Chi-square test
"Chi-square test" is often shorthand for Pearson's chi-square test. A chi-square test, also chi-squared test or 2 test, is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true, or any in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-square distribution as closely as desired by making the sample size large enough. Some examples of chi-squared tests where the chi-square distribution is only approximately valid: Pearson's chi-square test, also known as the chi-square goodness-of-fit test or chi-square test for independence. When mentioned without any modifiers or without other precluding context, this test is usually understood (for an exact test used in place of 2, see Fisher's exact test). Yates' chi-square test, also known as Yates' correction for continuity CochranMantelHaenszel chi-square test. Linear-by-linear association chi-square test. The portmanteau test in time-series analysis, testing for the presence of autocorrelation Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).

One case where the distribution of the test statistic is an exact chi-square distribution is the test that the variance of a normally-distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.

Chi-square test for variance population________________________________

in

normal

If a sample of size n is taken from a population having a normal distribution, then there is a well-known result (see distribution of the sample variance) which allows a test to be made of whether the variance of the population has a pre-determined value. For example, a manufacturing process might have been in stable condition for a long period, allowing a value for the variance to be determined essentially without error. Suppose that a variant of the process is being tested, giving rise to a small sample of product items whose variation is to be tested. The test statistic T in this instance could be set to be the sum of squares about the sample mean, divided by the nominal value for the variance (i.e. the value to be tested as holding). Then T has a chi-square distribution with n 1 degrees of freedom. For example if the sample size is 21, the acceptance region for T for a significance level of 5% is the interval 9.59 to 34.17.

A/B testing
A/B testing, split testing or bucket testing is a method of marketing testing by which a baseline control sample is compared to a variety of single-variable test samples in order to improve response rates. A classic direct mail tactic, this method has been recently adopted within the interactive space to test tactics such as banner ads, emails and landing pages. Significant improvements can be seen through testing elements like copy text, layouts, images and colors. However, not all elements produce the same improvements, and by looking at the results from different tests, it is possible to identify those elements that consistently tend to produce the greatest improvements. Employers of this A/B testing method will distribute multiple samples of a test, including the control, to see which single variable is most effective in increasing a response rate or other desired outcome. The test, in order to be effective, must reach an audience of a sufficient size that there is a reasonable chance of detecting a meaningful difference between the control and other tactics: see Statistical power. As a simple example, a company with a customer database of 2000 people decides to create an email campaign with a discount code in order to generate sales through its website. It creates an email and then modifies the Call To Action (the part of the copy which encourages customers to do something - in the case of a sales campaign, make a purchase). To 1000 people it sends the email with the Call To Action stating "Offer ends this Saturday! Use code A1", and to another 1000 people it sends the email with the Call To Action stating "Limited time offer! Use

code B1". All other elements of the email's copy and layout are identical. The company then monitors which campaign has the highest success rate by analysing the use of the promotional codes. The email using the code A1 has a 5% response rate (50 of the 1000 people emailed used the code to buy a product), and the email using the code B1 has a 3% response rate (30 of the recipients used the code to buy a product). The company therefore determines that in this instance, the first Call To Action is is more effective and will use it in future sales campaigns. In the example above, the purpose of the test is to determine which is the most effective way to impulse customers into making a sale. If, however, the aim of the test was to see which would generate the highest click-rate - i.e., the number of people who actually click onto the website after receiving the email - then the results may have been different. More of the customers receiving the code B1 may have accessed the website after receiving the email, but because the Call To Action didn't state the end-date of the promotion, there was less incentive for them to make an immediate purchase. If the purpose of the test was simply to see which would bring more traffic to the website, then the email containing code B1 may have been more successful. An A/B Test should have a defined outcome that is measurable, e.g. number of sales made, click-rate conversion, number of people signing up/registering etc. This method differs from multivariate testing, which applies statistical modeling by which a tester can try multiple variables within the samples distributed.

Companies well-known for using A/B testing


Many companies use the "designed experiment" approach to making marketing decisions. It is an increasingly common practice as the tools and expertise grows in this area. There are many A/B testing case studies which show that the practice of testing is increasingly becoming popular with small and medium-sized businesses as well. [1] While it is widely used behind the scenes to maximize profits, the practice occasionally makes it into the spotlight: Amazon.com pioneered its use within the web e-commerce space.[2] BBC[3] eBay Google - One of its top designers, Douglas Bowman, left and spoke out against excessive use of the practice.[4] Microsoft[5] Netflix [6]

Playdom (Disney Interactive)[citation needed] Zynga[7]

Other terms used


A/B/N Testing: A/B testing with more than two options ("N" cells). A/B/..Z Testing: Same as above. A/B/A: Only two alternatives, but one is repeated. This enables a quick visual of when the test reaches significance. Multivariate Testing: A designed experiment where effects from two or more potential causal factors can be isolated from one another. Multivariant Testing: same as above.

Scientific Advertising
"Scientific Advertising" was written by Claude C Hopkins in 1923 and is cited by many advertising and marketing personalities (such as David Ogilvy, Gary Halbert and Jay Abraham) as a "must-read" book. David Ogilvy is widely quoted as saying that "Nobody, at any level, should be allowed to have anything to do with advertising until he has read this book seven times". The book is cited as being the original description of the process of split testing and of coupon based customer tracking and loyalty schemes. In the book, Hopkins outlines an advertising approach based on testing and measuring. In this way losses from unsuccessful ads are kept to a safe level while gains from profitable ads are multiplied. Or, as Hopkins wrote, the advertiser is "playing on the safe side of a hundred to one shot". "The book also contains information on how to write advertising that sells: Salesmanship in print."

Expert system
In artificial intelligence, an expert system is a computer system that emulates the decision-making ability of a human expert.[1] Expert systems are designed to solve complex problems by reasoning about knowledge, like an expert, and not by following the procedure of a developer as is the case in conventional programming[2][3][4] The first expert systems were created in the 1970s and then proliferated in the 1980s.[5] Expert systems were among the first truly successful forms of AI software.[6][7][8][9][10][11] An expert system has a unique structure, different from traditional programs. It is divided into two parts, one fixed, independent of the expert system: the inference engine, and one variable: the knowledge base. To run an expert sytem, the engine reasons about the knowledge base like a human[12]. In the 80's a third part appeared: a dialog interface to communicate with users.[13] This ability to conduct a conversation with users was later called "conversational".[14][15]

Software architecture______________________________________________________________________
The rule base or knowledge base

n expert system technology, the knowledge base is expressed with natural language rules IF ... THEN ... For examples : "IF it is living THEN it is mortal" "IF his age = known THEN his date of birth = date of today - his age in years" "IF the identity of the germ is not known with certainty AND the germ is gram-positive AND the morphology of the organism is "rod" AND the germ is aerobic THEN there is a strong probability (0.8) that the germ is of type enterobacteriacae"[16]

This formulation has the advantage of speaking in everyday language which is very rare in computer science (a classic program is coded). Rules express the knowledge to be exploited by the expert system. There exist other formulations of rules, which are not in everyday language, understandable only to computer scientists. Each rule style is adapted to an engine style. The whole problem of expert systems is to collect this knowledge, usually unconscious, from the experts. There are methods but almost all are usable only by computer scientists.

The inference engine


The inference engine is a computer program designed to produce a reasoning on rules. In order to produce a reasoning, it is based on logic. There are several kinds of logic: propositional logic, predicates of order 1 or more, epistemic logic, modal logic, temporal logic, fuzzy logic, etc. Except propositional logic, all are complex and can only be understood by mathematicians, logicians or computer scientists. Propositional logic is the basic human logic that expressed in the syllogism. The expert system that uses that logic is also called zeroth-order expert system. With logic, the engine is able to generate new information from the knowledge contained in the rule base and data to be processed. The engine has two ways to run: batch or conversational. In batch, expert system has all the necessary data to process from the beginning. For the user, the program works as a classical program: he provides data and receives results immediately. Reasoning is invisible. The conversational becomes necessary when the developer knows he can't ask the user all the necessary data at the start, the problem being too complex. The software must "invent" the way to solve the problem, request the user missing data, gradually, approaching the goal as quickly as possible. The result gives the impression of a dialogue led by an expert. To guide a dialogue, the engine may have several levels of sophistication: "forward chaining", "backward chaining" and "mixed chaining". Forward chaining is the questioning of an expert who has no idea of the solution and investigates progressively (eg fault diagnosis). In backward chaining, the engine has an idea of the target (eg is it okay or not? Or: there is danger but what is the level?). It starts from the goal in hopes of finding the solution as soon as possible. In mixed chaining the engine has an idea of the goal but it is

not enough: it deduces in forward chaining from previous user responses all that is possible before asking the next question. So, quite often, he deduces the answer to the next question before asking it. A strong interest in using logic is that this kind of software is able to give to user clear explanation of what it is doing (the "Why?") and what it has deduced (the "How?). Better yet, thanks to logic the most sophisticated expert system are able to detect contradictions [17] into user informations or in the knowledge and can explain them clearly, revealing at the same time the expert knowledge and his way of thinking.

Advantages__________________________________________________________ ________________________
Expert system offers many advantages for users compared to traditional programs because it operates like a human brain : Conversational For users conversational is the first quality of expert system because it allows to interact with the computer like with human. To enter data into the computer, traditional computing uses a specific human-machine interface: the data entry screen, a non-interactive medium. The data entry screen forces the user to adapt because he must prepare a dataset without necessarily understanding why. It usually requires him to follow a preliminary training before using a software. The conversational expert system, it asks the user step by step about the problem, to identify the solution gradually. The user only thinks of one question at a time. It's a natural way that humans are naturally adapted. Quick availability and opportunity to program itself As the rule base is in everyday language (the engine is untouchable), expert system can be written much faster than a conventional program, by users or experts, bypassing professional developers to whom everything must be explained. Ability to exploit a considerable amount of knowledge The expert system transforms a software as a database: a rule base. So, unlike conventional programs, the volume of knowledge to program is not a major concern, like for database. Whether the rule base has 10 rules or 10 000, the engine operation is the same. Reliability

The reliability of an expert system is the same as the reliability of a database, ie good, higher than that of a classical program. Scalability Evolving an expert system is to add, modify or delete rules. Since the rules are written in plain language, those to be removed or modified are easily identified. Pedagogy The engines run by a true logic are able to explain to the user in plain language why they ask a question and how they arrived at each deduction. In doing so, they show knowledge of the expert contained in the system expert. So, user can learn this knowledge in its context. Moreover, they can communicate their deductions step by step. So, the user has information about his problem even before the final answer of the expert system.

Preservation and improvement of knowledge Valuable knowledge can disappear with the death, resignation or retirement of an expert. Recorded in an expert system, it becomes eternal. Develop an expert system is to interview an expert and make him aware of his knowledge. In doing so, he reflects and enhances it. New areas neglected by conventional computing Automating a vast knowledge, the developer may meet a classic problem: "combinatorial explosion" that greatly complicates his work and results in a complex and time consuming program. The reasoning expert system does not encounter that problem since the engine automatically loads of combinatorics between rules. This ability can address areas where combinatorics is enormous: highly interactive or conversational applications, fault diagnosis, decision support in complex systems, educational software, and logic simulation of machines or systems, constantly changing software.

Disadvantages_______________________________________________________ ________________________
The expert system has a major flaw which explains its low success although the principle has existed for 70 years: knowledge collection and interpretation into rules, the knowledge engineering. Most developers have no method to perform this task. They work "manually" what opens to many possibilities for errors. Expert

knowledge is not well understood, there is a lack of rules, rules are contradictory, some are poorly written and unusable. Worse, they most often use an engine unable to reasonning. Result: the expert system works badly and the project is abandoned.[18] This problem does not exist with a right method of developing. There exists software to interview the expert step by step which automatically write the rules and simultaneously run the expert system before his eyes, performing a consistency of the rules control[19][20][21].So expert and users can check the quality of the software before it is finished. Many expert systems are also penalized by the logic used. Most of the logics operate on "variables" facts, ie whose value changes several times during one reasoning, considered as a property belonging to a more powerful logic. Exactly like in classical computing, a way of programming where developers are in fact comfortable. This is the case of Mycin, Dendral, fuzzy logic, predicate logic (Prolog), symbolic logic, mathematical logic, etc.. Propositional logic uses only not variable facts[22]. It turns out that in human mind, the facts used must remain invariable as long as the brain reasons on them. This makes possible detection of contradictions and production of explanations, two ways of controlling consistency of the knowledge[23][24]. That is why expert systems using variable facts, more understandable for IT developers so the most numerous, are less easy to develop, less clear to users and less reliable, why they don't produce explanation or contradiction detection.

Application field__________________________________________________________________ __________


Expert systems address areas where combinatorics is enormous: highly interactive or conversational applications, IVR, voice server, chatterbot fault diagnosis, medical diagnosis decision support in complex systems, process control, interactive user guide educational and tutorial software logic simulation of machines or systems knowledge management constantly changing software.

They can also be used in software engineering for rapid prototyping applications (RAD). Indeed, the expert system quickly developed in front of the expert shows him if the future application should be programmed. Indeed, any program contains expert knowledge and classic programming always begins with an expert interview. A program written in the form of expert system receives all the specific benefits of expert system, among others things it can be developed by anyone without computer training and without programming languages. But this solution has a defect: expert system runs slower than a traditional program because he consistently "thinks" when in fact a classic software just follows paths traced by the programmer.

Examples of applications__________________________________________________________ ________


Expert systems are designed to facilitate tasks in the fields of accounting, medicine, process control, financial service, production, human resources, among others. Typically, the problem area is complex enough that a more simple traditional algorithm cannot provide a proper solution. The foundation of a successful expert system depends on a series of technical procedures and development that may be designed by technicians and related experts. As such, expert systems do not typically provide a definitive answer, but provide probabilistic recommendations. An example of the application of expert systems in the financial field is expert systems for mortgages. Loan departments are interested in expert systems for mortgages because of the growing cost of labour, which makes the handling and acceptance of relatively small loans less profitable. They also see a possibility for standardised, efficient handling of mortgage loan by applying expert systems, appreciating that for the acceptance of mortgages there are hard and fast rules which do not always exist with other types of loans. Another common application in the financial area for expert systems are in trading recommendations in various marketplaces. These markets involve numerous variables and human emotions which may be impossible to deterministically characterize, thus expert systems based on the rules of thumb from experts and simulation data are used. Expert system of this type can range from ones providing regional retail recommendations, like Wishabi, to ones used to assist monetary decisions by financial institutions and governments. Another 1970s and 1980s application of expert systems, which we today would simply call AI, was in computer games. For example, the computer baseball games Earl Weaver Baseball and Tony La Russa Baseball each had highly detailed simulations of the game strategies of those two baseball managers. When a human played the game against the computer, the computer queried the Earl Weaver or Tony La Russa Expert System for a decision on what strategy to follow. Even those

choices where some randomness was part of the natural system (such as when to throw a surprise pitch-out to try to trick a runner trying to steal a base) were decided based on probabilities supplied by Weaver or La Russa. Today we would simply say that "the game's AI provided the opposing manager's strategy.

Knowledge engineering__________________________________________________________ _________


The building, maintaining and development of expert systems is known as knowledge engineering.[25] Knowledge engineering is a "discipline that involves integrating knowledge into computer systems in order to solve complex problems normally requiring a high level of human expertise".[26] There are generally three individuals having an interaction in an expert system. Primary among these is the end-user, the individual who uses the system for its problem solving assistance. In the construction and maintenance of the system there are two other roles: the problem domain expert who builds the system and supplies the knowledge base, and a knowledge engineer who assists the experts in determining the representation of their knowledge, enters this knowledge into an explanation module and who defines the inference technique required to solve the problem. Usually the knowledge engineer will represent the problem solving activity in the form of rules. When these rules are created from domain expertise, the knowledge base stores the rules of the expert system.

History_______________________________________________________________ ________________________
Expert systems were introduced by researchers in the Stanford Heuristic Programming Project, including the "father of expert systems" Edward Feigenbaum, with the Dendral and Mycin systems. Principal contributors to the technology were Bruce Buchanan, Edward Shortliffe, Randall Davis, William vanMelle, Carli Scott, and others at Stanford. Expert systems were among the first truly successful forms of AI software.[6][7][8][9][10][11] Research is also very active in France, where researchers focus on the automation of reasoning and logic engines. French Prolog computer language, designed in 1972, marks a real advance over expert systems like Dendral or Mycin: it is a shell [27], that's to say a software structure ready to receive any expert system and to run it. It integrates an engine using First-Order logic, with rules and facts. It's also a declarative language, the first one operational [28]. It's a tool for mass production of expert systems. It became later, more likely, the best selling IA language in the world[29]. But, Prolog is not so user friendly and uses an 1 order logic far from human logic [30][31][32].

In the 1980s, expert systems proliferated as they were recognized as a practical tool for solving real-world problems. Universities offered expert system courses and two thirds of the Fortune 1000 companies applied the technology in daily business activities.[5][33] Interest was international with the Fifth Generation Computer Systems project in Japan and increased research funding in Europe. Growth in the field continued into the 1990s. The development of expert systems was aided by the development of the symbolic processing languages Lisp and Prolog. To avoid re-inventing the wheel, expert system shells were created that had more specialised features for building large expert systems.[34] In 1981 the first IBM PC was introduced, with MS-DOS operating system. Its low price started to multiply users and opened a new market for computing and expert systems. In the 80's the image of IA was very good. People believed it will succeed in a short time [15]. Many companies began to market expert systems shells from universities, renamed "generators" because they added to the shell a tool for writing rules in plain language and thus, theoretically, allowed to write expert systems without a programming language nor any other software [16]. The best known: Guru (USA) inspired by Mycin[17][18], Personal Consultant Plus (USA)[19] [20], Nexpert Object (developed by Neuron Data, company founded in California by three French)[21][22], Genesia (developed by french public company Electricit de France and marketed by Steria)[23], VP Expert (USA)[24]. But eventually that tools are only used in research projects. They did not penetrate the business market, showing that AI technology is not mature. In 1986, a new expert system generator for PCs appeared on the market, derived from the French academic research: Intelligence Service[35][36], sold by GSI-TECSI software company. This software showed a radical innovation: it used propositional logic ("Zeroth order logic") to execute the expert system, reasoning on a knowledge base written with everyday language rules, producing explanations and detecting logic contradictions between the facts. The first tool showing the AI defined by Edward Feigenbaum in his book about the japanese Fifth Generation, Artificial Intelligence and Japan's Computer Challenge to the World (1983): "The machines will have reasoning power: they will automatically engineer vast amounts of knowledge to serve whatever purpose humans propose, from medical diagnosis to product design, from management decisions to education", "The reasoning animal has, perhaps inevitably, fashioned the reasoning machine", "the reasoning power of these machines matches or exceeds the reasoning power of the humans who intructed them and, in some cases, the reasoning power of any human performing such tasks". This generator was "Pandora" (1985)[37], a software developed for their thesis by two academic students of Jean-Louis Laurire[38], one of the most famous and prolific French AI researcher[39]. Unfortunately, as this software was not developed by his own IT developers, GSI-TECS was unable to make it evolve. Sales became scarce and they stopped its marketing after a few years.

Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. Machine Learning is concerned with the development of algorithms allowing the machine to learn via inductive inference based on observation data that represent incomplete information about statistical phenomenon. Classification which is also referred to as pattern recognition, is an important task in Machine Learning, by which machines learn to automatically recognize complex pattern, to distinguish between exemplars based on their different patterns, and to make intelligent decisions.

Definition____________________________________________________________ ________________________
Tom M. Mitchell provided a widely quoted definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.[1]

Generalization_______________________________________________________ _______________________
The core objective of a learner is to generalize from its experience.[2] The training examples from its experience come from some generally unknown probability distribution and the learner has to extract from them something more general, something about that distribution, that allows it to produce useful answers in new cases.

Human interaction___________________________________________________________ ______________


Some machine learning systems attempt to eliminate the need for human intuition in data analysis, while others adopt a collaborative approach between human and machine. Human intuition cannot, however, be entirely eliminated, since the system's designer must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data.

Algorithm types_________________________________________________________________ ____________


Machine learning algorithms can be organized into a taxonomy based on the desired outcome of the algorithm. Supervised learning generates a function that maps inputs to desired outputs (also called labels, because they are often provided by human experts labeling the training examples). For example, in a classification problem, the learner approximates a function mapping a vector into classes by looking at input-output examples of the function. Unsupervised learning models a set of inputs, like clustering. Semi-supervised learning combines both labeled and unlabeled examples to generate an appropriate function or classifier. Reinforcement learning learns how to act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback in the form of rewards that guides the learning algorithm. Transduction tries to predict new outputs based on training inputs, training outputs, and test inputs. Learning to learn learns its own inductive bias based on previous experience.

Theory________________________________________________________________ ________________________
The computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory. Because training sets are finite and the future is uncertain, learning theory usually does not yield guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance are quite common. In addition to performance bounds, computational learning theorists study the time complexity and feasibility of learning. In computational learning theory, a computation is considered feasible if it can be done in polynomial time. There are two kinds of time complexity results. Positive results show that a certain class of functions can be learned in polynomial time. Negative results show that certain classes cannot be learned in polynomial time.

There are many similarities between machine learning theory and statistics, although they use different terms.

Approaches__________________________________________________________ ________________________
Decision tree learning Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. Association rule learning Association rule learning is a method for discovering interesting relations between variables in large databases. Artificial neural networks An artificial neural network (ANN) learning algorithm, usually called "neural network" (NN), is a learning algorithm that is inspired by the structure and/or functional aspects of biological neural networks. Computations are structured in terms of an interconnected group of artificial neurons, processing information using a connectionist approach to computation. Modern neural networks are non-linear statistical data modeling tools. They are usually used to model complex relationships between inputs and outputs, to find patterns in data, or to capture the statistical structure in an unknown joint probability distribution between observed variables. Genetic programming Genetic programming (GP) is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms (GA) where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program's ability to perform a given computational task. Inductive logic programming Inductive logic programming (ILP) is an approach to rule learning using logic programming as a uniform representation for examples, background knowledge, and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program which entails all the positive and none of the negative examples. Support vector machines

Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other. Clustering Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis. Bayesian networks A Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independencies via a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Reinforcement learning Reinforcement learning is concerned with how an agent ought to take actions in an environment so as to maximize some notion of long-term reward. Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states. Reinforcement learning differs from the supervised learning problem in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Representation learning Several learning algorithms, mostly unsupervised learning algorithms, aim at discovering better representations of the inputs provided during training. Classical examples include principal components analysis and clustering. Representation learning algorithms often attempt to preserve the information in their input but transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions, allowing to reconstruct the inputs coming from the unknown data generating distribution, while not being necessarily faithful for configurations that are implausible under that distribution. Manifold learning algorithms attempt to do so under the constraint that the learned representation is low-dimensional. Sparse coding algorithms attempt to do so under the constraint that the learned representation is sparse (has many zeros). Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-

level features. It has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data.[3]

Applications__________________________________________________________ _______________________
Applications for machine learning include: machine perception computer vision natural language processing syntactic pattern recognition search engines medical diagnosis bioinformatics brain-machine interfaces cheminformatics Detecting credit card fraud stock market analysis Classifying DNA sequences speech and handwriting recognition object recognition in computer vision game playing software engineering adaptive websites robot locomotion structural health monitoring. Sentiment Analysis (or Opinion Mining).

In 2006, the on-line movie company Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and beat its existing Netflix movie recommendation system by at least 10%. The AT&T Research Team BellKor beat out several other teams with their machine learning program "Pragmatic Chaos". After winning several minor prizes, it won the grand prize competition in 2009 for $1 million.[4]

Software_____________________________________________________________ _________________________
RapidMiner, KNIME, Weka, ODM, Shogun toolbox, Orange, Apache Mahout and MCMLL are software suites containing a variety of machine learning algorithms.

Journals and conferences_________________________________________________________________


Machine Learning (journal) Journal of Machine Learning Research Neural Computation (journal) International Conference on Machine Learning (ICML) (conference) Neural Information Processing Systems (NIPS) (conference) List of upcoming conferences in Machine Learning and Artificial Intelligence (conference)

Business intelligence
Business intelligence (BI) mainly refers to computer-based techniques used in identifying, extracting,[clarification needed] and analyzing business data, such as sales revenue by products and/or departments, or by associated costs and incomes. [1] BI technologies provide historical, current and predictive views of business operations. Common functions of business intelligence technologies are reporting, online analytical processing, analytics, data mining, process mining, complex event processing, business performance management, benchmarking, text mining and predictive analytics. Business intelligence aims to support better business decision-making. Thus a BI system can be called a decision support system (DSS).[2] Though the term business intelligence is sometimes used as a synonym for competitive intelligence, because they both support decision making, BI uses technologies, processes, and applications to analyze mostly internal, structured data and business processes while competitive intelligence gathers, analyzes and disseminates information with a topical focus on company competitors. Business intelligence understood broadly can include the subset of competitive intelligence.[3]

History_______________________________________________________________ ________________________
In a 1958 article, IBM researcher Hans Peter Luhn used the term business intelligence. He defined intelligence as: "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."[4] Business intelligence as it is understood today is said to have evolved from the decision support systems which began in the 1960s and developed throughout the mid-80s. DSS originated in the computer-aided models created to assist with decision making and planning. From DSS, data warehouses, Executive Information Systems, OLAP and business intelligence came into focus beginning in the late 80s. In 1989 Howard Dresner (later a Gartner Group analyst) proposed "business intelligence" as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems."[2] It was not until the late 1990s that this usage was widespread.[5]

Business intelligence and warehousing__________________________________________

data

Often BI applications use data gathered from a data warehouse or a data mart. However, not all data warehouses are used for business intelligence, nor do all business intelligence applications require a data warehouse. In order to distinguish between concepts of business intelligence and data warehouses, Forrester Research often defines business intelligence in one of two ways: Using a broad definition: "Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making."[6] When using this definition, business intelligence also includes technologies such as data integration, data quality, data warehousing, master data management, text and content analytics, and many others that the market sometimes lumps into the Information Management segment. Therefore, Forrester refers to data preparation and data usage as two separate, but closely linked segments of the business intelligence architectural stack. Forrester defines the latter, narrower business intelligence market as "referring to just the top layers of the BI architectural stack such as reporting, analytics and dashboards."[7]

Business intelligence and business analytics

Thomas Davenport has argued that business intelligence should be divided into querying, reporting, OLAP, an "alerts" tool, and business analytics. In this definition, business analytics is the subset of BI based on statistics, prediction, and optimization.[8]

Applications in an enterprise____________________________________________________________ _
Business Intelligence can be applied to the following business purposes (MARCKM), in order to drive business value:[citation needed] 1. Measurement program that creates a hierarchy of Performance metrics (see also Metrics Reference Model) and Benchmarking that informs business leaders about progress towards business goals (AKA Business process management). 2. Analytics program that builds quantitative processes for a business to arrive at optimal decisions and to perform Business Knowledge Discovery. Frequently involves: data mining, process mining, statistical analysis, Predictive analytics, Predictive modeling, Business process modeling, complex event processing. 3. Reporting/Enterprise Reporting program that builds infrastructure for Strategic Reporting to serve the Strategic management of a business, NOT Operational Reporting. Frequently involves: Data visualization, Executive information system, OLAP 4. Collaboration/Collaboration platform program that gets different areas (both inside and outside the business) to work together through Data sharing and Electronic Data Interchange. 5. Knowledge Management program to make the company data driven through strategies and practices to identify, create, represent, distribute, and enable adoption of insights and experiences that are true business knowledge. Knowledge Management leads to Learning Management and Regulatory compliance/Compliance.

Requirements gathering____________________________________________________________ ______


According to Ralph Kimball[9], the requirements of business users impact nearly every decision made throughout the design and implementation of a DW/BI system. The business requirements relate to all aspects of the daily business processes and hence are critical to successful data warehousing. Business requirements analysis occurs at two distinct levels: Macro level: understand business needs and priorities relative to the overall business strategy Micro level: understand user needs and desires in the context of a single, relatively narrowly defined project.

Approach There are two basic interactive techniques for gathering requirements: Conducting interviews: Speaking with users about their jobs, their objectives, and their challenges. This is either done with individuals or small groups. Facilitated sessions and seminars that encourage creative mind-mapping.

Preparation
Identify the interview team Lead interviewer directs the questioning Scribe takes notes during the interview. A tape recorder may be used to supplement the scribe. Observers (optional) watch but do not contribute. This may be for the purpose of training the observers in the interview approach, or so that the observers can comment on the interview after the event.

Research the organization Reports, review of business operations, part of the annual report to gain insight regarding organizational structure. If applicable, a copy of the resulting documentation from the latest internal business/ IT strategy and planning meeting. Select the interviewees Select a cross section of representatives. Study the organization to get a good idea of all the stakeholders in the project. These include:

Business interviewees (to understand the key business processes) IT and Compliance/Security Interviewees (to assess preliminary feasibility of the underlying operational source systems to support the requirements emerging from the business side of the house)

Develop the interview questionnaires Multiple questionnaires should be developed because the questioning will vary by job function and level. The questionnaires for the data audit sessions will differ from business requirements questionnaires Be structured. This will help the interview flow and help organize your thoughts before the interview.

Schedule and sequence the interviews Scheduling and rescheduling takes time; prepare these a good time in advance! Sequence your interviews by beginning with the business driver, followed by the business sponsor. This is to understand the playing field from their perspective. The optimal sequence would be: Business driver Business sponsor An interviewee from the middle of the organizational hierarchy Bottom of the organizational hierarchy

The bottom is a disastrous place to begin because you have no idea where you are headed. The top is great for overall vision, but you need the business background, confidence, and credibility to converse at those levels. If you are not adequately prepared with in-depth business familiarity, the safest route is to begin in the middle of the organization. Prepare the interviewees Make sure the interviewees are appropriately briefed and prepared to participate. As a minimum, a letter should be emailed to all interview participants to inform them about the process and the importance of their participation and contribution. The letter should explain that the goal is to understand their job responsibilities and business objectives, which then translate into the information and analyses required to get their job done. In addition they should be asked to bring copies of frequently used reports or spreadsheet analyses.

The letter should be signed by a high ranking sponsor, someone well respected by the interviewees. It is advisable not to attach a list of the fifty questions you might ask in hopes that the interviewees will come prepared with answers. The odds are that they wont take the time to prepare responses and even get intimidated by the volume of your questions. Issues with requirements gathering and interviews The process of conducting an interview may seem exhaustive at first, but the ground rule is to be well prepared in all steps. Techniques for questioning may be a good idea to investigate before conducting the interview. Ask open-ended questions such as why, how, what-if, and what-then questions. Ask unbiased questions. Wrongfully asked questions can lead to wrong answers and, in the worst case, wrong requirements are gathered. The whole process is valuable in time and resources, and the wrong data can slow down the development of the whole BI installation. Be sure that everyone in the interviewee team is aware of their role to support that everything goes as planned. The next part is to synthesize around the business processes.

Prioritization of business projects_______________________________________

intelligence

It is often difficult to provide a positive business case for business intelligence (BI) initiatives and often the projects will need to be prioritized through strategic initiatives. Here are some hints to increase the benefits for a BI project. As described by Kimball [10] you must determine the tangible benefits such as eliminated cost of producing legacy reports. Enforce access to data for the entire organization. In this way even a small benefit, such as a few minutes saved, will make a difference when it is multiplied by the number of employees in the entire organization. As described by Ross, Weil & Roberson for Enterprise Architecture,[11] consider letting the BI project be driven by other business initiatives with excellent business cases. To support this approach, the organization must have Enterprise Architects, which will be able to detect suitable business projects.

Success factors of implementation______________________________________________________

Before implementing a BI solution, it is worth taking different factors into consideration before proceeding. According to Kimball et al. These are the three critical areas that you need to assess within your organization before getting ready to do a BI project [12]: 1. The level of commitment and sponsorship of the project from senior management 2. The level of business need for creating a BI implementation 3. The amount and quality of business data available. Business Sponsorship The commitment and sponsorship of senior management is according to Kimball et al., the most important criteria for assessment.[13] This is because having strong management backing will help overcome shortcomings elsewhere in the project. But as Kimball et al. state: even the most elegantly designed DW/BI system cannot overcome a lack of business [management] sponsorship.[14] It is very important that the management personnel who participate in the project have a vision and an idea of the benefits and drawbacks of implementing a BI system. The best business sponsor should have organizational clout and should be well connected within the organization. It is ideal that the business sponsor is demanding but also able to be realistic and supportive if the implementation runs into delays or drawbacks. The management sponsor also needs to be able to assume accountability and to take responsibility for failures and setbacks on the project. It is imperative that there is support from multiple members of the management so the project will not fail if one person leaves the steering group. However, having many managers that work together on the project can also mean that the there are several different interests that attempt to pull the project in different directions. For instance if different departments want to put more emphasis on their usage of the implementation. This issue can be countered by an early and specific analysis of the different business areas that will benefit the most from the implementation. All stakeholders in project should participate in this analysis in order for them to feel ownership of the project and to find common ground between them. Another management problem that should be encountered before start of implementation is if the Business sponsor is overly aggressive. If the management individual gets carried away by the possibilities of using BI and starts wanting the DW or BI implementation to include several different sets of data that were not included in the original planning phase. However, since extra implementations of extra data will most likely add many months to the original plan. It is probably a good idea to make sure that the person from management is aware of his actions. Implementation should be driven by clear business needs.

Because of the close relationship with senior management, another critical thing that needs to be assessed before the project is implemented is whether or not there actually is a business need and whether there is a clear business benefit by doing the implementation.[15] The needs and benefits of the implementation are sometimes driven by competition and the need to gain an advantage in the market. Another reason for a business-driven approach to implementation of BI is the acquisition of other organizations that enlarge the original organization it can sometimes be beneficial to implement DW or BI in order to create more oversight. The amount and quality of the available data. This ought to be the most important factor, since without good data it does not really matter how good your management sponsorship or your business-driven motivation is. If you do not have the data, or the data does not have sufficient quality any BI implementation will fail. Before implementation it is a very good idea to do data profiling, this analysis will be able to describe the content, consistency and structure [..][15] of the data. This should be done as early as possible in the process and if the analysis shows that your data is lacking; it is a good idea to put the project on the shelf temporarily while the IT department figures out how to do proper data collection. Other scholars have added more factors to the list than these three. In his thesis Critical Success Factors of BI Implementation [16] Naveen Vodapalli does research on different factors that can impact the final BI product. He lists 7 crucial success factors for the implementation of a BI project, they are as follows: 1. Business-driven methodology and project management 2. Clear vision and planning 3. Committed management support & sponsorship 4. Data management and quality 5. Mapping solutions to user requirements 6. Performance considerations of the BI system 7. Robust and expandable framework

User aspect________________________________________________________________ __________________


Some considerations must be made in order to successfully integrate the usage of business intelligence systems in a company. Ultimately the BI system must be accepted and utilized by the users in order for it to add value to the organization.

[17][18] If the usability of the system is poor, the users may become frustrated and spend a considerable amount of time figuring out how to use the system or may not be able to really use the system. If the system does not add value to the users mission, they will simply not use it.[18] In order to increase the user acceptance of a BI system, it may be advisable to consult the business users at an early stage of the DW/BI lifecycle, for example at the requirements gathering phase.[17] This can provide an insight into the business process and what the users need from the BI system. There are several methods for gathering this information, such as questionnaires and interview sessions. When gathering the requirements from the business users, the local IT department should also be consulted in order to determine to which degree it is possible to fulfill the business's needs based on the available data.[17] Taking on a user-centered approach throughout the design and development stage may further increase the chance of rapid user adoption of the BI system.[18] Besides focusing on the user experience offered by the BI applications, it may also possible to motivate the users to utilize the system by adding an element of competition. Kimball [17] suggests implementing a function on the Business Intelligence portal website where reports on system usage can be found. By doing so, managers can see how well their departments are doing and compare themselves to others and this may spur them to encourage their staff to utilize the BI system even more. In a 2007 article, H. J. Watson gives an example of how the competitive element can act as an incentive.[19] Watson describes how a large call centre has implemented performance dashboards for all the call agents and that monthly incentive bonuses have been tied up to the performance metrics. Furthermore the agents can see how their own performance compares to the other team members. The implementation of this type of performance measurement and competition significantly improved the performance of the agents. Other elements which may increase the success of BI can be by involving senior management in order to make BI a part of the organizational culture and also by providing the users with the necessary tools, training and support.[19] By offering user training, more people may actually use the BI application.[17] Providing user support is necessary in order to maintain the BI system and assist users who run into problems.[18] User support can be incorporated in many ways, for example by creating a website. The website should contain great content and tools for finding the necessary information. Furthermore, helpdesk support can be used. The helpdesk can be manned by e.g. power users or the DW/BI project team. [17]

Marketplace__________________________________________________________ _______________________
There are a number of business intelligence vendors, often categorized into the remaining independent "pure-play" vendors and the consolidated "megavendors" which have entered the market through a recent trend of acquisitions in the BI industry.[20] Some companies adopting BI software decide to pick and choose from different product offerings (best-of-breed) rather than purchase one comprehensive integrated solution (full-service).[21] Industry-specific Specific considerations for business intelligence systems have to be taken in some sectors such as governmental banking regulations. The information collected by banking institutions and analyzed with BI software must be protected from some groups or individuals, while being fully available to other groups or individuals. Therefore BI solutions must be sensitive to those needs and be flexible enough to adapt to new regulations and changes to existing laws.

Semi-structured or data_________________________________________________

unstructured

Businesses create a huge amount of valuable information in the form of e-mails, memos, notes from call-centers, news, user groups, chats, reports, web-pages, presentations, image-files, video-files, and marketing material and news. According to Merrill Lynch, more than 85 percent of all business information exists in these forms. These information types are called either semi-structured or unstructured data. However, organizations often only use these documents once.[22] The management of semi-structured data is recognized as a major unsolved problem in the information technology industry.[23] According to projections from Gartner (2003), white collar workers will spend anywhere from 30 to 40 percent of their time searching, finding and assessing unstructured data. BI uses both structured and unstructured data, but the former is easy to search, and the latter contains a large quantity of the information needed for analysis and decision making.[23][24] Because of the difficulty of properly searching, finding and assessing unstructured or semi-structured data, organizations may not draw upon these vast reservoirs of information, which could influence a particular decision, task or project. This can ultimately lead to poorly-informed decision making.[22] Therefore, when designing a Business Intelligence/DW-solution, the specific problems associated with semi-structured and unstructured data must be accommodated for as well as those for the structured data.[24]

Unstructured data vs. Semi-structured data Unstructured and semi-structured data have different meanings depending on their context. In the context of relational database systems, it refers to data that cannot be stored in columns and rows. It must be stored in a BLOB (binary large object), a catch-all data type available in most relational database management systems. But many of these data types, like e-mails, word processing text files, PPTs, imagefiles, and video-files conform to a standard that offers the possibility of metadata. Metadata can include information such as author and time of creation, and this can be stored in a relational database. Therefore it may be more accurate to talk about this as semi-structured documents or data,[23] but no specific consensus seems to have been reached. Problems with semi-structured or unstructured data There are several challenges to developing BI with semi-structured data. According to Inmon & Nesavich,[25] some of those are: 1. Physically accessing unstructured textual data unstructured data is stored in a huge variety of formats. 2. Terminology Among researchers and analysts, there is a need to develop a standardized terminology. 3. Volume of data As stated earlier, up to 85% of all data exists as semistructured data. Couple that with the need for word-to-word and semantic analysis.. 4. Searchability of unstructured textual data A simple search on some data, e.g. apple, results in links where there is a reference to that precise search term. (Inmon & Nesavich, 2008)[25] gives an example: a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies. The use of metadata To solve problems with searchability and assessment of data, it is necessary to know something about the content. This can be done by adding context through the use of metadata.[22] A lot of systems already capture some metadata (e.g. filename, author, size, etc.), but more useful would be metadata about the actual content e.g. summaries, topics, people or companies mentioned. Two technologies designed for generating metadata about content are automatic categorization and information extraction.

Future________________________________________________________________ ________________________
A 2009 Gartner paper predicted[26] these developments in the business intelligence market: Because of lack of information, processes, and tools, through 2012, more than 35 percent of the top 5,000 global companies will regularly fail to make insightful decisions about significant changes in their business and markets. By 2012, business units will control at least 40 percent of the total budget for business intelligence. By 2012, one-third of analytic applications applied to business processes will be delivered through coarse-grained application mashups.

A 2009 Information Management special report predicted the top BI trends: "green computing, social networking, data visualization, mobile BI, predictive analytics, composite applications, cloud computing and multitouch."[27] Other lines of research include the combined study of business intelligence and uncertain data [28]. In this context, the data used is not assumed to be precise, accurate and complete. Instead, data is considered uncertain and therefore this uncertainty is propagated to the results produced by BI. According to a study by the Aberdeen Group, there has been increasing interest in Software-as-a-Service (SaaS) business intelligence over the past years, with twice as many organizations using this deployment approach as one year ago 15% in 2009 compared to 7% in 2008.[citation needed] An article by InfoWorlds Chris Kanaracus points out similar growth data from research firm IDC, which predicts the SaaS BI market will grow 22 percent each year through 2013 thanks to increased product sophistication, strained IT budgets, and other factors.[29]

Decision tree
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

General_______________________________________________________________ _____________________
In decision analysis, a "decision tree" and the closely-related influence diagram is used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated. A decision tree consists of 3 types of nodes:1. Decision nodes - commonly represented by squares 2. Chance nodes - represented by circles 3. End nodes - represented by triangles

Drawn from left to right, a decision tree has only burst nodes (splitting paths) but no sink nodes (converging paths). Therefore, used manually, they can grow very big and are then often hard to draw fully by hand. Traditionally, decision trees have been created manually - as the aside example shows - although increasingly, specialized software is employed. Analysis can take into account the decision maker's (e.g., the company's) preference or utility function, for example:

The basic interpretation in this situation is that the company prefers B's risk and payoffs under realistic risk preference coefficients (greater than $400Kin that range of risk aversion, the company would need to model a third strategy, "Neither A nor B").

Influence diagram______________________________________________________________ ____________


A decision tree can be represented more compactly as an influence diagram, focusing attention on the issues and relationships between events.

The squares represent decisions, the ovals represent action, and the diamond represents results.

Uses in teaching______________________________________________________________ ___________

Decision trees, influence diagrams, utility functions, and other decision analysis tools and methods are taught to undergraduate students in schools of business, health economics, and public health, and are examples of operations research or management science methods.

You might also like