You are on page 1of 2

0-sum: one gain other loss, operators: legal moves, utility function (final state), evaluation function [e(s)

> 0
good for max, non-terminal state]. Horizon effects [hidden pitfalls]. Alpha-beta: (max maximizes alpha [start=−∞])
M ax > beta ∨ M in < alpha 7→ prune. expectminimax weighted average (at chance nodes).
1R Attribute value makes rule by majority class. Calc error rate of rules. Chooses rule w/ smallest error rate.
√ 1
KNN sensitive to k, heuristic: k ≤ t. [weighted version]. Minkowski d(a, b) = (|a − b|qi ) q (Manhattan q = 1). Lazy.
Normalization. Curse of Dimensionality [sol. feature selection]. Produces arbitarially shaped decision boundary defined
by a subset of the Voronoi edges. Sensitive to noise. Discretization [aggregation > class changes intervals]
x−av(x) Pn
P (E1 |yes)P (E2 |yes)P (yes) (xi −av(x))2
Naive Bayer P (yes|E) = P (E) Probability density function 1√
sd(x) 2π
e 2sd(x)2 . sd(x)2 = i=1
n−1 .
Bad: Independence assumption [sol. feature selection]. Many numeric features not normally distributed [sol. use different
0+1
distribution]. Good: robust to isolate noise, simple, fast. Laplace/M-estimate Handle Zero-Numerators ( x1 +len(attr) ).
M proportional to importance of prior probability ( 0+mp i
x1 +m ) Laplace m = len(attr).
Evaluating and comparing classifiers Stratification [All classes equally represented in test-set]. Holdout Procedure:
split data into 2 independent sets. Repeated holdout method (cross-validation with possible overlaps). Validation set
[for parameter tuning]. Comparing classifiers [compare mean s-fold CV and run statistical test (ex. t-test) to check sig-
tp tp 2P R tp+tn
nificance]. Precision [P = tp+f p ], Recall [R = tp+f n ], combo [F 1 = P +R ], accuracy [ tp+tn+f p+f n ]. Realizable problem
[H contains true function].
Entropy H(S) = I(S) = − ni P (si ) log2 (P (m1 )) ∈ (0, 1) [how surprising! smallest possible number of bits per
P
symbol. Low: more predictable]. Information gain [reduction in entropy, how good current state is]: Gain(S|A) =
H(S) − a∈A SSv H(S|v) [Sv subset containing v]. [splitInf ormation(S|A) = − ni=1 |S i| |Si |
P P
|S| log2 |S| ]. [GainRatio(S|A) =
Gain(S|A)
punish high branching!]
SplitInf ormation(S|A)
Decision trees Optimization problem [hill climbing with information gain as evaluation function]. Prone to over-fitting
[due to small training set, noise in data]. Simple, fast, easy to interpret. pre-prunning [stop early], post-prunning [prune
at the end: we can use validation set], by sub-tree replacement: [start: leaf, for node: replace it with a leaf with majority
class, if the accuracy (of this new tree) is greater or equal: keep the new tree]. Sub-tree raising (combine rules and then
reduce them).

Multilayer NN sigmoid [λ n. 1+e1−n , f 0 (x) = f (x)(1 − f (x))]. Sum of Squared error [E = 12 i e2i , where ei = di − ai
P
desired output, actual]. Can be view as optimization problem (hill climbing with steepest gradient descent). Gradient
[Batch: after all exmpls, incremental/stochastic (faster n better)], Cybenko [any function can be approximated with two
hidden layers (cont. with 1)]. Early stopping method [Prevent over-fitting monitor error on validation set, if it begins to
rise: stop], we can also do cross-validation. Speeding up the convergence (momentum term: wqp (t) − wqp (t − 1)) Deep
learning (many layers/automated feature extraction/ unlabeled data for pre-training). BackProp is slow! Vanishing
gradient problem, local minima. (Stacked) autoencoder networks, Sets input equal outputs. Has middle hidden layer
with less neurons [idea: compress version of input]. Can be used for encryption. Initialization for Deep NN [for each
layer: autoencoder layer]: train autoencoders, train last layer, train whole network (back prop). Stacking autoencoders:
stacked autoencoder [using several autoencoders]. Convolutional NN [not fully connected: restrict connections]. Local
connectivity, each neurons input: [its receptive field ]. Sharing weights [further reduction in connections: same input to
many neurons, convolution layer ]. Pooling [Max-pooling layer: take maximum value of a selected set of neurons from
convolutional layer, also called subsampling layer]. Local Contrast Normalization: [(LCN) layer: normalize output of
each max-pooling neuron by taking by taking xi = xi −mean(x)
sd(x) ]. We can have multiple channels [filters]. We can have
many filters [output of filter: map]. Dropout We randomly set neurons to 0 during backprop (during test: random
weight) [forces NN to be less dependent on individual neurons].
Support Vector Machines Support vectors [touch margin]. We want maximal margin. Classify example x [by calcu-
lating f = wx + b and determining the sign]. Kernel trick [when moving to higher dimension φ(x, y) = (x ∗ y)2 ] [margin
2
2
= kwk , maximizing equivalent to minimizing kwk 2 ]
Ensemble (committee) of Classifiers [combining base classifiers (high correct and diverse), enlarges H] Bagging
63
(bootstrap aggregation) bootstrap sample [choose len(data) element with replacement, av 100 will be included],
bagging [create M bootstrap samples, each 7→ classifier]. New example: voting. Boosting typically more accurate
then bagging, but more sensitive to noise [make classifiers complement each other: train next using d́ifficultéxamples].
AdaBoost boosting theorem (If the base learning algo is weak learning algo than AdaBoost will return ensebmle that
classifies perfectly for large enough K. ts. AdaBoost boosts weak algo into strong algo). Random forest Create M
decision trees by: Bagging and random feature selection (based on k = log2 m randomly selected features) (without
pruning). Classify: [Majority voting].

P (x1 · · · xn ) = ni=1 (xi |P arents(xi )) [derived from chain rule and conditional independence]. Node [cond inde: non-
Q
descendants given parent and (Markov blanket): cond ind all nodes given its parent, children and children’s parents],
CPT [cond prob tables].
Clustering Partial (one set), Hierarchical [nested set (no need for k)]. Centroid (middle of the cluster). Medoid
[centrally located point]. d(v1 , v2 ) [can be between: centroids, medoids, single link [MIN] (smallest pairwise d) sensitive
to noise (chain effect), complete link [MAX] (more compact clusters, less sensitive no noise), average link]. Good
clustering [high cohesion (high similarity in cluster), high separation (low
 similarity between  clusters). Both measured
1 Pk dist(x,ci )+dist(x,cj )
by d]. Combining them Davis Bouldin index DB = k i=1 maxj6=i d(ci ,cj ) [dist: mean square distance
x to centroid] We want DB small. K-means [choose k centroids. Form k clusters by d. End epoch: re-compute
centroids, if not change or epoch max: stop]. Data should be normalized. Sensitive to seeds. Time-expensive. Interpre
as optimization: [minimize sum-squared error]. May stack in local minimum. Voronoi. Simple. Don’t like outliers.
Nearest Neighbor clustering [first item creates cluster. New item either merge or new cluster based on d. Often
single link (MIN)]. Hierarchical Clustering algorithms [create tree: dendrogram, leaf: single example/cluster, root:
all in one cluster, node: by merging] Agglomerative (bottom-up) [start: each in own cluster, merge based on distance
between clusters (increment the threshold), stop: when root is reached]. Divisive clustering [place all in one, iter:
split if distance ≤ threshold, initial threshold: big]

You might also like