D

Machine Learning
Study online at quizlet.com/_298y5j
1. 1st order Markov restricted to encoding sequential 10. ATTRIBUTE SUBSET Attribute subset selection = feature
models correlation on previous element only SELECTION selection
Feature selection is a form of
2. ABSTRACT Representation + Evaluation +
dimensionality reduction
ESSENCE OF ML Optimisation
in ML, hence the DM term
3. ACQUIRING Wide range of options to model ****** 'dimensionality reduction'
EMISSIONS probabilities: for manifold projection is
Discrete tables problematic.
Gaussians Approaches:
Mixture of Gaussians • Exact solution infeasible
Neural Networks/RVMs etc to mode • Greedy forward selection
4. All probability Product rule • Backward elimination
theory can be Sum rule • Forward-backward
expressed in terms • Decision tree induction
of two rules 11. BACKPROPAGATION Used to calculate derivatives of error
5. ANN FEATURE Artificial Neural Networks can function efficiently
SELECTION implicitly perform feature Errors propagate backwards layer by
selection layer
A multi-layer neural network where the 12. Backprop is for: Arbitrary feed-forward topology
first hidden layer Differentiable nonlinear activation
has fewer units (nodes) than the input functions
layer Broad class of error function
Called 'Auto-associative' networks
13. BACKWARD 1.Start with complete SF set (contains
6. APRIORI Apriori algorithm is a fast way of ELIMINATION all
ALGORITHM finding original features)
frequent itemsets 2. Find feature that, when removed,
7. ARFF Attribute-Relation File Format reduces
the filter score least
8. ARTIFICIAL NEURAL Feed-forward neural
3.Remove feature from SF set
NETS network/Multilayer Perceptron one of
4.Repeat steps 2-3 until convergence
many ANNs
We focus on the Multilayer Perceptron 14. BASIC DECISION Decision trees apply a series of linear
Really multiple layers of logistic TREES decisions, that often
regression depend on only a single variable at a
models time. Such trees partition
the input space into cuboid regions,
9. ASSOCIATION reflect items that are frequently
gradually refining the level
RULES found (purchased) together, i.e. they
of detail of a decision until a leaf
are frequent
node has been reached,
itemsets
which provides the final predicted
• Information that customers who buy
label.
beer also buy
crisps is e.g. encoded as: 15. Bayes' Error The Bayes Error rate is the theoretical
beer ) crisps[support = 2%, conf idence lowest possible error rate
= 75%] for a given classifier and a given
problem (dataset).
For real data, it is not possible to
calculate the Bayes Error rate,
although upper bounds can be given
when certain assumptions on
the data are made.
The Bayes Error functions mostly as a
theoretical device in
Machine Learning and Pattern
Recognition research.
16. BAYES' posterior =(likelihood x prior)/evidence 27. CFS Correlation based feature selection (CFS)
THEOREM selects features in a
forward-selection manner.
17. A Bernoulli is a trial with a binary outcome, for which
Looks at each step at both correlation
trial the
with target variable and
probability that the outcome is 1 equals p
already selected features.
(think of a coin toss of an
old warped coin with the probability of 28. CHOOSING K Elbow method
throwing heads being p). • Visual inspection
A Bernoulli experiment is a number of • 'Downstream' Analysis
Bernoulli trials performed after
29. Classification ML task where T has a discrete set of
each other. These trials are i.i.d. by
outcomes
definition.
• Often classification is binary
18. BINOMIAL In probability theory and statistics, the • Examples:
DISTRIBUTION binomial distribution with parameters n and • face detection
p is the discrete probability distribution of • smile detection
the number of successes in a sequence of n • spam classification
independent yes/no experiments, each of • hot/cold
which yields success with probability p.
30. CLASSIFICATION Common performance measure for
19. BLOCKING When a path is blocked, no information can MEASURES - classification problems
PATHS flow ERROR RATE Success: instance's class is predicted
through it correctly (True Positives (TP) /
This means that observing C, if it blocks a Negatives (TN))
path A-C-B, it Error: instance's class is predicted
means there is no added value in observing incorrectly (False Positives (FP) /
A, and B is Negatives (FN))
fully determined by C False positives - Type I error. False
Negative - Type II error
20. Bootstrapping: Randomly select a subset to be training set
Classification error rate: proportion of
Randomly select a subset to be test set
instances misclassified over the
Can be repeated many times
whole set of instances
Has theoretical problems of statistical
significance when 31. CLUSTERING Market segmentation
repeated APPLICATIONS Social Network Analysis
Vector quantisation
21. Branching Branching factor of node at level L is equal
Facial Point detection
Factor to the
number of branches it has to nodes at level 32. Commonly used Linear kernel:
L+1 kernels Polynomial kernel
Gaussian kernel
22. C4.5 Successor of ID3
(Gaussian kernel is probably the most
Multiway splits are used
frequently used kernel out there -
Statistical significant split pruning
Gaussian kernel maps to infinite feature
23. CART Classification And Regression Trees space)
24. •Causes of •"Not applicable" data value when collected 33. Comparing t-test
incomplete •Different considerations between the time Hypotheses Analysis of Variance (ANOVA) test
data: when the data
34. COMPLEX ITEM The general rule procedure for finding
was collected and when it is analysed.
SETS frequent
•Human/hardware/software problems
item sets would be:
25. •Causes of •Different data sources 1. Find all frequent itemsets
inconsistent •Functional dependency violation (e.g., 2. Generate strong association rules
data: modify linked data) However, this is terribly costly, with the
26. •Causes of •Faulty data collection instruments total
noisy data •Human or computer error at data entry number of item sets to be checked for
(incorrect •Errors in data transmission 100 items
values): being:
35. Conditional is the PGN mechanism to 45. Cross-validation Randomly split data into n folds and
independence show information in terms of interesting iteratively use one as
in PGN aspects of test set
probability distributions All data used to test, and almost all to
train
36. CONFUSION Easy to see if the system is commonly
Good for small sets
MATRIX mislabelling one class as another
46. CROSS- Split training data in a number of folds
37. Convolu+onal is a type of feed-forward artificial neural
VALIDATION For each fold, train on all other folds
Neural network in which the connectivity pattern
CRITERION and make predictions
Networks between its neurons is inspired by the
on the held-out test fold
organization of the animal visual cortex,
Combine all predictions and calculate
whose individual neurons are arranged in
error
such a way that they respond to
If error has gone down, continue
overlapping regions tiling the visual field.[1]
splitting nodes, otherwise,
38. COST Squared error cost function stop
FUNCTION
47. Curse of When D becomes large, learning
39. cost function ... Dimensionality problems can become very
- City block difficult. For example:
distance when dividing a space (e R^d) into
40. cost function ... regular cells, the number of
- Euclidean cells grows exponentially with D.
distance in linear regression a polynomial model
of order M has D^m coefficients
41. cost function ...
a sphere in high dimension has most of
- L1 norm
it's volume in an
42. cost function ... infinitesimally thin slice near the surface
- L2 norm
48. DATA Grouping a possibly infinite space to a
43. cost function ... DISCRETISATION discrete set of possible values
- Manhattan For categorical data:
44. Cross- In n-fold cross-validation, a dataset is split • Super-categories
validation into n roughly equally For real numbers:
sized partitions, such that each example is • Binning
assigned to one and only • Histogram analysis
one fold. At each iteration a hypothesis is • Clustering
learned using n-1 folds as 49. DATA • Entity identification problem
the training set and predictions are made on INTEGRATION • Redundancy detection
the n'th fold. This is • correlation analysis
repeated until a prediction is made for all n • Detection and resolution of data value
examples, and an error conflicts
rate for the entire dataset is obtained. • e.g. weight units, in/exclusion of taxes
Cross-validation maximises the amount of
50. Data Mining is the quest to extract knowledge and/
data available to train
or unknown interesting patterns from
and test on, at cost of increased time to
apparently
perform the evaluation.
unstructured data
• Training Data segments between different
AKA Knowledge Discovery from Data
folds should never overlap!
(KDD)
• Training and test data in the same fold
• Data mining bit of a misnomer -
should never overlap!
information/
Error estimation can either be done per fold
knowledge is mined, not data.
separately (as
shown above), or delayed by collating all
predictions per fold.
51. DATA QUALITY Multi-Dimensional Measure of Data 56. Deep learning (Deep) Neural Networks
MEASURES Quality methods • Convolutional Neural Networks
• A well-accepted multidimensional • Restricted Boltzmann
view: Machines/Deep Belief Networks
• Accuracy • Recurrent Neural Networks
• Completeness
57. Degrees of freedom Number of ways data can
• Consistency
of variability: change/number of separate
• Timeliness
transformations possible
• Believability
• Value added 58. Directed Acyclic are Bayesian Networks
• Interpretability Graphs (DAGs) Meaning there are no cyclic paths
• Accessibility from any node back to itself
• Broad categories: 59. Directed PGN: edges have direction (Bayesian
• Intrinsic, contextual, representational, Network)
and accessibility
60. DIRTY DATA incomplete:
52. DATA REDUCTION Should remove what's unnecessary, yet noisy:
otherwise inconsistent:
maintain the distribution and
61. Discrete latent are hidden variables that can take
properties of the
variables only
original data
a limited number of discrete values
• Data cube aggregation
(e.g. gender or basic
• Attribute subset selection (feature
emotion).
selection)
• Dimensionality reduction (manifold 62. DM Concept/Class description
projection) FUNCTIONALITIES • Characterisation
• Numerosity reduction • Discrimination
• Discretisation • Frequent
patterns/Associations/Correlations
53. DATA Data transformation alters original data
• Classification and Regression
TRANSFORMATION to make it more
(Prediction)
suitable for data mining.
• Cluster analysis
• Smoothing (noise cancellation)
• Outlier analysis
• Aggregation
• Evolution analysis
• Generalisation
• Normalisation 63. DM INTEGRATION •No coupling - DMS will not utilise
• Attribute/feature construction WITH DBS/ any DB/DW
DATA system functionalit
54. Decision surfaces • linear functions of x
WAREHOUSES•Tight •Loose coupling - uses some DB/DW
are • defined by (D-1)-dimensional
coupling functionality, in particular data
hyperplanes in the Ddimensional
fetching/storing,
input space
•Semi-tight coupling - in addition to
55. DEEP LEARNING Basically a Neural Network loose
Many hidden layers coupling use sorting, indexing,
Major breakthrough in pre-training aggregation,
Treat each layer first as an histogram analysis, multiway join, and
unsupervised statistics
restricted Boltzmann machine to primitives available in DB/DW
initialise weights systems
Then do standard supervised •Tight coupling
backpropagation
Can be used for unsupervised learning
and
dimensionality reduction
64. DM OBJECTIVE INTEREST Support: P(X u Y ) 69. DM TYPES OF DATA • Relational databases
Percentage of transactions that • Data warehouses
a rule satisfies • Transactional databases
• Object-relational databases
Confidence: P(Y | X) • Temporal/sequence/time-series
Degree of certainty of a databases
detected association, i.e. • Spatial and Spatio-temporal
the probability that a databases
transaction containing X also • Text & Multimedia databases
contains Y • Heterogeneous & Legacy databases
• Data streams
65. DM QUERY LANGUAGES DM query language
incorporates primitives 70. EIGENVALUES Given an invertible matrix , an
Allows flexible interaction with eigenvalue equation can be foundin
DM systems terms of a set of orthogonal vectors ,
Provides foundation for building and scalars such that:
user-friendly GUIs M
Example: DMQL
71. EM Algorythm • Takes a long time
66. DM SUBJECTIVE Subjective measures require a ISSUES • Often initialised using k-Means
INTEREST human with domain
72. EMISSION Probabilities of observed variables
knowledge to provide
PROBABILITIES
measures:
•Unexpected results 73. ERROR 1. Apply input vector to network and
contradicting a priori beliefs BACKPROPAGATION propagate
•Actionable forward
•Expected results confirming 2. Evaluate d(k) for al output units
hypothesis 3. Backpropagate d's to obtain d( j)
for all hidden units
67. DM systems can be •Kinds of databases
4. Evaluate error derivatives as:
divided into types based •Kinds of knowledge
on a •Kinds of techniques 74. Estimating Sample Error vs. True Error
number of variables: •Target applications hypothesis accuracy Confidence Intervals
68. DM TASK PRIMITIVES DM task primitives forms the 75. EVALUATING RULES • A good rule should not make
basis for DM queries. mistakes and should
DM primitives specify: cover as many examples as possible
•Set of task-relevant data to be Complexity: Favour rules with simple
mined predicates
•Kind of knowledge to be mined (Occam's Razor)
•Background knowledge to be 76. EVALUATING RULE A complete rule set should be good
used SETS at classifying all the
•Interestingness measures and training examples
thresholds for • Complexity: Favour rule sets with
pattern evaluation the minimal number
•Representation for visualising of rules
discovered patterns
77. EVALUATION For large datasets, a single split is
PROCEDURE usually sufficient.
For smaller datasets, rely on cross
validation
78. EVALUATION • For large datasets, a single split is
PROCEDURES usually sufficient.
• For smaller datasets, rely on cross
validation
79. EXTREME In an extreme, degenerate case, if D 87. FORWARD SELECTION 1.Start with empty SF set and
Dimensionality Case > n, each example can be candidate
uniquely described by a set of set being all original features
feature values. 2. Find feature with highest filter
score
80. Features/Attributes measurable values of variables that
3.Remove feature from candidate
correlate with the label y
set
Examples:
4.Add feature to SF set
• Sender domain in spam detection
5.Repeat steps 2-4 until
• Mouth corner location in smile
convergence
detection
• Temperature in forest fire 88. FREQUENT ITEMSET • Absolute support of an itemset
prediction is its frequency count
• Pixel value in face detection • Relative support is the
frequency count of the itemset
81. FEATURE SELECTION Feature Selection returns a subset
divided by the total size of the
of original feature set.
dataset
It does not extract new features.
Benefits: 89. General classification If classes are disjoint, i.e. each
Features retain original meaning problem pattern belongs to one and
After determining selected features, only one class then input space
selection process is fast is divided into decision
Disadvantages: regions separated by decision
Cannot extract new features which boundaries or surfaces
have stronger
90. Generalisation Generalisation is the desired
correlation with target variable
property of a classifier to be
82. FILTER SCORES Correlation able to
Mutual information predict the labels of unseen
Entropy examples correctly. A hypothesis
Classification rate generalises well if it can predict
Regression score an example coming from the
same distribution as the training
83. FISHER LDA Fisher's idea is to adjust/constrain so
examples well.
that class
separation in 1D is maximised 91. The goal of data mining find interesting patterns!
is to An interesting pattern is:
84. Fixed train, Randomly split data into training,
1. Easily understood
development and test development, and test sets.
2. Valid on new data with some
sets: Does not make use of all data to
degree of
train or test
certainty
Good for large datasets
3. Potentially useful
85. F-MEASURE Comparing different approaches is 4. Novel
difficult when using
92. Gradient descent is a Conjugate gradients
multiple evaluation measures (e.g.
poor algorithm itself. Quasi Newton
Recall and Precision)
Better Online gradient descent
F-measure combines recall and
variants exist:
precision into a single
measure: 93. Gradient point in the Steepest Ascent
direction of:
86. Forward-backward first applies Forward
algorithm selection and then filters redundant 94. HDF5 Much more complex file format
elements using designed
backward elimination for scientific data handling
It can store heterogeneous and
hierarchical organised data
• It has been designed for
efficiency
95. HESSIAN, Second-order Can be used to determine if a 103. IMPORTANCE • If you have good data, the rest will
partial derivative of stationary point is a (local) OF CLEANING follow
functions : y = f(x) minimum or maximum.
104. In order to an error function on the training set must
96. Hidden layer(s) can have arbitrary number of optimise the be minimised
nodes/units performance of This is done by adjusting:
have arbitrary number of links ANN *weights connecting nodes
from input nodes and to *parameters of non-linear functions h(a)
output nodes (or to next
105. Insight: LDA is a y = wT x
hidden layer)
form of High-dimensional input is mapped by w
there can be multiple hidden
dimensionality onto a single dimension y
layers
reduction.
(Default is a fully
interconnected graph, i.e. 106. INSTANCE Reduces the number of instances rather
every input node is linked REDUCTION than attributes.
to every hidden node, and Much more dangerous, as it risks
every hidden node to every changing the data
output node) distribution properties!
• Duplicate removal
97. HIDDEN UNIT ACTIVATION Common functions for are the
• Random sampling
sigmoid or h(·) tanh:
• Cluster sample
98. How is deep neural network Optimised through gradient • Stratified sampling
optimised? descent! (Forward-Backward
107. Intrinsic Subspace of data space that captures
algorithm)
dimensionality degrees of variability
only, and is thus the most compact
Penalise complex solutions
possible representation
to avoid overfitting
108. ITEMSETS simply a set of items (cf set theory)
99. How to make trees To do so, we will seek to
compact? minimise impurity of data 109. JACOBIAN, First- Can be used to determine if a point is a
reaching order partial (local) extreme.
descendent nodes derivative of
functions :
100. HYPOTHESIS QUALITY We want to know how well a
machine learner, which 110. JOINT Joint probability is the probability of
learned the hypothesis as the PROBABILITY encountering a particular
approximation of the target class while simultaneously observing the
function , performs in terms of value of one or more
correctly classifying novel, other random variables
unseen examples Can be generalised for the state of any
We want to assess the combination of random
confidence that we can have variables
in this 111. Kernel is a shortcut that helps us do certain
classification measure calculation faster which otherwise would
101. ID3 Interactive dichotomiser involve computations in higher
version 3 dimensional space.
Used for nominal, unordered, 112. Kernel methods Kernel methods map a non-linearly
input data only. separable input
Every split has branching space to another space which hopefully
factor , where is the number is linearly
of values a variable can take separable
(e.g. bins of discretised • This space is usually higher-dimensional,
variable) possibly
Has as many levels as input infinitely
variables • Even the 'non-linear' kernel methods
102. IID Independent and identically essentially solve a linear
distributed random variables optimisation problem!!!!
113. • Kernel map a non-linearly separable input 119. KNOWLEDGE 1. Data cleaning - remove noise and
methods space to another space which hopefully is DISCOVERY inconsistencies
linearly PROCESS 2. Data integration - combine data
separable sources
• This space is usually higher-dimensional, 3. Data selection - retrieve relevant
possibly data from db
infinitely 4. Data transformation - aggregation
The key element is that they do etc. (cf. feature extraction)
not actually map features to this space, 5. Data mining - machine learning
instead they 6. Pattern Evaluation - identify truly
return the distance between elements in this interesting patterns
space 7. Knowledge representation -
• This implicit mapping is called the visualise and transfer new
(Definition) Trick knowledge
114. Kernel trick The key element of kernel methods is that 120. Labels the values that h aims to predict
they do not actually map features to this Example:
space, instead they •Facial expressions of pain
return the distance between elements in this •Impact of diet on astronauts in space
space This implicit mapping is called the •Predictions of house prices
(definition)
121. LATENT VARIABLES Latent variables are variables that are
115. K-MEANS 1. Initialise uk randomly 'hidden' in the data. They
ALGORITHM 2. Minimise J with respect to rnk , keeping uk are not directly observed, but must
I fixed be inferred.
3. Minimise J with respect uk to , keeping rnk Clustering is one way of finding
fixed discrete latent variables in data.
4. Repeat until convergence
122. A Latice/Trellis state transitions over time
diagram visualises Also good tool to to visualise optimal
Step 2 is called Expectation step, step 3 the
path through states
Maximisation step
(Viterbi Algorithm)i
116. K-MEANS Informally, goal is to find groups of points
123. LEARNING RULE Learning rules sequentially, one at a
CLUSTERING that are close to each other but far from
SETS time
points in other groups
• Also known as separate-and-
• Each cluster is defined entirely and only by
conquer
its centre, or mean value µk
Learning all rules together
117. K-MEANS Convergence is guaranteed but not • Direct rule learning
ISSUES necessarily optimal - local • Deriving rules from decision trees
minima likely to occur
124. LEAST SQUARED Benefits:
• Depends largely on initial values of uk
LDA Exact closed form solution
• Hard to define optimal number K
Disatvantages:
• K-means algorithm is expensive: requires
• Not probabilistic
Euclidean
• Sensitive to outliers
distance computations per iteration
• Problems with multiple classes and/
• Each instance is discretely assigned to one
or unbalanced data
cluster
• Euclidian distance is sensitive to outliers 125. LINEAR BASIS ...
FUNCTIONS
118. K-MEDOIDS Addresses issue with quadratic error function
CLUSTERING (L2-norm, 126. linearly-separable Satisfising solution (e.g. perceptron
Euclidean norm) SVM algorithm): finds a
Replace L2 norm with any other dissimilarity solution, not necessarily the 'best'
measure (V...) Best is that solution that promises
maximum generalisibility
127. LINEAR Linearly separable data: 137. Memory-based Uses all training data in every prediction
SEPARABILITY • datasets whose classes can be separated methods: (e.g. kNN)
by linear Becomes a kernel method if using a non-
decision surfaces linear example
• implies no class-overlap comparison/metric:
• classes can be divided by e.g. lines for 2D
138. MINIMUM Goal is to minimise error rate
data or planes
ERROR RATE
in 3D data
139. MINING • One approach to data mining is to find
128. Local maxima The real smallest value of the function.
FREQUENT sets of
129. Local minima is the smallest value of the function. but it PATTERNS items that appear together frequently:
might not be the only one. frequent
itemsets
130. Machine Learning from experience. It's also called
• To be frequent some minimum threshold
Learning supervised learning, were E is the
of
supervision.
occurrence must be exceeded
131. Machine Field of study that • Other frequent patterns of interest:
Learning gives computers the ability to learn without • frequent sequential patterns
being explicitly • frequent structured patterns
programmed.
140. Min-max • Enables cost-function minimisation
132. Machine Field of study that normalisation: techniques to
Learning gives computers the ability to learn without function properly, taking all attributes into
broad being explicitly equal
definition programmed account
133. MAJOR DM *Data cleaning • Transforms all attributes to lie on the
PREP TASKS • Fill in missing values, smooth noisy data, range [0, 1]
identify or remove outliers, or [-1, 1]
and resolve inconsistencies • Linear transformation of all data
*Data integration 141. Misclassification is the minimum probability that training
• Integration of multiple databases, data Impurity example will be misclassified at node N
cubes, or files
142. MISSING It is common to have examples in your
*Data transformation
ATTRIBUTES dataset with missing
• Normalisation and aggregation
attributes/variables.
* Data reduction
One way of training a tree in the presence
• Obtains reduced representation in volume
of missing
but produces the same
attributes is removing all data points with
or similar analytical results
any missing attributes.
*Data discretisation
A better method is to only remove data
• Part of data reduction but with particular
points that miss a required attribute when
importance, especially for
considering the test for a given node
numerical data
for a given attribute.
134. Maximum is a classifier which is able to give an This is a great benefit of trees (and in
margin associated distance from the decision general of combined
classifier boundary for each example. models, see later slide)
135. MAXIMUM This turns out to be a solution where 143. MIXTURE OF • Simple formulation: density model with
MARGIN decision boundary is GAUSSIANS richer representation
CLASSIFIERS determined by nearest points only than single Gaussian
Minimal set of points spanning decision
boundary sought
These points are called Support Vectors
136. Measures of Classification Error Rate
classification Cross Validation
accuracy Recall, Precision, Confusion Matrix
Receiver Operator Curves, two-alternative
forced choice
144. MODEL • Decision Trees combine a set of models 153. NOISY DATA - Cancelling noise by clustering:
COMBINATION (the nodes) clustering: 1. Cluster data into N groups
VIEW • In any given point in space, only one 2. Replace original values by
model (node) is responsible means of clusters
for making predictions 3. OR: use to detect outlierst
• Process of selecting which model to
154. NOISY DATA - Cancelling noise by regression:
apply can be described as a
REGRESSION 1. Fit a parametric function to
sequential decision making process
the data using minimisation of
corresponding to the traversal of a binary
e.g. least squares error
tree
2. Replace original values by the
145. MULTICLASS SVM is an inherently binary classifier parametric function value
SVM Two strategies to use SVMs for multiclass
155. NON-LINEARLY Usually problems aren't linearly
classification:
SEPARABLE separable (not even in
One-vs-all
PROBLEMS feature space)
One-vs-one
'Perfect' separation of training data
Problems:
classes would cause
Ambiguities (both strategies)
poor generalisation due to massive
Imbalanced training sets (one-vs-all)
overfitting
146. MULTIVARIATE Instead of monothetic decisions at each
156. NORMAL By far the most (ab)used
TREES I node, we can learn
DENSITY density function is the
polythetic decisions.
Normal or Gaussian
This can be done using many linear
density
classifiers, but keep it simple!
157. NORMAL The Normal distribution has many
147. MUTUAL gives a measure of how 'close' two
DISTRIBUTION useful properties. It
INFORMATION components
is fully described by it's mean and
of a joint distribution are to being
variance and is easy
independent:
to use in calculations.
148. ncestral is a simple sampling method well The good thing: given enough
sampling suited to PGNs experiments, a Binomial
distribution converges to a Normal
149. NETWORK Variations include:
distribution.
TOPOLOGY Arbitrary number of layers
Fewer hidden units than input units (causes 158. NUMEROSITY Reduces the number of instances rather
in effect REDUCTION than attributes.
dimensionality reduction, equivalent to Much more dangerous, as it risks
PCA) changing the data
Skip-layer connections (see below) distribution properties!
Fully/sparsely interconnected networks • Parametrisation
• Discretisation
150. NN ERROR *Regression:
• Sampling
FUNCTIONS *Binary classification
*Multiple independent binary classification: 159. Occam's Razor All things being equal - the simplest
*Multi-class classification explanation is the best
(mutually exclusive):
160. ONE-CLASS SVM Unsupervised learning problem
151. NOISY DATA Noise is a random error or variance in a Similar to probability density estimation
measured variable Instead of a pdf, goal is to find smooth
boundary enclosing a
152. NOISY DATA - Cancelling noise by binning:
region of high density of a class
BINNING 1. Sort data
2. Create local groups of data
3. Replace original values by:
3.1. The bin mean
3.2. The closest min/max value
of the bin
161. ONLINE On-line gradient descent updates 168. PCA Manifold projection
GRADIENT weight vector one Assumes Gaussian latent variables and
DESCENT data point at a time Gaussian observed
Maximum Likelihood-based error variable distribution
functions use a sum Linear-Gaussian dependence of the
of terms for independent data points: observed variables on
*Handles redundancy better the latent variables
*Can deal with new data better Also known as Karhunen-Loève transform
*Good chance of escaping local
169. PCA requires mean of observed variables
minima
calculation of: covariance of observed variables
162. ORTHOGONALITY Two vectors and are orthogonal eigenvalue/eigenvector computation of
if they're perpendicular covariance matrix
170. PGNs are allow us to sample from
In this case, their inner product is 0:
generative the probability distribution it defines
a·b=0
models
163. Output layer can single node for binary classification
171. PREDICATES A logic statement, generally as boolean
be: single node for regression
logic
n nodes for multi-class classification
(One network can also cover multiple 172. The Principle of It is pointless to do with more
output variables, thus Parsimony what is done with less
increasing the number of nodes.) 173. The Principle of Plurality should not be posited
164. Overfitting A hypothesis is said to be overfit if its Plurality without necessity
prediction performance on 174. PRIOR is the probability of encountering a class
the training data is overoptimistic PROBABILITY without observing any evidence
compared to that on unseen Can be generalised for the state of any
data. random variable
It presents itself in complicated Easily obtained from training data (i.e.
decision boundaries that depend counting)
strongly on individual training
175. PROBABILITY p(x) = marginal distribution
examples.
THEORY RECAP p(x,y) = joint distribution
165. Overfitting can Learning is performed for too long (e.g. p(x|y) = conditional distribution
occur when: in Neural Networks)
176. PRUNING First fully train a tree, without stopping
The examples in the training set are not
criterion
representative of all
After training, prune tree by eliminating
possible situations (is usually the case! )
pairs of leaf nodes
Model parameters are adjusted to
for which the impurity penalty is small
uninformative features in
the training set that have no causal 177. RANDOM Very good performance (speed,
relation to the true FORESTS accuracy) when abundant data is
underlying target function! available
Use bootstrapping/bagging to initialise
166. Parametric Many methods learn parameters of
each tree with
methods: prediction
different data
function (e.g. linear regression, ANNs)
Use only a subset of variables at each
After training, training set is discarded.
node
Prediction purely based on learned
Use a random optimisation criterion at
parameters and new data
each node
167. Pattern finding patterns without experience. It's Project features on a random different
Recognition also called unsupervised learning. manifold at each node
178. RANDOM Randomly Initialise K-Means clusters 185. ROBBINS- •Addresses the slow update speed of the M-
INITIALISATION using actual instances as MONRO step in K-means
cluster centres •Uses linear regression (see lecture 1)
Run K-Means and store centres and
186. ROC Receiver Operator Characteristic (ROC) curves
final Cost function
CURVES plot
Pick clusters of iteration with lowest
TP vs FP rates
Cost function as optimal
solution 187. RULE Equivalent in expression power to traditional
Most useful if K < 10 BASED (mono-thetic) decision trees, but with more
LEARNING flexibility
179. Regression ML task where T has a real-valued
• They produce rule sets as solutions, in the
outcome on some continuous sub-
form of a
space
set of IF... THEN rules
Examples:
• age estimation 188. • Rule confidence(A -> B) = P (B|A)
• stock value prediction confidence
• temperature prediction 189. RULE SETS • Single rules are not the solution of the
• energy consumption prediction problem, they
180. REGRESSION Trained in a very similar way are members of rule sets
TREES Leaf nodes are now continuous values - • Rules in a rule set cooperate to solve the
the value at a leaf problem.
node is that assigned to a test example Together they should cover the whole search
if it reaches it space
Leaf node label assignment is e.g. mean 190. Rule support(A -> B) = P(A u B)
value of its data support:
sample
191. SAMPLE In statistics, sampling error is incurred when
Problem: nodes make hard decisions,
ERROR the statistical characteristics of a population
which is particularly
are estimated from a subset, or sample, of that
undesired in a regression problem,
population.
where a smooth function is
sought. 192. Sampling Binomial and Normal Distributions
Theory Mean and Variance
181. REGULARIZATION Regularization is a technique used in an
Basics
attempt to solve the overfitting
problem in statistical models.* 193. SEARCH Exhaustive
METHODS Greedy forward selection
182. REGULARIZATION Maximum likelihood generalisation
Greedy backward elimination
error
Forward-backward approach
(i.e. cross-validation)
Regularised error (penalise large 194. SEQUENCE Sequence data is data that comes in a
weights) DATA particular order
Early stopping Opposite of independent, identical distributed
(i.i.d.)
183. Relevance Vector model the typical points of a data set,
Strong correlation between subsequent
Machines rather than atypical( a la density
elements
estimation)
DNA
while remaining a (very) sparse (like
Time series
heat map)
Facial Expressions
representation
Speech Recognition
Returns a true posterior
Weather Prediction
Naturally extends to multi-classification
Action planning
Fewer parameters to tune
184. RELU Rectified Linear Unit
New trend, responsible for great deal
of Deep Learning success:
• No 'vanishing gradient' problem
• Can model any positive real value
• Can stimulate sparseness
195. SIGNIFICANCE LEVEL Significance level α%: α times 200. Slack variable Slack variables introduced to
out of 100 you would find a solve optimisation problem
statistically significant by allowing some training
difference between the data to be misclassified
distributions even if Slack variables en >= 0 give a
there was none. It essentially linear penalty to examples
defines our tolerance level. lying on the wrong side of
If the calculated t value is the d.b.:
above the threshold chosen for point on correct side of db
statistical |tn ! y(xn)|, otherwise
significance then the null
201. Small set of SVs means that our solution is now sparse
hypothesis that the two groups
do not differ is 202. SOFT MARGIN We have effectively replaced
rejected in favour of the the hard margin with a soft
alternative hypothesis: the margin
groups do differ. New optimisation goal is
maximising the margin while
196. SIMPLE DATA SPLITS Fixed train, development and
penalising points on the
test sets:
wrong side of d.b.:
Bootstrapping:
Cross-validation 203. Some variables are Example observed: labels of
observed, others are a training set
197. SIMPLER METHOD for Closed frequent itemset: X is
hidden/latent Example hidden: learned
complex item sets closed if there exists
weights of a model
no super-set Y such that Y has
the same support 204. SPARSE KERNEL METHODS Must be evaluated on all
count as X training examples during
• Maximal frequent itemset: X is testing
frequent, and there Must be evaluated on all
exist no supersets Y of X that pairs of patterns during
are also frequent training
*Training takes a long time
198. The simplest ANNs consist •A layer of D input nodes
*Testing too
of •A layer of hidden nodes
*Memory intensive (both
•A layer of output nodes
disk/RAM)
•Fully connected between
Solution: sparse methods
layers
205. Sparse kernel methods Must be evaluated on all
199. Six general questions to 1. How many splits per node
training examples during
decide on decision tree (properties binary or multi
testing
algorithm: valued)?
Must be evaluated on all
2. Which property to test at
pairs of patterns during
each node?
training
3. When to declare a node to
Training takes a long time
be leaf?
Testing too
4. How to prune a tree that has
Memory intensive (both
become too large (and when
disk/RAM)
is a tree too large)?
Solution: sparse methods
5. If a leaf node is impure, how
to assign a category label? 206. STOPPING CRITERIA Reaching a node with a pure
6. How to deal with missing sample is always possible but
data? usually not desirable as it
usually causes over-fitting.
207. SUPPORT AND are measures of pattern
CONFIDENCE interestingness
208. SVM Margin is defined as the minimum distance 214. Three ways Direct from feature space mappings
between of Proposing kernels directly
decision boundary and any constructing Combination of existing (valid) kernels
sample of a class new kernels: * multiplication by a constant
* exponential of a kerne
209. SVMs seek a decision that maximises the margin
* sum of two kernels
boundary
* product of two kernels
210. Techniques for cancelling out 1. Binning - first sort data, *left/right multiplication by
noise: then distribute over local any function of x/x'
bins
215. TRAINING Given a model h with solution space S and a
2. Regression - fit a
ALGORITHM training set {X,Y}, a learning algorithm finds
parametric function to the
the solution that minimises the cost function
data (e.g.
J(S)
linear or quadratic function)
3. Clustering 216. TRAINING A Choice of depends on the output variables:
NETWORK *unity for regression
211. Tests for comparing T-TEST compares two
*logistic sigmoid for (multiple independent)
distributions distributions
binary classification
ANOVA compares multiple
*softmax for exclusive (1-of-K) multiclass
distributions
classification
If NULL-hypothesis is
(Training a network involves finding the that
refuted, there are at least
minimise some
two distributions
error function)
with significantly different
means 217. TRAINING Find (i.e. learn) that minimises
Does NOT tell you which LDA some error function on the training set.
they are! objective:
Significant approaches:
212. There are three reasons to 1. Remove features that
• Least squares
reduce the dimensionality have no
• Fisher
of a feature set: correlation with the target
• Perceptron
distribution
2. Remove/combine 218. Tree Root node, branch, node, leaf node.
features that have components
redundant correlation with
219. Tree Trees are called monothetic if one
target
variaties property/variable is
distribution
considered at each node, polythetic
3. Extract new features with
otherwise
a more
direct correlation with 220. The true is the probability that it will
target error of misclassify a randomly drawn example from
distribution hypothesis h distribution : D
However, we cannot measure the true error.
213. Three common ways to Validation set
We can only estimate it by observing the
decide when to stop splitting Cross-validation
sample error eS
decision tree: Hypothesis testing (chi-
squared statistic) 221. T-TEST Assesses whether the means of two
distributions are
statistically different from each other
222. Undirected no edge direction (Markov Random Field)
PGN
223. VALIDATION SET CRITERION Split training data in a training set and a validation set (e.g.
66% training data and 34% validation data)
Keep splitting nodes, using only the training data to learn
decisions, until the error on the validation set stops going
down.
224. Well-posed Learning Problem A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if its
performance on T, as measured by P, improves with experience
E.
225. Well-posed Learning Problem A computer program is said to learn from experience E with
respect to some task T and some performance measure P, if its
performance on T, as measured by P, improves with experience E.
226. What is Deep Learning? Definition:
• Hierarchical organisation with more than one (non-linear)
hidden layer in-between the input and the output
variables
• Output of one layer is the input of the next layer
227. What trees are preferable? We prefer simple, compact trees, following Occam's Razor
228. z-score normalisation Better terminology is zero-mean normalisation
• min-max normalisation cannot cope with
outliers, z-score normalisation can
• Transforms all attributes to have zero mean and
unit standard deviation
• Outliers are in the heavy-tail of the Gaussian
• Still a linear transformation of all data:

D

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

D

Uploaded by

Copyright:

Available Formats

Machine Learning

Study online at quizlet.com/_298y5j

You might also like