You are on page 1of 5

Exam Advanced Data Mining

Date: 5-11-2009
Time: 14.00-17.00

General Remarks
1. You are allowed to consult 1 A4 sheet with notes written on both sides.
2. Always show how you arrived at the result of your calculations.
3. If you are a native speaker, answers in Dutch are preferred.
4. There are six questions, for which you can score a total of 100 points.

Question 1 Multiple Choice (16 points)


For the following questions, zero or more answers may be true.
a) Which of the following statements about classification trees are true?
1. The resubstitution error of a tree always goes down when we split
one of its leaf nodes.
2. In growing a tree, it is always possible to continue splitting until each
leaf node contains examples of a single class.
3. When used to compute the impurity reduction of a split, the giniindex and resubstitution error always prefer the same split.
4. For classification problems with two classes, in order to determine
the optimal split for a categorical attribute with L distinct values,
we have to compute 2L1 1 possible splits.
b) Which of the following statements about frequent pattern mining are true?
1. If all the subsets of an itemset are frequent, then the itemset itself
must also be frequent.
2. All maximal frequent itemsets are closed.
3. In the A-close algorithm, an itemset that has a subset with the same
support is called a generator.
1

4. For an association rule, if we move one item from the right-hand-side


to the left-hand-side of the rule, then the confidence will never go
down.
c) Which of the following statements about subgroup discovery are true?
1. PRIM stands for Patient Rule Induction Method. The term Patient
here refers to the fact that PRIM was originally developed for medical
applications.
2. PRIM guards against overfitting by requiring that a subbox must
have a significantly higher target mean than its parent box.
3. To construct the K th box, PRIM does not use the datapoints that
fall into one of the first K 1 boxes.
4. In box construction, Data Surveyor tends to reduce the support of
the subgroups faster than PRIM does.
d) Which of the following statements about clustering are true?
1. We dont want the clusters that are found by a clustering algorithm
to depend on the unit of measurement of a variable. For numeric
data, we can prevent this from happening by subtracting the mean
from each variable, so we get a new variable with zero mean.
2. In model based clustering, the specification of an appropriate dissimilarity measure is essential.
3. In agglomerative hierarchical clustering, we can use single-linkage,
complete-linkage or average-linkage to compute the dissimilarity between clusters. The first step of the algorithm (the first merging of
clusters) is the same regardless of the method we use to compute the
dissimilarity between clusters.
4. The average silhouette value will always increase as we increase the
number of clusters. Therefore it is not suited as a method to determine the appropriate number of clusters present in the data.

Question 2 Frequent Itemset Mining (20 points)


Given are the following five transactions on items {A, B, C, D, E}:
tid
1
2
3
4
5

items
{A, B}
{A, B, D}
{B, D, E}
{B, C, D, E}
{A, B, C}

a) Use the Apriori algorithm to compute all frequent itemsets, and their
support, with minimum support of 2. It is important that you clearly
indicate the steps of the algorithm.
b) Give all closed frequent itemsets.
c) Use the frequent itemsets to construct a krimp codetable, and compute
how often each itemset in the codetable is used in covering the database.

Question 3 Undirected Graphical Models (15 points)


Let M be an (undirected) graphical model on discrete variables (X1 , X2 , X3 , X4 )
with independence graph:

2
1

3
4

a) Express the independence properties of M in words.


b) Give the margin constraints satisfied by the maximum likelihood estimates
of M .
c) Use the constraints of b) together with the conditional independence properties of M , to find an expression for the fitted counts n
(x1 , x2 , x3 , x4 ) in
terms of margins of the observed counts n(x1 , x2 , x3 , x4 ).

Question 4 Bayesian Network Classifiers (20 points)


The Naive Bayes classifier makes the fundamental assumption that the attributes are independent given the class label.
a) Explain why the Naive Bayes Classifier often performs quite well (in terms
of the error-rate on a test sample), even when its independence assumption
is not satisfied.
To relax the independence assumption, one can allow some (restricted) dependencies between the attributes. A well-studied example are the Tree Augmented
Naive Bayes (TAN) classifiers.
b) From a computational viewpoint, what is the most important advantage
of restricting the structure on the attributes to trees?
3

c) Explain why a TAN classifier has the same independence properties as


the undirected graph obtained by simply dropping the direction of all its
edges.
d) In the article of Friedman et al. many different methods to use a Bayesian
Network for classification are studied. One of them is to use a standard
structure learning algorithm using BIC (MDL) as the score function, and
then to use the Markov blanket of the class variable in the resulting network for classification. It is shown that for datasets with many attributes,
this method tends to produce poor results. Explain why.

Question 5 Clustering (15 points)


We are given the following data on 4 objects:
object
1
2
3
4

x1
2
8
6
2

x2
2
6
8
4

a) Cluster this data into two clusters, using the k-means algorithm. To initialize the algorithm, put objects 1 and 3 in one cluster, and objects 2 and
4 in the other cluster. Show the steps of the algorithm clearly. Give the
value of the k-means error function after convergence.
b) What is the value of the error function in the optimal solution for k = 4?
c) The k-means algorithm can be viewed as a special case of model based
clustering with normal components. Which constraints have to be imposed
on the cluster covariance matrices to get a distance measure that is similar
to the one used by k-means?

Question 6 Classification Trees (14 points)


In learning classification trees, determination of the appropriate size of the tree
is an important problem. Most algorithms use a training sample to construct
an oversized tree that is subsequently pruned back to the right size. CART uses
cost-complexity pruning to do this. Cost-complexity pruning uses a complexity
parameter denoted by . The tree given below, denoted by Tmax , has been constructed on the training sample:


t1

60 | 40

aa
!
!! 
aa
!

!
aa 
!
a
!
t2

30 | 10

t3


@
@
t4 30 | 5

30 | 30


l 
,,
l
l
t6 0 | 20

t5 0 | 5

t7

30 | 10


@
@
t8 25 | 5

t9 5 | 5

In each node, the number of observations with class 0 is given in the left
part, and the number of observations with class 1 in the right part. The leaf
nodes have been drawn as rectangles.
a) Compute the impurity of nodes t1 , t2 and t3 according to the resubstitution
error. Give the impurity reduction achieved by the first split.
b) Compute the sequence T1 > T2 > . . . > {t1 }, where T1 is the smallest
minimizing subtree of Tmax for = 0. For each tree in the sequence, give
the interval of values for which it is the smallest minimizing subtree of
Tmax .

You might also like