You are on page 1of 1

BSCS Honors Program

CS-402

GIFT University Gujranwala

Course: Data Mining


Resource Person: Nadeem Qaisar Mehmood

14-December-2012

Max. Marks: 50
Question 01: Short Questions

(Fall 2012) MIDTERM EXAMINATION Time Allowed: 70 Minutes


(Marks 5x5)

1. In context to Rule-based classification discuss direct or indirect methods of building rules. Which method provides you the robust and exhaustive rules and why? 2. Discuss the hypothesis space of ID3 algorithm based Decision Tree learning. 3. Which proximity measure would you prefer to use, if in case you are working on finding distances between the class instances but the record data matrix to mine clusters is very sparse? 4. Find the following proximity measures for the following two binary vectors p and q.
p = (0 1 1 0 0 0 0 0 0 1), q = (0 1 0 0 0 0 1 0 0 1) Jaccard, Cosin, Hamming

5. In Rule-based Classification what are the three different methods to evaluate a rule.
Question 02: (Marks 15)

AutoRash Traders has a broad business of automobiles and has millions of customers. In order to market their new product in Texas City they are planning to target only a limited number of possible customers, to advertise them directly. In order to achieve the goal they collected the previous advertising data While looking on the following dataset, what data mining technique you will suggest to AutoRash Traders , so that they could able to predict about a new customer that he will purchase a car or not. AutoRash Traders have gathered the following dataset about different customers, please suggest a predicting modeling technique that could predict about a new customer that whether he will purchase a car or not.
Serial No 1 2 3 4 5 6 Some Entropy Measurements:
[2+, 2-] =1 [2+, 0] =0

Marital Status Single Married Divorced Divorced Single Single

Home Owner Y Y N Y N N

Annual Income Normal Normal Low High Low Normal

Bought Car Yes Yes No Yes No No

[2+, 3-] =0.971 [3+, 2-] =0.971 [3+, 1-] = 0.8115 [3+, 2-] =0.971 [5+, 1-] = 0.650

Question 03:

(Marks 15)

Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules:
R1: A + (covers 4 positive and 1 negative examples), R2: B + (covers 30 positive and 10 negative examples), R3: C + (covers 100 positive and 90 negative examples),

Rule accuracy measures are 80% (for R1), 75% (for R2), and 52.6% (forR3), respectively. Therefore R1 is the best candidate and R3 is the worst candidate according to rule accuracy. However this is not always the obviuous case why? Explain the situation with the help of FOILs Information gain.
Entropy = E ( S ) = ( P+ ) Log 2 ( P+ ) ( P ) Log 2 ( P ) Gini ( s ) = 1

[ p (i | t )
i=0

c 1

k n = Entropy ( S ) i Entropy ( i ) n i =1 p1 FOIL ' s .. Informatio n .. Gain = p 1 (log 2 log p1 + n1 END OF EXAM GAIN

p0 ) p0 + n0

Page 1 of 1

You might also like