Professional Documents
Culture Documents
CS-402
14-December-2012
Max. Marks: 50
Question 01: Short Questions
1. In context to Rule-based classification discuss direct or indirect methods of building rules. Which method provides you the robust and exhaustive rules and why? 2. Discuss the hypothesis space of ID3 algorithm based Decision Tree learning. 3. Which proximity measure would you prefer to use, if in case you are working on finding distances between the class instances but the record data matrix to mine clusters is very sparse? 4. Find the following proximity measures for the following two binary vectors p and q.
p = (0 1 1 0 0 0 0 0 0 1), q = (0 1 0 0 0 0 1 0 0 1) Jaccard, Cosin, Hamming
5. In Rule-based Classification what are the three different methods to evaluate a rule.
Question 02: (Marks 15)
AutoRash Traders has a broad business of automobiles and has millions of customers. In order to market their new product in Texas City they are planning to target only a limited number of possible customers, to advertise them directly. In order to achieve the goal they collected the previous advertising data While looking on the following dataset, what data mining technique you will suggest to AutoRash Traders , so that they could able to predict about a new customer that he will purchase a car or not. AutoRash Traders have gathered the following dataset about different customers, please suggest a predicting modeling technique that could predict about a new customer that whether he will purchase a car or not.
Serial No 1 2 3 4 5 6 Some Entropy Measurements:
[2+, 2-] =1 [2+, 0] =0
Home Owner Y Y N Y N N
[2+, 3-] =0.971 [3+, 2-] =0.971 [3+, 1-] = 0.8115 [3+, 2-] =0.971 [5+, 1-] = 0.650
Question 03:
(Marks 15)
Consider a training set that contains 100 positive examples and 400 negative examples. For each of the following candidate rules:
R1: A + (covers 4 positive and 1 negative examples), R2: B + (covers 30 positive and 10 negative examples), R3: C + (covers 100 positive and 90 negative examples),
Rule accuracy measures are 80% (for R1), 75% (for R2), and 52.6% (forR3), respectively. Therefore R1 is the best candidate and R3 is the worst candidate according to rule accuracy. However this is not always the obviuous case why? Explain the situation with the help of FOILs Information gain.
Entropy = E ( S ) = ( P+ ) Log 2 ( P+ ) ( P ) Log 2 ( P ) Gini ( s ) = 1
[ p (i | t )
i=0
c 1
k n = Entropy ( S ) i Entropy ( i ) n i =1 p1 FOIL ' s .. Informatio n .. Gain = p 1 (log 2 log p1 + n1 END OF EXAM GAIN
p0 ) p0 + n0
Page 1 of 1