You are on page 1of 2

CSC 5800 : Intelligent Systems Homework 1

Due Date: September 24th , 2012


Total: 100 Points

Problem 1. Discuss whether or not each of the following activities is a data mining task. If the answer is yes, then also specify which one of the following categories it will belong to : (i) classification (ii) association analysis (iii) clustering or (iv) anomaly detection (20 Points; 2.5 Points each) (a) Sorting a student database based on student identification numbers. (b) By looking at a CT scan, a doctor wants to identify if a patient has cancer or not. There are a lot of labeled CT scans that the doctor will use for making the decision. (c) An image analyst obtains some new images and wants to automatically detect the number of distinct objects in the image. He doesnt have any prior information about these objects (d) Predicting the outcomes of tossing a (fair) pair of dice. (e) The items that are bought together in a Walmart store. (f) In an Internet search engine company, there is a need to find potential users who will click a particular advertisement on the webpage. (g) Monitoring the heart rate of a patient for abnormalities. (h) Extracting the frequencies of a sound wave. Problem 2. Classify the following attributes as binary, discrete, or continuous. Also, classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation, so briefly indicate your reasoning if you think there may be some ambiguity. For example: Age in years. Answer: Discrete, quantitative, ratio (10 Points; 2 Points each) (a) Favorite movie of each person. (b) Number of days since the start of the Fall 2012 semester. (c) Category of a hurricane (The Hurricane Wind Scale ranges from category 1 to category 5). (d) Number of students enrolled in a class. (e) Bronze, Silver, and Gold medals as awarded at the Olympics. Problem 3. This problem is a MATLAB exercise (15 Points; 5 Points each)

(a) Load iris.dat file (available at the course website) Give the basic description of the data matrix; no. of data points, no. of features, no. of classes (b) Give some basic statistics (such as mean, median, standard deviation, min, max) for each of these features (c) Plot the first two features of the data. Classes must be discriminated by using different symbols. Please label the figure.

Problem 4. For the following vector A= [14, -12, 30.2, 20.0, 56], give the transformed values after using the following normalization methods (12 Points; 4 Points each) (a) z-score normalization (b) Decimal scaling (c) Min-Max normalization [0,1] Problem 5. Which similarity or distance measure is most effective for each of the domains given below. Explain your reasoning (16 Points; 4 Points each) (a) Which measure, Jaccard or Simple Matching Coefficient, is most appropriate to compare how similar are the answers provided by students in an exam. Assume that the answers to all the questions in the exam are either True or False. (b) Which measure, Jaccard or Simple Matching Coefficient, is most appropriate to compare how similar are the locations visited by tourists at an amusement park. Assume the location information is stored as binary yes/no attributes (yes means a location was visited by the tourist and no means a location has not been visited). (c) Which measure, Euclidean distance or cosine similarity, is most appropriate to compare the coordinates of a moving object in a 2-dimensional space. For example, using GPS data, the object may be located at (31.4 o West, 12.4 o North) at time T1 and (29.4 o West, 12.5 o North) at another time T2. Note: we may use +/- to indicate East/West or North/South directions when computing the similarity or distance measures. (d) Which measure, Euclidean distance or cosine similarity, is most appropriate to compare the similarity of items bought by customers at a grocery store. Assume each customer is represented by a 0/1 binary vector of items (where a 1 means the customer had previously bought the item). Problem 6. Solve the following two problems (15 Points; 6 + 9 Points) (a) Two vectors x and y have zero mean. What is the relationship of the cosine measure and correlation between them? (b) Derive the mathematical relationship between cosine similarity and Euclidean distance when each data object vector has an L2 length (magnitude) of 1. (NOTE: your final answer should be independent of the original vectors). Problem 7. A jogger completes one round on a (circular) athletic track of radius 1 mile. During this run, we would like to compute the minimum and maximum possible values for the following distance measures (from the center of the track): Manhattan, Euclidean and Chebyshev distance. (12 Points)

You might also like