You are on page 1of 9

Monica Nusskern 1

CSIS 5420 – Data Mining


Week 1 Assignment
June 3, 2005

Total Score 103 out of 110

Score 10 out of 10

1. (Question #2, page 30) For each of the following problem scenarios,
decide if a solution would best be addressed with supervised learning,
unsupervised clustering, or database query. As appropriate, state any
initial hypothesis you would like to test. If you decide that supervised
learning or unsupervised clustering is the best answer, list several input
attributes you believe to be relevant for solving the problem.

a. What characteristics differentiate people who have had


back surgery and have returned to work from those who
have had back surgery and have not returned to their jobs?

The best solution for this would be unsupervised clustering


because this scenario clearly offers an attribute whose value
represents a set of predefined output classes. The initial
hypothesis could be, if a person is over 40 years of age and
performs a job with a high percentage of physical labor, then they
are less likely to return to work then someone who is under 40 and
performs a low amount of physical labor with their job. Good

Input Attributes: Worker ID, Job Type, Age, Sex

b. A major automotive manufacturer recently initiated a tire


recall for one of their top-selling vehicles. The automotive
company blames the tires for the unusually high accident
rate seen with their top-seller. The company producing the
tires claims the high accident rate only occurs when their
tires are on the vehicle in question. Who is to blame?

The best solution would be supervised learning and a decision


tree could be utilized to reach a conclusion. The initial hypothesis
could be, if the vehicle has an accident without these tires, then the
automotive manufacturer is to blame. The opposite could also be
asked, if different types of vehicles use the tired in question and
have a high accident rate, then the tire maker is to blame. Good

Input Attributes: Car ID, Questionable Tires on Vehicle, Type of


Vehicle

c. When customers visit my web site, what products are they


most likely to buy together?
Monica Nusskern 2
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

The best solution would be unsupervised clustering because it


is possible to use data mining to with a company’s data to gain
insight into possible patterns in the database. An initial hypothesis
could be, if a customer purchases Product A, then Product B will
likely be purchased as well. Good

Input Attributes: Customer ID, Transaction Method, Item


Purchased

d. What percent of my employees miss one or more days of


work per month?

A database query could be performed on a database with


human resource data stored there. Once the data for each
employee is extracted, simple calculations could be performed to
determine what percentage of employees miss one or more days of
work per month. Good

e. What relationships can I find between an individual's


height, weight, age, and favorite spectator sport?

Unsupervised clustering would be the best method for this


scenario because it is necessary to build models without predefined
classes. The initial hypothesis would need to be if a relationship
could be developed between individual demographics and favorite
spectator sport. It is quite possible that no valid relationship may
exist. Good

Input Attributes: Person ID, Height, Weight, Age, Favorite Sport

Score 10 out of 10

2. (Question #3, page 30) Medical doctors are experts at disease diagnosis
and surgery. Explain how medical doctors use induction to help develop
their skills.

Induction-based learning is the process of forming a general concept


definition by observing specific examples of the concept to be learned. Induction
allows doctors to determine diagnosis based on observations of particular
patterns and to establish necessary treatment based on these observations of
recurring patterns. Once these patterns are learned doctors can use this
experience to treat similar patterns in their future patients. Very good

Score 10 out of 10
Monica Nusskern 3
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

3. (Question #6, page 31) What happens when you try to build a decision
tree for the data in Table 1.1 without employing the attributes Swollen
Glands and Fever?

Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen


Fever Congestion Headache Diagnosis
ID Throat Glands
Strep
1 Yes Yes Yes Yes Yes
throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
Strep
4 Yes No Yes No No
throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
Strep
7 No No Yes No No
throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold

Patients could be misdiagnosed without using the symptoms of swollen


glands and fever. The patient could be diagnosed with strep throat and actually
have a cold. This could lead to taking unnecessary medication. Good

Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.

Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).

Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
Monica Nusskern 4
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.

Score 10 out of 10

4. (Question #6, page 63) Supposed you have used data mining to develop
two alternative models designed to accept or reject home mortgage
applications. Both models show an 85% test set classification correctness.
The majority of errors made by model A are false accepts whereas the
majority of errors made by model B are false rejects. Which model should
you choose? Justify your answer.

The model that I would choose would be model B because false rejects
would cost the firm much less money than false accepts. If the majority of the
errors for model A were false accepts that means that people who where not
qualified candidates for home loans would be accepted regardless. This could
be detrimental to the company, as these applicants would not pay their mortgage
bills resulting in less income for the company while large expenses would be
incurred. OK, but consider this perspective, since a mortgage is secured
credit, is there much risk in false accepts?

Score 10 out of 10

5. (Question #7, page 63) Supposed you have used data mining to develop
two alternative models designed to decide whether or not to drill for oil.
Both models show an 85% test set classification correctness. The majority
of errors made by model A are false accepts whereas the majority of errors
made by model B are false rejects. Which model should you choose?
Justify your answer.

The model that I would choose would be model A because the company
could be missing out on large income by not drilling in a certain area when in
actuality, they should. If a company drilled for oil where none existed, this could
be used for future knowledge to apply to the models that were developed.
However, while I chose model A for this question, I do see the benefits of utilizing
model B, such as the environmental impacts of drilling where no oil exists. OK,
but consider if the cost of drilling for oil is very high, Model B is the best
choice.

Score 10 out of 10

6. (Question #8, page 63) Explain how unsupervised clustering can be used
to evaluate the likely success of a supervised learner model.
Monica Nusskern 5
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

Unsupervised clustering can be used to evaluate the likely success of a


supervised learner model by:

 Utilizing a confusion matrix to compute model accuracy by summing the


values found on the main diagonal and divide this sum by the total number of
test set instances.
 Using two-class error analysis to denote false accepts and false rejects.

 To evaluate supervised models having numeric output mean absolute error


and mean square error can be utilized.

OK, but let me suggest a simpler answer.

In a supervised learner model, we pre-determine which attributes will be


used to classify our data and what specific clusters we will accept. In other
words, we assume that a chosen set of attributes will classify our data
under a chosen output attribute.

If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.

Score 10 out of 10

7. (Question #9, page 63) Explain how supervised learning can be used to
help evaluate the results of an unsupervised clustering model.

Supervised learning can be used to help evaluate the results of an


unsupervised clustering model by the following technique:

 Perform an unsupervised clustering. Designate each cluster as a class and


assign each an arbitrary name such as C1, C2, and C3.
 Choose a random sample of instances from each of the classes as a result of
the instance clustering. Each class should be represented in the random
sample in the same ratio as it is represented in the entire dataset. A good
sample is two-thirds of all instances.

 Build a supervised learner model with the class name as the output attribute
using the randomly sampled instances as training data. Employ the
remaining instances to test the supervised model for classification
correctness. Very good
Monica Nusskern 6
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

Score 7 out of 10

8. (Computational Question #1, page 63) Consider the following three-class


confusion matrix. The matrix shows the classification results of a
supervised model that uses previous voting records to determine the
political party affiliation (Republican, Democrat, or Independent) of
members of the United States Senate.

Computed Decision

Rep Dem Ind


Rep 42 2 1
Dem 5 40 3
Ind 0 3 4

a. What percent of the instances were correctly


classified?

86% Good

b. According to the confusion matrix, how many


Democrats are in the Senate? How many
Republicans? How many Independents?

40 Democrats, 42 Republicans, 4 Independents

48 Democrats, 45 Republicans, 7 Independents.


Add across the rows. There are 100 total senators.

c. How many Republicans were classified as


belonging to the Democratic Party?

2 Republicans Good

d. How many Independents were classified as


Republicans?

0 Independents Good

Score 7 out of 10
Monica Nusskern 7
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

9. (Computational Question #2, page 64) Suppose we have two


classes each with 100 instances. The instances in one class
contain information about individuals who currently have
credit card insurance. The instances in the second class
include information about individuals who have at least one
credit card but are without credit card insurance. Use the
following to answer the questions below:

IF Life Insurance = Yes & Income > $50K

THEN Credit Card Insurance = Yes

Rule Accuracy = 80%

Rule Coverage = 40%

a. How many individuals represented by the


instances in the class of credit card insurance
holders have life insurance and make more
than $50,000 per year?

80 individuals 40 instances

b. How many instances representing individuals


who do not have credit card insurance have
life insurance and make more than $50,000 per
year?

80 individuals 10 instances

Score 10 out of 10

10. (Computational Question #3, page 64) Consider the


confusion matrices shown below.

a. Compute the lift for Model X.

Lift = 2.00785  2.008

b. Compute the lift for Model Y.


Monica Nusskern 8
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

Lift = 2.25 Good

Computed Computed
Model X
Accept Reject
Accept 46 54
Reject 2,245 7,655

Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945

Score 9 out of 10

11. (Computational Question #4, page 65) A certain mailing list


consists of P names. Suppose a model has been built to
determine a select group of individuals from the list who will
receive a special flyer. As a second option, the flyer can be
sent to all individuals on the list. Use the notation given in the
confusion matrix below to show that the lift for choosing the
model over sending out the flyer to the entire population can be computed
with the equation:

Send Computed Computed


Flyer? Send Don't Send
Send C11 C12
Don't
C21 C22
Send

Lift = P(C11 | Sample)

P(C11 | Population)

Send Flyer? Computed Send Computed Don't Send


Send c11 c12 Sum(Send)
Don't Send c21 c21 Sum(Don't Send)
Sum(Computed Send) Sum (Computed Don't Send) Sum(Total)

Lift = c11/Sum(ComputedSend)
Monica Nusskern 9
CSIS 5420 – Data Mining
Week 1 Assignment
June 3, 2005

Sum(Send)/Sum(Total)

So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )

and we know that (C11+C12+C21 +C22) = the total number of names P.


Therefore, using substitution …

Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )

Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )

Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

You might also like