Monica Nusskern Week 1 Assignment

Monica Nusskern 1
CSIS 5420 – Data Mining

Week 1 Assignment
June 3, 2005
Total Score 103 out of 110
Score 10 out of 10
1. (Question #2, page 30) For each of the following problem scenarios,
decide if a solution would best be addressed with supervised learning,
unsupervised clustering, or database query. As appropriate, state any
initial hypothesis you would like to test. If you decide that supervised
learning or unsupervised clustering is the best answer, list several input
attributes you believe to be relevant for solving the problem.
a. What characteristics differentiate people who have had

back surgery and have returned to work from those who
have had back surgery and have not returned to their jobs?
The best solution for this would be unsupervised clustering

because this scenario clearly offers an attribute whose value
represents a set of predefined output classes. The initial
hypothesis could be, if a person is over 40 years of age and
performs a job with a high percentage of physical labor, then they
are less likely to return to work then someone who is under 40 and
performs a low amount of physical labor with their job. Good
Input Attributes: Worker ID, Job Type, Age, Sex
b. A major automotive manufacturer recently initiated a tire

recall for one of their top-selling vehicles. The automotive
company blames the tires for the unusually high accident
rate seen with their top-seller. The company producing the
tires claims the high accident rate only occurs when their
tires are on the vehicle in question. Who is to blame?
The best solution would be supervised learning and a decision

tree could be utilized to reach a conclusion. The initial hypothesis
could be, if the vehicle has an accident without these tires, then the
automotive manufacturer is to blame. The opposite could also be
asked, if different types of vehicles use the tired in question and
have a high accident rate, then the tire maker is to blame. Good
Input Attributes: Car ID, Questionable Tires on Vehicle, Type of

Vehicle
c. When customers visit my web site, what products are they

most likely to buy together?
Monica Nusskern 2
Week 1 Assignment
June 3, 2005
The best solution would be unsupervised clustering because it

is possible to use data mining to with a company’s data to gain
insight into possible patterns in the database. An initial hypothesis
could be, if a customer purchases Product A, then Product B will
likely be purchased as well. Good
Input Attributes: Customer ID, Transaction Method, Item

Purchased
d. What percent of my employees miss one or more days of

work per month?
A database query could be performed on a database with

human resource data stored there. Once the data for each
employee is extracted, simple calculations could be performed to
determine what percentage of employees miss one or more days of
work per month. Good
e. What relationships can I find between an individual's

height, weight, age, and favorite spectator sport?
Unsupervised clustering would be the best method for this

scenario because it is necessary to build models without predefined
classes. The initial hypothesis would need to be if a relationship
could be developed between individual demographics and favorite
spectator sport. It is quite possible that no valid relationship may
exist. Good
Input Attributes: Person ID, Height, Weight, Age, Favorite Sport
Score 10 out of 10
2. (Question #3, page 30) Medical doctors are experts at disease diagnosis
and surgery. Explain how medical doctors use induction to help develop
their skills.
Induction-based learning is the process of forming a general concept

definition by observing specific examples of the concept to be learned. Induction
allows doctors to determine diagnosis based on observations of particular
patterns and to establish necessary treatment based on these observations of
recurring patterns. Once these patterns are learned doctors can use this
experience to treat similar patterns in their future patients. Very good
Score 10 out of 10
Monica Nusskern 3
Week 1 Assignment
June 3, 2005
3. (Question #6, page 31) What happens when you try to build a decision
tree for the data in Table 1.1 without employing the attributes Swollen
Glands and Fever?
Table 1.1 Hypothetical Training Data for Disease Diagnosis
Patient Sore Swollen

Fever Congestion Headache Diagnosis
ID Throat Glands
Strep
1 Yes Yes Yes Yes Yes
throat
2 No No No Yes Yes Allergy
3 Yes Yes No Yes No Cold
Strep
4 Yes No Yes No No
throat
5 No Yes No Yes No Cold
6 No No No Yes No Allergy
Strep
7 No No Yes No No
throat
8 Yes No No Yes Yes Allergy
9 No Yes No Yes Yes Cold
10 Yes Yes No Yes Yes Cold
Patients could be misdiagnosed without using the symptoms of swollen

glands and fever. The patient could be diagnosed with strep throat and actually
have a cold. This could lead to taking unnecessary medication. Good
Let's pick sore throat as the top-level node. The only possibilities are yes and no.
Instances one, three four, eight, and ten follow the yes path. The no path shows instances
2,5,6,7 & 9. The path for sore throat = yes has representatives from all three classes as
does sore throat = no.
Next we follow the sore throat = yes path and choose headache. We need only concern
ourselves with instances 1,3,4, 8 & 10. For headache = yes we have instances 1 (strep
throat) ,8 (allergy ), & 10 (cold). For headache = no we have instances 3 (cold) and 4 (strep
throat).
Next follow headache = yes and choose congestion the only remaining attribute. All
three instances show congestion = yes, therefore the tree is unable to further differentiate
the three instances. A similar problem is seen by following headache = no. Therefore, the
path following sore throat = yes is unable to differentiate any of the five instances. The
Monica Nusskern 4
Week 1 Assignment
June 3, 2005
problem repeats itself for the path sore throat = no. In general, any top-level node choice
of sore throat, congestion, or headache gives a similar result.
Score 10 out of 10
4. (Question #6, page 63) Supposed you have used data mining to develop
two alternative models designed to accept or reject home mortgage
applications. Both models show an 85% test set classification correctness.
The majority of errors made by model A are false accepts whereas the
majority of errors made by model B are false rejects. Which model should
you choose? Justify your answer.
The model that I would choose would be model B because false rejects
would cost the firm much less money than false accepts. If the majority of the
errors for model A were false accepts that means that people who where not
qualified candidates for home loans would be accepted regardless. This could
be detrimental to the company, as these applicants would not pay their mortgage
bills resulting in less income for the company while large expenses would be
incurred. OK, but consider this perspective, since a mortgage is secured
credit, is there much risk in false accepts?
Score 10 out of 10
5. (Question #7, page 63) Supposed you have used data mining to develop
two alternative models designed to decide whether or not to drill for oil.
Both models show an 85% test set classification correctness. The majority
of errors made by model A are false accepts whereas the majority of errors
made by model B are false rejects. Which model should you choose?
Justify your answer.
The model that I would choose would be model A because the company
could be missing out on large income by not drilling in a certain area when in
actuality, they should. If a company drilled for oil where none existed, this could
be used for future knowledge to apply to the models that were developed.
However, while I chose model A for this question, I do see the benefits of utilizing
model B, such as the environmental impacts of drilling where no oil exists. OK,
but consider if the cost of drilling for oil is very high, Model B is the best
choice.
Score 10 out of 10
6. (Question #8, page 63) Explain how unsupervised clustering can be used
to evaluate the likely success of a supervised learner model.
Monica Nusskern 5
Week 1 Assignment
June 3, 2005
Unsupervised clustering can be used to evaluate the likely success of a

supervised learner model by:
 Utilizing a confusion matrix to compute model accuracy by summing the

values found on the main diagonal and divide this sum by the total number of
test set instances.
 Using two-class error analysis to denote false accepts and false rejects.
 To evaluate supervised models having numeric output mean absolute error

and mean square error can be utilized.
OK, but let me suggest a simpler answer.
In a supervised learner model, we pre-determine which attributes will be

used to classify our data and what specific clusters we will accept. In other
words, we assume that a chosen set of attributes will classify our data
under a chosen output attribute.
If our unsupervised learner determines that the same input attributes will
form clusters that differentiate the values of the output attribute, then the
complementary results verify the supervised learner assumptions.
Score 10 out of 10
7. (Question #9, page 63) Explain how supervised learning can be used to
help evaluate the results of an unsupervised clustering model.
Supervised learning can be used to help evaluate the results of an

unsupervised clustering model by the following technique:
 Perform an unsupervised clustering. Designate each cluster as a class and

assign each an arbitrary name such as C1, C2, and C3.
 Choose a random sample of instances from each of the classes as a result of
the instance clustering. Each class should be represented in the random
sample in the same ratio as it is represented in the entire dataset. A good
sample is two-thirds of all instances.
 Build a supervised learner model with the class name as the output attribute
using the randomly sampled instances as training data. Employ the
remaining instances to test the supervised model for classification
correctness. Very good
Monica Nusskern 6
Week 1 Assignment
June 3, 2005
Score 7 out of 10
8. (Computational Question #1, page 63) Consider the following three-class

confusion matrix. The matrix shows the classification results of a
supervised model that uses previous voting records to determine the
political party affiliation (Republican, Democrat, or Independent) of
members of the United States Senate.
Computed Decision
Rep Dem Ind

Rep 42 2 1
Dem 5 40 3
Ind 0 3 4
a. What percent of the instances were correctly

classified?
86% Good
b. According to the confusion matrix, how many

Democrats are in the Senate? How many
Republicans? How many Independents?
40 Democrats, 42 Republicans, 4 Independents
48 Democrats, 45 Republicans, 7 Independents.

Add across the rows. There are 100 total senators.
c. How many Republicans were classified as

belonging to the Democratic Party?
2 Republicans Good
d. How many Independents were classified as

Republicans?
0 Independents Good
Score 7 out of 10
Monica Nusskern 7
Week 1 Assignment
June 3, 2005
9. (Computational Question #2, page 64) Suppose we have two

classes each with 100 instances. The instances in one class
contain information about individuals who currently have
credit card insurance. The instances in the second class
include information about individuals who have at least one
credit card but are without credit card insurance. Use the
following to answer the questions below:
IF Life Insurance = Yes & Income > $50K
THEN Credit Card Insurance = Yes
Rule Accuracy = 80%
Rule Coverage = 40%
a. How many individuals represented by the

instances in the class of credit card insurance
holders have life insurance and make more
than $50,000 per year?
80 individuals 40 instances
b. How many instances representing individuals

who do not have credit card insurance have
life insurance and make more than $50,000 per
year?
80 individuals 10 instances
Score 10 out of 10
10. (Computational Question #3, page 64) Consider the

confusion matrices shown below.
a. Compute the lift for Model X.
Lift = 2.00785  2.008
b. Compute the lift for Model Y.

Monica Nusskern 8
Week 1 Assignment
June 3, 2005
Lift = 2.25 Good
Computed Computed
Model X
Accept Reject
Accept 46 54
Reject 2,245 7,655
Computed Computed
Model Y
Accept Reject
Accept 45 55
Reject 1,955 7,945
Score 9 out of 10
11. (Computational Question #4, page 65) A certain mailing list

consists of P names. Suppose a model has been built to
determine a select group of individuals from the list who will
receive a special flyer. As a second option, the flyer can be
sent to all individuals on the list. Use the notation given in the
confusion matrix below to show that the lift for choosing the
model over sending out the flyer to the entire population can be computed
with the equation:
Send Computed Computed

Flyer? Send Don't Send
Send C11 C12
Don't
C21 C22
Send
Lift = P(C11 | Sample)
P(C11 | Population)
Send Flyer? Computed Send Computed Don't Send

Send c11 c12 Sum(Send)
Don't Send c21 c21 Sum(Don't Send)
Sum(Computed Send) Sum (Computed Don't Send) Sum(Total)
Lift = c11/Sum(ComputedSend)
Monica Nusskern 9
Week 1 Assignment
June 3, 2005
Sum(Send)/Sum(Total)
So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )
and we know that (C11+C12+C21 +C22) = the total number of names P.

Therefore, using substitution …
Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )
Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )
Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

Monica Nusskern Week 1 Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monica Nusskern Week 1 Assignment

Uploaded by

Copyright:

Available Formats

Monica Nusskern 1

CSIS 5420 – Data Mining

Total Score 103 out of 110

a. What characteristics differentiate people who have had

The best solution for this would be unsupervised clustering

Input Attributes: Worker ID, Job Type, Age, Sex

b. A major automotive manufacturer recently initiated a tire

The best solution would be supervised learning and a decision

Input Attributes: Car ID, Questionable Tires on Vehicle, Type of

c. When customers visit my web site, what products are they

The best solution would be unsupervised clustering because it

Input Attributes: Customer ID, Transaction Method, Item

d. What percent of my employees miss one or more days of

A database query could be performed on a database with

e. What relationships can I find between an individual's

Unsupervised clustering would be the best method for this

Input Attributes: Person ID, Height, Weight, Age, Favorite Sport

Induction-based learning is the process of forming a general concept

Table 1.1 Hypothetical Training Data for Disease Diagnosis

Patient Sore Swollen

Patients could be misdiagnosed without using the symptoms of swollen

Unsupervised clustering can be used to evaluate the likely success of a

 Utilizing a confusion matrix to compute model accuracy by summing the

 To evaluate supervised models having numeric output mean absolute error

OK, but let me suggest a simpler answer.

In a supervised learner model, we pre-determine which attributes will be

Supervised learning can be used to help evaluate the results of an

 Perform an unsupervised clustering. Designate each cluster as a class and

8. (Computational Question #1, page 63) Consider the following three-class

Rep Dem Ind

a. What percent of the instances were correctly

b. According to the confusion matrix, how many

40 Democrats, 42 Republicans, 4 Independents

48 Democrats, 45 Republicans, 7 Independents.

c. How many Republicans were classified as

d. How many Independents were classified as

9. (Computational Question #2, page 64) Suppose we have two

IF Life Insurance = Yes & Income > $50K

THEN Credit Card Insurance = Yes

Rule Accuracy = 80%

Rule Coverage = 40%

a. How many individuals represented by the

b. How many instances representing individuals

10. (Computational Question #3, page 64) Consider the

a. Compute the lift for Model X.

Lift = 2.00785  2.008

b. Compute the lift for Model Y.

Lift = 2.25 Good

11. (Computational Question #4, page 65) A certain mailing list

Send Computed Computed

Lift = P(C11 | Sample)

Send Flyer? Computed Send Computed Don't Send

So Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / (C11+C12+C21 +C22) )

and we know that (C11+C12+C21 +C22) = the total number of names P.

Lift = ( C11 / (C11 + C21) ) / ( (C11+C12) / P )

Lift = ( C11 / (C11 + C21) ) * (P / ( (C11+C12) )

Lift = ( C11 * P ) / ((C11 + C12) * (C11+ C21) )

You might also like