Professional Documents
Culture Documents
Final Project
12/10/14
Group 2:
Brian Leap
Matthew Murphy
Troy McCrum
Introduction
When people receive coupons in the mail or see advertisements in their favorite magazines,
they often interpret such events as random or maybe they think that companies are just trying to cast a
wide net to drum up business. In some cases, these things could be true; however, randomness usually
has nothing to do with it at all. What prospective customers may not know, or perhaps do not want to
know, is that companies know a lot more about them than they think. And random appearances of
coupons in the mailbox are in fact conscious, targeted efforts from marketing teams to persuade a
certain type of consumer to purchase a certain type of product.
So how do companies know so much about their customers? Any time a customer uses a credit
card, completes a survey, uses a coupon, subscribes to a magazine, etc., information about that
customer is being recorded and saved into a database. All sorts of customer information are valuable to
a company. A customers transactions, race, income, family size, pets, age, and occupation are all
examples of useful characteristics that companies can utilize in designing a marketing strategy. The
more information a company can gain about a customer, the better that company can market to that
customer. Additionally, as companies accumulate MORE information about MORE customers, insights
can be made about customers as a group rather than individually. In other words, companies can assign
individuals to a group and then simply market to that group, instead of marketing from customer to
customer. This process of grouping similar customers together is known as clustering. Clustering is an
essential technique because the time and resources spent in marketing to each customer individually
would far outweigh any sort of potential monetary gains.
As companies accrue more and more customers, the thought of pouring through customer data
by hand becomes impossible, and even somewhat laughable. The solution to this problem is a database.
A database of customer information to a marketing team is similar to the Periodic Table of Elements to a
chemist. Both a database and the periodic table are simply collections of information that are organized
in a certain way. To the average person, they are nothing more than that. However, for someone with
the necessary knowledge and tools, the information can be manipulated to create exciting new ideas.
With the help of analytics, marketing teams are able to query, or intelligently extract, data contained in
a database in order to make inferences about customers. Extracted data can then be converted into
tables, charts, graphs, or other types of graphical visualizations to be used in customer and marketing
analysis. Without a well-organized database, a technique like clustering would be impossible. Therefore,
using analytics to gain insight to customer and market behavior all starts with a good database.
Problem Description
The problem presented in this project is that a supermarket has an enormous amount of
transaction and demographic data pertaining to its customers, but the marketing team needs help in
finding ways to use this data to develop marketing and advertising strategies.
The project is divided into two parts:
The first part of the project deals with predictive analytics. This requires substantial use of
Microsoft Excel. Using Excel the team must calculate moving averages and exponential smoothing,
construct time series graphs, utilize regression lines, use k-means clustering to group similar customers
together, and use demographics data to form insights about the customer base. Using this information,
sales quantities of top-selling items must be predicted and methods of promoting and advertising
products must be developed.
The second part of the project deals primarily with databases. Part 2 requires the team to use
Microsoft Access to create database tables for analysis and perform queries to find interesting trends in
the data. Also, additional graphical visuals must be created outside of Access or Excel by using R. Again,
the team must make inferences and marketing recommendations based on the analysis.
Project Objectives
Part 1 (Excel)
o Calculate prediction error, find:
Mean Squared Average,
Mean Absolute Deviation,
Tracking Signal.
o Construct time series graphs
o Use OLS Regression to predict future sales
o Use k-means clustering to:
Group customers with similar buying habits,
Use analytics to find demographic commonalities
within those groups,
Make marketing recommendations based on analysis.
o Use demographic data to derive insights to the customer base
Part 2 (Access)
o Create database tables for analysis
o Identify relationships between tables
o Perform queries on various topics to gain insights
o Create advanced graphical visuals to display data
Part 1: Analytics
Moving Average and Exponential Smoothing
1)
Item Type 3
Item Type 17
4 period moving average
2 period weighted moving average
Exponential Smoothing
868.05
801.37
783.55
25.18
23.29
25.35
Figure 1: Prediction error for weeks 10 through 20 for item types 3 and 17. The graph shows
the Mean Squared Error, Mean Absolute Deviation, and Tracking Signal.
Regression
2)
Figure 2: Time series for the sales of snacks (Item 17) over the 104 week period
Also included are the trend line, regression equation, and R2 value
5.58
4.59
6.63
Figure 3: Time series for the sales of eggs (Item 12) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 4: Time series for the sales of butter (Item 3) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 5: Time series for the sales of cookies (Item 8) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 6: Time series for the sales of cereal (Item 5) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 7: Time series for the sales of BBQ (Item 2) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 8: Time series for the sales of cat food (Item 4) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 9: Time series for the sales of ice cream (Item 13) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 10: Time series for the sales of crackers (Item 9) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Figure 11: Time series for the sales of hot dogs (Item 11) over the 104 week period
Also included are the trend line, regression equation, and R2 value
Table 1: Regression Equations and Correlation Coefficients (R2) for Ten Highest Sold Items
Item Type
17
12
3
8
5
2
4
13
9
11
Item Description
Snacks
Eggs
Butter
Cookies
Cereal
BBQ
Cat Food
Ice Cream
Crackers
Hot Dogs
Regression Equation
y = -0.0593x + 84.136
y = -0.0168x + 72.545
y = -0.1308x + 72.062
y = -0.3014 + 69.736
y = -0.1813x + 61.246
y = -0.1055x + 47.305
y = -0.0358x + 42.883
y = -0.1145x + 42.911
y = -0.1825x + 44.876
y = -0.0429x + 26.906
Correlation Coefficient
0.006
0.002
0.0133
0.2358
0.1156
0.0646
0.0039
0.0531
0.2043
0.0092
3)
Table 2: Regression Equations for Top 3 Selling Items Using Only First 60 Weeks of Data
Item Type
17
12
3
Item Description
Snacks
Eggs
Butter
Regression Equation
y = 0.0423x + 85.113
y = -0.1012x +76.178
y = -0.4956x + 81.59
Correlation Coefficient
0.0011
0.0027
0.048
Figure 12: Time Series for the Sales of Snacks (Item 17) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
Figure 13: Time Series for the Sales of Eggs (Item 12) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
Figure 14: Time Series for the Sales of Butter (Item 3) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks
4)
Table 3: Projected Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80 using
Regression Equations obtained from first 60 Weeks of Data
Week
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Item 17 - Snacks
59
44
49
71
75
64
52
54
47
72
53
46
55
48
77
91
71
43
62
52
Item 12 - Cereal
104
36
36
39
47
46
34
45
38
58
43
71
95
27
50
63
147
61
62
94
Item 3 - Butter
38
24
41
37
35
41
41
79
40
24
39
74
49
53
128
68
71
43
82
167
Table 4: Actual Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80
Clustering
5)
Figure 15: Graphical Representation of K-Means Clustering using K=3 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 16: Graphical Representation of K-Means Clustering using K=4 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 17: Graphical Representation of K-Means Clustering using K=5 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 18: Graphical Representation of K-Means Clustering using K=6 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span
Figure 19: Graphical Representation of K-Means Clustering using K=3 for Units of Eggs
Purchased vs. Units of Cereal Purchased by Each Customer over the 104 Week Span
Figure 20: Graphical Representation of K-Means Clustering using K=3 for Units of Hotdogs
Purchased vs. Units of Pizza Purchased by Each Customer over the 104 Week Span
General
6)
Race (%)
White
Black
Hispanic
Oriential
Customer Income
90
80
70
60
50
40
30
20
10
0
0 - 10k 10 - 12 - 15 - 20 - 25 - 35 - 45 - 55 - 65 - 75+k
11.9k 14.9k 19.9k 24.9k 34.9k 44.9k 54.9k 64.9k 74.9k
Figure 22: Chart representing total number of customers falling into each income bracket.
140
120
100
80
60
Male
40
Female
20
0
2)
Problem #2 provides some interesting insight into the buying patterns of certain items. With the help of
a pivot table, it was easy to condense the data to determine the top ten most frequently purchased
items. After determining what these top ten items were, a time series graph was created for each
individual item. The last aspect of the time series that was implemented was the regression line (trend
line). This equation of this trend line allows the user to plug in any value for week number to get an
estimate of how many units of a particular item will be purchased during that given week. To rate the
accuracy of this trend line in predicting the actual number of units sold, a correlation coefficient (R2) was
also computed.
No items by themselves had any particularly interesting trends, however, there were a couple of
interesting aspects that each of these ten plots had in common. The first noticeable common
occurrence was the sharp decrease in sales of each item during week 57. This could have been caused
by a variety of factors. For example, maybe week 57 was a holiday week so customers were busy
spending their time at home with family instead of at the store. Another possibility is that maybe the
store was closed for a few days that week, thus decreasing its total sales. No conclusive answer can be
obtained from the data given, but there is surely something that happened during week 57 to cause this
decrease in sales.
Another interesting feature that all ten items had in common was a negatively sloping regression line.
This means that as time goes on, the projected sales of every single one of these items is projected to
continually decrease. Basically, this means that the store has a problem. If sales of each of its ten most
popular products continue to decrease as time goes on like the regression model suggests, the store will
bring in less and less money as time goes on. In the long run, if this trend goes on for long enough, the
projected decrease in sales could even lead to the store going out of business.
Finally, one last notable feature that all ten items shared was a very small correlation coefficient. The
correlation coefficient essentially shows how accurate the fit of the regression line is to the observed
data. This value will fall somewhere between 0 and 1; 0 representing minimal accuracy and 1
representing perfect accuracy in the fit of the regression line. Out of all ten items, the largest
correlation coefficient was 0.2358. One of the values were even as low as 0.002. These extremely low
coefficient correlation values indicate that the regression line is not very accurate in predicting the
actual number of units of an item sold given the week number.
3)
Problem #3 further dives into the concept of regression and using a regression line to predict future
response. In this problem, only the top three most frequently purchased items were considered. A
regression line was formed using only the sales data from the first 60 weeks. After this regression line
from the first 60 weeks was calculated and created, it was plotted overtop of the time series plot of all
104 weeks of data for its corresponding item type. This created a visual that made it easy to see just
how accurate each individual regression line was in predicting future response.
After all of these steps were carried out, the most notable occurrence was the great change in the
equations for the regression lines from what they were when all 104 weeks of data was considered. The
initial equations were y = -0.0593x + 84.136 for item 17, y = -0.0168x + 72.545 for item 12, and y = 0.1308x +72.062 for item 3. However, once the parameters were changed to only include the first 60
weeks, the new regression equations were y = 0.0423x + 85.113 for item 17, y = -0.1012x +76.178 for
item 12, and y = -0.4956x + 81.59 for item 3. One notable occurrence was the change in slope for item
17. When all 104 weeks of data was used, the slope of the regression line was negative. However,
when only the first 60 weeks of data was considered, the slope of the regression line became positive.
This means that sales are projected to gradually increase when only the first 60 weeks of data is used,
but when all 104 weeks is considered, sales of item 17 are projected to decrease as time goes on.
Another noticeable trend was the rapid decrease in slope of items 12 and 3 when only the first 60 weeks
of data are considered. The slope decreases almost six times as fast for item 12 and almost four times
as fast for item 3 when only the first 60 weeks is used in forming the regression line as opposed to all
104 weeks. However, it should be noted that in both of these cases, the intercept does start at a higher
value for the regression line from only the first 60 weeks, thus somewhat compensating for the large
increase in negativity of the slope. Overall, the great changes that take place in the equations for the
regression lines when more data is used indicates the lack of accuracy of the regression lines in
predicting the response (units sold).
4)
So now, the goal of problem #4 is to predict the number of units that will sell each week, starting from
week 61 and stopping at week 80. These predictions were made using the regression lines found from
considering only the first 60 weeks of data. In order to obtain these predictions, the week number had
to be plugged into the regression equation as the x value. The predicted values and the actual values
can be seen in the Results Section.
It is evident that the predictions obtained through regression are not very accurate for predicting future
sales. For item 17, the average sales from weeks 61 through 80 are significantly less than the predicted
values indicate. The predicted values are also noticeably larger for item 12. There are a few instances
where actual units sold increases to be greater than the projections, but for the most part, the average
sales are less than the predicted values. Like the other two items, the projections for item 3 are not
quite accurate either. The regression line indicates a quick decrease in units sold while in actuality, the
values seem to be gradually increasing.
Overall, after comparing the projections from the regression lines to the actual amount of each item
sold each week, it is clear that there is a lot of error in this prediction method. If the owner of the store
wants to better predict the amount of items that will be sold in the future, forecasting methods such as
Moving Average or Exponential Smoothing may create more accurate results.
5)
Problem #5 dealt with the clustering of two different items against each other. Clustering is a very
helpful strategy when it comes to breaking down and studying data. Customers are an example of
something that is often clustered. In this case, clustering allows the user to separate the customers into
groups so that different recommendations can be given to different groups based off of the common
preferences of the customers in that group. K-means clustering is a very widely used clustering
technique. This method breaks the data down into k separate groups, with each group having its own
centroid. The rule of thumb here is that all of the points in each cluster will be closest to its own
centroid, as opposed to being closer to the centroid of another group.
So, for problem 5, the k-means clustering algorithm was used to break the data down into k separate
groups. The first two items that were compared were items 8 and 9 (cookies and crackers). These items
were clustered using k values of 3, 4, 5, and 6. The graphs were created and then judged to determine
which value of k produced the best fit for the data. All of the graphs that were created can be seen in
the Results Section. K=6 appeared to create the best fit for the data on the graph because it creates the
most precise groups with points that seem to have the smallest deviation from their centroid. After this,
two more sets of items were clustered, however, k=3 was the only value of k that was used on those
cases. The first graph created contained a comparison between cereal and eggs while the second graph
created focused on hotdogs and pizza.
So, the reason that this clustering procedure was carried out is to help the store manager promote
products based on specific customer attributes. After the customers were assigned to a specific cluster,
the background information on each customer was examined to determine if there were any noticeable
similarities in the characteristics of customers within clusters. This background information was given in
the Demographics Data
So, the cluster analysis containing k=6 groups of crackers (Item 9) purchased vs. cookies (Item 8)
purchased was first examined to determine the buying patterns of the customers in each of the clusters.
The other three graphs for crackers vs. cookies using k=3, 4, and 5 were not considered here because it
was decided that k=6 created the best and most accurate results. In each of the graphs, clusters are
referred to as Series 1, Series 2, etc. Series 1 contains the customers who buy very little of both
products, Series 2 contains customers who buy a lot of crackers and a substantial amount of cookies,
Series 3 contains customers who buy some cookies but not a lot of crackers, Series 4 contains customers
who buy some crackers but not a lot of cookies, Series 5 contains customers who buy a lot of cookies
and a substantial amount of crackers, and Series 6 contains the customers who buy a substantial
amount of cookies and some crackers. To get a better understanding of the customers who belonged to
each of these groups, Family Size was useful in differentiating between clusters. The two groups that
were focused on were Series 2 and Series 5 because those were the groups that contained the
customers who bought a large amount crackers and a large amount of cookies, respectively. In Series 2,
the most common family size by far is 2 people, accounting for 63% of the people in that group.
However, in Series 5, there are three family sizes that make up the majority of this group. A family size
of 2 makes up 33%, a family size of 3 makes up 25%, and a family size of 4 makes up 25%. Note that all
of these Demographic Data statistics can be found in the Appendix in table form. So, now that the
company has some background on the customers belonging to each of the customers, it must now
decide how to allocate its advertising resources to try to increase the sales of cookies and crackers.
Based off of this data, the type of customer who buys a lot of crackers belongs to Series 2 and more
often than not, has a family size of 2 people. The type of customers who buy a lot of cookies are those
who have families of size 2, 3, or 4. So, in order to maximize profit from sales of both crackers and
cookies, crackers should be marketed mainly towards people who have family sizes of 2 people and
cookies should be marketed towards people who have families ranging from 2-4 people. In order to
advertise to these groups of people, TV would probably be the best way to get the word out. Out of the
people who responded, 264 out of the 272 people owned at least one television. However, if a
television advertisement is not economically feasible, other methods, such as a newspaper or magazine
advertisement could work well. 50% of the people in Series 2 and 58.33% of the people in Series 5 read
the newspaper. If a magazine advertisement is the way that the store wants to go, their best course of
action would be to put an ad for crackers in Better H&G because 37.5% of people that buy a lot of
crackers also subscribe to Better H&G. To reach the customers who buy a lot of cookies, and
advertisement should be placed in Good House because one third of the people who buy a lot of cookies
also subscribe to Good House.
The next cluster analysis that was carried out compared the amount of eggs (Item 16) purchased
vs. the amount of cereal (Item 5) purchased by each customer. This cluster analysis only used k=3
groups. Series 1 contained the customers who did not buy a lot of either product, Series 2 contained the
customers who bought a lot of cereal, and Series 3 contained the customers who bought a lot of eggs.
When the demographics data of all of the customers was studied by cluster, there were some pretty
interesting results. Annual Income was one factor that created some interesting statistics when studied.
Out of all of the people in Series 2, over 60% of those people make more than $35,000 per year. In
Series 3 on the other hand, less than 19% of people make more than $35,000 per year. So, since there is
clearly some correlation between annual income and product purchased, the store should take this into
account for advertising purposes. For the store to increase sales of both of these items, the best course
of action would be to try to advertise cereal to people who make more than $35,000 per year and
advertise eggs to the people who make less than $35,000 per year. In order to do this advertising, once
again the most efficient yet most costly way to do so would be to create a television commercial. If the
store would like to advertise in the newspaper instead, 57.89% of the people who buy a lot of cereal and
60.53% of the people who buy a lot of eggs subscribe to the newspaper. And finally, if a magazine ad is
the way that the store chooses to go, in order to reach the people who buy a lot of cereal, an ad should
be placed in Reader Digest because 26.32% of the people who buy a lot of cereal subscribe. To reach
the people who buy a lot of eggs, a magazine ad should be placed in either Readers Digest or Better
H&G because 26.32% of people who buy a lot of eggs subscribe to each of these magazines.
The final cluster analysis that was carried out compared the number of units of pizza (Item 16)
purchased vs. the number of units of hotdogs (Item 11) purchased by each customer. Again, k=3 was
the value used for number of groups. Series 1 contained people who purchased very few hotdogs and
pizzas, Series 3 contained the people who purchased a lot of hotdogs, and Series 2 pretty much just
contained everyone who did not fit either of those criteria. In this case, race was a factor that lead to
some pretty interesting results. Though white people were the race that was most common in the
study, the data shows some pretty evidence pertaining to the preferences of African Americans. In
Series 1 and 2, the percentage of African American belonging to each cluster is just above 10%.
However, in Series 3, the percentage of African American jumps all of the way up above 23%. The
percentage of white people belonging to Series 3 also drops more than 10% from both Series 1 and 2.
As a result of this correlation between race and product purchased, this should be taken into account by
the store when trying to advertise pizza and hotdogs. This data shows that it is a waste of time to
market pizza to African Americans because only a little more than 10% of them ever purchase pizza.
White people would be a better marketing target when it comes to advertising pizza. If an item should
be marketed to African Americans, it should be hot dogs because a much higher percentage of them
purchase hotdogs. So, once again, the statistics show that the most efficient way to reach consumers is
through a television advertisement. However, in this case a magazine or newspaper ad may not be a
good way to go when trying to advertise. In Series 3, which would be the main group that the store
would want to advertise to because they buy such a large amount of hotdogs, only 38.46% of people
subscribe to the newspaper which, compared to the other groups, is very small. Also, out of this group,
no magazine is subscribed to by more than 16% of the people, making this advertisement possibility also
somewhat ineffective. So, in order to reach these people in Series 3, the best method after all may just
be to save up money to produce a TV commercial because that seems to be the only really effective way
to reach these people.
6)
When developing a marketing strategy it is essential to know and understand ones customer base.
Using the Demographics data, some statistics were able to be pulled from the data to in order to
develop such an understanding of the area of the stores location and the people who live there. The
percentages of ethnicities of the customers were found, as seen in Figure 21. This graph indicates that
the population of the location of the town is predominantly white, and has very low minority population
percentages. This leads to believe that the store is located in a more rural area, as opposed to in a larger
city. More potential evidence for the rural, blue-collar town is that 34% of customers own at least one
dog. This number would not be so high in a city with people living mostly in apartment buildings. The
fact that so many people own dogs brings a recommendation from the team that the store should
promote its dog food and other dog-related products to its customers. This area also seems like a rural
town with blue-collar people because of the annual income. The vast majority of customers fall into the
middle-class range, whereas only a few customers earn more than $75k per year. Surprisingly, the most
common profession among all the customers is retirement. Yes, of all the types of professions there are
more customers that are retired than in any other category. This leads to believe that the store has a
customer base with a majority of elderly citizens. By using clustering and other analytic techniques, the
store has the capability to use all aforementioned statistics to gain an advantage with its sales and
marketing strategies.
Part 2: Databases
1)
2)
Query 1: Cross table of query, showing all females occupations, in addition to family size and
coupon origin.
SQL code:
TRANSFORM Count([AllOccFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]
Query 3: Total coupons used that valued greater than $0.99 in regards to item types.
SQL code:
TRANSFORM Count(CouponItemInteraction.[Customer ID]) AS [CountOfCustomer ID]
SELECT CouponItemInteraction.Description, Count(CouponItemInteraction.[Customer ID]) AS [Total Of
Customer ID]
FROM CouponItemInteraction
WHERE (((CouponItemInteraction.[Coupon Value (Cents)])>99))
GROUP BY CouponItemInteraction.Description
PIVOT CouponItemInteraction.[Coupon Value (Cents)];
Query 9: Item types bought with subscription to cable tv in regards to male occupation.
SQL code:
TRANSFORM Count([TVAds-MenOcc].[Cable TV]) AS [CountOfCable TV]
Query 13: What customer bought snacks from what vendor on day 5.
SQL code:
SELECT Transaction.[Customer ID], Product.Description, Vendor.Vendor, Transaction.Day
FROM Vendor INNER JOIN (Product INNER JOIN [Transaction] ON Product.[Item Type] =
Transaction.[Item Type]) ON Vendor.ItemType = Transaction.[Item Type]
WHERE (((Product.Description)="snack") AND ((Transaction.Day)=5));
Query 14: Amount of coupons from origin 23, item type 10, with a coupon value greater than
$0.99.
SQL code:
SELECT Transaction.[Coupon Origin], Product.[Item Type], Transaction.[Coupon Value
(Cents)]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
WHERE (((Transaction.[Coupon Origin])=23) AND ((Product.[Item Type])=10) AND
((Transaction.[Coupon Value (Cents)])>99));
Query 15: Hot Dog purchases >1 in regards to family size, number of dogs, day and week.
SQL code:
SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Dogs,
Product.Description, Transaction.Week, Transaction.[Units Bought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
Query 16: Amount of female cleaners that bought cleansers and amount of units bought.
SQL code:
SELECT Customer.[Customer ID], Customer.[Female Occupation], Transaction.[Item Type],
Transaction.[Units Bought]
FROM Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]
WHERE (((Customer.[Female Occupation])=8) AND ((Transaction.[Item Type])=6));
Query 18: Females subscribed to Better Home & Garden, and their occupations.
SQL code:
SELECT Customer.[Customer ID], Customer.[Subscription to Better H&G],
FemaleOccLookup.Desc
FROM FemaleOccLookup INNER JOIN Customer ON FemaleOccLookup.FemaleOccID =
Customer.[Female Occupation]
WHERE (((Customer.[Subscription to Better H&G])=Yes));
Query 21: How oftern customers purchased hot dogs along with family size and units bought.
SQL code:
SELECT Customer.[Customer ID], First(Customer.[Family Size]) AS [Family Size], First(Customer.Dogs)
AS Dogs, Count(Transaction.Week) AS CountOfWeek, First(Transaction.[Units Bought]) AS [Units
Bought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
GROUP BY Customer.[Customer ID]
HAVING (((First(Customer.Dogs))>1))
3)
Graphics:
Figure 2: The total coupons used and the origin of those coupons.
advantage when sorting through large amounts of data in search of making inferences. The ability to call
certain criteria from a large selection can return a small, comprehendible set of data. In the transaction
data, we were able to draw strong evidence of the demographic prevalence in the store region. From
this demographic data, buying tendencies gave way to easy assumptions to make about what items sell
and what items do not sell. This project proved to be very beneficial in learning hands-on the
importance of data analytics and databases. This experience will prove to be extremely useful in our
future fields.
Group Roles
Matt Murphy Part 1 - #(2,3,4,5)
Brian Leap Part 2 - #(1,2,3)
Troy McCrum Part 1 - #(1,6) and Part 2 - #(2)
Appendix
Part 1:
Figure 1: Equations used for question 1, prediction errors.
Table 1: Percentage of people from each series (or cluster) grouped by how many family member they
have (see key in Data Dictionary)
C1
0
1
2
3
4
5
6
C2
0.00
0.12
0.36
0.24
0.15
0.08
0.05
C3
0.00
0.06
0.63
0.00
0.19
0.00
0.13
C4
0.00
0.14
0.34
0.34
0.06
0.09
0.02
C5
0.00
0.07
0.30
0.23
0.33
0.03
0.03
C6
0.00
0.17
0.33
0.25
0.25
0.00
0.00
0.00
0.14
0.46
0.03
0.19
0.05
0.14
Table 2: Percentage of people from each series (cluster) grouped by how much they make annually (see
key in Data Dictionary)
C1
C2
C3
0 0.018315
0
0
1 0.087912 0.105263 0.078947
2 0.069597
0 0.052632
3 0.058608 0.078947 0.105263
4 0.120879 0.052632 0.184211
5 0.087912 0.105263 0.157895
6 0.164835 0.052632 0.236842
7 0.153846 0.210526 0.078947
8 0.128205 0.184211 0.078947
9 0.03663 0.052632 0.026316
10 0.029304 0.052632
0
11 0.043956 0.105263
0
Table 3: Percentage of people from each series (cluster) grouped by Race (see key in Data Dictionary)
C1
0
1
2
3
4
5
C2
C3
0
0
0
0.876364 0.885246 0.769231
0.105455 0.114754 0.230769
0.010909
0
0
0
0
0
0.007273
0
0
# of TVs
0
1
2
3
No Response
# of Customers
8
63
126
75
77
Table 5: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from
Crackers vs. Cookies
C1
C2
C3
C4
C5
C6
Newspaper
0.4895
0.5000 0.5000 0.5667
0.5833
0.4054
Subscription to Better H&G
0.2632
0.3750 0.1875 0.1667
0.1667
0.2432
Subscription to Good House
0.1158
0.1250 0.1250 0.1000
0.3333
0.0541
Subscription to Ladies HJ
0.1211
0.1875 0.1406 0.1000
0.0833
0.1351
Subscription to McCalls
0.1263
0.1250 0.1719 0.0333
0.2500
0.1622
Subscription to Redbook
0.0895
0.1250 0.0313 0.0333
0.1667
0.0270
Subscription to Reader's Digest 0.2263
0.1875 0.3281 0.3667
0.0833
0.3243
Subscription to Cosmopolitan 0.0211
0.0625 0.0000 0.0000
0.0833
0.0541
Subscription to TV Guide
0.1684
0.0625 0.1563 0.1333
0.0833
0.1622
Subscription to People
0.0105
0.0000 0.0781 0.0333
0.1667
0.0000
Subscription to Glamour
0.0158
0.0625 0.0000 0.0333
0.0833
0.0000
Subscription to Time
0.0789
0.0625 0.0625 0.0000
0.0833
0.0811
Subscription to Newsweek
0.0579
0.0000 0.0469 0.0667
0.0833
0.0270
Table 6: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Eggs
vs. Cereal
C1
C2
C3
Newspaper subscriber
0.4652
0.5789
0.6053
Subscription to Better H&G
0.2491
0.1579
0.2632
Subscription to Good House
0.1062
0.1842
0.1316
Subscription to Ladies HJ
0.1136
0.1316
0.2105
Subscription to McCalls
0.1465
0.0526
0.1316
Subscription to Redbook
0.0733
0.0789
0.0526
Subscription to Reader's Digest
0.2601
0.2632
0.2632
Subscription to Cosmopolitan
0.0256
0.0000
0.0263
Subscription to TV Guide
0.1575
0.1053
0.1842
Subscription to People
0.0293
0.0000
0.0526
Subscription to Glamour
0.0183
0.0000
0.0263
Subscription to Time
0.0806
0.0526
0.0000
Subscription to Newsweek
0.0623
0.0263
0.0000
Table 7: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Pizza
vs. Hotdogs
Newspaper
Subscription to Better H&G
Subscription to Good House
Subscription to Ladies HJ
Subscription to McCalls
Subscription to Redbook
Subscription to Reader's Digest
Subscription to Cosmopolitan
Subscription to TV Guide
Subscription to People
Subscription to Glamour
Subscription to Time
Subscription to Newsweek
C1
0.4909
0.2436
0.0982
0.1091
0.1200
0.0655
0.2582
0.0218
0.1455
0.0218
0.0182
0.0764
0.0545
C2
0.5246
0.2459
0.2295
0.2295
0.1967
0.0984
0.3115
0.0328
0.1967
0.0656
0.0164
0.0492
0.0492
C3
0.3846
0.1538
0.0000
0.0000
0.1538
0.0769
0.0769
0.0000
0.1538
0.0000
0.0000
0.0000
0.0000