You are on page 1of 49

IE 330

Final Project

12/10/14

Group 2:
Brian Leap
Matthew Murphy
Troy McCrum

Introduction
When people receive coupons in the mail or see advertisements in their favorite magazines,
they often interpret such events as random or maybe they think that companies are just trying to cast a
wide net to drum up business. In some cases, these things could be true; however, randomness usually
has nothing to do with it at all. What prospective customers may not know, or perhaps do not want to
know, is that companies know a lot more about them than they think. And random appearances of
coupons in the mailbox are in fact conscious, targeted efforts from marketing teams to persuade a
certain type of consumer to purchase a certain type of product.
So how do companies know so much about their customers? Any time a customer uses a credit
card, completes a survey, uses a coupon, subscribes to a magazine, etc., information about that
customer is being recorded and saved into a database. All sorts of customer information are valuable to
a company. A customers transactions, race, income, family size, pets, age, and occupation are all
examples of useful characteristics that companies can utilize in designing a marketing strategy. The
more information a company can gain about a customer, the better that company can market to that
customer. Additionally, as companies accumulate MORE information about MORE customers, insights
can be made about customers as a group rather than individually. In other words, companies can assign
individuals to a group and then simply market to that group, instead of marketing from customer to
customer. This process of grouping similar customers together is known as clustering. Clustering is an
essential technique because the time and resources spent in marketing to each customer individually
would far outweigh any sort of potential monetary gains.
As companies accrue more and more customers, the thought of pouring through customer data
by hand becomes impossible, and even somewhat laughable. The solution to this problem is a database.
A database of customer information to a marketing team is similar to the Periodic Table of Elements to a
chemist. Both a database and the periodic table are simply collections of information that are organized
in a certain way. To the average person, they are nothing more than that. However, for someone with
the necessary knowledge and tools, the information can be manipulated to create exciting new ideas.
With the help of analytics, marketing teams are able to query, or intelligently extract, data contained in
a database in order to make inferences about customers. Extracted data can then be converted into
tables, charts, graphs, or other types of graphical visualizations to be used in customer and marketing
analysis. Without a well-organized database, a technique like clustering would be impossible. Therefore,
using analytics to gain insight to customer and market behavior all starts with a good database.

Problem Description
The problem presented in this project is that a supermarket has an enormous amount of
transaction and demographic data pertaining to its customers, but the marketing team needs help in
finding ways to use this data to develop marketing and advertising strategies.
The project is divided into two parts:

The first part of the project deals with predictive analytics. This requires substantial use of
Microsoft Excel. Using Excel the team must calculate moving averages and exponential smoothing,
construct time series graphs, utilize regression lines, use k-means clustering to group similar customers
together, and use demographics data to form insights about the customer base. Using this information,
sales quantities of top-selling items must be predicted and methods of promoting and advertising
products must be developed.
The second part of the project deals primarily with databases. Part 2 requires the team to use
Microsoft Access to create database tables for analysis and perform queries to find interesting trends in
the data. Also, additional graphical visuals must be created outside of Access or Excel by using R. Again,
the team must make inferences and marketing recommendations based on the analysis.

Project Objectives

Part 1 (Excel)
o Calculate prediction error, find:
Mean Squared Average,
Mean Absolute Deviation,
Tracking Signal.
o Construct time series graphs
o Use OLS Regression to predict future sales
o Use k-means clustering to:
Group customers with similar buying habits,
Use analytics to find demographic commonalities
within those groups,
Make marketing recommendations based on analysis.
o Use demographic data to derive insights to the customer base
Part 2 (Access)
o Create database tables for analysis
o Identify relationships between tables
o Perform queries on various topics to gain insights
o Create advanced graphical visuals to display data

Part 1: Analytics
Moving Average and Exponential Smoothing
1)
Item Type 3

Mean Squared Error Mean Absolute Deviation Tracking Signal


4 period moving average
1235.22
24.32
-1.03
2 period weighted moving average
1481.34
28.71
-0.56
Exponential Smoothing
1192.31
23.9
-0.83

Item Type 17
4 period moving average
2 period weighted moving average
Exponential Smoothing

868.05
801.37
783.55

25.18
23.29
25.35

Figure 1: Prediction error for weeks 10 through 20 for item types 3 and 17. The graph shows
the Mean Squared Error, Mean Absolute Deviation, and Tracking Signal.

Regression
2)

Figure 2: Time series for the sales of snacks (Item 17) over the 104 week period
Also included are the trend line, regression equation, and R2 value

5.58
4.59
6.63

Figure 3: Time series for the sales of eggs (Item 12) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 4: Time series for the sales of butter (Item 3) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 5: Time series for the sales of cookies (Item 8) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 6: Time series for the sales of cereal (Item 5) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 7: Time series for the sales of BBQ (Item 2) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 8: Time series for the sales of cat food (Item 4) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 9: Time series for the sales of ice cream (Item 13) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 10: Time series for the sales of crackers (Item 9) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Figure 11: Time series for the sales of hot dogs (Item 11) over the 104 week period
Also included are the trend line, regression equation, and R2 value

Table 1: Regression Equations and Correlation Coefficients (R2) for Ten Highest Sold Items
Item Type
17
12
3
8
5
2
4
13
9
11

Item Description
Snacks
Eggs
Butter
Cookies
Cereal
BBQ
Cat Food
Ice Cream
Crackers
Hot Dogs

Total Units Sold (104 Weeks)


7776
6879
6254
5165
4961
4006
3930
3538
3382
2365

Regression Equation
y = -0.0593x + 84.136
y = -0.0168x + 72.545
y = -0.1308x + 72.062
y = -0.3014 + 69.736
y = -0.1813x + 61.246
y = -0.1055x + 47.305
y = -0.0358x + 42.883
y = -0.1145x + 42.911
y = -0.1825x + 44.876
y = -0.0429x + 26.906

Correlation Coefficient
0.006
0.002
0.0133
0.2358
0.1156
0.0646
0.0039
0.0531
0.2043
0.0092

3)
Table 2: Regression Equations for Top 3 Selling Items Using Only First 60 Weeks of Data

Item Type
17
12
3

Item Description
Snacks
Eggs
Butter

Regression Equation
y = 0.0423x + 85.113
y = -0.1012x +76.178
y = -0.4956x + 81.59

Correlation Coefficient
0.0011
0.0027
0.048

Figure 12: Time Series for the Sales of Snacks (Item 17) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks

Figure 13: Time Series for the Sales of Eggs (Item 12) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks

Figure 14: Time Series for the Sales of Butter (Item 3) over the 104 Week Period
Regression line displayed is based off of the results from the first 60 weeks

4)

Week Item 17 - Snacks Item 12 - Cereal Item 3 - Butter


61
87.6933
70.0048
51.3584
62
87.7356
69.9036
50.8628
63
87.7779
69.8024
50.3672
64
87.8202
69.7012
49.8716
65
87.8625
69.6000
49.3760
66
87.9048
69.4988
48.8804
67
87.9471
69.3976
48.3848
68
87.9894
69.2964
47.8892
69
88.0317
69.1952
47.3936
70
88.0740
69.0940
46.8980
71
88.1163
68.9928
46.4024
72
88.1586
68.8916
45.9068
73
88.2009
68.7904
45.4112
74
88.2432
68.6892
44.9156
75
88.2855
68.5880
44.4200
76
88.3278
68.4868
43.9244
77
88.3701
68.3856
43.4288
78
88.4124
68.2844
42.9332
79
88.4547
68.1832
42.4376
80
88.4970
68.0820
41.9420

Table 3: Projected Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80 using
Regression Equations obtained from first 60 Weeks of Data
Week
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

Item 17 - Snacks
59
44
49
71
75
64
52
54
47
72
53
46
55
48
77
91
71
43
62
52

Item 12 - Cereal
104
36
36
39
47
46
34
45
38
58
43
71
95
27
50
63
147
61
62
94

Item 3 - Butter
38
24
41
37
35
41
41
79
40
24
39
74
49
53
128
68
71
43
82
167

Table 4: Actual Units of Snacks, Cereal, and Butter Sold for Weeks 61 through 80

Clustering
5)

Figure 15: Graphical Representation of K-Means Clustering using K=3 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

Figure 16: Graphical Representation of K-Means Clustering using K=4 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

Figure 17: Graphical Representation of K-Means Clustering using K=5 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

Figure 18: Graphical Representation of K-Means Clustering using K=6 for Units of Crackers
Purchased vs. Units of Cookies Purchased by Each Customer over the 104 Week Span

Figure 19: Graphical Representation of K-Means Clustering using K=3 for Units of Eggs
Purchased vs. Units of Cereal Purchased by Each Customer over the 104 Week Span

Figure 20: Graphical Representation of K-Means Clustering using K=3 for Units of Hotdogs
Purchased vs. Units of Pizza Purchased by Each Customer over the 104 Week Span

General
6)

Race (%)
White
Black
Hispanic
Oriential

Figure 21: Graphical representation of race of store customers.

Customer Income
90
80
70
60
50
40
30
20
10
0
0 - 10k 10 - 12 - 15 - 20 - 25 - 35 - 45 - 55 - 65 - 75+k
11.9k 14.9k 19.9k 24.9k 34.9k 44.9k 54.9k 64.9k 74.9k

Figure 22: Chart representing total number of customers falling into each income bracket.

140
120
100
80
60

Male

40

Female

20
0

Figure 23: Chart representing total number of customers in occupation by field.

Part 1 Discussion and Overall Recommendations:


1)
Problem #1 involved using the solution to question 5 from homework 7. This question asks for the sales
forecast of the two highest selling items which are item types 3 and 17. The following forecasts were
completed in homework 7: 4-period moving average, 2-period weighted moving average, and
exponential smoothing. Using these solutions, the mean squared error, mean absolute deviation, and
the tracking signal were computed. Using the equations shown in (Figure 1A) in the Appendix, each of
these errors were calculated.
According to the results, the Mean Squared Error for Exponential Smoothing provided the least amount
of error for both Item types. The values are rather large showing that prediction sales data may not be
entirely effective. I would recommend the store owner to stay away from these sales forecasting
methods, but to choose the exponential smoothing if need be.

2)
Problem #2 provides some interesting insight into the buying patterns of certain items. With the help of
a pivot table, it was easy to condense the data to determine the top ten most frequently purchased
items. After determining what these top ten items were, a time series graph was created for each
individual item. The last aspect of the time series that was implemented was the regression line (trend

line). This equation of this trend line allows the user to plug in any value for week number to get an
estimate of how many units of a particular item will be purchased during that given week. To rate the
accuracy of this trend line in predicting the actual number of units sold, a correlation coefficient (R2) was
also computed.
No items by themselves had any particularly interesting trends, however, there were a couple of
interesting aspects that each of these ten plots had in common. The first noticeable common
occurrence was the sharp decrease in sales of each item during week 57. This could have been caused
by a variety of factors. For example, maybe week 57 was a holiday week so customers were busy
spending their time at home with family instead of at the store. Another possibility is that maybe the
store was closed for a few days that week, thus decreasing its total sales. No conclusive answer can be
obtained from the data given, but there is surely something that happened during week 57 to cause this
decrease in sales.
Another interesting feature that all ten items had in common was a negatively sloping regression line.
This means that as time goes on, the projected sales of every single one of these items is projected to
continually decrease. Basically, this means that the store has a problem. If sales of each of its ten most
popular products continue to decrease as time goes on like the regression model suggests, the store will
bring in less and less money as time goes on. In the long run, if this trend goes on for long enough, the
projected decrease in sales could even lead to the store going out of business.
Finally, one last notable feature that all ten items shared was a very small correlation coefficient. The
correlation coefficient essentially shows how accurate the fit of the regression line is to the observed
data. This value will fall somewhere between 0 and 1; 0 representing minimal accuracy and 1
representing perfect accuracy in the fit of the regression line. Out of all ten items, the largest
correlation coefficient was 0.2358. One of the values were even as low as 0.002. These extremely low
coefficient correlation values indicate that the regression line is not very accurate in predicting the
actual number of units of an item sold given the week number.
3)
Problem #3 further dives into the concept of regression and using a regression line to predict future
response. In this problem, only the top three most frequently purchased items were considered. A
regression line was formed using only the sales data from the first 60 weeks. After this regression line
from the first 60 weeks was calculated and created, it was plotted overtop of the time series plot of all
104 weeks of data for its corresponding item type. This created a visual that made it easy to see just
how accurate each individual regression line was in predicting future response.
After all of these steps were carried out, the most notable occurrence was the great change in the
equations for the regression lines from what they were when all 104 weeks of data was considered. The
initial equations were y = -0.0593x + 84.136 for item 17, y = -0.0168x + 72.545 for item 12, and y = 0.1308x +72.062 for item 3. However, once the parameters were changed to only include the first 60
weeks, the new regression equations were y = 0.0423x + 85.113 for item 17, y = -0.1012x +76.178 for
item 12, and y = -0.4956x + 81.59 for item 3. One notable occurrence was the change in slope for item

17. When all 104 weeks of data was used, the slope of the regression line was negative. However,
when only the first 60 weeks of data was considered, the slope of the regression line became positive.
This means that sales are projected to gradually increase when only the first 60 weeks of data is used,
but when all 104 weeks is considered, sales of item 17 are projected to decrease as time goes on.
Another noticeable trend was the rapid decrease in slope of items 12 and 3 when only the first 60 weeks
of data are considered. The slope decreases almost six times as fast for item 12 and almost four times
as fast for item 3 when only the first 60 weeks is used in forming the regression line as opposed to all
104 weeks. However, it should be noted that in both of these cases, the intercept does start at a higher
value for the regression line from only the first 60 weeks, thus somewhat compensating for the large
increase in negativity of the slope. Overall, the great changes that take place in the equations for the
regression lines when more data is used indicates the lack of accuracy of the regression lines in
predicting the response (units sold).
4)
So now, the goal of problem #4 is to predict the number of units that will sell each week, starting from
week 61 and stopping at week 80. These predictions were made using the regression lines found from
considering only the first 60 weeks of data. In order to obtain these predictions, the week number had
to be plugged into the regression equation as the x value. The predicted values and the actual values
can be seen in the Results Section.
It is evident that the predictions obtained through regression are not very accurate for predicting future
sales. For item 17, the average sales from weeks 61 through 80 are significantly less than the predicted
values indicate. The predicted values are also noticeably larger for item 12. There are a few instances
where actual units sold increases to be greater than the projections, but for the most part, the average
sales are less than the predicted values. Like the other two items, the projections for item 3 are not
quite accurate either. The regression line indicates a quick decrease in units sold while in actuality, the
values seem to be gradually increasing.
Overall, after comparing the projections from the regression lines to the actual amount of each item
sold each week, it is clear that there is a lot of error in this prediction method. If the owner of the store
wants to better predict the amount of items that will be sold in the future, forecasting methods such as
Moving Average or Exponential Smoothing may create more accurate results.
5)
Problem #5 dealt with the clustering of two different items against each other. Clustering is a very
helpful strategy when it comes to breaking down and studying data. Customers are an example of
something that is often clustered. In this case, clustering allows the user to separate the customers into
groups so that different recommendations can be given to different groups based off of the common
preferences of the customers in that group. K-means clustering is a very widely used clustering
technique. This method breaks the data down into k separate groups, with each group having its own
centroid. The rule of thumb here is that all of the points in each cluster will be closest to its own
centroid, as opposed to being closer to the centroid of another group.

So, for problem 5, the k-means clustering algorithm was used to break the data down into k separate
groups. The first two items that were compared were items 8 and 9 (cookies and crackers). These items
were clustered using k values of 3, 4, 5, and 6. The graphs were created and then judged to determine
which value of k produced the best fit for the data. All of the graphs that were created can be seen in
the Results Section. K=6 appeared to create the best fit for the data on the graph because it creates the
most precise groups with points that seem to have the smallest deviation from their centroid. After this,
two more sets of items were clustered, however, k=3 was the only value of k that was used on those
cases. The first graph created contained a comparison between cereal and eggs while the second graph
created focused on hotdogs and pizza.
So, the reason that this clustering procedure was carried out is to help the store manager promote
products based on specific customer attributes. After the customers were assigned to a specific cluster,
the background information on each customer was examined to determine if there were any noticeable
similarities in the characteristics of customers within clusters. This background information was given in
the Demographics Data
So, the cluster analysis containing k=6 groups of crackers (Item 9) purchased vs. cookies (Item 8)
purchased was first examined to determine the buying patterns of the customers in each of the clusters.
The other three graphs for crackers vs. cookies using k=3, 4, and 5 were not considered here because it
was decided that k=6 created the best and most accurate results. In each of the graphs, clusters are
referred to as Series 1, Series 2, etc. Series 1 contains the customers who buy very little of both
products, Series 2 contains customers who buy a lot of crackers and a substantial amount of cookies,
Series 3 contains customers who buy some cookies but not a lot of crackers, Series 4 contains customers
who buy some crackers but not a lot of cookies, Series 5 contains customers who buy a lot of cookies
and a substantial amount of crackers, and Series 6 contains the customers who buy a substantial
amount of cookies and some crackers. To get a better understanding of the customers who belonged to
each of these groups, Family Size was useful in differentiating between clusters. The two groups that
were focused on were Series 2 and Series 5 because those were the groups that contained the
customers who bought a large amount crackers and a large amount of cookies, respectively. In Series 2,
the most common family size by far is 2 people, accounting for 63% of the people in that group.
However, in Series 5, there are three family sizes that make up the majority of this group. A family size
of 2 makes up 33%, a family size of 3 makes up 25%, and a family size of 4 makes up 25%. Note that all
of these Demographic Data statistics can be found in the Appendix in table form. So, now that the
company has some background on the customers belonging to each of the customers, it must now
decide how to allocate its advertising resources to try to increase the sales of cookies and crackers.
Based off of this data, the type of customer who buys a lot of crackers belongs to Series 2 and more
often than not, has a family size of 2 people. The type of customers who buy a lot of cookies are those
who have families of size 2, 3, or 4. So, in order to maximize profit from sales of both crackers and
cookies, crackers should be marketed mainly towards people who have family sizes of 2 people and
cookies should be marketed towards people who have families ranging from 2-4 people. In order to
advertise to these groups of people, TV would probably be the best way to get the word out. Out of the
people who responded, 264 out of the 272 people owned at least one television. However, if a

television advertisement is not economically feasible, other methods, such as a newspaper or magazine
advertisement could work well. 50% of the people in Series 2 and 58.33% of the people in Series 5 read
the newspaper. If a magazine advertisement is the way that the store wants to go, their best course of
action would be to put an ad for crackers in Better H&G because 37.5% of people that buy a lot of
crackers also subscribe to Better H&G. To reach the customers who buy a lot of cookies, and
advertisement should be placed in Good House because one third of the people who buy a lot of cookies
also subscribe to Good House.
The next cluster analysis that was carried out compared the amount of eggs (Item 16) purchased
vs. the amount of cereal (Item 5) purchased by each customer. This cluster analysis only used k=3
groups. Series 1 contained the customers who did not buy a lot of either product, Series 2 contained the
customers who bought a lot of cereal, and Series 3 contained the customers who bought a lot of eggs.
When the demographics data of all of the customers was studied by cluster, there were some pretty
interesting results. Annual Income was one factor that created some interesting statistics when studied.
Out of all of the people in Series 2, over 60% of those people make more than $35,000 per year. In
Series 3 on the other hand, less than 19% of people make more than $35,000 per year. So, since there is
clearly some correlation between annual income and product purchased, the store should take this into
account for advertising purposes. For the store to increase sales of both of these items, the best course
of action would be to try to advertise cereal to people who make more than $35,000 per year and
advertise eggs to the people who make less than $35,000 per year. In order to do this advertising, once
again the most efficient yet most costly way to do so would be to create a television commercial. If the
store would like to advertise in the newspaper instead, 57.89% of the people who buy a lot of cereal and
60.53% of the people who buy a lot of eggs subscribe to the newspaper. And finally, if a magazine ad is
the way that the store chooses to go, in order to reach the people who buy a lot of cereal, an ad should
be placed in Reader Digest because 26.32% of the people who buy a lot of cereal subscribe. To reach
the people who buy a lot of eggs, a magazine ad should be placed in either Readers Digest or Better
H&G because 26.32% of people who buy a lot of eggs subscribe to each of these magazines.
The final cluster analysis that was carried out compared the number of units of pizza (Item 16)
purchased vs. the number of units of hotdogs (Item 11) purchased by each customer. Again, k=3 was
the value used for number of groups. Series 1 contained people who purchased very few hotdogs and
pizzas, Series 3 contained the people who purchased a lot of hotdogs, and Series 2 pretty much just
contained everyone who did not fit either of those criteria. In this case, race was a factor that lead to
some pretty interesting results. Though white people were the race that was most common in the
study, the data shows some pretty evidence pertaining to the preferences of African Americans. In
Series 1 and 2, the percentage of African American belonging to each cluster is just above 10%.
However, in Series 3, the percentage of African American jumps all of the way up above 23%. The
percentage of white people belonging to Series 3 also drops more than 10% from both Series 1 and 2.
As a result of this correlation between race and product purchased, this should be taken into account by
the store when trying to advertise pizza and hotdogs. This data shows that it is a waste of time to
market pizza to African Americans because only a little more than 10% of them ever purchase pizza.
White people would be a better marketing target when it comes to advertising pizza. If an item should

be marketed to African Americans, it should be hot dogs because a much higher percentage of them
purchase hotdogs. So, once again, the statistics show that the most efficient way to reach consumers is
through a television advertisement. However, in this case a magazine or newspaper ad may not be a
good way to go when trying to advertise. In Series 3, which would be the main group that the store
would want to advertise to because they buy such a large amount of hotdogs, only 38.46% of people
subscribe to the newspaper which, compared to the other groups, is very small. Also, out of this group,
no magazine is subscribed to by more than 16% of the people, making this advertisement possibility also
somewhat ineffective. So, in order to reach these people in Series 3, the best method after all may just
be to save up money to produce a TV commercial because that seems to be the only really effective way
to reach these people.

6)
When developing a marketing strategy it is essential to know and understand ones customer base.
Using the Demographics data, some statistics were able to be pulled from the data to in order to
develop such an understanding of the area of the stores location and the people who live there. The
percentages of ethnicities of the customers were found, as seen in Figure 21. This graph indicates that
the population of the location of the town is predominantly white, and has very low minority population
percentages. This leads to believe that the store is located in a more rural area, as opposed to in a larger
city. More potential evidence for the rural, blue-collar town is that 34% of customers own at least one
dog. This number would not be so high in a city with people living mostly in apartment buildings. The
fact that so many people own dogs brings a recommendation from the team that the store should
promote its dog food and other dog-related products to its customers. This area also seems like a rural
town with blue-collar people because of the annual income. The vast majority of customers fall into the
middle-class range, whereas only a few customers earn more than $75k per year. Surprisingly, the most
common profession among all the customers is retirement. Yes, of all the types of professions there are
more customers that are retired than in any other category. This leads to believe that the store has a
customer base with a majority of elderly citizens. By using clustering and other analytic techniques, the
store has the capability to use all aforementioned statistics to gain an advantage with its sales and
marketing strategies.

Part 2: Databases
1)

Figure 1: Access Database Table Schema

Table 1:Transaction: Transaction Information Table

Table 2:CouponLookup Coupon Description


Table

Table 3: Product: Item Description Table


Table 4:Vendor: Vendor Decsription Table

Table 5:Customer: Customer Demographic Table

Table 6:RaceLookup Race Description Table

Table 7:IncomeLookup Income Table

Table 8:FemaleOccLookup Female Occupation Table

Table 9:MaleOccLookup Male Occupation Table

Table 10:MaleAgeLookup Male Age Table

2)

Query 1: Cross table of query, showing all females occupations, in addition to family size and
coupon origin.
SQL code:
TRANSFORM Count([AllOccFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]

SELECT [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc,


Count([AllOccFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]
FROM [AllOccFemale-CouponOrig]
GROUP BY [AllOccFemale-CouponOrig].[Family Size], [AllOccFemale-CouponOrig].Desc
PIVOT [AllOccFemale-CouponOrig].CouponDesc;

Query 2: Origins of all coupon transactions.


SQL code:

TRANSFORM Count(CouponAmounts.[Customer ID]) AS [CountOfCustomer ID]


SELECT CouponAmounts.CouponDesc, Count(CouponAmounts.[Customer ID]) AS [Total Of
Customer ID]
FROM CouponAmounts
GROUP BY CouponAmounts.CouponDesc
PIVOT CouponAmounts.CouponID;

Query 3: Total coupons used that valued greater than $0.99 in regards to item types.
SQL code:
TRANSFORM Count(CouponItemInteraction.[Customer ID]) AS [CountOfCustomer ID]
SELECT CouponItemInteraction.Description, Count(CouponItemInteraction.[Customer ID]) AS [Total Of
Customer ID]
FROM CouponItemInteraction
WHERE (((CouponItemInteraction.[Coupon Value (Cents)])>99))
GROUP BY CouponItemInteraction.Description
PIVOT CouponItemInteraction.[Coupon Value (Cents)];

Query 4: Total items bought by ethnicity.


SQL code:
TRANSFORM Count(EthnicItemType.[Customer ID]) AS [CountOfCustomer ID]
SELECT EthnicItemType.Desc, Count(EthnicItemType.[Customer ID]) AS [Total Of Customer
ID]
FROM EthnicItemType
GROUP BY EthnicItemType.Desc
PIVOT EthnicItemType.Description;

Query 5: Units bought compared to family size.


SQL code:
TRANSFORM Count([Family-Units Bought].[Customer ID]) AS [CountOfCustomer ID]
SELECT [Family-Units Bought].[Units Bought], Count([Family-Units Bought].[Customer ID])
AS [Total Of Customer ID]
FROM [Family-Units Bought]
GROUP BY [Family-Units Bought].[Units Bought]
PIVOT [Family-Units Bought].[Family Size];

Query 6: Coupon origins based on female occupations.


SQL code:
TRANSFORM Count([FemaleOcc-Coupons].[Customer ID]) AS [CountOfCustomer ID]
SELECT [FemaleOcc-Coupons].Desc, Count([FemaleOcc-Coupons].[Customer ID]) AS [Total
Of Customer ID]
FROM [FemaleOcc-Coupons]
GROUP BY [FemaleOcc-Coupons].Desc
PIVOT [FemaleOcc-Coupons].[Coupon Origin];

Query 7: Ice Cream by Day of Week


SQL code:
TRANSFORM Count([IceCream-ByDay].[Item Type]) AS [CountOfItem Type]
SELECT [IceCream-ByDay].Description, Count([IceCream-ByDay].[Item Type]) AS [Total Of
Item Type]
FROM [IceCream-ByDay]
GROUP BY [IceCream-ByDay].Description
PIVOT [IceCream-ByDay].Day;

Query 8: Coupon origin and family size of unemployed females.


SQL code:
TRANSFORM Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [CountOfCustomer ID]
SELECT [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc,
Count([NonEmployFemale-CouponOrig].[Customer ID]) AS [Total Of Customer ID]
FROM [NonEmployFemale-CouponOrig]
GROUP BY [NonEmployFemale-CouponOrig].[Family Size], [NonEmployFemale-CouponOrig].Desc
PIVOT [NonEmployFemale-CouponOrig].CouponDesc;

Query 9: Item types bought with subscription to cable tv in regards to male occupation.
SQL code:
TRANSFORM Count([TVAds-MenOcc].[Cable TV]) AS [CountOfCable TV]

SELECT [TVAds-MenOcc].Description, Count([TVAds-MenOcc].[Cable TV]) AS [Total Of Cable TV]


FROM [TVAds-MenOcc]
GROUP BY [TVAds-MenOcc].Description
PIVOT [TVAds-MenOcc].Desc;

Query 10: Item types bought by income amounts.


SQL code:
TRANSFORM Count([VolumeBought-Income].[Customer ID]) AS [CountOfCustomer ID]
SELECT [VolumeBought-Income].Incomeamt, Count([VolumeBought-Income].[Customer ID]) AS [Total Of
Customer ID]
FROM [VolumeBought-Income]
GROUP BY [VolumeBought-Income].Incomeamt
PIVOT [VolumeBought-Income].Description;

Query 11: Item types bought by week.


SQL code:
TRANSFORM Count(WeeklyProducts.[Customer ID]) AS [CountOfCustomer ID]
SELECT WeeklyProducts.Description, Count(WeeklyProducts.[Customer ID]) AS [Total Of
Customer ID]
FROM WeeklyProducts
GROUP BY WeeklyProducts.Description
PIVOT WeeklyProducts.Week;

Query 12: Family size and income with children.


SQL code:
SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Income,
Customer.Children
FROM Customer
WHERE (((Customer.[Subscription to Newsweek])=Yes))
ORDER BY Customer.Children DESC , Customer.[Family Size] DESC;

Query 13: What customer bought snacks from what vendor on day 5.
SQL code:
SELECT Transaction.[Customer ID], Product.Description, Vendor.Vendor, Transaction.Day

FROM Vendor INNER JOIN (Product INNER JOIN [Transaction] ON Product.[Item Type] =
Transaction.[Item Type]) ON Vendor.ItemType = Transaction.[Item Type]
WHERE (((Product.Description)="snack") AND ((Transaction.Day)=5));

Query 14: Amount of coupons from origin 23, item type 10, with a coupon value greater than
$0.99.
SQL code:
SELECT Transaction.[Coupon Origin], Product.[Item Type], Transaction.[Coupon Value
(Cents)]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
WHERE (((Transaction.[Coupon Origin])=23) AND ((Product.[Item Type])=10) AND
((Transaction.[Coupon Value (Cents)])>99));

Query 15: Hot Dog purchases >1 in regards to family size, number of dogs, day and week.
SQL code:
SELECT Customer.[Customer ID], Customer.[Family Size], Customer.Dogs,
Product.Description, Transaction.Week, Transaction.[Units Bought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID]
= Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]

WHERE (((Customer.Dogs)>1) AND ((Product.Description)="dogs"))


ORDER BY Customer.[Customer ID], Transaction.Week;

Query 16: Amount of female cleaners that bought cleansers and amount of units bought.
SQL code:
SELECT Customer.[Customer ID], Customer.[Female Occupation], Transaction.[Item Type],
Transaction.[Units Bought]
FROM Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]
WHERE (((Customer.[Female Occupation])=8) AND ((Transaction.[Item Type])=6));

Query 17: Amount of duplicates of vendors and their item numbers.


SQL code:
SELECT First(Vendor.[Vendor]) AS [Vendor Field], First(Vendor.[ItemNumber]) AS
[ItemNumber Field], Count(Vendor.[Vendor]) AS NumberOfDups
FROM Vendor
GROUP BY Vendor.[Vendor], Vendor.[ItemNumber]
HAVING (((Count(Vendor.Vendor))>1) And ((Count(Vendor.ItemNumber))>1));

Query 18: Females subscribed to Better Home & Garden, and their occupations.
SQL code:
SELECT Customer.[Customer ID], Customer.[Subscription to Better H&G],
FemaleOccLookup.Desc
FROM FemaleOccLookup INNER JOIN Customer ON FemaleOccLookup.FemaleOccID =
Customer.[Female Occupation]
WHERE (((Customer.[Subscription to Better H&G])=Yes));

Query 19: Male and female education in regards to income.


SQL code:
SELECT Customer.[Customer ID], Customer.Income, Customer.[Male Education],
Customer.[Female Education]
FROM IncomeLookup INNER JOIN Customer ON IncomeLookup.IncomeID = Customer.Income
WHERE (((Customer.Income)=6));

Query 20: Coupon origin of snacks bought by African Americans.


SQL code:
SELECT Customer.[Customer ID], RaceLookup.Desc, Product.Description, CouponLookup.CouponDesc
FROM CouponLookup INNER JOIN (RaceLookup INNER JOIN (Product INNER JOIN (Customer INNER
JOIN [Transaction] ON Customer.[Customer ID] = Transaction.[Customer ID]) ON Product.[Item Type] =
Transaction.[Item Type]) ON RaceLookup.RaceID = Customer.Ethnicity) ON CouponLookup.CouponOrig
= Transaction.[Coupon Origin]
WHERE (((RaceLookup.Desc)="black") AND ((Product.Description)="snack"));

Query 21: How oftern customers purchased hot dogs along with family size and units bought.
SQL code:
SELECT Customer.[Customer ID], First(Customer.[Family Size]) AS [Family Size], First(Customer.Dogs)
AS Dogs, Count(Transaction.Week) AS CountOfWeek, First(Transaction.[Units Bought]) AS [Units
Bought]
FROM Product INNER JOIN (Customer INNER JOIN [Transaction] ON Customer.[Customer ID] =
Transaction.[Customer ID]) ON Product.[Item Type] = Transaction.[Item Type]
GROUP BY Customer.[Customer ID]
HAVING (((First(Customer.Dogs))>1))

3)

Graphics:

Figure1: Ice Cream units sold by day of week.

Figure 2: The total coupons used and the origin of those coupons.

Figure 3: Sum of total items bought by ethnicity.

Figure 4: Distribution of coupons used valued greater than $0.99.

Figure 5: Total coupons used by females occupations.

Part 2 Discussion and Overall Recommendations:


Creating databases in Access has proven to show many important insights to the transaction data.
Through the formulation of tables and queries, information can be found lying within the bulk of the
data. The database was set up by creating a few of the necessary tables. The demographic data was
inputted and created as the customer table. A transaction table was also created where Customer ID is
shared between the Transaction and Customer tables (Relationship). To make things easier to visualize,
a Product table was created that lists the description of each item type instead of the corresponding
numerical ids. As queries were being made, understanding many of the numerical signifiers for the
demographics were difficult. Thus the following tables were created to show descriptions when calling
upon them in a query: CouponLookup, IncomeLookup, RaceLookup, FemaleOccLookup,
MaleAgeLookup, and MaleOccLookup. By using the actual descriptions from the demographic data, the
store owner can easily identify anything he or she may be searching for in the form of data analysis.
As queries were being written, certain trends became apparent. The grocery store runs on people
buying their products. Finding relationships of what type of people are buying what type of products can
prove to be very beneficial. Linking transaction data to the demographics of the customers is crucial to
marketing and sales of any given company.
To analyze some of the findings from the queries, we will start with the most general query: the quantity
of items bought over the 102 week period. There were a total of 49,576 items bought. Of these 49,576
items bought, 213 purchases were made from Hispanic customers, 255 listed as other, 4,373 African
Americans, and 44,735 Caucasian customers. Within these purchases, the top 3 items bought were
snacks at 6,379, eggs at 5059, and cereal at 4859 items. This data already shows that this store is located
where the primary customers of Caucasian ethnicity.
Coupons can be an integral part to a grocery stores sales. When analyzing coupon data, you can find
what types of items coupons are used most often with, where these coupons are coming from, and
what type of people are using them. Looking at the queries CouponItemInteraction, CouponAmounts,
and FemaleOcc-Coupons can answer a few of these questions. The top 3 items bought with the use of
coupons that valued greater than $0.99 were cereal with 397 instances, detergents with 147, and coffee
with 139. With almost 10 times more than any other origin, the Sunday supplement vendor accounted
for 60 percent of all coupons used at this store with a total of 2014 of the 3376 coupons used.
Newspapers were the next largest source of coupons with 424 transactions. As females accounted for
the bulk of coupon use, the primary occupations of these coupon users were retired. The next two
female occupations with the highest coupon use were unemployed and clerical respectively. Drawing
conclusions from this data, there is probably a large population of retirees that shop at this store. To
target market, I would recommend targeting Caucasian females in the Sunday supplement for items that
you may want to sell more of.
This is just one example of something pulled from this transaction data. There are certainly many more
inferences that can be made by using different queries to find and make these conclusions. We found
that there were many lessons to be learned from this project. The use of databases can be a huge

advantage when sorting through large amounts of data in search of making inferences. The ability to call
certain criteria from a large selection can return a small, comprehendible set of data. In the transaction
data, we were able to draw strong evidence of the demographic prevalence in the store region. From
this demographic data, buying tendencies gave way to easy assumptions to make about what items sell
and what items do not sell. This project proved to be very beneficial in learning hands-on the
importance of data analytics and databases. This experience will prove to be extremely useful in our
future fields.

Group Roles
Matt Murphy Part 1 - #(2,3,4,5)
Brian Leap Part 2 - #(1,2,3)
Troy McCrum Part 1 - #(1,6) and Part 2 - #(2)

Appendix
Part 1:
Figure 1: Equations used for question 1, prediction errors.

Table 1: Percentage of people from each series (or cluster) grouped by how many family member they
have (see key in Data Dictionary)

C1
0
1
2
3
4
5
6

C2
0.00
0.12
0.36
0.24
0.15
0.08
0.05

C3
0.00
0.06
0.63
0.00
0.19
0.00
0.13

C4
0.00
0.14
0.34
0.34
0.06
0.09
0.02

C5
0.00
0.07
0.30
0.23
0.33
0.03
0.03

C6
0.00
0.17
0.33
0.25
0.25
0.00
0.00

0.00
0.14
0.46
0.03
0.19
0.05
0.14

Table 2: Percentage of people from each series (cluster) grouped by how much they make annually (see
key in Data Dictionary)

C1
C2
C3
0 0.018315
0
0
1 0.087912 0.105263 0.078947
2 0.069597
0 0.052632
3 0.058608 0.078947 0.105263
4 0.120879 0.052632 0.184211
5 0.087912 0.105263 0.157895
6 0.164835 0.052632 0.236842
7 0.153846 0.210526 0.078947
8 0.128205 0.184211 0.078947
9 0.03663 0.052632 0.026316
10 0.029304 0.052632
0
11 0.043956 0.105263
0
Table 3: Percentage of people from each series (cluster) grouped by Race (see key in Data Dictionary)

C1
0
1
2
3
4
5

C2

C3

0
0
0
0.876364 0.885246 0.769231
0.105455 0.114754 0.230769
0.010909
0
0
0
0
0
0.007273
0
0

Table 4: Amount of TVs owned by Customers

# of TVs
0
1
2
3
No Response

# of Customers
8
63
126
75
77

Table 5: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from
Crackers vs. Cookies

C1
C2
C3
C4
C5
C6
Newspaper
0.4895
0.5000 0.5000 0.5667
0.5833
0.4054
Subscription to Better H&G
0.2632
0.3750 0.1875 0.1667
0.1667
0.2432
Subscription to Good House
0.1158
0.1250 0.1250 0.1000
0.3333
0.0541
Subscription to Ladies HJ
0.1211
0.1875 0.1406 0.1000
0.0833
0.1351
Subscription to McCalls
0.1263
0.1250 0.1719 0.0333
0.2500
0.1622
Subscription to Redbook
0.0895
0.1250 0.0313 0.0333
0.1667
0.0270
Subscription to Reader's Digest 0.2263
0.1875 0.3281 0.3667
0.0833
0.3243
Subscription to Cosmopolitan 0.0211
0.0625 0.0000 0.0000
0.0833
0.0541
Subscription to TV Guide
0.1684
0.0625 0.1563 0.1333
0.0833
0.1622
Subscription to People
0.0105
0.0000 0.0781 0.0333
0.1667
0.0000
Subscription to Glamour
0.0158
0.0625 0.0000 0.0333
0.0833
0.0000
Subscription to Time
0.0789
0.0625 0.0625 0.0000
0.0833
0.0811
Subscription to Newsweek
0.0579
0.0000 0.0469 0.0667
0.0833
0.0270

Table 6: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Eggs
vs. Cereal
C1
C2
C3
Newspaper subscriber
0.4652
0.5789
0.6053
Subscription to Better H&G
0.2491
0.1579
0.2632
Subscription to Good House
0.1062
0.1842
0.1316
Subscription to Ladies HJ
0.1136
0.1316
0.2105
Subscription to McCalls
0.1465
0.0526
0.1316
Subscription to Redbook
0.0733
0.0789
0.0526
Subscription to Reader's Digest
0.2601
0.2632
0.2632
Subscription to Cosmopolitan
0.0256
0.0000
0.0263
Subscription to TV Guide
0.1575
0.1053
0.1842
Subscription to People
0.0293
0.0000
0.0526
Subscription to Glamour
0.0183
0.0000
0.0263
Subscription to Time
0.0806
0.0526
0.0000
Subscription to Newsweek
0.0623
0.0263
0.0000

Table 7: Percentage of people who subscribe to Newspaper/Magazines from Clusters formed from Pizza
vs. Hotdogs

Newspaper
Subscription to Better H&G
Subscription to Good House
Subscription to Ladies HJ
Subscription to McCalls
Subscription to Redbook
Subscription to Reader's Digest
Subscription to Cosmopolitan
Subscription to TV Guide
Subscription to People
Subscription to Glamour
Subscription to Time
Subscription to Newsweek

C1
0.4909
0.2436
0.0982
0.1091
0.1200
0.0655
0.2582
0.0218
0.1455
0.0218
0.0182
0.0764
0.0545

C2
0.5246
0.2459
0.2295
0.2295
0.1967
0.0984
0.3115
0.0328
0.1967
0.0656
0.0164
0.0492
0.0492

C3
0.3846
0.1538
0.0000
0.0000
0.1538
0.0769
0.0769
0.0000
0.1538
0.0000
0.0000
0.0000
0.0000

You might also like