You are on page 1of 10

Available online at www.sciencedirect.

com

Expert Systems
with Applications
Expert Systems with Applications 34 (2008) 1200–1209
www.elsevier.com/locate/eswa

A recommender system using GA K-means clustering


in an online shopping market
a,* b
Kyoung-jae Kim , Hyunchul Ahn
a
Department of Management Information Systems, Dongguk University, 3-26 Pil-Dong, Jung-Gu, Seoul 100-715, South Korea
b
Graduate School of Management, Korea Advanced Institute of Science and Technology, 207-43 Cheongrangri-Dong,
Dongdaemun-Gu, Seoul 130-722, South Korea

Abstract

The Internet is emerging as a new marketing channel, so understanding the characteristics of online customers’ needs and expectations
is considered a prerequisite for activating the consumer-oriented electronic commerce market. In this study, we propose a novel clustering
algorithm based on genetic algorithms (GAs) to effectively segment the online shopping market. In general, GAs are believed to be effective
on NP-complete global optimization problems, and they can provide good near-optimal solutions in reasonable time. Thus, we believe
that a clustering technique with GA can provide a way of finding the relevant clusters more effectively. The research in this paper applied
K-means clustering whose initial seeds are optimized by GA, which is called GA K-means, to a real-world online shopping market seg-
mentation case. In this study, we compared the results of GA K-means to those of a simple K-means algorithm and self-organizing maps
(SOM). The results showed that GA K-means clustering may improve segmentation performance in comparison to other typical clustering
algorithms. In addition, our study validated the usefulness of the proposed model as a preprocessing tool for recommendation systems.
 2007 Elsevier Ltd. All rights reserved.

Keywords: Recommender system; Genetic algorithms; Self-organizing maps; Market segmentation; Case-based reasoning

1. Introduction classes in which the consumers can be naturally grouped,


according to the information available (Velido, Lisboa, &
Since the Internet has become popular, the consumer- Meehan, 1999). It can be the basis for effective targeting
oriented electronic commerce market has grown so huge and predicting prospects through the identification of the
that now companies are convinced of the importance of proper segments. Although much of the marketing litera-
understanding this new emerging market. It is becoming ture has proposed various market segmentation techniques,
more important for the companies to analyze and under- clustering techniques are frequently used in practice (Wedel
stand the needs and expectations of their online users or & Kamakura, 1998). In addition, K-means clustering is
customers because the Internet is one of the most effective the most frequently used market segmentation technique
media to gather, disseminate and utilize the information among the clustering techniques (Gehrt & Shim, 1998;
about its users or customers. Thus, it is easier to extract Kuo, Chang, & Chien, 2004). However, the major draw-
knowledge out of the shopping process to create new busi- back of K-means clustering is that it often falls in local
ness opportunities under the Internet environment. optima and the result largely depends on the initial cluster
Market segmentation is one of the ways in which such centers. Prior studies pointed out this limitation and tried
knowledge can be represented. It attempts to discover the to integrate K-means clustering and global search tech-
niques including genetic algorithms (see Babu & Murty,
1993; Kuo, Liao, & Tu, 2005; Maulik & Bandyopadhyay,
*
Corresponding author. Tel.: +82 2 2260 3324; fax: +82 2 2260 3684. 2000; Murthy & Chowdhury, 1996; Pena, Lozano, & Lar-
E-mail address: kjkim@dongguk.edu (K.-j. Kim). ranaga, 1999).

0957-4174/$ - see front matter  2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2006.12.025
K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1201

In this paper, we try to apply hybrid K-means clustering tial seeds, but there is no mechanism to optimize the initial
and genetic algorithms to carry out an exploratory segmen- seeds. In addition, it may converge to a local minimum
tation of an online shopping market. To find the most under certain conditions because it works as a hill-climbing
effective clustering method for those kinds of data, we strategy. As described in the next section, GA can find opti-
adopt a number of clustering methods and compare the mal clusters by minimizing the extrinsic clustering metric.
performance of each clustering method by using our
suggested performance criteria. In addition, we validate 2.2. Self-organizing map (SOM)
the usefulness of our proposed model in a real-world
application. SOM is the clustering algorithm based on the unsuper-
The rest of this paper is organized as follows: the next vised neural network model. Since it was suggested by
section reviews two traditional clustering algorithms, K- Kohonen (1982), it has been applied to many studies
means and self-organizing map (SOM), along with the per- because of its good performance. The basic SOM model
formance criteria. Section 3 proposes the GA approach to consists of two layers, an input layer and an output layer.
optimize the K-means clustering and Section 4 describes the When the training set is presented to the network, the val-
data and the experiments. In this section, the empirical ues flow forward through the network to units in the out-
results are also summarized and discussed. In the final sec- put layer. The neurons in the output layer are arranged
tion, conclusions and the limitations of this study are in a grid, and the units in the output layer compete with
presented. each other, and the one with the highest value wins. The
process of SOM is as follows:
2. Clustering algorithms
(1) Initialize the weights as random small numbers.
Cluster analysis is an effective tool in scientific or man- (2) Put an input sample, Xi, into the SOM network, and
agerial inquiry. It groups a set of data in d-dimensional fea- the distances between weight vectors, Wj = (wj1,
ture space to maximize the similarity within the clusters wj2, . . . ,wjm), and the input sample, Xi, are calculated.
and minimize the similarity between two different clusters. Then, select the neuron whose distance with Xi is the
There are various clustering methods and they are cur- shortest, as in Eq. (1). The selected neuron would be
rently widely used. Among them, we apply two popular called the ‘winner’
methods, K-means and SOM, and a novel hybrid method X 2
Min DðjÞ ¼ ðwji  xi Þ : ð1Þ
to market segmentation. Before providing a brief descrip-
i
tion of each method, the following assumptions are neces-
sary. A given population consists of n elements described (3) Update the weights of the winner as well as its neigh-
by m attributes and it is partitioned into K clusters. bors by the following equation when a is assumed to
Xi = (xi1, xi2, . . . ,xim) represents the vector of the m attri- be the learning rate
butes of element i. W NEW ¼ W j þ akX i  W j k: ð2Þ
j

2.1. K-means clustering algorithm (4) Iterate Steps (2) and (3) until the stop criterion is
satisfied.
The K-means method is a widely used clustering proce-
dure that searches for a nearly optimal partition with a In spite of several excellent applications, SOM has some
fixed number of clusters. It uses an iterative hill-climbing limitations that hinder its performance. Especially, just like
algorithm. The process of K-means clustering is as follows: other neural network based algorithms, SOM has no mech-
anism to determine the number of clusters, initial weights
(1) The initial seeds with the chosen number of clusters, and stopping conditions.
K, are selected and an initial partition is built by
using the seeds as the centroids of the initial clusters. 2.3. Criteria for performance comparison of clustering
(2) Each record is assigned to the centroid that is nearest, algorithms
thus forming a cluster.
(3) Keeping the same number of clusters, the new cen- We compare the performance of the clustering algo-
troid of each cluster is calculated. rithms using ‘intraclass inertia’. Intraclass inertia is a mea-
(4) Iterate Step (2) and (3) until the clusters stop chang- sure of how compact each cluster is in the m-dimensional
ing or stop conditions are satisfied. space of numeric attributes. It is the average of the dis-
tances between the means and the observations in each
The K-means algorithm has been popular because of its cluster. Eq. (3) represents the equation for the intraclass
easiness and simplicity for application. However, it also has inertia for the given K clusters (Michaud, 1997)
some drawbacks. First, it does not deal well with overlap- 1X 1XXX m
ping clusters and the clusters can be pulled out of center by F ðkÞ ¼ nK I K ¼ ðX iP  X KP Þ ð3Þ
n K n K i2CK P ¼1
outliers. Also, the clustering result may depend on the ini-
1202 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209

where n is the number of total observations, CK is set of the adjacent genes on the chromosome of a parent. The two-
Kth cluster, XiP is the value of the attribute P for observa- point crossover involves two cuts in each chromosome.
tion i, and X KP is the mean
P of the attribute P in the Kth The uniform crossover allows two parent strings to
cluster that is X KP ¼ n1K i2CK X iP . produce two children. It permits great flexibility in the
way strings are combined. In addition, the mutation oper-
3. GA K-means clustering algorithm ator arbitrarily alters one or more components of a selected
chromosome. Mutation randomly changes a gene on a
As indicated in Section 2.1, the K-means algorithm does chromosome. It provides the means for introducing new
not have any mechanism for choosing appropriate initial information into the population. Finally, the GA tends to
seeds. However, selecting different initial seeds may gener- converge on an optimal or near-optimal solution through
ate huge differences in clustering results, especially when these operators (Wong & Tan, 1994).
the target sample contains many outliers. In addition, ran- The GA is popularly used to improve the performance
dom selection of initial seeds often causes the clustering of artificial intelligence techniques. Recently, the GA has
quality to fall into local optimization (Bradley & Fayyad, been employed to mitigate the limitations of typical cluster-
1998). So, it is very important to select appropriate initial ing algorithms including the K-means algorithm. Lletı́,
seeds in the traditional K-means clustering method. To Ortiz, Sarabia, and Sánchez (2004) proposed the use of
overcome this critical limitation, we propose GA as the the integrated genetic algorithm and K-means clustering
optimization tool of the initial seeds in the K-means algo- to select relevant input variables. The method proposed
rithm. We call our proposed model the GA K-means clus- in their study was based on the use of GA to select those
tering method. variables important for clarifying the presence of groups
Genetic algorithms are stochastic search techniques that in data. Other studies tried to integrate GA and the K-
can search large and complicated spaces. It is based on means algorithm for overcoming the drawback of the pos-
biology including natural genetics and evolutionary princi- sibility of convergence in a local minimum solution (see
ple. In particular, GAs are suitable for parameter optimiza- Babu & Murty, 1993; Kuo, An, Wang, & Chung, 2006;
tion problems with an objective function subject to various Maulik & Bandyopadhyay, 2000; Murthy & Chowdhury,
hard and soft constraints (Shin & Han, 1999). The GA 1996; Pena et al., 1999).
basically explores a complex space in an adaptive way,
guided by the biological evolution of selection, crossover, 3.1. The process of the GA K-means algorithm
and mutation. This algorithm uses natural selection – sur-
vival of the fittest – to solve optimization problems (Kim, GA K-means uses GA to select optimal or sub-optimal
2004). initial seeds in K-means clustering. Fig. 1 shows the overall
The first step of the GA is problem representation. The framework of GA K-means clustering.
problem must be represented in a suitable form to be han- The detailed explanation on the process of GA K-means
dled by the GA. Thus, the problem is described in terms of clustering is as follows:
genetic code, like DNA chromosomes. The GA often
works with a form of binary coding. If the problems are (1) In the first step, the system generates the initial pop-
coded as chromosomes, a population – a set of chromo- ulation that would be used to find global optimum
somes – is initialized. Each chromosome within a popula- initial seeds. The values of the chromosomes for the
tion gradually evolves through biological operations. population are initiated into random values before
There are no general rules for determining the population the search process. To enable GA to find the optimal
size. But, population sizes of 100–200 are commonly used initial seeds, we should design the structure of a chro-
in GA research. Once the population size is chosen, the ini- mosome properly as a form of binary strings. A
tial population is randomly generated (Bauer, 1994). After detailed explanation on chromosome design is pre-
the initialization step, each chromosome is evaluated by a sented in Section 3.2.
fitness function. According to the value of the fitness func- (2) The second step is the process of K-means clustering,
tion, the chromosomes associated with the fittest individu- which is mentioned in Section 2.1. In our study, the
als will be reproduced more often than those associated maximum number of iterations for centroid adjust-
with unfit individuals (Davis, 1994). ment in K-means is set at 10. After the process of
The GA works with three operators that are iteratively K-means, the value of the fitness function is updated.
used. The selection operator determines which individuals To find optimal initial seeds, we applied ‘intraclass
may survive (Hertz & Kobler, 2000). The crossover opera- inertia’ as a fitness function. That is, the objective
tor allows the search to fan out in diverse directions look- of our GA K-means is to find the initial seeds where
ing for attractive solutions, and permits chromosomal intraclass intertia is minimized after K-means cluster-
material from different parents to be combined in a single ing finishes.
child. There are three popular crossover methods: single- (3) In this stage, GA performs genetic operations such as
point, two-point, and uniform. The single-point crossover selection, crossover and mutation, on the current
makes only one cut in each chromosome and selects two population. By doing this, a new generation is pro-
K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1203

1. Design the structure of chromosomes

2. Generate the initial population


and define fitness function

User 3. Perform the K-means process


Data Base according to the given initial seeds

4. Calculate the intraclass inertia


as fitness function for each chromosome

5. Apply genetic operators


and produce a new generation

No
6. Satisfy
the stopping criteria?

Yes

7. Finish

Fig. 1. The overall framework of GA K-means.

duced. After that, Step (2) and (3) are repeated until values, they have to be coded on a chromosome, a form
the stopping conditions are satisfied. of binary strings. The structure of the chromosomes for
GA K-means is presented in Fig. 2.
3.2. Chromosome encoding As shown in Fig. 2, the length of each chromosome for
GA K-means depends on the types of features. In the case
The system searches the space to find optimal or near- of the binary selection variables whose value is 0 or 1, the
optimal values of the features that consist of the centroids length of each feature is just 1 bit. But, the features that
for each cluster. To apply GA to search for these optimal have continuous values require more bits to express them

Types of the features consist of the data set

Type 1. Binary selection (0 or 1)


V1
1

Type 2. Continuous values ranging from 0 to 1


V2
1 2 3 4 5 6 7 8 9 10 11 12 13 14

V1 V2 … V36 V37 … V42


1 1 … 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 1 2 … 14

Cluster 1 1 0 … 0 1 0 1 1 0 0 1 0 0 1 0 1 1 0 … 1 1 … 0

Cluster 2 0 1 … 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 … 0 1 … 0

Cluster 3 1 1 … 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1 … 1 1 … 1

… 1 0 … 1 1 1 0 0 1 0 1 1 0 0 0 1 1 1 … 0 0 … 1

Cluster n 1 1 … 1 0 0 1 0 0 1 1 0 1 1 0 1 0 0 … 1 0 … 0

Fig. 2. The structures of the chromosomes for GA K-means.


1204 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209

in a chromosome. Here, we set the continuous feature val- at 0.1. This study performs the crossover using a uniform
ues as precise as 1/10,000. When we apply min–max crossover routine. The uniform crossover method is consid-
normalization to the continuous features, they become ered better at preserving the schema, and can generate any
the values ranging from 0 to 1. In this case, 14 binary bits schema from the two parents, while single-point and two-
are required to express them with 1/10,000th precision point crossover methods may bias the search with the irrel-
because 8192 = 213 < 10000 6 214 = 16384. These 14 bit evant position of the features. For the mutation method,
binary numbers are transformed into decimal floating this study generates a random number between 0 and 1
numbers, ranging from 0 to 1 by applying the following for each of the features in the organism. If a gene gets a
equation (4): number that is less than or equal to the mutation rate, then
x x that gene is mutated. As the stopping condition, only 4000
x0 ¼ 14 ¼ ð4Þ trials (20 generations) are permitted.
2  1 16383
Simple K-means was conducted by SPSS for Windows
where x is the decimal number of the binary code for each 11.0 and SOM by Neuroshell 2 R4.0. GA K-means was
feature weight (Michalewicz, 1996). developed by using Microsoft Excel 2002 and Palisade
For example, the binary code for the value of the 37th Software’s Evolver Version 4.06. And, the K-means algo-
feature of the first sample chromosome – the chromosome rithm for GA K-means was implemented in VBA (Visual
represents Cluster 1 – in Fig. 2 is (10110010010110)2. The Basic for Applications) of Microsoft Excel 2002.
decimal value of it is (11414)10 and it is interpreted as
11414
16383
¼ 0:696697796  0:6967.
In this study, we use a real-world data set from an Inter- 4.2. Experimental results
net shopping mall that consists of 36 selection variables
and 6 continuous variables. For the design of the chromo- In this study, we used real-world Internet shopping mall
some for this data set, (36 · 1 + 6 · 14) · K bits are data for assessing the performance of the proposed model.
required where K is the number of clusters. The research data was collected from an online diet portal
site in Korea which contains all kinds of services for online
4. Experimental design and results diets such as providing information, community services
and a shopping mall. In the case of a diet shopping mall,
4.1. Experimental design customers generally have a clear objective, and they have
a strong demand for personalized care and services. Thus,
We adopt three clustering algorithms – simple K-means, they are usually open-minded about providing their per-
SOM and GA K-means – to our data. We try to segment sonal information to get appropriate service. As a result,
the Internet users into 5 clusters (that is, K = 5). In the case the target company of our research possesses detailed and
of SOM, we set the learning rate (a) at 0.5. accurate customer information, and it has a desire to use
For the controlling parameters of the GA search, the the information as a source of direct marketing for selling
population size is set at 200 organisms. The value of the their products. Consequently, we tried to build a recom-
crossover rate is set at 0.7 while the mutation rate is set mendation model for the users of this web site.

Fig. 3. The system architecture of the recommendation system.


K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1205

There are two prevalent approaches for recommenda- Nonetheless, it also has many critical limitations. Usu-
tion – collaborative filtering (CF) and content-based (CB) ally, it cannot recommend newly added items, and requires
methods. Both approaches have weaknesses as well as high computational complexity as the user base increases.
strengths. However, CF has been used more frequently in Moreover, it mainly relies on annotations by users and pro-
real-world applications. vides meaningless results in heterogeneous domains such as
CF works by collecting user feedback in the form of rat- food. However, the most critical problem is that it cannot
ings for items in a given domain and exploiting similarities make any recommendations when customers have not
and differences among profiles of several users in determin- bought items repeatedly, which is generally called the ‘spar-
ing how to recommend items. In general, CF can make sity problem’ (Cho & Kim, 2004; Cho, Kim, & Kim, 2002;
accurate recommendations in homogeneous domains, Good et al., 1999; Herlocker, Konstan, & Riedl, 2000;
and it requires lower computational complexity when the Kim, Cho, Kim, Kim, & Suh, 2002; Resnick, Iacovou,
amount of the collected data set is small. Moreover, it Suchak, & Bergstrom, 1994).
can recommend items in novel territories because it deter- These limitations are very critical in particular for Inter-
mines items to be recommended by considering only net shopping malls with specialized items. In the case of
similarity between users, not characteristics of products our target shopping mall which is specialized for dieters,
(Konstan et al., 1997; Pazzani, 1999). 80.44% of total buyers had purchased just one time at

Table 1
Features and their descriptions
Code Description Range
AGE Age Continuous (yrs)
ADD0 Residences are located in Seoul (the capital of South Korea) 0: False/1: True
ADD1 Residences are located in big cities 0: False/1: True
ADD2 Residences are located in the places other than Seoul and big cities 0: False/1: True
OCCU0 Customers work for companies 0: False/1: True
OCCU1 Customers are housewives 0: False/1: True
OCCU2 Customers are students 0: False/1: True
OCCU3 Customers run their own businesses 0: False/1: True
OCCU4 Customers are jobless 0: False/1: True
SEX Gender 0: Male/1: Female
MARRIED Customers got married 0: False/1: True
LOSS1 Customers’ need to lose weight around their bellies 0: Not exist/1: Exist
LOSS2 Customers’ need to lose weight around their hips 0: Not exist/1: Exist
LOSS3 Customers’ need to lose weight around their waists 0: Not exist/1: Exist
LOSS4 Customers’ need to lose weight around their faces 0: Not exist/1: Exist
LOSS5 Customers’ need to lose weight around their backs 0: Not exist/1: Exist
LOSS6 Customers’ need to lose weight around their legs and thighs 0: Not exist/1: Exist
LOSS7 Customers’ need to lose weight around their arms 0: Not exist/1: Exist
LOSS8 Customers’ need to lose weight around their chests 0: Not exist/1: Exist
LOSS9 Customers’ need to lose weight around their shoulders 0: Not exist/1: Exist
PUR0 Customers diet for beauty 0: False/1: True
PUR1 Customers diet for urgent purpose such as employment, marriage 0: False/1: True
PUR2 Customers diet for the reasons other than PUR0, PUR1 0: False/1: True
HEIGHT Height Continuous (m)
WEIGHT Weight Continuous (kg)
WAIST Waist Continuous (inch)
OBJ Customers’ target weight to reduce Continuous (kg)
BMI Body mass index (BMI) is the measure of body fat based on height and weight that applies Continuous (kg/m2)
weight ðkgÞ
to both adult men and women. It is calculated as follows: BMI ðkg=m2 Þ ¼ ðheight ðmÞÞ2
D1 Customers experienced ‘edema’ 0: False/1: True
D2 Customers experienced ‘constipation’ 0: Not exist/1: Exist
D3 Customers experienced ‘hypertension’ or ‘high blood fat’ 0: Not exist/1: Exist
D4 Customers experienced ‘anemia’ or ‘malnutrition’ 0: Not exist/1: Exist
E01 Prior experience with ‘functional diet food’ 0: Not exist/1: Exist
E02 Prior experience with ‘diet drugs’ 0: Not exist/1: Exist
E03 Prior experience with ‘diuretics’ or the drugs generating ‘diarrhea’ 0: Not exist/1: Exist
E04 Prior experience with ‘hunger’ 0: Not exist/1: Exist
E05 Prior experience with ‘folk remedies’ such as ‘one food diet’ and ‘Denmark diet’ 0: Not exist/1: Exist
E06 Prior experience with ‘fasting treatment’ 0: Not exist/1: Exist
E07 Prior experience with ‘diet clinic’ 0: Not exist/1: Exist
E08 Prior experience with ‘oriental diet treatment’ 0: Not exist/1: Exist
E09 Prior experience with ‘toning system’ 0: Not exist/1: Exist
E10 Prior experience with other diet methods other than E01-E09 0: Not exist/1: Exist
1206 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209

the time of this research. Thus, it is impossible to apply CF order to predict the items which are likely to be purchased
in our target shopping mall because the buying behavior of by users. However, CBR also has a limitation that is simi-
the customers cannot be patternized. Consequently, we lar to CF – that is the requirement of high computational
applied a simple case-based reasoning (CBR) model in complexity as the user base increases. To resolve the prob-
lem, clustering methods have been applied as a preprocess-
ing tool for CBR (Kim & Han, 2001). In this study, we
Table 2
used our proposed clustering algorithm – GA K-means –
Intraclass inertia of each clustering method
as a preprocessing tool for CBR. Fig. 3 shows the system
Clustering method K-means SOM GA K-means
architecture of the recommendation system using GA K-
Intraclass inertia 2.1467 2.1340 2.1288 means and CBR.
The collected data in this case included 3298 customers.
It contained users’ demographic variables and the items
Table 3 that they had purchased. We collected totally 42 indepen-
t-values of the paired-samples t-test dent variables including demographic and other personal
SOM GA K-means information. Table 1 presents detailed information on the
K-means 5.534a 8.233a features.
SOM 3.588a In order to determine the most appropriate clustering
a
Significant at the 1% level. algorithm for this data set, we also applied simple K-means,

Table 4 Table 5
The result of Chi-square analysis The result of ANOVA

Code K-means SOM GA K-means Code K-means SOM GA K-means

Chi- Asy. Chi- Asy. Chi- Asy. F- Asy. F- Asy. F- Asy.


square Sig. square Sig. square Sig. value Sig. value Sig. value Sig.
value value value AGE 93.444 0.000 209.880 0.000 217.018 0.000
ADD0 96.368 0.000 1348.831 0.000 10.235 0.037 HEIGHT 31.631 0.000 22.773 0.000 23.624 0.000
ADD1 13.468 0.009 270.957 0.000 9.278 0.055 WEIGHT 23.181 0.000 26.752 0.000 29.990 0.000
ADD2 148.332 0.000 516.466 0.000 14.492 0.006 OBJ 19.671 0.000 21.144 0.000 23.581 0.000
OCCU0 57.055 0.000 12.248 0.016 72.685 0.000 BMI 37.419 0.000 41.808 0.000 54.352 0.000
OCCU1 450.737 0.000 83.418 0.000 167.636 0.000 WAIST 25.189 0.000 31.245 0.000 40.390 0.000
OCCU2 1079.419 0.000 372.907 0.000 1458.239 0.000
OCCU3 1.937 0.747 13.663 0.008 20.005 0.000
OCCU4 603.448 0.000 202.115 0.000 727.388 0.000
SEX 167.612 0.000 214.696 0.000 184.098 0.000
MARRIED 661.327 0.000 216.406 0.000 351.134 0.000
LOSS1 757.606 0.000 478.591 0.000 593.393 0.000
LOSS2 445.300 0.000 390.210 0.000 416.854 0.000
LOSS3 1075.998 0.000 724.511 0.000 740.688 0.000
LOSS4 380.218 0.000 443.821 0.000 473.003 0.000
LOSS5 696.827 0.000 1305.146 0.000 1189.162 0.000
LOSS6 339.398 0.000 414.516 0.000 377.712 0.000
LOSS7 646.072 0.000 658.696 0.000 611.935 0.000
LOSS8 220.424 0.000 389.492 0.000 447.096 0.000
LOSS9 522.761 0.000 1145.626 0.000 1065.225 0.000
PUR0 1555.298 0.000 1666.533 0.000 1665.739 0.000
PUR1 141.533 0.000 89.095 0.000 134.972 0.000
PUR2 1772.369 0.000 1643.852 0.000 2162.927 0.000
D1 144.186 0.000 157.686 0.000 156.771 0.000
D2 96.115 0.000 169.692 0.000 99.773 0.000
D3 21.630 0.000 19.825 0.001 28.130 0.000
D4 20.424 0.000 30.589 0.000 37.742 0.000
E01 158.875 0.000 105.290 0.000 110.969 0.000
E02 237.783 0.000 113.755 0.000 97.106 0.000
E03 54.677 0.000 38.590 0.000 47.942 0.000
E04 153.452 0.000 116.398 0.000 144.575 0.000
E05 86.949 0.000 90.472 0.000 72.661 0.000
E06 27.896 0.000 15.372 0.004 17.544 0.002
E07 14.877 0.005 16.113 0.003 26.981 0.000
E08 51.238 0.000 25.048 0.000 32.696 0.000
E09 38.677 0.000 30.369 0.000 18.063 0.001
E10 27.819 0.000 18.089 0.001 21.357 0.000
Fig. 4. Clustering result of GA K-means.
K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1207

Fig. 5. Sample screens of the prototype system (a) The input screen. (b) The result screen for recommendation.

SOM, and GA K-means all together, and focused on finding Table 6


the algorithm which produced the best result. We used The result of the survey
intraclass inertia as a performance measure. Table 2 shows Recommendation method Average Standard deviation
the summarized performance of the three methods. As you Randomly-chosen model 3.76 1.342
can see, GA K-means is the best among three comparative Proposed model (GA K-means + CBR) 4.51 1.010
methods for this data set.
To determine whether the differences of intraclass inertia
between the proposed model and other comparative mod- for paired samples. As explained in Section 2.3, intraclass
els are statistically significant or not, we applied the t-test inertia is the mean of the distances between the cluster
1208 K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209

Table 7
The results of paired-samples t-test
Paired differences t-Value Degree of freedom Sig. level (2-tails)
Mean Standard deviation Standard error mean 95% confidence interval of the difference
Lower Upper
0.75 1.604 0.1604 0.432 1.068 4.675 99 0.0000

center and each sample, thus we can apply the statistical they felt the randomly recommended results were mediocre
test for comparing means of two samples. The paired-sam- (3.76 point).
ples t-test is usually applied when the two sets of values are To examine whether the difference of the satisfaction
from the same sample, such as in a pre-test/post-test situa- levels of the two recommended results is statistically signif-
tion. It is sometimes called the t-test for correlated samples icant or not, we applied the paired-samples t-test. In the
or dependent samples (Green, Salkind, & Akey, 2000). In test, the null hypothesis was H0:l1  l2 = 0 while the alter-
the study, the null hypothesis is H0:INi  INj = 0 where native hypothesis was Ha:u1  l2 5 0 where l1 was the sat-
i = 1,2 and j = 2,3, while the alternative hypothesis is isfaction level of the proposed model and l2 was the level
Ha:INi  INj 5 0 where i = 1,2 and j = 2,3. INk means for the random model. Table 7 shows the results for
intraclass inertia (i.e. average distance between the cluster paired-samples t-test.
center and each sample) for the clustering method k. Table As shown in Table 7, the satisfaction level of the pro-
3 shows the result for paired-samples t-test. As shown in posed model is higher than the level of the random model
Table 3, GA K-means outperforms all of the comparative at the 1% statistical significance level. This shows that GA
models including simple K-means and SOM at the 1% sta- K-means, as a preprocessing tool for a prediction model
tistical significance level. like CBR, may support satisfactory results, and also
In addition, Chi-square analysis and ANOVA (analysis reduces search space for training.
of variance) are also applied to examine the discriminant
power of the three clustering methods. Table 4 suggests 5. Conclusions
the results of Chi-square analysis, and Table 5 presents
the results of ANOVA. These results show that the five seg- This study suggests a new clustering algorithm, GA
ments by all three clustering methods differed significantly K-means. We applied it to a real-world case for market seg-
with respect to almost all of the independent variables. mentation in electronic commerce, and found that GA
Thus, we can conclude that GA K-means may be the most K-means might result in better segmentation than other tra-
appropriate preprocessing tool for this data set. Fig. 4 dis- ditional clustering algorithms including simple K-means
plays the clustering result of GA K-means with two vari- and SOM from the perspective of intraclass inertia. In addi-
ables (AGE and WAIST), indicating that there are tion, we empirically examined the usefulness of GA K-
clearly five clusters. means as a preprocessing tool for recommendation model.
Using the result from GA K-means clustering and the However, this study has some limitations. Although we
CBR algorithm, we constructed the recommendation sys- suggest intraclass inertia as a criterion for performance
tem for the target shopping mall. The system was devel- comparison, it is uncertain that this is a complete measure
oped as a Web-based system using Microsoft ASP (active for performance comparison of the clustering algorithms.
server pages). Fig. 5 shows the sample screens for our pro- Consequently, the efforts to develop effective measures to
totype system. compare clustering algorithms should be done in the future
To validate the usefulness of the recommendation model research.
using GA K-means and CBR, our prototype system Moreover, we arbitrarily set the number of clusters to
generated two kinds of recommendation results – one five in this study. Unfortunately, there have been few stud-
was generated randomly and the other was generated using ies to propose any mechanism to determine the optimal
our model that combined GA K-means and CBR. The number of clusters, so it has usually been determined by
sequence of the visual presentation of these two results heuristics. Thus, the attempts to adjust the number of clus-
changed randomly in order to offset main testing effect – ters should be one of the focuses of future research. In
the effect of a prior observation on a later observation. addition, GA K-means needs to be applied to other
For measuring satisfaction of each recommendation result, domains in order to validate generalizability of the pro-
we added a survey function that measured the satisfaction posed model.
level in a 7-point Likert scale. Using the prototype system,
we conducted the survey for one month (April, 2005). As a
result, we collected 100 responses in total. Table 6 shows References
the result of the survey. As shown in Table 6, the respon-
Babu, G. P., & Murty, M. N. (1993). A near-optimal initial seed value
dents replied that they were quite satisfied (4.51 point) with selection in K-means algorithm using a genetic algorithm. Pattern
the recommended results by our proposed model, although Recognition Letters, 14(10), 763–769.
K.-j. Kim, H. Ahn / Expert Systems with Applications 34 (2008) 1200–1209 1209

Bauer, R. J. (1994). Genetic algorithms and investment strategies. New Kuo, R. J., An, Y. L., Wang, H. S., & Chung, W. J. (2006). Integration of
York: John Wiley & Sons. self-organizing feature maps neural network and genetic K-means
Bradley, P. S., & Fayyad, U. M. (1998). Refining initial points for K- algorithm for market segmentation. Expert Systems with Applications,
means clustering. In Proceedings of the 15th international conference on 30(2), 313–324.
machine learning (pp. 91–99). Kuo, R. J., Chang, K., & Chien, S. Y. (2004). Integration of self-
Cho, Y. H., & Kim, J. K. (2004). Application of Web usage mining and organizing feature maps and genetic-algorithm-based clustering
product taxonomy to collaborative recommendations in e-commerce. method for market segmentation. Journal of Organizational Computing
Expert Systems with Applications, 26(2), 233–246. and Electronic Commerce, 14(1), 43–60.
Cho, Y. H., Kim, J. K., & Kim, S. H. (2002). A personalized recommender Kuo, R. J., Liao, J. L., & Tu, C. (2005). Integration of ART2 neural
system based on Web usage mining and decision tree induction. Expert network and genetic K-means algorithm for analyzing Web browsing
Systems with Applications, 23(3), 329–342. paths in electronic commerce. Decision Support Systems, 40(2),
Davis, L. (1994). Handbook of genetic algorithms. New York: Van 355–374.
Nostrand Reinhold. Lletı́, R., Ortiz, M. C., Sarabia, L. A., & Sánchez, M. S. (2004). Selecting
Gehrt, K. C., & Shim, S. (1998). A shopping orientation segmentation of variables for k-means cluster analysis by using a genetic algorithm that
French consumers: implications for catalog marketing. Journal of optimises the silhouettes. Analytica Chimica Acta, 515(1), 87–100.
Interactive Marketing, 12(4), 34–46. Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based
Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sarwar, B., clustering technique. Pattern Recognition, 33(9), 1455–1465.
Herlocker, J., & Riedl, J. (1999). Combining collaborative filtering Michalewicz, Zb. (1996). Genetic algorithms + data structures = evolution
with personal agents for better recommendations. In Proceedings of the programs (3rd ed.). Berlin: Springer-Verlag.
16th national conference of the American Association of Artificial Michaud, P. (1997). Clustering techniques. Future Generation Computer
Intelligence (AAAI-99) (pp. 439–446). Orlando, FL, USA. Systems, 13(2–3), 135–147.
Green, S. B., Salkind, N. J., & Akey, T. M. (2000). Using SPSS for Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters
Windows (2nd ed.). Upper Saddle River, NJ: Prentice Hall. using genetic algorithms. Pattern Recognition Letters, 17(8), 825–832.
Herlocker, J. L., Konstan, J. A., & Riedl, J. (2000). Explaining Pazzani, M. J. (1999). A framework for collaborative, content-based and
collaborative filtering recommendations. In Proceedings of the ACM demographic filtering. Artificial Intelligence Review, 13(5–6), 393–408.
conference on computer supported cooperative work (pp. 241–250). Pena, J. M., Lozano, J. A., & Larranaga, P. (1999). An empirical
Philadelphia, USA. comparison of four initialization methods for the K-means algorithm.
Hertz, A., & Kobler, D. (2000). A framework for the description of Pattern Recognition Letters, 20(10), 1027–1040.
evolutionary algorithms. European Journal of Operational Research, Resnick, P., Iacovou, N., Suchak, M., & Bergstrom, P. (1994). Group-
126(1), 1–12. Lens: an open architecture for collaborative filtering of netnews. In
Kim, K. (2004). Toward global optimization of case-based reasoning Proceedings of the ACM conference on computer supported cooperative
systems for financial forecasting. Applied Intelligence, 21(3), 239–249. work (pp. 175–186).
Kim, J. K., Cho, Y. H., Kim, W. J., Kim, J. R., & Suh, J. H. (2002). A Shin, K. S., & Han, I. (1999). Case-based reasoning supported by genetic
personalized recommendation procedure for Internet shopping sup- algorithms for corporate bond rating. Expert Systems with Applica-
port. Electronic Commerce Research and Applications, 1(3–4), 301–313. tions, 16(2), 85–95.
Kim, K.-S., & Han, I. (2001). The cluster-indexing method for case-based Velido, A., Lisboa, P. J. G., & Meehan, K. (1999). Segmentation of the
reasoning using self-organizing maps and learning vector quantization on-line shopping market using neural networks. Expert Systems with
for bond rating cases. Expert Systems with Applications, 21(3), Applications, 17(4), 303–314.
147–156. Wedel, M., & Kamakura, W. A. (1998). Market segmentation: concepts
Kohonen, T. (1982). Self-organized formation of topologically correct and methodological foundations. Boston: Kluwer Academic Publishers.
feature maps. Biological Cybernetics, 43(1), 59–69. Wong, F., & Tan, C. (1994). Hybrid neural, genetic and fuzzy systems. In
Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R., G. J. Deboeck (Ed.), Trading on the edge (pp. 245–247). New York:
& Riedl, J. (1997). GroupLens: applying collaborative filtering to John Wiley & Sons.
Usenet news. Communications of ACM, 40(3), 77–87.

You might also like